James Pustejovsky February 27 2018 Brandeis University Slides thanks to David Blei Set of states Process moves from one state to another generating a sequence of states ID: 757225
Download Presentation The PPT/PDF document "Hidden Markov Models COSI 114 – Comput..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Hidden Markov Models
COSI 114 – Computational LinguisticsJames PustejovskyFebruary 27, 2018Brandeis University
Slides
thanks to David
BleiSlide2
Set of states:
Process moves from one state to another generating a sequence of states : Markov chain property: probability of each subsequent state depends only on what was the previous state:
To define Markov model, the following probabilities have to be specified: transition probabilities and initial probabilities
Markov ModelsSlide3
Rain
Dry
0.7
0.3
0.2
0.8
Two states :
‘
Rain
’
and
‘
Dry
’
.
Transition probabilities:
P(
‘
Rain
’
|
‘
Rain
’
)
=0.3 ,
P(
‘
Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8 Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .
Example of Markov ModelSlide4
By Markov chain property, probability of state sequence can be found by the formula:
Suppose we want to calculate a probability of a sequence of states in our example, {
‘
Dry
’
,
’Dry’,’Rain’,Rain’}. P({‘Dry
’,’Dry’,’Rain’,Rain
’} ) =P(‘Rain’|’Rain’) P(‘Rain’|
’Dry’) P(‘Dry’|
’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6Calculation of sequence probabilitySlide5
Hidden Markov
models
Set of states:
Process moves from one state to another generating a
sequence
of states :
Markov chain property: probability of each subsequent state depends only on what was the previous state: States are not visible, but each state randomly generates one of M observations (or visible states) To define hidden Markov model, the following probabilities have to be specified: matrix of transition probabilities A=(aij), aij= P(s
i | sj) , matrix of observation probabilities B=(bi (vm
)), bi(vm ) = P(vm | si) and a vector of initial probabilities =(i
), i = P(si) . Model is represented by M=(A, B,
).Slide6
Low
High
0.7
0.3
0.2
0.8
Dry
Rain
0.6
0.6
0.4
0.4
Example of Hidden Markov ModelSlide7
Rain
Dry
0.7
0.3
0.2
0.8
Two states :
‘
Rain
’
and
‘
Dry
’
.
Transition probabilities:
P(
‘
Rain
’
|
‘
Rain
’
)
=0.3 ,
P(
‘
Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8 Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .
Example of Markov ModelSlide8
By Markov chain property, probability of state sequence can be found by the formula:
Suppose we want to calculate a probability of a sequence of states in our example, {
‘
Dry
’
,
’Dry’,’Rain’,Rain’}. P({‘Dry
’,’Dry’,’Rain’,Rain
’} ) =P(‘Rain’|’Rain’) P(‘Rain’|
’Dry’) P(‘Dry’|
’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6Calculation of sequence probabilitySlide9
Hidden Markov
models
Set of states:
Process moves from one state to another generating a
sequence
of states :
Markov chain property: probability of each subsequent state depends only on what was the previous state: States are not visible, but each state randomly generates one of M observations (or visible states) To define hidden Markov model, the following probabilities have to be specified: matrix of transition probabilities A=(aij), aij= P(s
i | sj) , matrix of observation probabilities B=(bi (vm
)), bi(vm ) = P(vm | si) and a vector of initial probabilities =(i
), i = P(si) . Model is represented by M=(A, B,
).Slide10
Two states :
‘Low’ and ‘High
’
atmospheric pressure. Two observations : ‘Rain’ and
‘Dry’. Transition probabilities:
P(‘Low’|‘Low’)=0.3 , P(‘
High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2, P(‘High’
|‘High’)=0.8 Observation probabilities : P(‘Rain’|
‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High
’)=0.4 , P(‘Dry’|‘High
’)=0.3 . Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .Example of Hidden Markov ModelSlide11
What is an HMM?
Graphical ModelCircles indicate statesArrows indicate probabilistic dependencies between statesSlide12
What is an HMM?
Green circles are hidden states
Dependent only on the previous state
“The past is independent of the future given the present.”Slide13
What is an HMM?
Purple nodes are observed states
Dependent only on their corresponding hidden stateSlide14
HMM Formalism
{S, K, P, A, B} S : {s1…sN } are the values for the hidden statesK : {k
1
…kM } are the values for the observations
S
S
S
K
K
K
S
K
S
KSlide15
HMM Formalism
{S, K, P, A, B} P = {pi} are the initial state probabilities
A
= {aij} are the state transition probabilities
B = {bik} are the observation state probabilities
A
B
A
A
A
B
B
S
S
S
K
K
K
S
K
S
KSlide16
Inference in an HMM
Compute the probability of a given observation sequenceGiven an observation sequence, compute the most likely hidden state sequenceGiven an observation sequence and set of possible models, which model most closely fits the data?Slide17
o
T
o
1
o
t
o
t-1
o
t+1
Given an observation sequence and a model, compute the probability of the observation sequence
DecodingSlide18
Decoding
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1Slide19
Decoding
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1Slide20
Decoding
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1Slide21
Decoding
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1Slide22
Decoding
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1Slide23
Forward Procedure
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Special structure gives us an efficient solution using
dynamic programming.
Intuition
: Probability of the first
t
observations is the same for all possible
t
+1 length state sequences.
Define:Slide24
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Forward ProcedureSlide25
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Forward ProcedureSlide26
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Forward ProcedureSlide27
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Forward ProcedureSlide28
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Forward ProcedureSlide29
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Forward ProcedureSlide30
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Forward ProcedureSlide31
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Forward ProcedureSlide32
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Backward Procedure
Probability of the rest of the states given the first stateSlide33
o
T
o
1
o
t
o
t-1
o
t+1
x
1
x
t+1
x
T
x
t
x
t-1
Decoding Solution
Forward Procedure
Backward Procedure
CombinationSlide34
o
T
o
1
o
t
o
t-1
o
t+1
Best State Sequence
Find the state sequence that best explains the observations
Viterbi
algorithm
Slide35
o
T
o
1
o
t
o
t-1
o
t+1
Viterbi Algorithm
The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t
x
1
x
t-1
jSlide36
o
T
o
1
o
t
o
t-1
o
t+1
Viterbi Algorithm
Recursive Computation
x
1
x
t-1
x
t
x
t+1Slide37
o
T
o
1
o
t
o
t-1
o
t+1
Viterbi Algorithm
Compute the most likely state sequence by working backwards
x
1
x
t-1
x
t
x
t+1
x
TSlide38
o
T
o
1
o
t
o
t-1
o
t+1
Parameter Estimation
Given an observation sequence, find the model that is most likely to produce that sequence.
No analytic method
Given a model and observation sequence, update the model parameters to better fit the observations.
A
B
A
A
A
B
B
B
BSlide39
o
T
o
1
o
t
o
t-1
o
t+1
Parameter Estimation
A
B
A
A
A
B
B
B
B
Probability of traversing an arc
Probability of being in state
iSlide40
o
T
o
1
o
t
o
t-1
o
t+1
Parameter Estimation
A
B
A
A
A
B
B
B
B
Now we can compute the new estimates of the model parameters.Slide41
HMM Applications
Generating parameters for n-gram modelsTagging speechSpeech recognitionSlide42
o
T
o
1
o
t
o
t-1
o
t+1
The Most Important Thing
A
B
A
A
A
B
B
B
B
We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not solvable.Slide43
Low
High
0.7
0.3
0.2
0.8
Dry
Rain
0.6
0.6
0.4
0.4
Example of Hidden Markov ModelSlide44
Rain
Dry
0.7
0.3
0.2
0.8
Two states :
‘
Rain
’
and
‘
Dry
’
.
Transition probabilities:
P(
‘
Rain
’
|
‘
Rain
’
)
=0.3 ,
P(
‘
Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8 Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .
Example of Markov ModelSlide45
By Markov chain property, probability of state sequence can be found by the formula:
Suppose we want to calculate a probability of a sequence of states in our example, {
‘
Dry
’
,
’Dry’,’Rain’,Rain’}. P({‘Dry
’,’Dry’,’Rain’,Rain
’} ) =P(‘Rain’|’Rain’) P(‘Rain’|
’Dry’) P(‘Dry’|
’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6Calculation of sequence probabilitySlide46
Hidden Markov
models
Set of states:
Process moves from one state to another generating a
sequence
of states :
Markov chain property: probability of each subsequent state depends only on what was the previous state: States are not visible, but each state randomly generates one of M observations (or visible states) To define hidden Markov model, the following probabilities have to be specified: matrix of transition probabilities A=(aij), aij= P(s
i | sj) , matrix of observation probabilities B=(bi (vm
)), bi(vm ) = P(vm | si) and a vector of initial probabilities =(i
), i = P(si) . Model is represented by M=(A, B,
).Slide47
Two states :
‘Low’ and ‘High
’
atmospheric pressure. Two observations : ‘Rain’ and
‘Dry’. Transition probabilities:
P(‘Low’|‘Low’)=0.3 , P(‘
High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2, P(‘High’
|‘High’)=0.8 Observation probabilities : P(‘Rain’|
‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High
’)=0.4 , P(‘Dry’|‘High
’)=0.3 . Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .Example of Hidden Markov ModelSlide48
Suppose we want to calculate a probability of a sequence of observations in our example, {
‘
Dry
’
,
’Rain’}.Consider all possible hidden state sequences: P({‘
Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’
Low’}) + P({‘Dry’,’Rain’} , {
‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’
,’Low’}) + P({‘Dry’,
’Rain’} , {‘High’,’High’}) where first term is : P({‘Dry’,
’Rain’} , {‘Low’,’Low
’})= P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({
‘Low’,’Low’}) = P(
‘
Dry
’
|
’
Low
’
)P(
‘
Rain
’
|
’
Low’) P(‘Low’)P(‘Low’|’
Low)= 0.4*0.4*0.6*0.4*0.3Calculation of observation sequence probabilitySlide49
Evaluation problem.
Given the HMM M=(A, B, ) and the observation sequence O=o1 o
2
... oK , calculate the probability that model M has generated sequence O .
Decoding problem. Given the HMM M=(A, B, ) and the observation sequence
O=o1 o2 ... oK , calculate the most likely sequence of hidden states si that produced this observation sequence O.
Learning problem. Given some training observation sequences O=o1 o2 ... oK and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B, ) that best fit training data. O=o1...oK denotes a sequence of observations ok
{v1,…,vM}.
Main issues using HMMs :Slide50
Typed word recognition, assume all characters are separated.
Amherst
Character recognizer outputs probability of the image being particular character, P(image|character).
A
0.5
0.03
0.005
0.31
z
c
b
a
Word recognition example(1).
Hidden state ObservationSlide51
Hidden states of HMM = characters.
Observations = typed images of characters segmented from the image . Note that there is an infinite number of observations
Observation probabilities = character recognizer scores.
Transition probabilities will be defined differently in two subsequent models.
Word recognition example(2).Slide52
If lexicon is given, we can construct separate HMM models for each lexicon word.
Amherst
a
m
h
e
r
s
t
Buffalo
b
u
f
f
a
l
o
A
0.5
0.03
Here recognition of word image is equivalent to the problem of evaluating few HMM models.
This is an application of
Evaluation problem.
Word recognition example(3).
m
0.4
0.6Slide53
We can construct a single HMM for all words.
Hidden states = all characters in the alphabet. Transition probabilities and initial probabilities are calculated from language model. Observations and observation probabilities are as before.
a
m
h
e
r
s
t
b
v
f
o
Here we have to determine the best sequence of hidden states, the one that most likely produced word image.
This is an application of
Decoding problem.
Word recognition example(4).Slide54
A
The structure of hidden states is chosen.
Observations are feature vectors extracted from vertical slices.
Probabilistic mapping from hidden state to feature vectors: 1. use mixture of Gaussian models
2. Quantize feature vector space.
Character recognition with HMM example.Slide55
The structure of hidden states:
Observation = number of islands in the vertical slice.
s
1
s
2
s
3
A
HMM for character
‘
A
’
:
Transition probabilities: {
a
ij
}=
Observation probabilities: {
b
jk
}=
.8 .2 0
0 .8 .2
0 0 1
.9 .1 0
.1 .8 .1
.9 .1 0
B
HMM for character
‘
B
’
:
Transition probabilities: {
a
ij
}=
Observation probabilities: {
b
jk
}=
.8 .2 0
0 .8 .2
0 0 1
.9 .1 0
0 .2 .8
.6 .4 0
Exercise: character recognition with HMM(1)Slide56
Suppose that after character image segmentation the following sequence of island numbers in 4 slices was observed:
{ 1, 3, 2, 1}
What HMM is more likely to generate this observation sequence , HMM for
‘
A
’ or HMM for ‘B’ ?
Exercise: character recognition with HMM(2)Slide57
Consider likelihood of generating given observation for each possible sequence of hidden states:
HMM for character
‘
A
’
:
Hidden state sequence
Transition probabilities
Observation probabilities
s
1 s1 s2s3
.8
.2 .2 .9 0 .8 .9 = 0
s
1
s
2
s
2
s
3
.2
.8
.2 .9 .1 .8 .9 = 0.0020736 s1 s2 s3s3.2 .2 1 .9 .1 .1 .9 = 0.000324
Total = 0.0023976
HMM for character
‘
B
’
:
Hidden state sequence
Transition probabilities
Observation probabilities
s
1
s
1
s
2
s
3
.8
.2
.2
.9
0
.2
.6 = 0
s
1
s
2
s
2
s
3
.2
.8
.2
.9
.8
.2
.6 = 0.0027648
s
1
s
2
s
3
s
3
.2
.2
1
.9
.8
.4
.6 = 0.006912
Total = 0.0096768
Exercise: character recognition with HMM(3)Slide58
Evaluation problem.
Given the HMM M=(A, B, ) and the observation sequence O=o1
o
2 ... oK , calculate the probability that model M has generated sequence O .
Trying to find probability of observations O=o1 o2
... oK by means of considering all hidden state sequences (as was done in example) is impractical: NK hidden state sequences - exponential complexity. Use Forward-Backward HMM algorithms for efficient calculations.
Define the forward variable k(i) as the joint probability of the partial observation sequence o1 o2 ... ok and that the hidden state at time k is si : k(i)= P(o1
o2 ... ok , qk= si )
Evaluation Problem.Slide59
s
1
s
2
s
i
s
N
s
1
s
2
s
i
s
N
s
1
s
2
s
j
s
N
s
1
s
2
s
i
s
N
a
1j
a
2j
a
ij
a
Nj
Time= 1 k k+1 K
o
1
o
k
o
k+1
o
K
= Observations
Trellis representation of an HMMSlide60
Initialization: 1(i)= P(o
1 ,
q1= si
) = i
bi (o1) , 1<=i<=N. Forward recursion:
k+1(i)= P(o1 o2 ... ok+1 , qk+1= sj ) = i
P(o1 o2 ... ok+1 , qk= si ,
qk+1= sj ) = i P(o1 o2 ... ok , q
k= si) aij bj (ok+1 )
= [i k(i) aij ] bj (ok+1 ) , 1<=j<=N, 1<=k<=K-1. Termination: P(o
1 o2 ... oK) = i
P(o1 o2 ... oK , qK= si) = i K(i)
Complexity : N2K operations.
Forward recursion for HMMSlide61
Define the forward variable
k(i) as the joint probability of the partial observation sequence o
k+1
ok+2 ... oK given that the hidden state at time k is si
: k(i)= P(o
k+1 ok+2 ... oK |qk= si )
Initialization: K(i)= 1 , 1<=i<=N. Backward recursion: k(j)= P(ok+1 ok+2 ... o
K | qk= sj ) =
i P(ok+1 ok+2 ... oK , qk+1= si | qk=
sj ) =
i P(ok+2 ok+3 ... oK | qk+1= si) aji bi (ok+1 )
= i k+1(i) a
ji bi (ok+1 ) , 1<=j<=N, 1<=k<=K-1. Termination: P(o1 o2 ... oK) = i P(o
1 o2 ... oK , q1= si
)
=
i
P(o
1
o
2
... o
K
|q
1= si) P(q1= si) =
i 1(i) bi (o1) i Backward recursion for HMMSlide62
Decoding problem.
Given the HMM M=(A, B, ) and the observation sequence O=o1
o
2 ... oK , calculate the most likely sequence of hidden states si that produced this observation sequence.
We want to find the state sequence Q= q1…q
K which maximizes P(Q | o1 o2 ... oK ) , or equivalently P(Q , o1 o2
... oK ) . Brute force consideration of all paths takes exponential time. Use efficient Viterbi algorithm instead. Define variable k(i) as the maximum probability of producing observation sequence o1 o2 ... ok when moving along any hidden state sequence
q1… qk-1 and getting into qk= si .
k(i) = max P(q1… qk-1 , qk= si , o1 o2 ... o
k) where max is taken over all possible paths q1… qk-1 .
Decoding problemSlide63
General idea:
if best path ending in qk= sj goes through
q
k-1= si then it should coincide with best path ending in
qk-1= si .
s
1
s
i
s
N
s
j
a
ij
a
Nj
a
1j
q
k-1
q
k
k
(i) = max
P(q
1
… q
k-1
,
q
k
=
s
j
,
o
1
o
2
... o
k
)
=
max
i
[
a
ij
b
j
(o
k
)
max
P(q
1
… q
k-1
=
s
i
,
o
1
o
2
... o
k-1
)
]
To backtrack best path keep info that predecessor of
s
j
was
s
i
.
Viterbi algorithm (1)Slide64
Initialization: 1(i) = max
P(q
1= si , o
1) = i
bi (o1) , 1<=i<=N.Forward recursion: k(j) = max
P(q1… qk-1 , qk= sj , o1 o2 ... ok) = max
i [ aij bj (ok ) max P(q1… q
k-1= si , o1 o2 ... ok-1) ] = maxi [ aij
bj (ok ) k-1(i) ] , 1<=j<=N, 2<=k<=K.
Termination: choose best path ending at time K maxi [ K(i) ] Backtrack best path.
This algorithm is similar to the forward recursion of evaluation problem, with replaced by max and additional backtracking.Viterbi algorithm (2)Slide65
Learning problem.
Given some training observation sequences O=o1 o2 ... oK
and general structure of HMM (numbers of hidden and visible states), determine HMM parameters
M=(A, B, ) that best fit training data, that is maximizes P(O |
M) . There is no algorithm producing optimal parameter values.
Use iterative expectation-maximization algorithm to find local maximum of P(O | M) - Baum-Welch algorithm.
Learning problem (1)Slide66
If training data has information about sequence of hidden states (as in word recognition example), then use maximum likelihood estimation of parameters:
aij= P(s
i
| sj) =
Number of transitions from state
sj to state
si Number of transitions out of state sj
b
i
(vm ) = P(vm
| si)=
Number of times observation vm occurs in state si Number of times in state si
Learning problem (2)Slide67
General idea:
a
ij
= P(s
i
| sj) =
Expected number of transitions from state sj
to state si Expected number of transitions out of state sj
b
i(vm )
= P(vm | si)=
Expected number of times observation vm occurs in state si
Expected number of times in state si
i
= P(s
i
) =
Expected frequency in state
s
i
at time
k=1.
Baum-Welch algorithmSlide68
Define variable
k(i,j) as the probability of being in state s
i
at time k and in state sj at time k+1, given the observation sequence o1 o
2 ... oK .
k(i,j)= P(qk= si , qk+1= sj
| o1 o2 ... oK)
k(i,j)=
P(qk= si ,
qk+1= sj , o1 o
2 ... ok) P(o1 o2 ... ok)
=
P(qk= si ,
o1 o2 ... ok) aij bj
(o
k+1
)
P(o
k+2
... o
K
|
q
k+1
= s
j ) P(o1 o2 ... ok)
= k(i) aij bj (ok+1 ) k+1(j) i j k(i) aij bj (ok+1 ) k+1(j)
Baum-Welch algorithm: expectation step(1)Slide69
Define variable
k(i) as the probability of being in state s
i
at time k, given the observation sequence o1 o2 ... oK .
k(i)= P(q
k= si | o1 o2 ... oK)
k(i)=
P(qk= si , o1 o2 ... ok)
P(o1 o2 ... ok)
=
k(i) k(i) i k(i)
k(i)
Baum-Welch algorithm: expectation step(2)Slide70
We calculated
k
(i,j) =
P(q
k= si , qk+1= s
j | o1 o2 ... oK) and k(i)= P(qk= si |
o1 o2 ... oK) Expected number of transitions from state si to state
sj = = k k(i,j) Expected number of transitions out of state si = k k
(i) Expected number of times observation vm occurs in state s
i = = k k(i) , k is such that ok= vm Expected frequency in state si
at time k=1 : 1(i) .
Baum-Welch algorithm: expectation step(3)Slide71
a
ij =
Expected number of transitions from state
s
j to state si Expected number of transitions out of state sj
b
i
(vm ) =
Expected number of times observation vm occurs in state si Expected number of times in state
si
i = (Expected frequency in state si at time k=1) = 1(i).
=
k
k
(i,j)
k
k
(i)
=
k
k(i,j)k,ok= vm
k
(i)
Baum-Welch algorithm: maximization stepSlide72
The Noisy Channel Model
Search through space of all possible sentences.Pick the one that is most probable given the waveform.Slide73
The Noisy Channel Model (II)
What is the most likely sentence out of all sentences in the language L given some acoustic input O?Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,otDefine a sentence as a sequence of words:W = w1,w2,w3,…,w
n
Slide74
Noisy Channel Model (III)
Probabilistic implication: Pick the highest prob S:We can use Bayes rule to rewrite this:Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:Slide75
Noisy channel model
likelihood
priorSlide76
The noisy channel model
Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source)Slide77
Speech Architecture meets Noisy ChannelSlide78
HMMs for speechSlide79
Phones are not homogeneous!Slide80
Each phone has 3 subphonesSlide81
Resulting HMM word model for
“six”Slide82
HMMs more formally
Markov chainsA kind of weighted finite-state automatonSlide83
HMMs more formally
Markov chainsA kind of weighted finite-state automatonSlide84
Another Markov chainSlide85
Another view of Markov chainsSlide86
An example with numbers:
What is probability of:Hot hot hot hotCold hot cold hotSlide87
Hidden Markov ModelsSlide88
Hidden Markov ModelsSlide89
Hidden Markov Models
Bakis network Ergodic (fully-connected) networkLeft-to-right networkSlide90
The Jason Eisner task
You are a climatologist in 2799 studying the history of global warmingYOU can’t find records of the weather in Baltimore for summer 2006But you do find Jason Eisner’s diaryWhich records how many ice creams he ate each day.Can we use this to figure out the weather?Given a sequence of observations O, each observation an integer = number of ice creams eatenFigure out correct hidden sequence Q of weather states (H or C) which caused Jason to eat the ice creamSlide91Slide92
HMMs more formally
Three fundamental problemsJack Ferguson at IDA in the 1960s Given a specific HMM, determine likelihood of observation sequence. Given an observation sequence and an HMM, discover the best (most probable) hidden state sequence Given only an observation sequence, learn the HMM parameters (A, B matrix)Slide93
The Three Basic Problems for HMMs
Problem 1 (Evaluation): Given the observation sequence O=(o1o2…oT), and an HMM model = (A,B), how do we efficiently compute P(O| )
, the probability of the observation sequence, given the model
Problem 2 (Decoding):
Given the observation sequence O=(o1o2…oT
), and an HMM model = (A,B), how do we choose a corresponding state sequence Q=(q1q2…q
T) that is optimal in some sense (i.e., best explains the observations)Problem 3 (Learning): How do we adjust the model parameters = (A,B) to maximize P(O| )?Slide94
Problem 1: computing the observation likelihood
Given the following HMM:
How likely is the sequence 3 1 3?Slide95
How to compute likelihood
For a Markov chain, we just follow the states 3 1 3 and multiply the probabilitiesBut for an HMM, we don’t know what the states are!So let’s start with a simpler situation.Computing the observation likelihood for a given hidden state sequenceSuppose we knew the weather and wanted to predict how much ice cream Jason would eat.I.e. P( 3 1 3 | H H C)Slide96
Computing likelihood for 1 given hidden state sequenceSlide97
Computing total likelihood of 3 1 3
We would need to sum overHot hot coldHot hot hotHot cold hot….How many possible hidden state sequences are there for this sequence?How about in general for an HMM with N hidden states and a sequence of T observations?
N
TSo we can
’t just do separate computation for each hidden state sequence.Slide98
Instead: the Forward algorithm
A kind of dynamic programming algorithmUses a table to store intermediate valuesIdea:Compute the likelihood of the observation sequenceBy summing over all possible hidden state sequencesBut doing this efficiently By folding all the sequences into a single trellisSlide99
The Forward TrellisSlide100
The forward algorithm
Each cell of the forward algorithm trellis alphat(j)Represents the probability of being in state jAfter seeing the first t observationsGiven the automatonEach cell thus expresses the following probabilitySlide101
We update each cellSlide102
The Forward RecursionSlide103
The Forward AlgorithmSlide104
Decoding
Given an observation sequence3 1 3And an HMMThe task of the decoderTo find the best hidden state sequenceGiven the observation sequence O=(o1o2…oT), and an HMM model = (A,B),
how do we choose a corresponding state sequence
Q=(q1q
2…qT)
that is optimal in some sense (i.e., best explains the observations)Slide105
Decoding
One possibility:For each hidden state sequenceHHH, HHC, HCH, Run the forward algorithm to compute P( |O) Why not?NTInstead:The Viterbi algorithm
Is again a
dynamic programming algorithmUses a similar trellis to the Forward algorithmSlide106
The Viterbi trellisSlide107
Viterbi intuition
Process observation sequence left to rightFilling out the trellisEach cell:Slide108
Viterbi AlgorithmSlide109
Viterbi backtraceSlide110
Viterbi RecursionSlide111
Reminder: a word looks like this:Slide112
HMM for digit recognition taskSlide113
The Evaluation (forward) problem for speech
The observation sequence O is a series of MFCC vectorsThe hidden states W are the phones and wordsFor a given phone/word string W, our job is to evaluate P(O|W)Intuition: how likely is the input to have been generated by just that word string WSlide114
Evaluation for speech: Summing over all different paths!
f ay ay ay ay v v v v f f ay ay ay ay v v v f f f f ay ay ay ay vf f ay ay ay ay ay ay vf f ay ay ay ay ay ay ay ay vf f ay v v v v v v v Slide115
The forward lattice for
“five”Slide116
The forward trellis for
“five”Slide117
Viterbi trellis for
“five”Slide118
Viterbi trellis for
“five”Slide119
Search space with bigramsSlide120
Viterbi trellis with 2 words and uniform LMSlide121
Viterbi backtraceSlide122
Part-of-speech taggingSlide123
Parts of Speech
Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speechlexical categories, word classes, “tags”, POSDionysius Thrax of Alexandria (c. 100 BCE): 8 parts of speechStill with us! But his 8 aren’t exactly the ones we are taught todayThrax: noun, verb, article, adverb, preposition, conjunction,
participle, pronoun
School grammar: noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjectionSlide124
Open class
(lexical) words
Closed class
(functional)
Nouns
Verbs
Proper
Common
Modals
Main
Adjectives
Adverbs
Prepositions
Particles
Determiners
Conjunctions
Pronouns
… more
… more
IBM
Italy
cat / cats
snow
see
registered
can
had
old older oldest
slowly
to with
off
up
the some
and or
he its
Numbers
122,312
one
Interjections
Ow
EhSlide125
Open vs. Closed classes
Open vs. Closed classesClosed: determiners: a, an, thepronouns: she, he, Iprepositions: on, under, over, near, by, …
Why
“closed
”?Open: Nouns, Verbs, Adjectives, Adverbs. Slide126
POS Tagging
Words often have more than one POS: backThe back door = JJOn my back = NNWin the voters
back
= RBPromised to back
the bill = VBThe POS tagging problem is to determine the POS tag for a particular instance of a word.Slide127
POS Tagging
Input: Plays well with othersAmbiguity: NNS/VBZ UH/JJ/NN/RB IN NNSOutput: Plays/VBZ well/RB with/IN others/NNSUses:MT: reordering of adjectives and nouns (say from Spanish to English)Text-to-speech (how do we pronounce “lead”
?)
Can write regexps like (Det)
Adj* N+ over the output for phrases, etc.Input to a syntactic parser
Penn Treebank
POS tagsSlide128
The Penn
TreeBankTagset128Slide129
Penn Treebank tags
129Slide130
POS tagging performance
How many tags are correct? (Tag accuracy)About 97% currentlyBut baseline is already 90%Baseline is performance of stupidest possible methodTag every word with its most frequent tagTag unknown words as nounsPartly easy becauseMany words are unambiguousYou get points for them (
the
, a, etc.) and for punctuation marks!Slide131
Deciding on the correct part of speech can be difficult even for people
Mrs/NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBGAll/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN
the/DT corner/NN
Chateau/NNP Petrus/NNP costs/VBZ around/RB
250/CDSlide132
How difficult is POS tagging?
About 11% of the word types in the Brown corpus are ambiguous with regard to part of speechBut they tend to be very common words. E.g., thatI know that he is honest = IN
Yes,
that play was nice = DT
You can’t go that
far = RB40% of the word tokens are ambiguousSlide133
Sources of information
What are the main sources of information for POS tagging?Knowledge of neighboring wordsBill saw that man yesterdayNNP NN DT NN NNVB VB(D) IN VB NNKnowledge of word probabilitiesman is rarely used as a verb….The latter proves the most useful, but the former also helpsSlide134
More and Better Features
Feature-based taggerCan do surprisingly well just looking at a word by itself:Word the: the DTLowercased word Importantly: importantly RBPrefixes unfathomable: un-
JJ
Suffixes Importantly: -ly
RBCapitalization Meridian: CAP NNP
Word shapes 35-year: d-x JJThen build a classifier to predict tagMaxent P(t|w): 93.7% overall / 82.6% unknownSlide135
Overview:
POS Tagging AccuraciesRough accuracies:
Most
freq tag: ~90% / ~50%
Trigram HMM: ~95% / ~55
%Maxent P(
t|w): 93.7% / 82.6%TnT (HMM++): 96.2% / 86.0%MEMM tagger: 96.9
% / 86.9%Bidirectional dependencies: 97.2% / 90.0%Upper bound: ~98% (human agreement)
Most errors on unknown wordsSlide136
POS tagging as a sequence classification task
We are given a sentence (an “observation” or “sequence of observations”)Secretariat is expected to race tomorrowShe promised to back the billWhat is the best sequence of tags which corresponds to this sequence of observations?Probabilistic view:Consider all possible sequences of tags
Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…
wn.Slide137
How do we apply classification to sequences?Slide138
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
NNPSlide139
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
VBDSlide140
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
DTSlide141
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
NNSlide142
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
CCSlide143
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw
the saw and decided to take it to the table.
classifier
VBDSlide144
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
TOSlide145
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
VBSlide146
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
PRPSlide147
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
INSlide148
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
DTSlide149
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney
John
saw the saw and decided to take it
to the table.
classifier
NNSlide150
Sequence Labeling as Classification
Using Outputs as InputsBetter input features are usually the categories of the surrounding tokens, but these are not available yet.Can use category of either the preceding or succeeding tokens by going forward or back and using previous output.Slide from Ray MooneySlide151
Forward Classification
Slide from Ray MooneyJohn saw
the saw and decided to take it
to the table.
classifier
NNPSlide152
Forward Classification
Slide from Ray Mooney NNPJohn saw
the saw and decided to take it to the table.
classifier
VBDSlide153
Forward Classification
Slide from Ray MooneyNNP VBDJohn saw
the saw and decided to take it
to the table.
classifier
DTSlide154
Forward Classification
Slide from Ray Mooney NNP VBD DTJohn saw
the saw and decided to take it to
the table.
classifier
NNSlide155
Forward Classification
Slide from Ray Mooney NNP VBD DT NNJohn saw
the saw and decided to take it
to the table.
classifier
CCSlide156
Forward Classification
Slide from Ray Mooney NNP VBD DT NN CCJohn saw
the saw and decided to take it
to the table.
classifier
VBDSlide157
Forward Classification
Slide from Ray Mooney NNP VBD DT NN CC VBDJohn saw
the saw and decided to take it
to the table.
classifier
TOSlide158
Forward Classification
Slide from Ray Mooney NNP VBD DT NN CC VBD TOJohn
saw
the saw and decided to take it to
the table.
classifier
VBSlide159
Backward Classification
Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney
DT
NNJohn
saw the saw and decided to take it
to the table.
classifier
INSlide160
Backward Classification
Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney
IN
DT NNJohn
saw the saw and decided to take it
to the table.
classifier
PRPSlide161
Backward Classification
Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney
PRP IN DT NN
John saw
the saw and decided to take it to the table.
classifier
VBSlide162
Backward Classification
Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney
VB
PRP IN DT NNJohn
saw the saw and decided to take it
to the table.
classifier
TOSlide163
Backward Classification
Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney
TO VB PRP IN DT NN John
saw the saw and decided to take it
to the table.
classifier
VBDSlide164
Backward Classification
Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney
VBD TO
VB PRP
IN DT NN John
saw the saw and decided to take it to the table.
classifier
CCSlide165
Backward Classification
Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney CC VBD TO VB PRP IN DT NN
John
saw
the saw and decided to take it to the table.
classifier
VBDSlide166
Backward Classification
Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney VBD CC VBD
TO VB PRP IN DT NN
John saw
the saw and decided to take it to the table
.
classifier
DTSlide167
Backward Classification
Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney
DT VBD CC VBD
TO VB
PRP IN DT NNJohn
saw the saw and decided to take it
to the table.
classifier
VBDSlide168
Backward Classification
Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney VBD DT VBD CC
VBD TO
VB PRP IN DT NN John
saw the saw and decided to take it
to the table.
classifier
NNPSlide169
The Maximum Entropy Markov Model (MEMM)
A sequence version of the logistic regression (also called maximum entropy) classifier.Find the best series of tags:169Slide170
The Maximum Entropy Markov Model (MEMM)
170Slide171
Features for the classifier at each tag
171Slide172
More features
172Slide173
MEMM computes the best tag sequence
173Slide174
MEMM Decoding
Simplest algorithm:What we use in practice: The Viterbi algorithmA version of the same dynamic programming algorithm we used to compute minimum edit distance.
174Slide175
The Stanford Tagger
Is a bidirectional version of the MEMM called a cyclic dependency networkStanford tagger:http://nlp.stanford.edu/software/tagger.shtml175