Hidden Markov Models COSI 114 – Computational Linguistics - PowerPoint Presentation

test . @test

349 views
Uploaded On 2019-03-16

Hidden Markov Models COSI 114 – Computational Linguistics - PPT Presentation

James Pustejovsky February 27 2018 Brandeis University Slides thanks to David Blei Set of states Process moves from one state to another generating a sequence of states ID: 757225

state sequence hidden observation sequence state observation hidden hmm states probabilities dry

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/757225" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Hidden Markov Models COSI 114 – Comput..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Hidden Markov Models

COSI 114 – Computational LinguisticsJames PustejovskyFebruary 27, 2018Brandeis University

Slides

thanks to David

BleiSlide2

Set of states:

Process moves from one state to another generating a sequence of states : Markov chain property: probability of each subsequent state depends only on what was the previous state:

To define Markov model, the following probabilities have to be specified: transition probabilities and initial probabilities

Markov ModelsSlide3

Rain

Dry

0.7

0.3

0.2

0.8

Two states :

‘

Rain

’

and

‘

Dry

’

Transition probabilities:

‘

Rain

’

‘

Rain

’

)

=0.3 ,

‘

Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8 Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Example of Markov ModelSlide4

By Markov chain property, probability of state sequence can be found by the formula:

Suppose we want to calculate a probability of a sequence of states in our example, {

‘

Dry

’

’Dry’,’Rain’,Rain’}. P({‘Dry

’,’Dry’,’Rain’,Rain

’} ) =P(‘Rain’|’Rain’) P(‘Rain’|

’Dry’) P(‘Dry’|

’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6Calculation of sequence probabilitySlide5

Hidden Markov

models

Set of states:

Process moves from one state to another generating a

sequence

of states :

Markov chain property: probability of each subsequent state depends only on what was the previous state: States are not visible, but each state randomly generates one of M observations (or visible states) To define hidden Markov model, the following probabilities have to be specified: matrix of transition probabilities A=(aij), aij= P(s

i | sj) , matrix of observation probabilities B=(bi (vm

)), bi(vm ) = P(vm | si) and a vector of initial probabilities =(i

), i = P(si) . Model is represented by M=(A, B,

).Slide6

Low

High

0.7

0.3

0.2

0.8

Dry

Rain

0.6

0.4

Example of Hidden Markov ModelSlide7

Rain

Dry

0.7

0.3

0.2

0.8

Two states :

‘

Rain

’

and

‘

Dry

’

Transition probabilities:

‘

Rain

’

‘

Rain

’

)

=0.3 ,

‘

Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8 Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Example of Markov ModelSlide8

By Markov chain property, probability of state sequence can be found by the formula:

Suppose we want to calculate a probability of a sequence of states in our example, {

‘

Dry

’

’Dry’,’Rain’,Rain’}. P({‘Dry

’,’Dry’,’Rain’,Rain

’} ) =P(‘Rain’|’Rain’) P(‘Rain’|

’Dry’) P(‘Dry’|

’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6Calculation of sequence probabilitySlide9

Hidden Markov

models

Set of states:

Process moves from one state to another generating a

sequence

of states :

i | sj) , matrix of observation probabilities B=(bi (vm

)), bi(vm ) = P(vm | si) and a vector of initial probabilities =(i

), i = P(si) . Model is represented by M=(A, B,

).Slide10

Two states :

‘Low’ and ‘High

’

atmospheric pressure. Two observations : ‘Rain’ and

‘Dry’. Transition probabilities:

P(‘Low’|‘Low’)=0.3 , P(‘

High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2, P(‘High’

|‘High’)=0.8 Observation probabilities : P(‘Rain’|

‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High

’)=0.4 , P(‘Dry’|‘High

’)=0.3 . Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .Example of Hidden Markov ModelSlide11

What is an HMM?

Graphical ModelCircles indicate statesArrows indicate probabilistic dependencies between statesSlide12

What is an HMM?

Green circles are hidden states

Dependent only on the previous state

“The past is independent of the future given the present.”Slide13

What is an HMM?

Purple nodes are observed states

Dependent only on their corresponding hidden stateSlide14

HMM Formalism

{S, K, P, A, B} S : {s1…sN } are the values for the hidden statesK : {k

…kM } are the values for the observations

KSlide15

HMM Formalism

{S, K, P, A, B} P = {pi} are the initial state probabilities

= {aij} are the state transition probabilities

B = {bik} are the observation state probabilities

KSlide16

Inference in an HMM

Compute the probability of a given observation sequenceGiven an observation sequence, compute the most likely hidden state sequenceGiven an observation sequence and set of possible models, which model most closely fits the data?Slide17

t-1

t+1

Given an observation sequence and a model, compute the probability of the observation sequence

DecodingSlide18

Decoding

t-1

t+1

t-1Slide19

Decoding

t-1

t+1

t-1Slide20

Decoding

t-1

t+1

t-1Slide21

Decoding

t-1

t+1

t-1Slide22

Decoding

t-1

t+1

t-1Slide23

Forward Procedure

t-1

t+1

t-1

Special structure gives us an efficient solution using

dynamic programming.

Intuition

: Probability of the first

observations is the same for all possible

+1 length state sequences.

Define:Slide24

t-1

t+1

t-1

Forward ProcedureSlide25

t-1

t+1

t-1

Forward ProcedureSlide26

t-1

t+1

t-1

Forward ProcedureSlide27

t-1

t+1

t-1

Forward ProcedureSlide28

t-1

t+1

t-1

Forward ProcedureSlide29

t-1

t+1

t-1

Forward ProcedureSlide30

t-1

t+1

t-1

Forward ProcedureSlide31

t-1

t+1

t-1

Forward ProcedureSlide32

t-1

t+1

t-1

Backward Procedure

Probability of the rest of the states given the first stateSlide33

t-1

t+1

t-1

Decoding Solution

Forward Procedure

Backward Procedure

CombinationSlide34

t-1

t+1

Best State Sequence

Find the state sequence that best explains the observations

Viterbi

algorithm

Slide35

t-1

t+1

Viterbi Algorithm

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

t-1

jSlide36

t-1

t+1

Viterbi Algorithm

Recursive Computation

t-1

t+1Slide37

t-1

t+1

Viterbi Algorithm

Compute the most likely state sequence by working backwards

t-1

t+1

TSlide38

t-1

t+1

Parameter Estimation

Given an observation sequence, find the model that is most likely to produce that sequence.

No analytic method

Given a model and observation sequence, update the model parameters to better fit the observations.

BSlide39

t-1

t+1

Parameter Estimation

Probability of traversing an arc

Probability of being in state

iSlide40

t-1

t+1

Parameter Estimation

Now we can compute the new estimates of the model parameters.Slide41

HMM Applications

Generating parameters for n-gram modelsTagging speechSpeech recognitionSlide42

t-1

t+1

The Most Important Thing

We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not solvable.Slide43

Low

High

0.7

0.3

0.2

0.8

Dry

Rain

0.6

0.4

Example of Hidden Markov ModelSlide44

Rain

Dry

0.7

0.3

0.2

0.8

Two states :

‘

Rain

’

and

‘

Dry

’

Transition probabilities:

‘

Rain

’

‘

Rain

’

)

=0.3 ,

‘

Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8 Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Example of Markov ModelSlide45

By Markov chain property, probability of state sequence can be found by the formula:

Suppose we want to calculate a probability of a sequence of states in our example, {

‘

Dry

’

’Dry’,’Rain’,Rain’}. P({‘Dry

’,’Dry’,’Rain’,Rain

’} ) =P(‘Rain’|’Rain’) P(‘Rain’|

’Dry’) P(‘Dry’|

’Dry’) P(‘Dry’)= = 0.3*0.2*0.8*0.6Calculation of sequence probabilitySlide46

Hidden Markov

models

Set of states:

Process moves from one state to another generating a

sequence

of states :

i | sj) , matrix of observation probabilities B=(bi (vm

)), bi(vm ) = P(vm | si) and a vector of initial probabilities =(i

), i = P(si) . Model is represented by M=(A, B,

).Slide47

Two states :

‘Low’ and ‘High

’

atmospheric pressure. Two observations : ‘Rain’ and

‘Dry’. Transition probabilities:

P(‘Low’|‘Low’)=0.3 , P(‘

High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2, P(‘High’

|‘High’)=0.8 Observation probabilities : P(‘Rain’|

‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High

’)=0.4 , P(‘Dry’|‘High

’)=0.3 . Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .Example of Hidden Markov ModelSlide48

Suppose we want to calculate a probability of a sequence of observations in our example, {

‘

Dry

’

’Rain’}.Consider all possible hidden state sequences: P({‘

Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’

Low’}) + P({‘Dry’,’Rain’} , {

‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’

,’Low’}) + P({‘Dry’,

’Rain’} , {‘High’,’High’}) where first term is : P({‘Dry’,

’Rain’} , {‘Low’,’Low

’})= P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({

‘Low’,’Low’}) = P(

‘

Dry

’

Low

’

)P(

‘

Rain

’

Low’) P(‘Low’)P(‘Low’|’

Low)= 0.4*0.4*0.6*0.4*0.3Calculation of observation sequence probabilitySlide49

Evaluation problem.

Given the HMM M=(A, B, ) and the observation sequence O=o1 o

... oK , calculate the probability that model M has generated sequence O .

Decoding problem. Given the HMM M=(A, B, ) and the observation sequence

O=o1 o2 ... oK , calculate the most likely sequence of hidden states si that produced this observation sequence O.

Learning problem. Given some training observation sequences O=o1 o2 ... oK and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B, ) that best fit training data. O=o1...oK denotes a sequence of observations ok

{v1,…,vM}.

Main issues using HMMs :Slide50

Typed word recognition, assume all characters are separated.

Amherst

Character recognizer outputs probability of the image being particular character, P(image|character).

0.5

0.03

0.005

0.31

Word recognition example(1).

Hidden state ObservationSlide51

Hidden states of HMM = characters.

Observations = typed images of characters segmented from the image . Note that there is an infinite number of observations

Observation probabilities = character recognizer scores.

Transition probabilities will be defined differently in two subsequent models.

Word recognition example(2).Slide52

If lexicon is given, we can construct separate HMM models for each lexicon word.

Amherst

Buffalo

0.5

0.03

Here recognition of word image is equivalent to the problem of evaluating few HMM models.

This is an application of

Evaluation problem.

Word recognition example(3).

0.4

0.6Slide53

We can construct a single HMM for all words.

Hidden states = all characters in the alphabet. Transition probabilities and initial probabilities are calculated from language model. Observations and observation probabilities are as before.

Here we have to determine the best sequence of hidden states, the one that most likely produced word image.

This is an application of

Decoding problem.

Word recognition example(4).Slide54

The structure of hidden states is chosen.

Observations are feature vectors extracted from vertical slices.

Probabilistic mapping from hidden state to feature vectors: 1. use mixture of Gaussian models

2. Quantize feature vector space.

Character recognition with HMM example.Slide55

The structure of hidden states:

Observation = number of islands in the vertical slice.

HMM for character

‘

’

Transition probabilities: {

Observation probabilities: {

 .8 .2 0 

 0 .8 .2 

 0 0 1 

 .9 .1 0 

 .1 .8 .1 

 .9 .1 0 

HMM for character

‘

’

Transition probabilities: {

Observation probabilities: {

 .8 .2 0 

 0 .8 .2 

 0 0 1 

 .9 .1 0 

 0 .2 .8 

 .6 .4 0 

Exercise: character recognition with HMM(1)Slide56

Suppose that after character image segmentation the following sequence of island numbers in 4 slices was observed:

{ 1, 3, 2, 1}

What HMM is more likely to generate this observation sequence , HMM for

‘

’ or HMM for ‘B’ ?

Exercise: character recognition with HMM(2)Slide57

Consider likelihood of generating given observation for each possible sequence of hidden states:

HMM for character

‘

’

Hidden state sequence

Transition probabilities

Observation probabilities

1 s1 s2s3

 .2  .2  .9  0  .8  .9 = 0





.2  .9  .1  .8  .9 = 0.0020736 s1 s2 s3s3.2  .2  1  .9  .1  .1  .9 = 0.000324

Total = 0.0023976

HMM for character

‘

’

Hidden state sequence

Transition probabilities

Observation probabilities





.6 = 0





.6 = 0.0027648





.6 = 0.006912

Total = 0.0096768

Exercise: character recognition with HMM(3)Slide58

Evaluation problem.

Given the HMM M=(A, B, ) and the observation sequence O=o1

2 ... oK , calculate the probability that model M has generated sequence O .

Trying to find probability of observations O=o1 o2

... oK by means of considering all hidden state sequences (as was done in example) is impractical: NK hidden state sequences - exponential complexity. Use Forward-Backward HMM algorithms for efficient calculations.

Define the forward variable k(i) as the joint probability of the partial observation sequence o1 o2 ... ok and that the hidden state at time k is si : k(i)= P(o1

o2 ... ok , qk= si )

Evaluation Problem.Slide59

Time= 1 k k+1 K

k+1

= Observations

Trellis representation of an HMMSlide60

Initialization: 1(i)= P(o

1 ,

q1= si

) = i

bi (o1) , 1<=i<=N. Forward recursion: 

k+1(i)= P(o1 o2 ... ok+1 , qk+1= sj ) = i

P(o1 o2 ... ok+1 , qk= si ,

qk+1= sj ) = i P(o1 o2 ... ok , q

k= si) aij bj (ok+1 )

= [i k(i) aij ] bj (ok+1 ) , 1<=j<=N, 1<=k<=K-1. Termination: P(o

1 o2 ... oK) = i

P(o1 o2 ... oK , qK= si) = i K(i)

Complexity : N2K operations.

Forward recursion for HMMSlide61

Define the forward variable

k(i) as the joint probability of the partial observation sequence o

k+1

ok+2 ... oK given that the hidden state at time k is si

: k(i)= P(o

k+1 ok+2 ... oK |qk= si )

Initialization: K(i)= 1 , 1<=i<=N. Backward recursion: k(j)= P(ok+1 ok+2 ... o

K | qk= sj ) =

i P(ok+1 ok+2 ... oK , qk+1= si | qk=

sj ) = 

i P(ok+2 ok+3 ... oK | qk+1= si) aji bi (ok+1 )

= i k+1(i) a

ji bi (ok+1 ) , 1<=j<=N, 1<=k<=K-1. Termination: P(o1 o2 ... oK) = i P(o

1 o2 ... oK , q1= si

)



P(o

... o

1= si) P(q1= si) = 

i 1(i) bi (o1) i Backward recursion for HMMSlide62

Decoding problem.

Given the HMM M=(A, B, ) and the observation sequence O=o1

2 ... oK , calculate the most likely sequence of hidden states si that produced this observation sequence.

We want to find the state sequence Q= q1…q

K which maximizes P(Q | o1 o2 ... oK ) , or equivalently P(Q , o1 o2

... oK ) . Brute force consideration of all paths takes exponential time. Use efficient Viterbi algorithm instead. Define variable k(i) as the maximum probability of producing observation sequence o1 o2 ... ok when moving along any hidden state sequence

q1… qk-1 and getting into qk= si .

k(i) = max P(q1… qk-1 , qk= si , o1 o2 ... o

k) where max is taken over all possible paths q1… qk-1 .

Decoding problemSlide63

General idea:

if best path ending in qk= sj goes through

k-1= si then it should coincide with best path ending in

qk-1= si .

k-1



(i) = max

P(q

… q

k-1

... o

)

max

[

)

max

P(q

… q

k-1

... o

k-1

)

]

To backtrack best path keep info that predecessor of

was

Viterbi algorithm (1)Slide64

Initialization: 1(i) = max

P(q

1= si , o

1) = i

bi (o1) , 1<=i<=N.Forward recursion: k(j) = max

P(q1… qk-1 , qk= sj , o1 o2 ... ok) = max

i [ aij bj (ok ) max P(q1… q

k-1= si , o1 o2 ... ok-1) ] = maxi [ aij

bj (ok ) k-1(i) ] , 1<=j<=N, 2<=k<=K.

Termination: choose best path ending at time K maxi [ K(i) ] Backtrack best path.

This algorithm is similar to the forward recursion of evaluation problem, with  replaced by max and additional backtracking.Viterbi algorithm (2)Slide65

Learning problem.

Given some training observation sequences O=o1 o2 ... oK

and general structure of HMM (numbers of hidden and visible states), determine HMM parameters

M=(A, B, ) that best fit training data, that is maximizes P(O |

M) . There is no algorithm producing optimal parameter values.

Use iterative expectation-maximization algorithm to find local maximum of P(O | M) - Baum-Welch algorithm.

Learning problem (1)Slide66

If training data has information about sequence of hidden states (as in word recognition example), then use maximum likelihood estimation of parameters:

aij= P(s

| sj) =

Number of transitions from state

sj to state

si Number of transitions out of state sj

(vm ) = P(vm

| si)=

Number of times observation vm occurs in state si Number of times in state si

Learning problem (2)Slide67

General idea:

= P(s

| sj) =

Expected number of transitions from state sj

to state si Expected number of transitions out of state sj

i(vm )

= P(vm | si)=

Expected number of times observation vm occurs in state si

Expected number of times in state si



= P(s

) =

Expected frequency in state

at time

k=1.

Baum-Welch algorithmSlide68

Define variable

k(i,j) as the probability of being in state s

at time k and in state sj at time k+1, given the observation sequence o1 o

2 ... oK . 

k(i,j)= P(qk= si , qk+1= sj

| o1 o2 ... oK)

k(i,j)=

P(qk= si ,

qk+1= sj , o1 o

2 ... ok) P(o1 o2 ... ok)

P(qk= si ,

o1 o2 ... ok) aij bj

k+1

)

P(o

k+2

... o

k+1

= s

j ) P(o1 o2 ... ok)

= k(i) aij bj (ok+1 ) k+1(j) i j k(i) aij bj (ok+1 ) k+1(j)

Baum-Welch algorithm: expectation step(1)Slide69

Define variable

k(i) as the probability of being in state s

at time k, given the observation sequence o1 o2 ... oK .

k(i)= P(q

k= si | o1 o2 ... oK)



k(i)=

P(qk= si , o1 o2 ... ok)

P(o1 o2 ... ok)



k(i) k(i) i k(i) 

k(i)

Baum-Welch algorithm: expectation step(2)Slide70

We calculated



(i,j) =

P(q

k= si , qk+1= s

j | o1 o2 ... oK) and k(i)= P(qk= si |

o1 o2 ... oK) Expected number of transitions from state si to state

sj = = k k(i,j) Expected number of transitions out of state si = k k

(i) Expected number of times observation vm occurs in state s

i = = k k(i) , k is such that ok= vm Expected frequency in state si

at time k=1 : 1(i) .

Baum-Welch algorithm: expectation step(3)Slide71

ij =

Expected number of transitions from state

j to state si Expected number of transitions out of state sj

(vm ) =

Expected number of times observation vm occurs in state si Expected number of times in state



i = (Expected frequency in state si at time k=1) = 1(i).





(i,j)





(i)





k(i,j)k,ok= vm



(i)

Baum-Welch algorithm: maximization stepSlide72

The Noisy Channel Model

Search through space of all possible sentences.Pick the one that is most probable given the waveform.Slide73

The Noisy Channel Model (II)

What is the most likely sentence out of all sentences in the language L given some acoustic input O?Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,otDefine a sentence as a sequence of words:W = w1,w2,w3,…,w

Slide74

Noisy Channel Model (III)

Probabilistic implication: Pick the highest prob S:We can use Bayes rule to rewrite this:Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:Slide75

Noisy channel model

likelihood

priorSlide76

The noisy channel model

Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source)Slide77

Speech Architecture meets Noisy ChannelSlide78

HMMs for speechSlide79

Phones are not homogeneous!Slide80

Each phone has 3 subphonesSlide81

Resulting HMM word model for

“six”Slide82

HMMs more formally

Markov chainsA kind of weighted finite-state automatonSlide83

HMMs more formally

Markov chainsA kind of weighted finite-state automatonSlide84

Another Markov chainSlide85

Another view of Markov chainsSlide86

An example with numbers:

What is probability of:Hot hot hot hotCold hot cold hotSlide87

Hidden Markov ModelsSlide88

Hidden Markov ModelsSlide89

Hidden Markov Models

Bakis network Ergodic (fully-connected) networkLeft-to-right networkSlide90

The Jason Eisner task

You are a climatologist in 2799 studying the history of global warmingYOU can’t find records of the weather in Baltimore for summer 2006But you do find Jason Eisner’s diaryWhich records how many ice creams he ate each day.Can we use this to figure out the weather?Given a sequence of observations O, each observation an integer = number of ice creams eatenFigure out correct hidden sequence Q of weather states (H or C) which caused Jason to eat the ice creamSlide91
Slide92

HMMs more formally

Three fundamental problemsJack Ferguson at IDA in the 1960s Given a specific HMM, determine likelihood of observation sequence. Given an observation sequence and an HMM, discover the best (most probable) hidden state sequence Given only an observation sequence, learn the HMM parameters (A, B matrix)Slide93

The Three Basic Problems for HMMs

Problem 1 (Evaluation): Given the observation sequence O=(o1o2…oT), and an HMM model  = (A,B), how do we efficiently compute P(O| )

, the probability of the observation sequence, given the model

Problem 2 (Decoding):

Given the observation sequence O=(o1o2…oT

), and an HMM model  = (A,B), how do we choose a corresponding state sequence Q=(q1q2…q

T) that is optimal in some sense (i.e., best explains the observations)Problem 3 (Learning): How do we adjust the model parameters  = (A,B) to maximize P(O|  )?Slide94

Problem 1: computing the observation likelihood

Given the following HMM:

How likely is the sequence 3 1 3?Slide95

How to compute likelihood

For a Markov chain, we just follow the states 3 1 3 and multiply the probabilitiesBut for an HMM, we don’t know what the states are!So let’s start with a simpler situation.Computing the observation likelihood for a given hidden state sequenceSuppose we knew the weather and wanted to predict how much ice cream Jason would eat.I.e. P( 3 1 3 | H H C)Slide96

Computing likelihood for 1 given hidden state sequenceSlide97

Computing total likelihood of 3 1 3

We would need to sum overHot hot coldHot hot hotHot cold hot….How many possible hidden state sequences are there for this sequence?How about in general for an HMM with N hidden states and a sequence of T observations?

TSo we can

’t just do separate computation for each hidden state sequence.Slide98

Instead: the Forward algorithm

A kind of dynamic programming algorithmUses a table to store intermediate valuesIdea:Compute the likelihood of the observation sequenceBy summing over all possible hidden state sequencesBut doing this efficiently By folding all the sequences into a single trellisSlide99

The Forward TrellisSlide100

The forward algorithm

Each cell of the forward algorithm trellis alphat(j)Represents the probability of being in state jAfter seeing the first t observationsGiven the automatonEach cell thus expresses the following probabilitySlide101

We update each cellSlide102

The Forward RecursionSlide103

The Forward AlgorithmSlide104

Decoding

Given an observation sequence3 1 3And an HMMThe task of the decoderTo find the best hidden state sequenceGiven the observation sequence O=(o1o2…oT), and an HMM model  = (A,B),

how do we choose a corresponding state sequence

Q=(q1q

2…qT)

that is optimal in some sense (i.e., best explains the observations)Slide105

Decoding

One possibility:For each hidden state sequenceHHH, HHC, HCH, Run the forward algorithm to compute P( |O) Why not?NTInstead:The Viterbi algorithm

Is again a

dynamic programming algorithmUses a similar trellis to the Forward algorithmSlide106

The Viterbi trellisSlide107

Viterbi intuition

Process observation sequence left to rightFilling out the trellisEach cell:Slide108

Viterbi AlgorithmSlide109

Viterbi backtraceSlide110

Viterbi RecursionSlide111

Reminder: a word looks like this:Slide112

HMM for digit recognition taskSlide113

The Evaluation (forward) problem for speech

The observation sequence O is a series of MFCC vectorsThe hidden states W are the phones and wordsFor a given phone/word string W, our job is to evaluate P(O|W)Intuition: how likely is the input to have been generated by just that word string WSlide114

Evaluation for speech: Summing over all different paths!

f ay ay ay ay v v v v f f ay ay ay ay v v v f f f f ay ay ay ay vf f ay ay ay ay ay ay vf f ay ay ay ay ay ay ay ay vf f ay v v v v v v v Slide115

The forward lattice for

“five”Slide116

The forward trellis for

“five”Slide117

Viterbi trellis for

“five”Slide118

Viterbi trellis for

“five”Slide119

Search space with bigramsSlide120

Viterbi trellis with 2 words and uniform LMSlide121

Viterbi backtraceSlide122

Part-of-speech taggingSlide123

Parts of Speech

Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speechlexical categories, word classes, “tags”, POSDionysius Thrax of Alexandria (c. 100 BCE): 8 parts of speechStill with us! But his 8 aren’t exactly the ones we are taught todayThrax: noun, verb, article, adverb, preposition, conjunction,

participle, pronoun

School grammar: noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjectionSlide124

Open class

(lexical) words

Closed class

(functional)

Nouns

Verbs

Proper

Common

Modals

Main

Adjectives

Adverbs

Prepositions

Particles

Determiners

Conjunctions

Pronouns

… more

IBM

Italy

cat / cats

snow

see

registered

can

had

old older oldest

slowly

to with

off

the some

and or

he its

Numbers

122,312

one

Interjections

EhSlide125

Open vs. Closed classes

Open vs. Closed classesClosed: determiners: a, an, thepronouns: she, he, Iprepositions: on, under, over, near, by, …

Why

“closed

”?Open: Nouns, Verbs, Adjectives, Adverbs. Slide126

POS Tagging

Words often have more than one POS: backThe back door = JJOn my back = NNWin the voters

back

= RBPromised to back

the bill = VBThe POS tagging problem is to determine the POS tag for a particular instance of a word.Slide127

POS Tagging

Input: Plays well with othersAmbiguity: NNS/VBZ UH/JJ/NN/RB IN NNSOutput: Plays/VBZ well/RB with/IN others/NNSUses:MT: reordering of adjectives and nouns (say from Spanish to English)Text-to-speech (how do we pronounce “lead”

Can write regexps like (Det)

Adj* N+ over the output for phrases, etc.Input to a syntactic parser

Penn Treebank

POS tagsSlide128

The Penn

TreeBankTagset128Slide129

Penn Treebank tags

129Slide130

POS tagging performance

How many tags are correct? (Tag accuracy)About 97% currentlyBut baseline is already 90%Baseline is performance of stupidest possible methodTag every word with its most frequent tagTag unknown words as nounsPartly easy becauseMany words are unambiguousYou get points for them (

the

, a, etc.) and for punctuation marks!Slide131

Deciding on the correct part of speech can be difficult even for people

Mrs/NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBGAll/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN

the/DT corner/NN

Chateau/NNP Petrus/NNP costs/VBZ around/RB

250/CDSlide132

How difficult is POS tagging?

About 11% of the word types in the Brown corpus are ambiguous with regard to part of speechBut they tend to be very common words. E.g., thatI know that he is honest = IN

Yes,

that play was nice = DT

You can’t go that

far = RB40% of the word tokens are ambiguousSlide133

Sources of information

What are the main sources of information for POS tagging?Knowledge of neighboring wordsBill saw that man yesterdayNNP NN DT NN NNVB VB(D) IN VB NNKnowledge of word probabilitiesman is rarely used as a verb….The latter proves the most useful, but the former also helpsSlide134

More and Better Features

 Feature-based taggerCan do surprisingly well just looking at a word by itself:Word the: the  DTLowercased word Importantly: importantly  RBPrefixes unfathomable: un-

 JJ

Suffixes Importantly: -ly

 RBCapitalization Meridian: CAP  NNP

Word shapes 35-year: d-x  JJThen build a classifier to predict tagMaxent P(t|w): 93.7% overall / 82.6% unknownSlide135

Overview:

POS Tagging AccuraciesRough accuracies:

Most

freq tag: ~90% / ~50%

Trigram HMM: ~95% / ~55

%Maxent P(

t|w): 93.7% / 82.6%TnT (HMM++): 96.2% / 86.0%MEMM tagger: 96.9

% / 86.9%Bidirectional dependencies: 97.2% / 90.0%Upper bound: ~98% (human agreement)

Most errors on unknown wordsSlide136

POS tagging as a sequence classification task

We are given a sentence (an “observation” or “sequence of observations”)Secretariat is expected to race tomorrowShe promised to back the billWhat is the best sequence of tags which corresponds to this sequence of observations?Probabilistic view:Consider all possible sequences of tags

Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…

wn.Slide137

How do we apply classification to sequences?Slide138

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

NNPSlide139

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

VBDSlide140

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

DTSlide141

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

NNSlide142

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

CCSlide143

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw

the saw and decided to take it to the table.

classifier

VBDSlide144

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

TOSlide145

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

VBSlide146

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

PRPSlide147

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

INSlide148

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

DTSlide149

Sequence Labeling as Classification

Classify each token independently but use as input features, information about the surrounding tokens (sliding window).Slide from Ray Mooney

John

saw the saw and decided to take it

to the table.

classifier

NNSlide150

Sequence Labeling as Classification

Using Outputs as InputsBetter input features are usually the categories of the surrounding tokens, but these are not available yet.Can use category of either the preceding or succeeding tokens by going forward or back and using previous output.Slide from Ray MooneySlide151

Forward Classification

Slide from Ray MooneyJohn saw

the saw and decided to take it

to the table.

classifier

NNPSlide152

Forward Classification

Slide from Ray Mooney NNPJohn saw

the saw and decided to take it to the table.

classifier

VBDSlide153

Forward Classification

Slide from Ray MooneyNNP VBDJohn saw

the saw and decided to take it

to the table.

classifier

DTSlide154

Forward Classification

Slide from Ray Mooney NNP VBD DTJohn saw

the saw and decided to take it to

the table.

classifier

NNSlide155

Forward Classification

Slide from Ray Mooney NNP VBD DT NNJohn saw

the saw and decided to take it

to the table.

classifier

CCSlide156

Forward Classification

Slide from Ray Mooney NNP VBD DT NN CCJohn saw

the saw and decided to take it

to the table.

classifier

VBDSlide157

Forward Classification

Slide from Ray Mooney NNP VBD DT NN CC VBDJohn saw

the saw and decided to take it

to the table.

classifier

TOSlide158

Forward Classification

Slide from Ray Mooney NNP VBD DT NN CC VBD TOJohn

saw

the saw and decided to take it to

the table.

classifier

VBSlide159

Backward Classification

Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney

NNJohn

saw the saw and decided to take it

to the table.

classifier

INSlide160

Backward Classification

Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney

DT NNJohn

saw the saw and decided to take it

to the table.

classifier

PRPSlide161

Backward Classification

Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney

PRP IN DT NN

John saw

the saw and decided to take it to the table.

classifier

VBSlide162

Backward Classification

Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney

PRP IN DT NNJohn

saw the saw and decided to take it

to the table.

classifier

TOSlide163

Backward Classification

Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney

TO VB PRP IN DT NN John

saw the saw and decided to take it

to the table.

classifier

VBDSlide164

Backward Classification

Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney

VBD TO

VB PRP

IN DT NN John

saw the saw and decided to take it to the table.

classifier

CCSlide165

Backward Classification

Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney CC VBD TO VB PRP IN DT NN

John

saw

the saw and decided to take it to the table.

classifier

VBDSlide166

Backward Classification

Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney VBD CC VBD

TO VB PRP IN DT NN

John saw

the saw and decided to take it to the table

classifier

DTSlide167

Backward Classification

Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney

DT VBD CC VBD

TO VB

PRP IN DT NNJohn

saw the saw and decided to take it

to the table.

classifier

VBDSlide168

Backward Classification

Disambiguating “to” in this case would be even easier backward.Slide from Ray Mooney VBD DT VBD CC

VBD TO

VB PRP IN DT NN John

saw the saw and decided to take it

to the table.

classifier

NNPSlide169

The Maximum Entropy Markov Model (MEMM)

A sequence version of the logistic regression (also called maximum entropy) classifier.Find the best series of tags:169Slide170

The Maximum Entropy Markov Model (MEMM)

170Slide171

Features for the classifier at each tag

171Slide172

More features

172Slide173

MEMM computes the best tag sequence

173Slide174

MEMM Decoding

Simplest algorithm:What we use in practice: The Viterbi algorithmA version of the same dynamic programming algorithm we used to compute minimum edit distance.

174Slide175

The Stanford Tagger

Is a bidirectional version of the MEMM called a cyclic dependency networkStanford tagger:http://nlp.stanford.edu/software/tagger.shtml175