/
Albert Gatt Albert Gatt

Albert Gatt - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
380 views
Uploaded On 2016-04-21

Albert Gatt - PPT Presentation

Corpora and Statistical Methods Lecture 8 Markov and Hidden Markov Models Conceptual Introduction Part 2 In this lecture We focus on Hidden Markov Models conceptual intro to Markov Models ID: 286893

sequence state model weather state sequence weather model markov hidden probability observation hmms models time observations probabilities states underlying

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Albert Gatt" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Albert Gatt

Corpora and Statistical Methods

Lecture 8Slide2

Markov and Hidden Markov Models: Conceptual Introduction

Part

2Slide3

In this lecture

We focus on (Hidden) Markov Models

conceptual intro to Markov Models

relevance to NLP

Hidden Markov Models

algorithmsSlide4

Acknowledgement

Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang MaassSlide5

Talking about the weather

Suppose we want to predict tomorrow’s weather. The possible predictions are:

sunny

foggy

rainy

We

might decide to predict tomorrow’s outcome based on earlier weather

if it’s been sunny all week, it’s likelier to be sunny tomorrow than if it had been rainy all week

how far back do we want to go to predict tomorrow’s weather?Slide6

Statistical weather model

Notation:

S: the state space,

a set of possible values for the weather:

{sunny, foggy, rainy}

(each state is identifiable by an integer

i

)

X:

a sequence of random variables, each taking a value from S

these model weather over a sequence of days

t

is an integer standing for

time

(

X

1

, X

2

, X

3

, ... X

T

) models the value of a series of random variables

each takes a value from S with a certain

probability

P(X=s

i

)

the entire sequence tells us the weather over T daysSlide7

Statistical weather model

If we want to predict the weather for day

t+1

, our model might look like this:

E.g. P(weather tomorrow = sunny), conditional on the weather in the past

t

days.

Problem: the larger

t

gets, the more calculations we have to make.Slide8

Markov Properties I:

Limited horizon

The probability that we’re in state s

i

at time t+1 only depends on where we were at time t:

Given

this assumption, the probability of any sequence is

just:Slide9

Markov Properties II:

Time invariance

The probability of being in state s

i

given the previous state does not change over time:Slide10

Concrete instantiation

Day t

Day t+1

sunny

rainy

foggy

sunny

0.8

0.05

0.15

rainy

0.2

0.6

0.2

foggy

0.2

0.3

0.5

This is essentially a transition matrix, which gives us probabilities of going from one state to the other.

We can denote state transition probabilities as

a

ij

(prob. of going from state i to state j)Slide11

Graphical view

Components of the model:

states (s)

transitions

transition probabilities

initial probability distribution for states

Essentially

, a

non-deterministic finite state automaton

.Slide12

Example continued

If the weather today (X

t

) is sunny, what’s the probability that tomorrow (X

t+1

) is sunny and the day after (X

t+2

) is rainy?

Markov assumptionSlide13

Formal definition

A Markov Model is a triple

(S,

, A) where:

S is the set of states

 are the probabilities of being initially in some state

A are the transition probabilitiesSlide14

Hidden Markov ModelsSlide15

A slight variation on the example

You’re locked in a room with no windows

You can’t observe the weather directly

You only observe whether the guy who brings you food is carrying an umbrella or not

Need

a model telling you the probability of seeing the umbrella, given the weather

distinction between

observations

and their

underlying emitting state

.

Define

:

O

t

as an observation at time

t

K = {+umbrella, -umbrella}

as the possible outputs

We’re

interested in

P(

Ot=k|Xt=si)i.e. p. of a given observation at t given that the underlying weather state at t is siSlide16

Symbol emission probabilities

weather

Probability of umbrella

sunny

0.1

rainy

0.8

foggy

0.3

This is the hidden model, telling us the probability that

O

t

=

k

given that

X

t

= s

i

We assume that each underlying state

X

t

= s

i emits an observation with a given probability.Slide17

Using the hidden model

Model

gives:

P

(

O

t

=

k|X

t

=s

i

)

Then, by

Bayes

’ Rule we can compute:

P(

X

t

=

s

i

|O

t=k)Generalises easily to an entire sequenceSlide18

HMM in graphics

Circles indicate states

Arrows indicate probabilistic dependencies between statesSlide19

HMM in graphics

Green nodes are

hidden states

Each hidden state depends only on the previous state (Markov assumption)Slide20

Why HMMs?

HMMs are a way of thinking of underlying events probabilistically generating surface events.

Example: Parts

of

speech

a POS is a class or set of words

we can think of language as an underlying Markov Chain of parts of speech from which actual words are

generated (“emitted”)

So what are our hidden states here, and what are the observations?Slide21

HMMs in POS Tagging

ADJ

N

V

DET

Hidden layer (constructed through training)

Models the sequence of POSs in the training corpusSlide22

HMMs in POS Tagging

ADJ

tall

N

lady

V

is

DET

the

Observations are words.

They are “emitted” by their corresponding hidden state.

The state depends on its previous state.Slide23

Why HMMs

There are efficient algorithms to train HMMs using Expectation Maximisation

General idea:

training data is assumed to have been generated by some HMM (parameters unknown)

try and learn the unknown parameters in the data

Similar

idea is used in finding the parameters of some n-gram

models, especially those that use interpolation.Slide24

Formalisation of a Hidden Markov modelSlide25

Crucial ingredients (familiar)

Underlying states:

S = {s

1

,…,

s

N

}

Output

alphabet (observations):

K = {k

1

,…,

k

M

}

State

transition probabilities:

A = {

a

ij

},

i,j Є SState sequence: X = (X1,…,XT+1) + a function mapping each Xt to a state sOutput sequence: O = (O1,…,OT)where each ot Є KSlide26

Crucial ingredients (additional)

Initial state probabilities:

Π

= {

π

i

}

, i

Є

S

(tell us the initial probability of each state)

Symbol

emission probabilities:

B = {

b

ijk

},

i,j

Є

S, k

Є K (tell us the probability b of seeing observation Ot=k, given that Xt=si and Xt+1 = sj)Slide27

Trellis diagram of an HMM

s

1

s

2

s

3

a

1,1

a

1,2

a

1,3Slide28

Trellis diagram of an HMM

s

1

s

2

s

3

a

1,1

a

1,2

a

1,3

o

1

o

2

o

3

Obs. seq:

time:

t

1

t2t3Slide29

Trellis diagram of an HMM

s

1

s

2

s

3

a

1,1

a

1,2

a

1,3

o

1

o

2

o

3

Obs. seq:

time:

t

1

t2t3b1,1,kb1,1,kb1,2,kb1,3,kSlide30

The fundamental questions for HMMs

Given a model

μ

= (A, B,

Π

)

, how do we compute the likelihood of an observation

P(O|

μ

)

?

Given an observation sequence

O

, and model

μ

, which is the state sequence

(X

1

,…,X

t+1

)

that best explains the observations?

This is the decoding problem

Given an observation sequence O, and a space of possible models μ = (A, B, Π), which model best explains the observed data?Slide31

Application of question 1 (ASR)

Given a model

μ

= (A, B,

Π

)

, how do we compute the likelihood of an observation

P(O|

μ

)

?

Input of an ASR system: a continuous stream of sound waves, which is

ambiguous

Need to decode it into a sequence of phones.

is the input the sequence [n iy d] or [n iy]?

which sequence is the most probable?Slide32

Application of question 2 (POS Tagging)

Given an observation sequence

O

, and model

μ

, which is the state sequence

(X

1

,…,X

t+1

)

that best explains the observations?

this is the

decoding

problem

Consider

a POS Tagger

Input observation sequence:

I can read

need

to find the most likely sequence of underlying POS tags:

e.g. is

can

a modal verb, or the noun?how likely is it that can is a noun, given that the previous word is a pronoun?Slide33

Summary

HMMs are a way of representing:

sequences of observations arising from

sequences of states

states are the variables of interest, giving rise to the observations

Next

up:

algorithms for answering the fundamental questions about HMMs