Corpora and Statistical Methods Lecture 8 Markov and Hidden Markov Models Conceptual Introduction Part 2 In this lecture We focus on Hidden Markov Models conceptual intro to Markov Models ID: 286893
Download Presentation The PPT/PDF document "Albert Gatt" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Albert Gatt
Corpora and Statistical Methods
Lecture 8Slide2
Markov and Hidden Markov Models: Conceptual Introduction
Part
2Slide3
In this lecture
We focus on (Hidden) Markov Models
conceptual intro to Markov Models
relevance to NLP
Hidden Markov Models
algorithmsSlide4
Acknowledgement
Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang MaassSlide5
Talking about the weather
Suppose we want to predict tomorrow’s weather. The possible predictions are:
sunny
foggy
rainy
We
might decide to predict tomorrow’s outcome based on earlier weather
if it’s been sunny all week, it’s likelier to be sunny tomorrow than if it had been rainy all week
how far back do we want to go to predict tomorrow’s weather?Slide6
Statistical weather model
Notation:
S: the state space,
a set of possible values for the weather:
{sunny, foggy, rainy}
(each state is identifiable by an integer
i
)
X:
a sequence of random variables, each taking a value from S
these model weather over a sequence of days
t
is an integer standing for
time
(
X
1
, X
2
, X
3
, ... X
T
) models the value of a series of random variables
each takes a value from S with a certain
probability
P(X=s
i
)
the entire sequence tells us the weather over T daysSlide7
Statistical weather model
If we want to predict the weather for day
t+1
, our model might look like this:
E.g. P(weather tomorrow = sunny), conditional on the weather in the past
t
days.
Problem: the larger
t
gets, the more calculations we have to make.Slide8
Markov Properties I:
Limited horizon
The probability that we’re in state s
i
at time t+1 only depends on where we were at time t:
Given
this assumption, the probability of any sequence is
just:Slide9
Markov Properties II:
Time invariance
The probability of being in state s
i
given the previous state does not change over time:Slide10
Concrete instantiation
Day t
Day t+1
sunny
rainy
foggy
sunny
0.8
0.05
0.15
rainy
0.2
0.6
0.2
foggy
0.2
0.3
0.5
This is essentially a transition matrix, which gives us probabilities of going from one state to the other.
We can denote state transition probabilities as
a
ij
(prob. of going from state i to state j)Slide11
Graphical view
Components of the model:
states (s)
transitions
transition probabilities
initial probability distribution for states
Essentially
, a
non-deterministic finite state automaton
.Slide12
Example continued
If the weather today (X
t
) is sunny, what’s the probability that tomorrow (X
t+1
) is sunny and the day after (X
t+2
) is rainy?
Markov assumptionSlide13
Formal definition
A Markov Model is a triple
(S,
, A) where:
S is the set of states
are the probabilities of being initially in some state
A are the transition probabilitiesSlide14
Hidden Markov ModelsSlide15
A slight variation on the example
You’re locked in a room with no windows
You can’t observe the weather directly
You only observe whether the guy who brings you food is carrying an umbrella or not
Need
a model telling you the probability of seeing the umbrella, given the weather
distinction between
observations
and their
underlying emitting state
.
Define
:
O
t
as an observation at time
t
K = {+umbrella, -umbrella}
as the possible outputs
We’re
interested in
P(
Ot=k|Xt=si)i.e. p. of a given observation at t given that the underlying weather state at t is siSlide16
Symbol emission probabilities
weather
Probability of umbrella
sunny
0.1
rainy
0.8
foggy
0.3
This is the hidden model, telling us the probability that
O
t
=
k
given that
X
t
= s
i
We assume that each underlying state
X
t
= s
i emits an observation with a given probability.Slide17
Using the hidden model
Model
gives:
P
(
O
t
=
k|X
t
=s
i
)
Then, by
Bayes
’ Rule we can compute:
P(
X
t
=
s
i
|O
t=k)Generalises easily to an entire sequenceSlide18
HMM in graphics
Circles indicate states
Arrows indicate probabilistic dependencies between statesSlide19
HMM in graphics
Green nodes are
hidden states
Each hidden state depends only on the previous state (Markov assumption)Slide20
Why HMMs?
HMMs are a way of thinking of underlying events probabilistically generating surface events.
Example: Parts
of
speech
a POS is a class or set of words
we can think of language as an underlying Markov Chain of parts of speech from which actual words are
generated (“emitted”)
So what are our hidden states here, and what are the observations?Slide21
HMMs in POS Tagging
ADJ
N
V
DET
Hidden layer (constructed through training)
Models the sequence of POSs in the training corpusSlide22
HMMs in POS Tagging
ADJ
tall
N
lady
V
is
DET
the
Observations are words.
They are “emitted” by their corresponding hidden state.
The state depends on its previous state.Slide23
Why HMMs
There are efficient algorithms to train HMMs using Expectation Maximisation
General idea:
training data is assumed to have been generated by some HMM (parameters unknown)
try and learn the unknown parameters in the data
Similar
idea is used in finding the parameters of some n-gram
models, especially those that use interpolation.Slide24
Formalisation of a Hidden Markov modelSlide25
Crucial ingredients (familiar)
Underlying states:
S = {s
1
,…,
s
N
}
Output
alphabet (observations):
K = {k
1
,…,
k
M
}
State
transition probabilities:
A = {
a
ij
},
i,j Є SState sequence: X = (X1,…,XT+1) + a function mapping each Xt to a state sOutput sequence: O = (O1,…,OT)where each ot Є KSlide26
Crucial ingredients (additional)
Initial state probabilities:
Π
= {
π
i
}
, i
Є
S
(tell us the initial probability of each state)
Symbol
emission probabilities:
B = {
b
ijk
},
i,j
Є
S, k
Є K (tell us the probability b of seeing observation Ot=k, given that Xt=si and Xt+1 = sj)Slide27
Trellis diagram of an HMM
s
1
s
2
s
3
a
1,1
a
1,2
a
1,3Slide28
Trellis diagram of an HMM
s
1
s
2
s
3
a
1,1
a
1,2
a
1,3
o
1
o
2
o
3
Obs. seq:
time:
t
1
t2t3Slide29
Trellis diagram of an HMM
s
1
s
2
s
3
a
1,1
a
1,2
a
1,3
o
1
o
2
o
3
Obs. seq:
time:
t
1
t2t3b1,1,kb1,1,kb1,2,kb1,3,kSlide30
The fundamental questions for HMMs
Given a model
μ
= (A, B,
Π
)
, how do we compute the likelihood of an observation
P(O|
μ
)
?
Given an observation sequence
O
, and model
μ
, which is the state sequence
(X
1
,…,X
t+1
)
that best explains the observations?
This is the decoding problem
Given an observation sequence O, and a space of possible models μ = (A, B, Π), which model best explains the observed data?Slide31
Application of question 1 (ASR)
Given a model
μ
= (A, B,
Π
)
, how do we compute the likelihood of an observation
P(O|
μ
)
?
Input of an ASR system: a continuous stream of sound waves, which is
ambiguous
Need to decode it into a sequence of phones.
is the input the sequence [n iy d] or [n iy]?
which sequence is the most probable?Slide32
Application of question 2 (POS Tagging)
Given an observation sequence
O
, and model
μ
, which is the state sequence
(X
1
,…,X
t+1
)
that best explains the observations?
this is the
decoding
problem
Consider
a POS Tagger
Input observation sequence:
I can read
need
to find the most likely sequence of underlying POS tags:
e.g. is
can
a modal verb, or the noun?how likely is it that can is a noun, given that the previous word is a pronoun?Slide33
Summary
HMMs are a way of representing:
sequences of observations arising from
sequences of states
states are the variables of interest, giving rise to the observations
Next
up:
algorithms for answering the fundamental questions about HMMs