Mark Stamp 1 HMM Hidden Markov Models What is a hidden Markov model HMM A machine learning technique A discrete hill climb technique Where are HMMs used Speech recognition Malware detection IDS etc etc ID: 544024
Download Presentation The PPT/PDF document "A Revealing Introduction to Hidden Marko..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A Revealing Introduction to Hidden Markov Models
Mark Stamp
1
HMMSlide2
Hidden Markov Models
What is a hidden Markov model (HMM)?A machine learning techniqueA discrete hill climb technique
Where are HMMs used?Speech recognition
Malware detection, IDS, etc., etc.
Why is it useful?
Efficient algorithms
HMM
2Slide3
Markov Chain
Markov chain is a “memoryless random process”Transitions depend only on
current state and transition probabilities matrixExample on next slide…
HMM
3Slide4
Markov Chain
We are interested in average annual temperatureOnly consider Hot and Cold
From recorded history, we obtain probabilitiesSee diagram to the right
HMM
4
H
C
0.7
0.6
0.3
0.4Slide5
Markov Chain
Transition probability matrix
Matrix is denoted as
A
Note,
A
is “row stochastic”
HMM
5
H
C
0.7
0.6
0.3
0.4Slide6
Markov Chain
Can also include begin
, end statesBegin state matrix is
π
In this example,
Note that
π
is row stochastic
HMM
6
H
C
0.7
0.6
0.3
0.4
begin
end
0.6
0.4Slide7
Hidden Markov Model
HMM includes a Markov chainBut this Markov process is “hidden”Cannot observe the Markov process
Instead, we observe something related to hidden statesIt’s as if there is a “curtain” between Markov chain and observations
Example on next slide
HMM
7Slide8
HMM Example
Consider H/C temperature example
Suppose we want to know H or
C
temperature in distant past
Before humans (or thermometers) invented
OK if we can just decide Hot versus ColdWe assume transition between Hot and Cold years is same as todayThat is, the
A
matrix is same as today
HMM
8Slide9
HMM Example
Temp in past determined by Markov processBut, we cannot observe temperature in past
Instead, we note that tree ring size is related to temperatureLook at historical data to see the connectionWe consider 3 tree ring sizes
Small, Medium, Large (
S, M, L
, respectively)
Measure tree ring sizes and recorded temperatures to determine relationship
HMM
9Slide10
HMM Example
We find that tree ring sizes and temperature related by
This is known as the
B
matrix:
Note that
B
also row stochastic
HMM
10Slide11
HMM Example
Can we now find temps in distant past?We cannot measure (observe) tempBut we can measure tree ring sizes……and tree ring sizes related to temp
By the B matrix
So, we ought to be able to say something about temperature
HMM
11Slide12
HMM Notation
A lot of notation is requiredNotation may be the most difficult part
HMM
12Slide13
HMM Notation
To simplify notation, observations are taken from the set {0,1,…,M-1}That is,
The matrix A = {a
ij
}
is
N x
N
, where
The matrix
B = {
b
j
(k
)}
is
N
x M, where
HMM13Slide14
HMM Example
Consider our temperature example…What are the observations?
V = {0,1,2}, which corresponds to S,M,LWhat are states of Markov process?
Q = {H,C}
What are
A,B,
π
, and
T
?
A,B,
π
on previous slides
T
is number of tree rings measured
What are N and
M?N = 2 and
M = 3HMM
14Slide15
Generic HMM
Generic view of HMM
HMM defined by A,B,
and
π
We denote HMM “model” as
λ
= (
A,B,
π)
HMM
15Slide16
HMM Example
Suppose that we observe tree ring sizesFor 4 year period of interest:
S,M,S,LThen = (0, 1, 0, 2)Most likely (hidden) state sequence?
We want most likely
X = (x
0
, x
1
, x
2
, x
3
)
Let
π
x0
be prob. of starting in state
x
0
Note prob. of initial observation And ax0,x1
is prob. of transition x0 to
x1And so on…
HMM
16Slide17
HMM Example
Bottom line?We can compute P(X) for any
XFor X = (x
0
, x
1
, x
2
, x
3
)
we have
Suppose we observe
(0,1,0,2)
, then what is probability of, say,
HHCC
?
Plug into formula above to find
HMM
17Slide18
HMM Example
Do same for all 4-state sequencesWe find…
The winner is?CCCH Not so fast my friend…
HMM
18Slide19
HMM Example
The path CCCH
scores the highestIn dynamic programming (DP), we find highest scoring pathBut, HMM maximizes expected number of correct states
Sometimes called “EM algorithm”
For “Expectation Maximization”
How does HMM work in this example?
HMM
19Slide20
HMM Example
For first position…Sum probabilities for all paths that have
H in 1st position, compare to sum of probs
for paths with
C
in 1
st position --- biggest winsRepeat for each position and we find:
HMM
20Slide21
HMM Example
So, HMM solution gives us CHCH
While dynamic program solution is CCCH Which solution is better?
Neither!!! Why is that?
Different definitions of “best”
HMM
21Slide22
HMM Paradox?
HMM maximizes expected number of correct statesWhereas DP chooses “best” overall pathPossible for HMM to choose “path” that is impossible
Could be a transition probability of 0Cannot get impossible path with DPIs this a flaw with HMM?
No, it’s a feature…
HMM
22Slide23
The Three Problems
HMMs used to solve 3 problems
Problem 1: Given a model λ
= (
A,B,
π)
and observation sequence
O
, find
P(O|
λ)
That is, we score an observation sequence to see how well it fits the given model
Problem 2
: Given
λ
= (
A,B,
π)
and
O, find an optimal state sequenceUncover hidden part (as in previous example)
Problem 3: Given O, N, and M
, find the model λ that maximizes probability of OThat is, train a model to fit the observations
HMM
23Slide24
HMMs in Practice
Typically, HMMs used as follows
Given an observation sequenceAssume a hidden Markov process existsTrain a model based on observationsProblem 3 (determine
N
by trial and error)
Then given a sequence of observations, score it
vs model from previous stepProblem 1 (high score implies it’s similar to training data)
HMM
24Slide25
HMMs in Practice
Previous slide gives sense in which HMM is a “machine learning” techniqueWe do not need to specify anything except the parameter
NAnd “best” N
found by trial and error
That is, we don’t have to think too much
Just train HMM and then use it
Best of all, efficient algorithms for HMMs
HMM
25Slide26
The Three Solutions
We give detailed solutions to the three problemsNote: We must have
efficient solutionsRecall the three problems:Problem 1
: Score an observation sequence versus a given model
Problem 2
: Given a model, “uncover” hidden part
Problem 3: Given an observation sequence, train a model
HMM
26Slide27
Solution 1
Score observations versus a given model
Given model λ
= (
A,B,
π)
and observation sequence
O=(O
0
,O
1
,…,O
T-1
)
, find
P(O|
λ)
Denote hidden states as
X = (x
0, x1
, . . . , xT-1) Then from definition of
B, P(O|X,λ)=b
x0(O0
) bx1(O
1) … bxT-1(O
T-1) And from definition of A
and π, P(X|
λ)=π
x0 ax0,x1
ax1,x2
… axT-2,xT-1
HMM
27Slide28
Solution 1
Elementary conditional probability fact:
P(O,X|λ
) = P(O|X,
λ
) P(X|
λ
)
Sum over all possible state sequences
X
,
P(O|
λ
) =
Σ
P(O,X|
λ
) =
Σ
P(O|X,λ) P(X|
λ) = Σπ
x0bx0(O
0)a
x0,x1bx1
(O1)…
axT-2,xT-1b
xT-1(OT-1)
This “works” but way too costlyRequires about 2TNT multiplications
Why?There better be a better way…
HMM28Slide29
Forward Algorithm
Instead of brute force: forward algorithmOr “alpha pass”
For t = 0,1,…,T-1
and
i
=0,1,…,N-1
, let
α
t
(i
) = P(O
0
,O
1
,…,
O
t
,x
t
=q
i|λ)Probability of “partial sum”
to t, and Markov process is in state qi
at step tWhat the?
Can be computed recursively, efficientlyHMM
29Slide30
Forward Algorithm
Let α
0(i) = π
i
b
i
(O
0
)
for
i
= 0,1,…,N-1
For
t
= 1,2,…,T-1
and
i
=0,1,…,N-1
, let
αt(i) =
(Σαt-1
(j)aji)
bi(O
t)Where the sum is from j
= 0 to N-1From definition of
αt(i)
we see P(O|λ
) = ΣαT-1(i)
Where the sum is from i = 0 to
N-1Note this requires only N
2T multiplications
HMM
30Slide31
Solution 2
Given a model, find “most likely” hidden states: Given λ
= (A,B,
π)
and
O
, find an optimal state sequenceRecall that optimal means “maximize expected number of correct states”In contrast, DP finds best scoring path
For temp/tree ring example, solved this
But hopelessly inefficient approach
A better way:
backward algorithm
Or “beta pass”
HMM
31Slide32
Backward Algorithm
For t
= 0,1,…,T-1 and i
=0,1,…,N-1
, let
βt
(i
) = P(O
t+1
,O
t+2
,…,O
T-1
|x
t
=
q
i
,
λ)Probability of partial sum from t to end and Markov process in state
qi at step t
Analogous to the forward algorithmAs with forward algorithm, this can be computed recursively and efficiently
HMM32Slide33
Backward Algorithm
Let β
T-1(i) = 1
for
i
= 0,1,…,N-1For t
=
T-2,T-3, …,1
and
i
=0,1,…,N-1
, let
β
t
(i
) = Σ
ai,jb
j(Ot+1)β
t+1(j)Where the sum is from j
= 0 to N-1
HMM
33Slide34
Solution 2
For t
= 1,2,…,T-1 and i
=0,1,…,N-1
define
γt
(i
) =
P(x
t
=
q
i
|O,λ
)
Most likely state at
t
is
qi that maximizes
γt(i)Note that
γt(i) =
αt(i)
βt(i)/P(O|λ)
And recall P(O|λ) = Σ
αT-1(i)
The bottom line?Forward algorithm solves Problem 1
Forward/backward algorithms solve Problem 2
HMM34Slide35
Solution 3
Train a model: Given O,
N, and M, find
λ
that maximizes probability of
O
Here, we iteratively adjust
λ
= (
A,B,
π)
to better fit the given observations
O
The size of matrices are fixed (
N
and
M
)
But elements of matrices can
changeIt is amazing that this works!And even more amazing that it’s efficient
HMM35Slide36
Solution 3
For t
=0,1,…,T-2 and i,j
in
{0,1,…,N-1}
, define “di-gammas” as
γ
t
(i,j
) =
P(x
t
=
q
i
, x
t+1
=
q
j|O,λ
)Note γt
(i,j) is prob of being in state
qi at time
t and transiting to state qj at
t+1 Then γ
t(i,j) = α
t(i)aij
bj(O
t+1)β
t+1(j)/P(O|λ)
And
γ
t
(i
) =
Σ
γ
t
(i,j
)
Where sum is from
j
= 0
to
N – 1
HMM
36Slide37
Model Re-estimation
Given di-gammas and gammas…For
i = 0,1,…,N-1 let
π
i
=
γ0
(i)
For
i
= 0,1,…,N-1
and
j
= 0,1,…,N-1
a
ij
=
Σ
γt
(i,j)/Σγt
(i)Where both sums are from t
= 0 to T-2For
j = 0,1,…,N-1 and k
= 0,1,…,M-1 bj
(k) = Σ
γt(j)/Σ
γt(j
)Both sums from from
t = 0 to T-2
but only t for which O
t
=
k
are counted in numerator
Why does this work?
HMM
37Slide38
Solution 3
To summarize…Initialize
λ = (A,B,
π)
Compute
α
t
(i
),
β
t
(i
),
γ
t
(i,j
)
,
γ
t(i)
Re-estimate the model λ = (
A,B,π) If
P(O|λ) increases, goto 2
HMM
38Slide39
Solution 3
Some fine points…Model initializationIf we have a good guess for
λ = (
A,B,
π)
then we can use it for initialization
If not, let π
i
≈ 1/N,
a
i,j
≈ 1/N,
b
j
(k
)
≈ 1/M
Subject to row stochastic conditions
Note: Do
not initialize to uniform valuesStopping conditionsStop after some number of iterationsStop if increase in P(O|
λ) is “small”HMM
39Slide40
HMM as Discrete Hill Climb
Algorithm on previous slides shows that HMM is a “discrete hill climb”HMM consists of discrete parametersSpecifically, the elements of the matrices
And re-estimation process improves model by modifying parametersSo, process “climbs” toward improved model
This happens
in a high-dimensional space
HMM
40Slide41
Dynamic Programming
Brief detour…For λ
= (A,B,
π)
as above, it’s easy to define a dynamic program (DP)
Executive summary:
DP is forward algorithm, with “sum” replaced by “max”Precise details on next slides
HMM
41Slide42
Dynamic Programming
Let δ
0(i) = π
i
b
i
(O
0
)
for
i
=0,1,…,N-1
For
t
=1,2,…,T-1
and
i
=0,1,…,N-1 compute δ
t(i) = max (
δt-1(j)aji)
bi(O
t)
Where the max is over j in
{0,1,…,N-1}Note that at each t, the DP computes best path for each state, up to that point
So, probability of best path is max δ
T-1(j)This max only gives best probability
Not the best path, for that, see next slide
HMM42Slide43
Dynamic Programming
To determine optimal pathWhile computing optimal path, keep track of pointers to previous state
When finished, construct optimal path by tracing back pointsFor example, consider temp exampleProbabilities for path of length 1:
These are the only “paths” of length 1
HMM
43Slide44
Dynamic Programming
Probabilities for each path of length 2
Best path of length 2 ending with H is
CH
Best path of length 2 ending with
C
is CC
HMM
44Slide45
Dynamic Program
Continuing, we compute best path ending at H and
C at each stepAnd save pointers --- why?
HMM
45Slide46
Dynamic Program
Best final score is .002822And, thanks to pointers, best path is
CCCHBut what about underflow?A
serious problem in bigger cases
HMM
46Slide47
Underflow Resistant DP
Common trick to prevent underflowInstead of multiplying probabilities……we add logarithms of probabilities
Why does this work?Because log(xy
) = log
x
+ log
y
And adding logs does not tend to 0
Note that we must avoid 0 probabilities
HMM
47Slide48
Underflow Resistant DP
Underflow resistant DP algorithm:Let δ
0(i) =
log(
π
i
b
i
(O
0
))
for
i
=0,1,…,N-1
For
t
=1,2,…,T-1
and i=0,1,…,N-1 compute
δt(i) = max
(δt-1(j) +
log(aji) +
log(bi
(Ot))
)Where the max is over
j in {0,1,…,N-1}And score of best path is
max δT-1
(j)As before, must also keep track of paths
HMM48Slide49
HMM Scaling
Trickier to prevent underflow in HMMWe consider solution 3Since it includes solutions 1 and 2
Recall for t = 1,2,…,T-1
,
i
=0,1,…,N-1
,
α
t
(i
) =
(
Σ
α
t-1
(j)
a
j,i
)
bi
(Ot)The idea is to normalize alphas so that they sum to oneAlgorithm on next slide
HMM
49Slide50
HMM Scaling
Given α
t(i) =
(
Σ
α
t-1
(j)
a
j,i
)
b
i
(O
t
)
Let
a
0
(i) = α
0(i) for i=0,1,…,N-1Let
c0 = 1/Σa0
(j) For i
= 0,1,…,N-1, let a0
(i) = c0a0
(i)This takes care of t
= 0 caseAlgorithm continued on next slide…HMM
50Slide51
HMM Scaling
For t = 1,2,…,T-1
do the following: For i
= 0,1,…,N-1
,
at
(i
) =
(
Σ
a
t-1
(j)
a
j,i
)
b
i
(Ot
)Let ct
= 1/Σat(j) For
i = 0,1,…,N-1 let a
t(i) =
ctat
(i)
HMM51Slide52
HMM Scaling
Easy to show at
(i) = c
0
c
1
…c
t
α
t
(i
) (
♯)
Simple proof by induction
So,
c
0
c
1
…ct is scaling factor at step
tAlso, easy to show that a
t(i) = α
t(i)/Σα
t(j)
Which implies ΣaT-1
(i) = 1 (♯♯)
HMM
52Slide53
HMM Scaling
By combining (
♯) and
(♯♯)
, we have
1 = Σ
a
T-1
(i) = c
0
c
1
…c
T-1
Σ
α
T-1
(i)
= c0
c1…cT-1
P(O|λ)Therefore, P(O|
λ) = 1 / c0c
1…cT-1
To avoid underflow, we compute log
P(O|λ) = -Σ
log(cj)
Where sum is from j = 0 to
T-1
HMM
53Slide54
HMM Scaling
Similarly, scale betas as c
tβ
t
(i
)
For re-estimation,Compute
γ
t
(i,j
)
and
γ
t
(i
)
using
original formulas, but with scaled alphas and betasThis gives us new values for λ = (
A,B,π) “Easy exercise” to show re-estimate is exact when scaled alphas and betas usedAlso,
P(O|λ) cancels from formulaUse
log P(O|λ) = -
Σ log(cj
) to decide if iterate improvesHMM
54Slide55
All Together Now
Complete pseudo code for Solution 3Given: (O
0,O
1
,…,O
T-1
) and
N
and
M
Initialize:
λ
= (
A,B,
π)
A
is
NxN
, B is NxM and
π is 1xNπ
i ≈ 1/N,
aij ≈ 1/N, b
j(k) ≈ 1/M
, each matrix row stochastic, but not uniformInitialize:maxIters = max number of re-estimation steps
iters = 0oldLogProb
= -∞HMM
55Slide56
Forward Algorithm
Forward algorithm
With scaling
HMM
56Slide57
Backward Algorithm
Backward algorithm or “beta pass”With scalingNote: same scaling factor as alphas
HMM
57Slide58
Gammas
Here, use scaled alphas and betasSo formulas unchanged
HMM
58Slide59
Re-Estimation
Again, using scaled gammas
So formulas unchanged
HMM
59Slide60
Stopping Criteria
Check that probability increasesIn practice, want
logProb >
oldLogProb
+
ε
And don’t exceed max iterations
HMM
60Slide61
English Text Example
Suppose Martian arrives on earthSees written English textWants to learn something about it
Martians know about HMMsSo, strip our all non-letters, make all letters lower-case
27 symbols (letters, plus word-space)
Train HMM on long sequence of symbols
HMM
61Slide62
English Text
For first training case, initialize: N = 2 and
M = 27Elements of A
and
π
are about ½ eachElements of
B
are each about 1/27
We use 50,000 symbols for training
After 1
st
iter
:
log
P(O|
λ) ≈ -165097
After 100th iter
: log P(O|
λ) ≈ -137305HMM
62Slide63
English Text
Matrices A and
π converge:
What does this tells us?
Started in hidden state 1 (not state 0)
And we know transition probabilities between hidden states
Nothing too interesting here
We don’t care about hidden states
HMM
63Slide64
English Text
What about B
matrix?This much more interesting…Why???
HMM
64Slide65
A Security Application
Suppose we want to detect metamorphic computer virusesSuch viruses vary their internal structure
But function of malware stays sameIf sufficiently variable, standard signature detection will failCan we use HMM for detection?
What to use as observation sequence?
Is there really a “hidden” Markov process?
What about
N, M
, and
T
?
How many
O
s needed for training, scoring?
HMM
65Slide66
HMM for Metamorphic Detection
Set of “family” viruses into 2 subsetsExtract opcodes
from each virusAppend opcodes from subset 1 to make one long sequence
Train HMM on
opcode
sequence (problem 3)
Obtain a model λ
= (
A,B,
π)
Set threshold: score
opcodes
from files in subset 2 and “normal” files (problem 1)
Can you sets a threshold that separates sets?
If so, may have a viable detection method
HMM
66Slide67
HMM for Metamorphic Detection
Virus detection results from recent paperNote the separation
This is good!HMM
67Slide68
HMM Generalizations
Here, assumed Markov process of order 1Current state depends only on previous state and transition matrixCan use higher order Markov process
Current state depends on n previous states
Higher order
vs
increased
N ?Can have
A
and
B
matrices depend on
t
HMM often combined with other techniques (e.g., neural nets)
HMM
68Slide69
Generalizations
In some cases, big limitation of HMM is that position information is not usedIn many applications this is OK/desirable
In some apps, this is a serious limitationBioinformatics applicationsDNA sequencing, protein alignment, etc.Sequence alignment is crucial
They use “profile
HMMs
” instead of
HMMsPHMM is next topic…
HMM
69Slide70
References
A revealing introduction to hidden Markov models, by M. Stamphttp://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf
A tutorial on hidden Markov models and selected applications in speech recognition, by L.R. Rabinerhttp://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf
HMM
70Slide71
References
Hunting for metamorphic engines, W. Wong and M. StampJournal in Computer Virology, Vol. 2, No. 3, December 2006, pp. 211-229Hunting for undetectable metamorphic viruses, D. Lin and M. Stamp
Journal in Computer Virology, Vol. 7, No. 3, August 2011, pp. 201-214
HMM
71