/
A Revealing Introduction to Hidden Markov Models A Revealing Introduction to Hidden Markov Models

A Revealing Introduction to Hidden Markov Models - PowerPoint Presentation

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
432 views
Uploaded On 2017-05-02

A Revealing Introduction to Hidden Markov Models - PPT Presentation

Mark Stamp 1 HMM Hidden Markov Models What is a hidden Markov model HMM A machine learning technique A discrete hill climb technique Where are HMMs used Speech recognition Malware detection IDS etc etc ID: 544024

markov hmm model state hmm markov state model hidden path algorithm solution log find sequence probability sum matrix process

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A Revealing Introduction to Hidden Marko..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A Revealing Introduction to Hidden Markov Models

Mark Stamp

1

HMMSlide2

Hidden Markov Models

What is a hidden Markov model (HMM)?A machine learning techniqueA discrete hill climb technique

Where are HMMs used?Speech recognition

Malware detection, IDS, etc., etc.

Why is it useful?

Efficient algorithms

HMM

2Slide3

Markov Chain

Markov chain is a “memoryless random process”Transitions depend only on

current state and transition probabilities matrixExample on next slide…

HMM

3Slide4

Markov Chain

We are interested in average annual temperatureOnly consider Hot and Cold

From recorded history, we obtain probabilitiesSee diagram to the right

HMM

4

H

C

0.7

0.6

0.3

0.4Slide5

Markov Chain

Transition probability matrix

Matrix is denoted as

A

Note,

A

is “row stochastic”

HMM

5

H

C

0.7

0.6

0.3

0.4Slide6

Markov Chain

Can also include begin

, end statesBegin state matrix is

π

In this example,

Note that

π

is row stochastic

HMM

6

H

C

0.7

0.6

0.3

0.4

begin

end

0.6

0.4Slide7

Hidden Markov Model

HMM includes a Markov chainBut this Markov process is “hidden”Cannot observe the Markov process

Instead, we observe something related to hidden statesIt’s as if there is a “curtain” between Markov chain and observations

Example on next slide

HMM

7Slide8

HMM Example

Consider H/C temperature example

Suppose we want to know H or

C

temperature in distant past

Before humans (or thermometers) invented

OK if we can just decide Hot versus ColdWe assume transition between Hot and Cold years is same as todayThat is, the

A

matrix is same as today

HMM

8Slide9

HMM Example

Temp in past determined by Markov processBut, we cannot observe temperature in past

Instead, we note that tree ring size is related to temperatureLook at historical data to see the connectionWe consider 3 tree ring sizes

Small, Medium, Large (

S, M, L

, respectively)

Measure tree ring sizes and recorded temperatures to determine relationship

HMM

9Slide10

HMM Example

We find that tree ring sizes and temperature related by

This is known as the

B

matrix:

Note that

B

also row stochastic

HMM

10Slide11

HMM Example

Can we now find temps in distant past?We cannot measure (observe) tempBut we can measure tree ring sizes……and tree ring sizes related to temp

By the B matrix

So, we ought to be able to say something about temperature

HMM

11Slide12

HMM Notation

A lot of notation is requiredNotation may be the most difficult part

HMM

12Slide13

HMM Notation

To simplify notation, observations are taken from the set {0,1,…,M-1}That is,

The matrix A = {a

ij

}

is

N x

N

, where

The matrix

B = {

b

j

(k

)}

is

N

x M, where

HMM13Slide14

HMM Example

Consider our temperature example…What are the observations?

V = {0,1,2}, which corresponds to S,M,LWhat are states of Markov process?

Q = {H,C}

What are

A,B,

π

, and

T

?

A,B,

π

on previous slides

T

is number of tree rings measured

What are N and

M?N = 2 and

M = 3HMM

14Slide15

Generic HMM

Generic view of HMM

HMM defined by A,B,

and

π

We denote HMM “model” as

λ

= (

A,B,

π)

HMM

15Slide16

HMM Example

Suppose that we observe tree ring sizesFor 4 year period of interest:

S,M,S,LThen = (0, 1, 0, 2)Most likely (hidden) state sequence?

We want most likely

X = (x

0

, x

1

, x

2

, x

3

)

Let

π

x0

be prob. of starting in state

x

0

Note prob. of initial observation And ax0,x1

is prob. of transition x0 to

x1And so on…

HMM

16Slide17

HMM Example

Bottom line?We can compute P(X) for any

XFor X = (x

0

, x

1

, x

2

, x

3

)

we have

Suppose we observe

(0,1,0,2)

, then what is probability of, say,

HHCC

?

Plug into formula above to find

HMM

17Slide18

HMM Example

Do same for all 4-state sequencesWe find…

The winner is?CCCH Not so fast my friend…

HMM

18Slide19

HMM Example

The path CCCH

scores the highestIn dynamic programming (DP), we find highest scoring pathBut, HMM maximizes expected number of correct states

Sometimes called “EM algorithm”

For “Expectation Maximization”

How does HMM work in this example?

HMM

19Slide20

HMM Example

For first position…Sum probabilities for all paths that have

H in 1st position, compare to sum of probs

for paths with

C

in 1

st position --- biggest winsRepeat for each position and we find:

HMM

20Slide21

HMM Example

So, HMM solution gives us CHCH

While dynamic program solution is CCCH Which solution is better?

Neither!!! Why is that?

Different definitions of “best”

HMM

21Slide22

HMM Paradox?

HMM maximizes expected number of correct statesWhereas DP chooses “best” overall pathPossible for HMM to choose “path” that is impossible

Could be a transition probability of 0Cannot get impossible path with DPIs this a flaw with HMM?

No, it’s a feature…

HMM

22Slide23

The Three Problems

HMMs used to solve 3 problems

Problem 1: Given a model λ

= (

A,B,

π)

and observation sequence

O

, find

P(O|

λ)

That is, we score an observation sequence to see how well it fits the given model

Problem 2

: Given

λ

= (

A,B,

π)

and

O, find an optimal state sequenceUncover hidden part (as in previous example)

Problem 3: Given O, N, and M

, find the model λ that maximizes probability of OThat is, train a model to fit the observations

HMM

23Slide24

HMMs in Practice

Typically, HMMs used as follows

Given an observation sequenceAssume a hidden Markov process existsTrain a model based on observationsProblem 3 (determine

N

by trial and error)

Then given a sequence of observations, score it

vs model from previous stepProblem 1 (high score implies it’s similar to training data)

HMM

24Slide25

HMMs in Practice

Previous slide gives sense in which HMM is a “machine learning” techniqueWe do not need to specify anything except the parameter

NAnd “best” N

found by trial and error

That is, we don’t have to think too much

Just train HMM and then use it

Best of all, efficient algorithms for HMMs

HMM

25Slide26

The Three Solutions

We give detailed solutions to the three problemsNote: We must have

efficient solutionsRecall the three problems:Problem 1

: Score an observation sequence versus a given model

Problem 2

: Given a model, “uncover” hidden part

Problem 3: Given an observation sequence, train a model

HMM

26Slide27

Solution 1

Score observations versus a given model

Given model λ

= (

A,B,

π)

and observation sequence

O=(O

0

,O

1

,…,O

T-1

)

, find

P(O|

λ)

Denote hidden states as

X = (x

0, x1

, . . . , xT-1) Then from definition of

B, P(O|X,λ)=b

x0(O0

) bx1(O

1) … bxT-1(O

T-1) And from definition of A

and π, P(X|

λ)=π

x0 ax0,x1

ax1,x2

… axT-2,xT-1

HMM

27Slide28

Solution 1

Elementary conditional probability fact:

P(O,X|λ

) = P(O|X,

λ

) P(X|

λ

)

Sum over all possible state sequences

X

,

P(O|

λ

) =

Σ

P(O,X|

λ

) =

Σ

P(O|X,λ) P(X|

λ) = Σπ

x0bx0(O

0)a

x0,x1bx1

(O1)…

axT-2,xT-1b

xT-1(OT-1)

This “works” but way too costlyRequires about 2TNT multiplications

Why?There better be a better way…

HMM28Slide29

Forward Algorithm

Instead of brute force: forward algorithmOr “alpha pass”

For t = 0,1,…,T-1

and

i

=0,1,…,N-1

, let

α

t

(i

) = P(O

0

,O

1

,…,

O

t

,x

t

=q

i|λ)Probability of “partial sum”

to t, and Markov process is in state qi

at step tWhat the?

Can be computed recursively, efficientlyHMM

29Slide30

Forward Algorithm

Let α

0(i) = π

i

b

i

(O

0

)

for

i

= 0,1,…,N-1

For

t

= 1,2,…,T-1

and

i

=0,1,…,N-1

, let

αt(i) =

(Σαt-1

(j)aji)

bi(O

t)Where the sum is from j

= 0 to N-1From definition of

αt(i)

we see P(O|λ

) = ΣαT-1(i)

Where the sum is from i = 0 to

N-1Note this requires only N

2T multiplications

HMM

30Slide31

Solution 2

Given a model, find “most likely” hidden states: Given λ

= (A,B,

π)

and

O

, find an optimal state sequenceRecall that optimal means “maximize expected number of correct states”In contrast, DP finds best scoring path

For temp/tree ring example, solved this

But hopelessly inefficient approach

A better way:

backward algorithm

Or “beta pass”

HMM

31Slide32

Backward Algorithm

For t

= 0,1,…,T-1 and i

=0,1,…,N-1

, let

βt

(i

) = P(O

t+1

,O

t+2

,…,O

T-1

|x

t

=

q

i

,

λ)Probability of partial sum from t to end and Markov process in state

qi at step t

Analogous to the forward algorithmAs with forward algorithm, this can be computed recursively and efficiently

HMM32Slide33

Backward Algorithm

Let β

T-1(i) = 1

for

i

= 0,1,…,N-1For t

=

T-2,T-3, …,1

and

i

=0,1,…,N-1

, let

β

t

(i

) = Σ

ai,jb

j(Ot+1)β

t+1(j)Where the sum is from j

= 0 to N-1

HMM

33Slide34

Solution 2

For t

= 1,2,…,T-1 and i

=0,1,…,N-1

define

γt

(i

) =

P(x

t

=

q

i

|O,λ

)

Most likely state at

t

is

qi that maximizes

γt(i)Note that

γt(i) =

αt(i)

βt(i)/P(O|λ)

And recall P(O|λ) = Σ

αT-1(i)

The bottom line?Forward algorithm solves Problem 1

Forward/backward algorithms solve Problem 2

HMM34Slide35

Solution 3

Train a model: Given O,

N, and M, find

λ

that maximizes probability of

O

Here, we iteratively adjust

λ

= (

A,B,

π)

to better fit the given observations

O

The size of matrices are fixed (

N

and

M

)

But elements of matrices can

changeIt is amazing that this works!And even more amazing that it’s efficient

HMM35Slide36

Solution 3

For t

=0,1,…,T-2 and i,j

in

{0,1,…,N-1}

, define “di-gammas” as

γ

t

(i,j

) =

P(x

t

=

q

i

, x

t+1

=

q

j|O,λ

)Note γt

(i,j) is prob of being in state

qi at time

t and transiting to state qj at

t+1 Then γ

t(i,j) = α

t(i)aij

bj(O

t+1)β

t+1(j)/P(O|λ)

And

γ

t

(i

) =

Σ

γ

t

(i,j

)

Where sum is from

j

= 0

to

N – 1

HMM

36Slide37

Model Re-estimation

Given di-gammas and gammas…For

i = 0,1,…,N-1 let

π

i

=

γ0

(i)

For

i

= 0,1,…,N-1

and

j

= 0,1,…,N-1

a

ij

=

Σ

γt

(i,j)/Σγt

(i)Where both sums are from t

= 0 to T-2For

j = 0,1,…,N-1 and k

= 0,1,…,M-1 bj

(k) = Σ

γt(j)/Σ

γt(j

)Both sums from from

t = 0 to T-2

but only t for which O

t

=

k

are counted in numerator

Why does this work?

HMM

37Slide38

Solution 3

To summarize…Initialize

λ = (A,B,

π)

Compute

α

t

(i

),

β

t

(i

),

γ

t

(i,j

)

,

γ

t(i)

Re-estimate the model λ = (

A,B,π) If

P(O|λ) increases, goto 2

HMM

38Slide39

Solution 3

Some fine points…Model initializationIf we have a good guess for

λ = (

A,B,

π)

then we can use it for initialization

If not, let π

i

≈ 1/N,

a

i,j

≈ 1/N,

b

j

(k

)

≈ 1/M

Subject to row stochastic conditions

Note: Do

not initialize to uniform valuesStopping conditionsStop after some number of iterationsStop if increase in P(O|

λ) is “small”HMM

39Slide40

HMM as Discrete Hill Climb

Algorithm on previous slides shows that HMM is a “discrete hill climb”HMM consists of discrete parametersSpecifically, the elements of the matrices

And re-estimation process improves model by modifying parametersSo, process “climbs” toward improved model

This happens

in a high-dimensional space

HMM

40Slide41

Dynamic Programming

Brief detour…For λ

= (A,B,

π)

as above, it’s easy to define a dynamic program (DP)

Executive summary:

DP is forward algorithm, with “sum” replaced by “max”Precise details on next slides

HMM

41Slide42

Dynamic Programming

Let δ

0(i) = π

i

b

i

(O

0

)

for

i

=0,1,…,N-1

For

t

=1,2,…,T-1

and

i

=0,1,…,N-1 compute δ

t(i) = max (

δt-1(j)aji)

bi(O

t)

Where the max is over j in

{0,1,…,N-1}Note that at each t, the DP computes best path for each state, up to that point

So, probability of best path is max δ

T-1(j)This max only gives best probability

Not the best path, for that, see next slide

HMM42Slide43

Dynamic Programming

To determine optimal pathWhile computing optimal path, keep track of pointers to previous state

When finished, construct optimal path by tracing back pointsFor example, consider temp exampleProbabilities for path of length 1:

These are the only “paths” of length 1

HMM

43Slide44

Dynamic Programming

Probabilities for each path of length 2

Best path of length 2 ending with H is

CH

Best path of length 2 ending with

C

is CC

HMM

44Slide45

Dynamic Program

Continuing, we compute best path ending at H and

C at each stepAnd save pointers --- why?

HMM

45Slide46

Dynamic Program

Best final score is .002822And, thanks to pointers, best path is

CCCHBut what about underflow?A

serious problem in bigger cases

HMM

46Slide47

Underflow Resistant DP

Common trick to prevent underflowInstead of multiplying probabilities……we add logarithms of probabilities

Why does this work?Because log(xy

) = log

x

+ log

y

And adding logs does not tend to 0

Note that we must avoid 0 probabilities

HMM

47Slide48

Underflow Resistant DP

Underflow resistant DP algorithm:Let δ

0(i) =

log(

π

i

b

i

(O

0

))

for

i

=0,1,…,N-1

For

t

=1,2,…,T-1

and i=0,1,…,N-1 compute

δt(i) = max

(δt-1(j) +

log(aji) +

log(bi

(Ot))

)Where the max is over

j in {0,1,…,N-1}And score of best path is

max δT-1

(j)As before, must also keep track of paths

HMM48Slide49

HMM Scaling

Trickier to prevent underflow in HMMWe consider solution 3Since it includes solutions 1 and 2

Recall for t = 1,2,…,T-1

,

i

=0,1,…,N-1

,

α

t

(i

) =

(

Σ

α

t-1

(j)

a

j,i

)

bi

(Ot)The idea is to normalize alphas so that they sum to oneAlgorithm on next slide

HMM

49Slide50

HMM Scaling

Given α

t(i) =

(

Σ

α

t-1

(j)

a

j,i

)

b

i

(O

t

)

Let

a

0

(i) = α

0(i) for i=0,1,…,N-1Let

c0 = 1/Σa0

(j) For i

= 0,1,…,N-1, let a0

(i) = c0a0

(i)This takes care of t

= 0 caseAlgorithm continued on next slide…HMM

50Slide51

HMM Scaling

For t = 1,2,…,T-1

do the following: For i

= 0,1,…,N-1

,

at

(i

) =

(

Σ

a

t-1

(j)

a

j,i

)

b

i

(Ot

)Let ct

= 1/Σat(j) For

i = 0,1,…,N-1 let a

t(i) =

ctat

(i)

HMM51Slide52

HMM Scaling

Easy to show at

(i) = c

0

c

1

…c

t

α

t

(i

) (

♯)

Simple proof by induction

So,

c

0

c

1

…ct is scaling factor at step

tAlso, easy to show that a

t(i) = α

t(i)/Σα

t(j)

Which implies ΣaT-1

(i) = 1 (♯♯)

HMM

52Slide53

HMM Scaling

By combining (

♯) and

(♯♯)

, we have

1 = Σ

a

T-1

(i) = c

0

c

1

…c

T-1

Σ

α

T-1

(i)

= c0

c1…cT-1

P(O|λ)Therefore, P(O|

λ) = 1 / c0c

1…cT-1

To avoid underflow, we compute log

P(O|λ) = -Σ

log(cj)

Where sum is from j = 0 to

T-1

HMM

53Slide54

HMM Scaling

Similarly, scale betas as c

t

(i

)

For re-estimation,Compute

γ

t

(i,j

)

and

γ

t

(i

)

using

original formulas, but with scaled alphas and betasThis gives us new values for λ = (

A,B,π) “Easy exercise” to show re-estimate is exact when scaled alphas and betas usedAlso,

P(O|λ) cancels from formulaUse

log P(O|λ) = -

Σ log(cj

) to decide if iterate improvesHMM

54Slide55

All Together Now

Complete pseudo code for Solution 3Given: (O

0,O

1

,…,O

T-1

) and

N

and

M

Initialize:

λ

= (

A,B,

π)

A

is

NxN

, B is NxM and

π is 1xNπ

i ≈ 1/N,

aij ≈ 1/N, b

j(k) ≈ 1/M

, each matrix row stochastic, but not uniformInitialize:maxIters = max number of re-estimation steps

iters = 0oldLogProb

= -∞HMM

55Slide56

Forward Algorithm

Forward algorithm

With scaling

HMM

56Slide57

Backward Algorithm

Backward algorithm or “beta pass”With scalingNote: same scaling factor as alphas

HMM

57Slide58

Gammas

Here, use scaled alphas and betasSo formulas unchanged

HMM

58Slide59

Re-Estimation

Again, using scaled gammas

So formulas unchanged

HMM

59Slide60

Stopping Criteria

Check that probability increasesIn practice, want

logProb >

oldLogProb

+

ε

And don’t exceed max iterations

HMM

60Slide61

English Text Example

Suppose Martian arrives on earthSees written English textWants to learn something about it

Martians know about HMMsSo, strip our all non-letters, make all letters lower-case

27 symbols (letters, plus word-space)

Train HMM on long sequence of symbols

HMM

61Slide62

English Text

For first training case, initialize: N = 2 and

M = 27Elements of A

and

π

are about ½ eachElements of

B

are each about 1/27

We use 50,000 symbols for training

After 1

st

iter

:

log

P(O|

λ) ≈ -165097

After 100th iter

: log P(O|

λ) ≈ -137305HMM

62Slide63

English Text

Matrices A and

π converge:

What does this tells us?

Started in hidden state 1 (not state 0)

And we know transition probabilities between hidden states

Nothing too interesting here

We don’t care about hidden states

HMM

63Slide64

English Text

What about B

matrix?This much more interesting…Why???

HMM

64Slide65

A Security Application

Suppose we want to detect metamorphic computer virusesSuch viruses vary their internal structure

But function of malware stays sameIf sufficiently variable, standard signature detection will failCan we use HMM for detection?

What to use as observation sequence?

Is there really a “hidden” Markov process?

What about

N, M

, and

T

?

How many

O

s needed for training, scoring?

HMM

65Slide66

HMM for Metamorphic Detection

Set of “family” viruses into 2 subsetsExtract opcodes

from each virusAppend opcodes from subset 1 to make one long sequence

Train HMM on

opcode

sequence (problem 3)

Obtain a model λ

= (

A,B,

π)

Set threshold: score

opcodes

from files in subset 2 and “normal” files (problem 1)

Can you sets a threshold that separates sets?

If so, may have a viable detection method

HMM

66Slide67

HMM for Metamorphic Detection

Virus detection results from recent paperNote the separation

This is good!HMM

67Slide68

HMM Generalizations

Here, assumed Markov process of order 1Current state depends only on previous state and transition matrixCan use higher order Markov process

Current state depends on n previous states

Higher order

vs

increased

N ?Can have

A

and

B

matrices depend on

t

HMM often combined with other techniques (e.g., neural nets)

HMM

68Slide69

Generalizations

In some cases, big limitation of HMM is that position information is not usedIn many applications this is OK/desirable

In some apps, this is a serious limitationBioinformatics applicationsDNA sequencing, protein alignment, etc.Sequence alignment is crucial

They use “profile

HMMs

” instead of

HMMsPHMM is next topic…

HMM

69Slide70

References

A revealing introduction to hidden Markov models, by M. Stamphttp://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf

A tutorial on hidden Markov models and selected applications in speech recognition, by L.R. Rabinerhttp://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf

HMM

70Slide71

References

Hunting for metamorphic engines, W. Wong and M. StampJournal in Computer Virology, Vol. 2, No. 3, December 2006, pp. 211-229Hunting for undetectable metamorphic viruses, D. Lin and M. Stamp

Journal in Computer Virology, Vol. 7, No. 3, August 2011, pp. 201-214

HMM

71