Latent - PowerPoint Presentation

388 views
Uploaded On 2017-03-26

Latent - PPT Presentation

Dirichlet Allocation 1 Directed Graphical Models William W Cohen Machine Learning 10601 2 DGMs The Burglar Alarm example Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes ID: 529735

word pick burglar lda pick word lda burglar distribution bayes sampling dirichlet true document alarm

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/529735" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Latent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Fast sampling and LDA

1Slide2

Announcements –

1/3

12/7 is in-class final exam

12/5 is presentation for 10-805 students

10 projects at 5min/projectonly self-directed 805 projects: reproducibility projects will not be presentedFinal review session will be abbreviatedmaybe just questionsIt’s fine if not everything is finished on 12/5I know the writeup is due 132 hours later

2Slide3

Announcements –

2/3

Do you need AWS $$ for your project?

Email William and cc Rose

3Slide4

Announcements –

3/3

Anant

Bo Chen

Chen Hu

Minxing

Ning

Janini

Rose

Tao

YifanYuhan

I need good TAs for 10-405 this spring and 10-605 next fall!

4Slide5

Directed Graphical Models

5Slide6

DGMs:

The

“

Burglar Alarm

” exampleYour house has a twitchy burglar alarm that is also sometimes triggered by earthquakes.While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!

Burglar

Earthquake

Alarm

Phone Call

Node ~ random variable

Arcs define

form

of probability distribution:

(Xi | X1,…,Xi-1,Xi+1,…,

) =

(Xi | parents(Xi))

6Slide7

DGMs: The

“

Burglar Alarm

”

exampleGenerative story (and joint distribution):Pick b ~ Pr(Burglar), a binomialPick e ~

(Earthquake), a binomial

Pick a ~

(Alarm | Burglar=b, Earthquake=e), four binomials

Pick c ~ Pr(PhoneCall | Alarm)Burglar

Earthquake

Alarm

Phone Call

Node ~ random variable

Arcs define

form

of probability distribution:

(Xi | X1,…,Xi-1,Xi+1,…,

) =

(Xi | parents(Xi))

7Slide8

DGMs: The

“

Burglar Alarm

”

exampleGenerative story:Pick b ~ Pr(Burglar), a binomialPick e ~ Pr

(Earthquake), a binomial

Pick a ~

(Alarm | Burglar=b, Earthquake=e), four binomials

Pick c ~

Pr(PhoneCall | Alarm)Burglar

Earthquake

Alarm

Phone Call

You can also compute other quantities: e.g.,

(Burglar=true |

PhoneCall

=true)

(Burglar=true |

PhoneCall

=true, and Earthquake=true)

8Slide9

Inference in DGMs

General problem

: given evidence E

,…,

compute P(X|E

,..,E

) for any X

Big assumption

: graph is

“

polytree

”

<=1 undirected path between any nodes X,Y

Notation:

9Slide10

DGM Inference: P(X|E

)

E+: causal support

E-: evidential support

10Slide11

CPT table lookup

Recursive call to P(

)

So far: simple way of propagating requests for

“

belief due to causal evidence

”

the tree

Evidence for

that

doesn

’

t go thru X

I.e. info on

(X|E+) flows down

DGM Inference: P(X|E

)

11Slide12

Inference in Bayes nets: P(E

|X)

simplified

d-sep + polytree

Recursive call to P(E

)

So far: simple way of propagating

requests

for

“

belief due to evidential support

”

down

the tree

I.e. info on

(E-|X) flows up

12Slide13

Message

Passing for BP

We reduced P(X|E) to product of two recursively calculated parts:

P(X=

x|E

)

i.e., CPT for X and product of

“

forward

” messages from parents P(E-|X=x)i.e., combination of “backward”

messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E)

This can also be implemented by message-passing (belief propagation)Messages are distributions – i.e., vectors

13Slide14

Message

Passing for BP

Top-level algorithm

Pick one vertex as the “root”

Any node with only one edge is a “leaf”

Pass messages from the leaves to the root

Pass messages from root to the leaves

Now every X has received P(X|E+) and P(E-|X) and can compute P(X|E)

14Slide15

NAÏVE BAYES AS A DGM

15Slide16

Naïve Bayes as a DGM

For each document

in the corpus (of size D):

Pick a label yd from Pr(Y)For each word in d (of length

Pick a word

id from Pr(X|Y=yd)o

Pr(Y=y)

onion

0.3

economist

0.7

Pr(X|y=y)

onion

aardvark

0.034

onion

0.0067

…

economist

aardvark

0.0000003

….

economist

zymurgy

0.01000

for every X

zymurgy

forever!

aardvarks?

learn

!Slide17

Naïve Bayes as a DGM

For each document

in the corpus (of size D):

Pick a label yd from Pr(Y)For each word in d (of length Nd

Pick a word

from

Pr(X|Y=yd)Y

Pr(Y=y)

onion

0.3

economist

0.7

Pr(X|y=y)

onion

aardvark

0.034

onion

0.0067

…

economist

aardvark

0.0000003

….

economist

zymurgy

0.01000

for every X

Plate diagram

17Slide18

Naïve Bayes as a DGM

For each document

in the corpus (of size D):

Pick a label yd from Pr(Y)For each word in d (of length Nd

Pick a word

from

Pr(X|Y=yd)Y

Plate diagram

Not described: how do we smooth for classes? For multinomials? How many classes are there? ….

18Slide19

Recap: smoothing for a binomial

estimate Θ=P(heads) for a binomial with MLE as:

#heads

#tails

and with MAP as:

#imaginary heads

#imaginary tails

MLE: maximize Pr(D|θ)

MAP: maximize Pr(D|θ)Pr(θ)

Smoothing =

prior over the

parameter

19Slide20

Smoothing for a binomial as a DGM

MAP for dataset D with α

heads and α

tails:

#imaginary heads

#imaginary tails

MAP is a Viterbi-style inference: want to find max probability

parameter

, to the posterior distribution

Also: inference in a simple graph can be intractable if the conditional distributions are complicated

20Slide21

Smoothing for a binomial as a DGM

MAP for dataset D with α

heads and α

tails:

#imaginary heads

#imaginary tails

Comment: conjugate for a multinomial is called a

Dirichlet

21Slide22

Smoothing for a binomial as a DGM

MAP for dataset D with α

heads and α

tails:

Comment: conjugate for a multinomial is called a

Dirichlet

imaginary

rolls of

iSlide23

Recap: Naïve Bayes as a DGM

Plate diagram

Now: let

’

s turn Bayes up to 11 for naïve Bayes….

23Slide24

A more Bayesian Naïve Bayes

From a

Dirichlet

α:Draw a multinomial π over the K classes

From a

Dirichlet

For each class

y=1…KDraw a multinomial γ[y] over the vocabulary

For each document

d=1..D:Pick a label y

d from π

For each word in d (of length N

d):Pick a word

xid from γ

24Slide25

Unsupervised Naïve Bayes

From a

Dirichlet

α:Draw a multinomial π over the K classes

From a

Dirichlet

For each class

y=1…KDraw a multinomial γ[y] over the vocabulary

For each document

d=1..D:Pick a label y

d from π

For each word in d (of length N

d):Pick a word

xid from γ

25Slide26

Unsupervised Naïve Bayes

The generative story is the same

What we do is different

We want to both find values of

γ and π

and

find values for Y for each example.

EM is a natural algorithm

26Slide27

LDA (As a DGM)

27Slide28

The LDA Topic Model

28Slide29

29Slide30

nsupervised NB

LDA

one Y per doc

one Y per

word

one

class prior

different class

distrib

for

each

doc

30Slide31

nsupervised NB

LDA

one Y per doc

one Z per

word

one

class prior

different class distrib

for

each

doc

31Slide32

LDA

Blei

’

s motivation: start with BOW assumption



Assumptions: 1) documents are

i.i.d

within

a document, words are

i.i.d

. (bag of words)

For each document d = 1,



Generate



~ D

(…)

For each word n = 1,



generate

~ D

(

)

Now

pick your favorite distributions for D

, D

32Slide33

nsupervised NB

LDA



Unsupervised NB clusters

documents

into latent

classes

LDA clusters

word occurrences

into latent

classes

(topics)

The smoothness of



same word suggests same topic

The smoothness of



same document suggests same topic

33Slide34

LDA

’

s view of a document

Mixed membership model

34Slide35

LDA topics: top words w by Pr(w|Z=k)

Z=13

Z=22

Z=27

Z=19

35Slide36

SVM using 50 features: Pr(Z=k|θ

)

50 topics vs all words, SVM

36Slide37

Gibbs Sampling for LDA

37Slide38

LDA

Latent Dirichlet Allocation

Parameter learning:

Variational EM

Not covered in 601-BCollapsed Gibbs Sampling Wait, why is sampling called “

learning

”

here?

Here

’

s the idea….38Slide39

LDA

Gibbs sampling – works for

any

directed model!

Applicable when joint distribution is hard to evaluate but conditional distribution is knownSequence of samples comprises a Markov ChainStationary distribution of the chain is the joint distribution

Key capability: estimate distribution of

one

latent variables given

the other latent variables

and observed variables.

39Slide40

’

ll assume we know parameters for

(X|Z) and

(Y|Z)

40Slide41

Pick Z

~ Pr(Z|x

θ)

Pick

~ Pr(z

, z

,α)

pick from posterior

Pick Z

~ Pr(Z|x

θ)

Pick Z

~ Pr(Z|x

θ)

Pick

~ Pr(z

, z

,α)

pick from posterior

Pick Z

~ Pr(Z|x

θ)

in a broad range of cases eventually these will converge to samples from the

true joint distribution

Initialize all the hidden variables randomly, then….

So we will have (a sample of) the true

41Slide42

Why does Gibbs sampling work?

Basic claim: when you sample x~P(X|y

,…,y

k) then if

,…,y

were sampled from the true joint

then x will be sampled from the true jointSo the true joint is a “fixed point”you tend to stay there if you ever get thereHow long does it take to get there?

depends on the structure of the space of samples: how well-connected are they by the sampling steps?

42Slide43

43Slide44

44Slide45

LDA

Latent Dirichlet Allocation

Parameter learning:

Variational EM

Not covered in 601-BCollapsed Gibbs Sampling What is collapsed Gibbs sampling?

45Slide46

Pick Z

~ Pr(Z

α)

Initialize all the Z

’

s randomly, then….

Pick Z

~ Pr(Z

α)

Pick Z

~ Pr(Z

α)

Pick Z

~ Pr(Z

α)

Pick Z

~ Pr(Z

α)

Pick Z

~ Pr(Z

α)

Converges to samples from the true joint … and then we can estimate Pr(

θ|

, sample of Z

’

46Slide47

Pick Z

~ Pr(Z

α)

Initialize all the Z

’

s randomly, then….

What

’

s this distribution?

47Slide48

Pick Z

~ Pr(Z

α)

Initialize all the Z

’

s randomly, then….

Simpler case: What

’

this

distribution?

called a Dirichlet-multinomial and it looks like this:

If there are

values for the Z

’

s and

= # Z

’

s with value

then it turns out:

48Slide49

Pick

~ Pr(Z

α)

Initialize all the Z

’

s randomly, then….

It turns out that sampling from a

Dirichlet

-multinomial is very easy!

Notation:

values for the

’

= #

’

s with value

,…, Z

)

,…, Z

i-1

, Z

i+1

,…, Z

)

(-i)

= #

’

s with value

excluding

49Slide50

What about with downstream evidence?

captures the constraints on

via θ

(from

“

above

”

“

causal

”

direction)

what about via

β?

Pr(Z) = (1/c) * Pr(Z|E+) Pr(E-|Z)

β[k]

50Slide51

Sampling for LDA

Notation:

values for the

d,i

’

(

-d,

all the Z

’

s but

d,i

w,k

= #

d,i

’

s with value

paired with

d,i

*,k

= #

d,i

’

s with value

w,k

(-d,i)

w,k

excluding

i,d

*,k

(-d,i)

excluding

i,d

*,k

d,(-i)

*k from

doc

excluding

i,d

(Z|E+)

(E-|Z)

fraction of time

Z=k

in doc

fraction of time

W=w in topic k

51Slide52

LDA (in Too Much Detail)

52Slide53

Way way more detail

53Slide54

More detail

54Slide55

55Slide56

56Slide57

What gets learned…..

57Slide58

In A Math-ier Notation

N[*,k]

d,k

]

w,k

]

N[*,*]=V

58Slide59

for each document

and word position

d,j

] = k,

a random topic

N[d,k]++

W[w,k]++

where w = id of

j-th

word in d59Slide60

for each document

and word position

z[d,j] = k,

new random topic

update

N, W

to reflect the new assignment of z:

N[d,k]++; N[d,k

’

] - -

where k

’ is old

z[d,j]W[w,k]++; W[w,k

’] - - where

w is w[d,j]

for each pass t=1,2,….

60Slide61

61Slide62

z=1

z=2

z=3

…

unit height

random

You spend a

lot

of time sampling

There

’

s a loop over all topics here in the sampler

Latent - PowerPoint Presentation

Latent - PPT Presentation

Share:

Link:

Embed:

Related Contents