/
Latent Latent

Latent - PowerPoint Presentation

jane-oiler
jane-oiler . @jane-oiler
Follow
388 views
Uploaded On 2017-03-26

Latent - PPT Presentation

Dirichlet Allocation 1 Directed Graphical Models William W Cohen Machine Learning 10601 2 DGMs The Burglar Alarm example Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes ID: 529735

word pick burglar lda pick word lda burglar distribution bayes sampling dirichlet true document alarm

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Latent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Fast sampling and LDA

1Slide2

Announcements –

1/3

12/7 is in-class final exam

12/5 is presentation for 10-805 students

10 projects at 5min/projectonly self-directed 805 projects: reproducibility projects will not be presentedFinal review session will be abbreviatedmaybe just questionsIt’s fine if not everything is finished on 12/5I know the writeup is due 132 hours later

2Slide3

Announcements –

2/3

Do you need AWS $$ for your project?

Email William and cc Rose

3Slide4

Announcements –

3/3

Anant

Bo Chen

Chen Hu

Minxing

Ning

Janini

Rose

Tao

YifanYuhan

I need good TAs for 10-405 this spring and 10-605 next fall!

4Slide5

Directed Graphical Models

5Slide6

DGMs:

The

Burglar Alarm

” exampleYour house has a twitchy burglar alarm that is also sometimes triggered by earthquakes.While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!

Burglar

Earthquake

Alarm

Phone Call

Node ~ random variable

Arcs define

form

of probability distribution:

Pr

(Xi | X1,…,Xi-1,Xi+1,…,

Xn

) =

Pr

(Xi | parents(Xi))

6Slide7

DGMs: The

Burglar Alarm

exampleGenerative story (and joint distribution):Pick b ~ Pr(Burglar), a binomialPick e ~

Pr

(Earthquake), a binomial

Pick a ~

Pr

(Alarm | Burglar=b, Earthquake=e), four binomials

Pick c ~ Pr(PhoneCall | Alarm)Burglar

Earthquake

Alarm

Phone Call

Node ~ random variable

Arcs define

form

of probability distribution:

Pr

(Xi | X1,…,Xi-1,Xi+1,…,

Xn

) =

Pr

(Xi | parents(Xi))

7Slide8

DGMs: The

Burglar Alarm

exampleGenerative story:Pick b ~ Pr(Burglar), a binomialPick e ~ Pr

(Earthquake), a binomial

Pick a ~

Pr

(Alarm | Burglar=b, Earthquake=e), four binomials

Pick c ~

Pr(PhoneCall | Alarm)Burglar

Earthquake

Alarm

Phone Call

You can also compute other quantities: e.g.,

Pr

(Burglar=true |

PhoneCall

=true)

Pr

(Burglar=true |

PhoneCall

=true, and Earthquake=true)

8Slide9

Inference in DGMs

General problem

: given evidence E

1

,…,

E

k

compute P(X|E

1

,..,E

k

) for any X

Big assumption

: graph is

polytree

<=1 undirected path between any nodes X,Y

Notation:

X

Y

1

Y

2

Z

2

Z

1

U

2

U

1

9Slide10

DGM Inference: P(X|E

)

E+: causal support

E-: evidential support

10Slide11

CPT table lookup

Recursive call to P(

.

|E

+

)

So far: simple way of propagating requests for

belief due to causal evidence

up

the tree

Evidence for

Uj

that

doesn

t go thru X

I.e. info on

Pr

(X|E+) flows down

DGM Inference: P(X|E

+

)

11Slide12

Inference in Bayes nets: P(E

-

|X)

simplified

d-sep + polytree

Recursive call to P(E

-

|

.

)

So far: simple way of propagating

requests

for

belief due to evidential support

down

the tree

I.e. info on

Pr

(E-|X) flows up

12Slide13

Message

Passing for BP

We reduced P(X|E) to product of two recursively calculated parts:

P(X=

x|E

+

)

i.e., CPT for X and product of

forward

” messages from parents P(E-|X=x)i.e., combination of “backward”

messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E)

This can also be implemented by message-passing (belief propagation)Messages are distributions – i.e., vectors

13Slide14

Message

Passing for BP

Top-level algorithm

Pick one vertex as the “root”

Any node with only one edge is a “leaf”

Pass messages from the leaves to the root

Pass messages from root to the leaves

Now every X has received P(X|E+) and P(E-|X) and can compute P(X|E)

14Slide15

NAÏVE BAYES AS A DGM

15Slide16

Naïve Bayes as a DGM

For each document

d

in the corpus (of size D):

Pick a label yd from Pr(Y)For each word in d (of length

N

d

):

Pick a word

x

id from Pr(X|Y=yd)o

Pr(Y=y)

onion

0.3

economist

0.7

Y

X

Pr(X|y=y)

onion

aardvark

0.034

onion

ai

0.0067

economist

aardvark

0.0000003

….

economist

zymurgy

0.01000

for every X

16

o

e

zymurgy

forever!

aardvarks?

learn

ai

!Slide17

Naïve Bayes as a DGM

For each document

d

in the corpus (of size D):

Pick a label yd from Pr(Y)For each word in d (of length Nd

):

Pick a word

x

id

from

Pr(X|Y=yd)Y

X

Pr(Y=y)

onion

0.3

economist

0.7

Y

X

Pr(X|y=y)

onion

aardvark

0.034

onion

ai

0.0067

economist

aardvark

0.0000003

….

economist

zymurgy

0.01000

for every X

Plate diagram

N

d

D

17Slide18

Naïve Bayes as a DGM

For each document

d

in the corpus (of size D):

Pick a label yd from Pr(Y)For each word in d (of length Nd

):

Pick a word

x

id

from

Pr(X|Y=yd)Y

X

Plate diagram

N

d

D

Not described: how do we smooth for classes? For multinomials? How many classes are there? ….

18Slide19

Recap: smoothing for a binomial

estimate Θ=P(heads) for a binomial with MLE as:

#heads

#tails

and with MAP as:

#imaginary heads

#imaginary tails

MLE: maximize Pr(D|θ)

MAP: maximize Pr(D|θ)Pr(θ)

Smoothing =

prior over the

parameter

θ

19Slide20

Smoothing for a binomial as a DGM

MAP for dataset D with α

1

heads and α

2

tails:

#imaginary heads

#imaginary tails

θ

γ

D

MAP is a Viterbi-style inference: want to find max probability

parameter

θ

, to the posterior distribution

Also: inference in a simple graph can be intractable if the conditional distributions are complicated

20Slide21

Smoothing for a binomial as a DGM

MAP for dataset D with α

1

heads and α

2

tails:

#imaginary heads

#imaginary tails

θ

γ

F

α

0

+

α

1

Comment: conjugate for a multinomial is called a

Dirichlet

21Slide22

Smoothing for a binomial as a DGM

MAP for dataset D with α

1

heads and α

2

tails:

θ

γ

R

α

0

+

α

1

+

α

2

Comment: conjugate for a multinomial is called a

Dirichlet

22

 

#

imaginary

rolls of

iSlide23

Recap: Naïve Bayes as a DGM

Y

X

Plate diagram

N

d

D

Now: let

s turn Bayes up to 11 for naïve Bayes….

23Slide24

A more Bayesian Naïve Bayes

From a

Dirichlet

α:Draw a multinomial π over the K classes

From a

Dirichlet

β

For each class

y=1…KDraw a multinomial γ[y] over the vocabulary

For each document

d=1..D:Pick a label y

d from π

For each word in d (of length N

d):Pick a word

xid from γ

[y

d]

Y

N

d

D

π

W

γ

K

α

β

24Slide25

Unsupervised Naïve Bayes

From a

Dirichlet

α:Draw a multinomial π over the K classes

From a

Dirichlet

β

For each class

y=1…KDraw a multinomial γ[y] over the vocabulary

For each document

d=1..D:Pick a label y

d from π

For each word in d (of length N

d):Pick a word

xid from γ

[y

d]

Y

N

d

D

π

W

γ

K

α

β

25Slide26

Unsupervised Naïve Bayes

The generative story is the same

What we do is different

We want to both find values of

γ and π

and

find values for Y for each example.

EM is a natural algorithm

Y

N

d

D

π

W

γ

K

α

β

26Slide27

LDA (As a DGM)

27Slide28

The LDA Topic Model

28Slide29

29Slide30

U

nsupervised NB

vs

LDA

Y

N

d

D

π

W

γ

K

α

β

Y

Y

N

d

D

π

W

γ

K

α

β

Y

one Y per doc

one Y per

word

one

class prior

different class

distrib

for

each

doc

30Slide31

U

nsupervised NB

vs

LDA

Y

N

d

D

π

W

γ

K

α

β

Y

Y

N

d

D

θ

d

γ

k

K

α

β

Z

di

one Y per doc

one Z per

word

one

class prior

different class distrib

θ

for

each

doc

W

d

31Slide32

LDA

Blei

s motivation: start with BOW assumption

w

Assumptions: 1) documents are

i.i.d

2)

within

a document, words are

i.i.d

. (bag of words)

For each document d = 1,

,M

Generate

d

~ D

1

(…)

For each word n = 1,

,

N

d

generate

w

n

~ D

2

(

.

|

θ

d

n

)

Now

pick your favorite distributions for D

1

, D

2

32Slide33

U

nsupervised NB

vs

LDA

Y

N

d

D

θ

d

γ

k

K

α

β

Z

di

W

d

w

Unsupervised NB clusters

documents

into latent

classes

LDA clusters

word occurrences

into latent

classes

(topics)

The smoothness of

β

k

same word suggests same topic

The smoothness of

θ

d

same document suggests same topic

33Slide34

LDA

s view of a document

Mixed membership model

34Slide35

LDA topics: top words w by Pr(w|Z=k)

Z=13

Z=22

Z=27

Z=19

35Slide36

SVM using 50 features: Pr(Z=k|θ

d

)

50 topics vs all words, SVM

36Slide37

Gibbs Sampling for LDA

37Slide38

LDA

Latent Dirichlet Allocation

Parameter learning:

Variational EM

Not covered in 601-BCollapsed Gibbs Sampling Wait, why is sampling called “

learning

here?

Here

s the idea….38Slide39

LDA

Gibbs sampling – works for

any

directed model!

Applicable when joint distribution is hard to evaluate but conditional distribution is knownSequence of samples comprises a Markov ChainStationary distribution of the chain is the joint distribution

Key capability: estimate distribution of

one

latent variables given

the other latent variables

and observed variables.

39Slide40

Z

X

Y

θ

α

2

Z

1

X

1

Y

1

θ

α

Z

2

X

2

Y

2

I

ll assume we know parameters for

Pr

(X|Z) and

Pr

(Y|Z)

40Slide41

Pick Z

1

~ Pr(Z|x

1

,y

1

,

θ)

Pick

θ

~ Pr(z

1

, z

2

,α)

pick from posterior

Z

1

X

1

Y

1

θ

α

Z

2

X

2

Y

2

Pick Z

2

~ Pr(Z|x

2

,y

2

,

θ)

.

.

.

Pick Z

1

~ Pr(Z|x

1

,y

1

,

θ)

Pick

θ

~ Pr(z

1

, z

2

,α)

pick from posterior

Pick Z

2

~ Pr(Z|x

2

,y

2

,

θ)

in a broad range of cases eventually these will converge to samples from the

true joint distribution

Initialize all the hidden variables randomly, then….

So we will have (a sample of) the true

θ

41Slide42

Why does Gibbs sampling work?

Basic claim: when you sample x~P(X|y

1

,…,y

k) then if

y

1

,…,y

k

were sampled from the true joint

then x will be sampled from the true jointSo the true joint is a “fixed point”you tend to stay there if you ever get thereHow long does it take to get there?

depends on the structure of the space of samples: how well-connected are they by the sampling steps?

42Slide43

43Slide44

44Slide45

LDA

Latent Dirichlet Allocation

Parameter learning:

Variational EM

Not covered in 601-BCollapsed Gibbs Sampling What is collapsed Gibbs sampling?

45Slide46

Pick Z

1

~ Pr(Z

1

|z

2

,z

3

,

α)

Z

1

X

1

Y

1

θ

α

Z

2

X

2

Y

2

.

.

.

Initialize all the Z

s randomly, then….

Z

3

X

3

Y

3

Pick Z

2

~ Pr(Z

2

|z

1

,z

3

,

α)

Pick Z

3

~ Pr(Z

3

|z

1

,z

2

,

α)

Pick Z

1

~ Pr(Z

1

|z

2

,z

3

,

α)

Pick Z

2

~ Pr(Z

2

|z

1

,z

3

,

α)

Pick Z

3

~ Pr(Z

3

|z

1

,z

2

,

α)

Converges to samples from the true joint … and then we can estimate Pr(

θ|

α

, sample of Z

s)

46Slide47

Pick Z

1

~ Pr(Z

1

|z

2

,z

3

,

α)

Z

1

X

1

Y

1

θ

α

Z

2

X

2

Y

2

Initialize all the Z

s randomly, then….

Z

3

X

3

Y

3

What

s this distribution?

47Slide48

Pick Z

1

~ Pr(Z

1

|z

2

,z

3

,

α)

Z

1

θ

α

Z

2

Initialize all the Z

s randomly, then….

Z

3

Simpler case: What

s

this

distribution?

called a Dirichlet-multinomial and it looks like this:

If there are

k

values for the Z

s and

n

k

= # Z

s with value

k,

then it turns out:

48Slide49

Pick

Z

1

~ Pr(Z

1

|z

2

,z

3

,

α)

Z

1

θ

α

Z

2

Initialize all the Z

s randomly, then….

Z

3

It turns out that sampling from a

Dirichlet

-multinomial is very easy!

Notation:

k

values for the

Z

s

n

k

= #

Z

s with value

k

Z

=

(Z

1

,…, Z

m

)

Z

(-

i)

=

(Z

1

,…, Z

i-1

, Z

i+1

,…, Z

m

)

n

k

(-i)

= #

Z

s with value

k

excluding

Z

i

49Slide50

Z

1

X

1

θ

α

Z

2

X

2

Z

3

X

3

What about with downstream evidence?

captures the constraints on

Z

i

via θ

(from

above

,

causal

direction)

what about via

β?

Pr(Z) = (1/c) * Pr(Z|E+) Pr(E-|Z)

β[k]

η

Z

1

X

1

Z

2

X

2

β[k]

η

50Slide51

Sampling for LDA

Y

N

d

D

θ

d

β

k

K

α

η

Z

di

W

d

Notation:

k

values for the

Z

d,i

s

Z

(

-d,

i)

=

all the Z

s but

Z

d,i

n

w,k

= #

Z

d,i

s with value

k

paired with

W

d,i

=w

n

*,k

= #

Z

d,i

s with value

k

n

w,k

(-d,i)

=

n

w,k

excluding

Z

i,d

n

*,k

(-d,i)

=

n

*k

excluding

Z

i,d

n

*,k

d,(-i)

=

n

*k from

doc

d

excluding

Z

i,d

Pr

(Z|E+)

Pr

(E-|Z)

fraction of time

Z=k

in doc

d

fraction of time

W=w in topic k

51Slide52

LDA (in Too Much Detail)

52Slide53

Way way more detail

53Slide54

More detail

54Slide55

55Slide56

56Slide57

What gets learned…..

57Slide58

In A Math-ier Notation

N[*,k]

N[

d,k

]

M[

w,k

]

N[*,*]=V

58Slide59

for each document

d

and word position

j

in

d

z[

d,j

] = k,

a random topic

N[d,k]++

W[w,k]++

where w = id of

j-th

word in d59Slide60

for each document

d

and word position

j

in

d

z[d,j] = k,

a

new random topic

update

N, W

to reflect the new assignment of z:

N[d,k]++; N[d,k

] - -

where k

’ is old

z[d,j]W[w,k]++; W[w,k

’] - - where

w is w[d,j]

for each pass t=1,2,….

60Slide61

61Slide62

z=1

z=2

z=3

unit height

random

You spend a

lot

of time sampling

There

s a loop over all topics here in the sampler

62