Dirichlet Allocation 1 Directed Graphical Models William W Cohen Machine Learning 10601 2 DGMs The Burglar Alarm example Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes ID: 529735
Download Presentation The PPT/PDF document "Latent" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Fast sampling and LDA
1Slide2
Announcements –
1/3
12/7 is in-class final exam
12/5 is presentation for 10-805 students
10 projects at 5min/projectonly self-directed 805 projects: reproducibility projects will not be presentedFinal review session will be abbreviatedmaybe just questionsIt’s fine if not everything is finished on 12/5I know the writeup is due 132 hours later
2Slide3
Announcements –
2/3
Do you need AWS $$ for your project?
Email William and cc Rose
3Slide4
Announcements –
3/3
Anant
Bo Chen
Chen Hu
Minxing
Ning
Janini
Rose
Tao
YifanYuhan
I need good TAs for 10-405 this spring and 10-605 next fall!
4Slide5
Directed Graphical Models
5Slide6
DGMs:
The
“
Burglar Alarm
” exampleYour house has a twitchy burglar alarm that is also sometimes triggered by earthquakes.While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh!
Burglar
Earthquake
Alarm
Phone Call
Node ~ random variable
Arcs define
form
of probability distribution:
Pr
(Xi | X1,…,Xi-1,Xi+1,…,
Xn
) =
Pr
(Xi | parents(Xi))
6Slide7
DGMs: The
“
Burglar Alarm
”
exampleGenerative story (and joint distribution):Pick b ~ Pr(Burglar), a binomialPick e ~
Pr
(Earthquake), a binomial
Pick a ~
Pr
(Alarm | Burglar=b, Earthquake=e), four binomials
Pick c ~ Pr(PhoneCall | Alarm)Burglar
Earthquake
Alarm
Phone Call
Node ~ random variable
Arcs define
form
of probability distribution:
Pr
(Xi | X1,…,Xi-1,Xi+1,…,
Xn
) =
Pr
(Xi | parents(Xi))
7Slide8
DGMs: The
“
Burglar Alarm
”
exampleGenerative story:Pick b ~ Pr(Burglar), a binomialPick e ~ Pr
(Earthquake), a binomial
Pick a ~
Pr
(Alarm | Burglar=b, Earthquake=e), four binomials
Pick c ~
Pr(PhoneCall | Alarm)Burglar
Earthquake
Alarm
Phone Call
You can also compute other quantities: e.g.,
Pr
(Burglar=true |
PhoneCall
=true)
Pr
(Burglar=true |
PhoneCall
=true, and Earthquake=true)
8Slide9
Inference in DGMs
General problem
: given evidence E
1
,…,
E
k
compute P(X|E
1
,..,E
k
) for any X
Big assumption
: graph is
“
polytree
”
<=1 undirected path between any nodes X,Y
Notation:
X
Y
1
Y
2
Z
2
Z
1
U
2
U
1
9Slide10
DGM Inference: P(X|E
)
E+: causal support
E-: evidential support
10Slide11
CPT table lookup
Recursive call to P(
.
|E
+
)
So far: simple way of propagating requests for
“
belief due to causal evidence
”
up
the tree
Evidence for
Uj
that
doesn
’
t go thru X
I.e. info on
Pr
(X|E+) flows down
DGM Inference: P(X|E
+
)
11Slide12
Inference in Bayes nets: P(E
-
|X)
simplified
d-sep + polytree
Recursive call to P(E
-
|
.
)
So far: simple way of propagating
requests
for
“
belief due to evidential support
”
down
the tree
I.e. info on
Pr
(E-|X) flows up
12Slide13
Message
Passing for BP
We reduced P(X|E) to product of two recursively calculated parts:
P(X=
x|E
+
)
i.e., CPT for X and product of
“
forward
” messages from parents P(E-|X=x)i.e., combination of “backward”
messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E)
This can also be implemented by message-passing (belief propagation)Messages are distributions – i.e., vectors
13Slide14
Message
Passing for BP
Top-level algorithm
Pick one vertex as the “root”
Any node with only one edge is a “leaf”
Pass messages from the leaves to the root
Pass messages from root to the leaves
Now every X has received P(X|E+) and P(E-|X) and can compute P(X|E)
14Slide15
NAÏVE BAYES AS A DGM
15Slide16
Naïve Bayes as a DGM
For each document
d
in the corpus (of size D):
Pick a label yd from Pr(Y)For each word in d (of length
N
d
):
Pick a word
x
id from Pr(X|Y=yd)o
Pr(Y=y)
onion
0.3
economist
0.7
Y
X
Pr(X|y=y)
onion
aardvark
0.034
onion
ai
0.0067
…
…
…
economist
aardvark
0.0000003
….
economist
zymurgy
0.01000
for every X
16
o
e
zymurgy
forever!
aardvarks?
learn
ai
!Slide17
Naïve Bayes as a DGM
For each document
d
in the corpus (of size D):
Pick a label yd from Pr(Y)For each word in d (of length Nd
):
Pick a word
x
id
from
Pr(X|Y=yd)Y
X
Pr(Y=y)
onion
0.3
economist
0.7
Y
X
Pr(X|y=y)
onion
aardvark
0.034
onion
ai
0.0067
…
…
…
economist
aardvark
0.0000003
….
economist
zymurgy
0.01000
for every X
Plate diagram
N
d
D
17Slide18
Naïve Bayes as a DGM
For each document
d
in the corpus (of size D):
Pick a label yd from Pr(Y)For each word in d (of length Nd
):
Pick a word
x
id
from
Pr(X|Y=yd)Y
X
Plate diagram
N
d
D
Not described: how do we smooth for classes? For multinomials? How many classes are there? ….
18Slide19
Recap: smoothing for a binomial
estimate Θ=P(heads) for a binomial with MLE as:
#heads
#tails
and with MAP as:
#imaginary heads
#imaginary tails
MLE: maximize Pr(D|θ)
MAP: maximize Pr(D|θ)Pr(θ)
Smoothing =
prior over the
parameter
θ
19Slide20
Smoothing for a binomial as a DGM
MAP for dataset D with α
1
heads and α
2
tails:
#imaginary heads
#imaginary tails
θ
γ
D
MAP is a Viterbi-style inference: want to find max probability
parameter
θ
, to the posterior distribution
Also: inference in a simple graph can be intractable if the conditional distributions are complicated
20Slide21
Smoothing for a binomial as a DGM
MAP for dataset D with α
1
heads and α
2
tails:
#imaginary heads
#imaginary tails
θ
γ
F
α
0
+
α
1
Comment: conjugate for a multinomial is called a
Dirichlet
21Slide22
Smoothing for a binomial as a DGM
MAP for dataset D with α
1
heads and α
2
tails:
θ
γ
R
α
0
+
α
1
+
α
2
Comment: conjugate for a multinomial is called a
Dirichlet
22
#
imaginary
rolls of
iSlide23
Recap: Naïve Bayes as a DGM
Y
X
Plate diagram
N
d
D
Now: let
’
s turn Bayes up to 11 for naïve Bayes….
23Slide24
A more Bayesian Naïve Bayes
From a
Dirichlet
α:Draw a multinomial π over the K classes
From a
Dirichlet
β
For each class
y=1…KDraw a multinomial γ[y] over the vocabulary
For each document
d=1..D:Pick a label y
d from π
For each word in d (of length N
d):Pick a word
xid from γ
[y
d]
Y
N
d
D
π
W
γ
K
α
β
24Slide25
Unsupervised Naïve Bayes
From a
Dirichlet
α:Draw a multinomial π over the K classes
From a
Dirichlet
β
For each class
y=1…KDraw a multinomial γ[y] over the vocabulary
For each document
d=1..D:Pick a label y
d from π
For each word in d (of length N
d):Pick a word
xid from γ
[y
d]
Y
N
d
D
π
W
γ
K
α
β
25Slide26
Unsupervised Naïve Bayes
The generative story is the same
What we do is different
We want to both find values of
γ and π
and
find values for Y for each example.
EM is a natural algorithm
Y
N
d
D
π
W
γ
K
α
β
26Slide27
LDA (As a DGM)
27Slide28
The LDA Topic Model
28Slide29
29Slide30
U
nsupervised NB
vs
LDA
Y
N
d
D
π
W
γ
K
α
β
Y
Y
N
d
D
π
W
γ
K
α
β
Y
one Y per doc
one Y per
word
one
class prior
different class
distrib
for
each
doc
30Slide31
U
nsupervised NB
vs
LDA
Y
N
d
D
π
W
γ
K
α
β
Y
Y
N
d
D
θ
d
γ
k
K
α
β
Z
di
one Y per doc
one Z per
word
one
class prior
different class distrib
θ
for
each
doc
W
d
31Slide32
LDA
Blei
’
s motivation: start with BOW assumption
w
Assumptions: 1) documents are
i.i.d
2)
within
a document, words are
i.i.d
. (bag of words)
For each document d = 1,
,M
Generate
d
~ D
1
(…)
For each word n = 1,
,
N
d
generate
w
n
~ D
2
(
.
|
θ
d
n
)
Now
pick your favorite distributions for D
1
, D
2
32Slide33
U
nsupervised NB
vs
LDA
Y
N
d
D
θ
d
γ
k
K
α
β
Z
di
W
d
w
Unsupervised NB clusters
documents
into latent
classes
LDA clusters
word occurrences
into latent
classes
(topics)
The smoothness of
β
k
same word suggests same topic
The smoothness of
θ
d
same document suggests same topic
33Slide34
LDA
’
s view of a document
Mixed membership model
34Slide35
LDA topics: top words w by Pr(w|Z=k)
Z=13
Z=22
Z=27
Z=19
35Slide36
SVM using 50 features: Pr(Z=k|θ
d
)
50 topics vs all words, SVM
36Slide37
Gibbs Sampling for LDA
37Slide38
LDA
Latent Dirichlet Allocation
Parameter learning:
Variational EM
Not covered in 601-BCollapsed Gibbs Sampling Wait, why is sampling called “
learning
”
here?
Here
’
s the idea….38Slide39
LDA
Gibbs sampling – works for
any
directed model!
Applicable when joint distribution is hard to evaluate but conditional distribution is knownSequence of samples comprises a Markov ChainStationary distribution of the chain is the joint distribution
Key capability: estimate distribution of
one
latent variables given
the other latent variables
and observed variables.
39Slide40
Z
X
Y
θ
α
2
Z
1
X
1
Y
1
θ
α
Z
2
X
2
Y
2
I
’
ll assume we know parameters for
Pr
(X|Z) and
Pr
(Y|Z)
40Slide41
Pick Z
1
~ Pr(Z|x
1
,y
1
,
θ)
Pick
θ
~ Pr(z
1
, z
2
,α)
pick from posterior
Z
1
X
1
Y
1
θ
α
Z
2
X
2
Y
2
Pick Z
2
~ Pr(Z|x
2
,y
2
,
θ)
.
.
.
Pick Z
1
~ Pr(Z|x
1
,y
1
,
θ)
Pick
θ
~ Pr(z
1
, z
2
,α)
pick from posterior
Pick Z
2
~ Pr(Z|x
2
,y
2
,
θ)
in a broad range of cases eventually these will converge to samples from the
true joint distribution
Initialize all the hidden variables randomly, then….
So we will have (a sample of) the true
θ
41Slide42
Why does Gibbs sampling work?
Basic claim: when you sample x~P(X|y
1
,…,y
k) then if
y
1
,…,y
k
were sampled from the true joint
then x will be sampled from the true jointSo the true joint is a “fixed point”you tend to stay there if you ever get thereHow long does it take to get there?
depends on the structure of the space of samples: how well-connected are they by the sampling steps?
42Slide43
43Slide44
44Slide45
LDA
Latent Dirichlet Allocation
Parameter learning:
Variational EM
Not covered in 601-BCollapsed Gibbs Sampling What is collapsed Gibbs sampling?
45Slide46
Pick Z
1
~ Pr(Z
1
|z
2
,z
3
,
α)
Z
1
X
1
Y
1
θ
α
Z
2
X
2
Y
2
.
.
.
Initialize all the Z
’
s randomly, then….
Z
3
X
3
Y
3
Pick Z
2
~ Pr(Z
2
|z
1
,z
3
,
α)
Pick Z
3
~ Pr(Z
3
|z
1
,z
2
,
α)
Pick Z
1
~ Pr(Z
1
|z
2
,z
3
,
α)
Pick Z
2
~ Pr(Z
2
|z
1
,z
3
,
α)
Pick Z
3
~ Pr(Z
3
|z
1
,z
2
,
α)
Converges to samples from the true joint … and then we can estimate Pr(
θ|
α
, sample of Z
’
s)
46Slide47
Pick Z
1
~ Pr(Z
1
|z
2
,z
3
,
α)
Z
1
X
1
Y
1
θ
α
Z
2
X
2
Y
2
Initialize all the Z
’
s randomly, then….
Z
3
X
3
Y
3
What
’
s this distribution?
47Slide48
Pick Z
1
~ Pr(Z
1
|z
2
,z
3
,
α)
Z
1
θ
α
Z
2
Initialize all the Z
’
s randomly, then….
Z
3
Simpler case: What
’
s
this
distribution?
called a Dirichlet-multinomial and it looks like this:
If there are
k
values for the Z
’
s and
n
k
= # Z
’
s with value
k,
then it turns out:
48Slide49
Pick
Z
1
~ Pr(Z
1
|z
2
,z
3
,
α)
Z
1
θ
α
Z
2
Initialize all the Z
’
s randomly, then….
Z
3
It turns out that sampling from a
Dirichlet
-multinomial is very easy!
Notation:
k
values for the
Z
’
s
n
k
= #
Z
’
s with value
k
Z
=
(Z
1
,…, Z
m
)
Z
(-
i)
=
(Z
1
,…, Z
i-1
, Z
i+1
,…, Z
m
)
n
k
(-i)
= #
Z
’
s with value
k
excluding
Z
i
49Slide50
Z
1
X
1
θ
α
Z
2
X
2
Z
3
X
3
What about with downstream evidence?
captures the constraints on
Z
i
via θ
(from
“
above
”
,
“
causal
”
direction)
what about via
β?
Pr(Z) = (1/c) * Pr(Z|E+) Pr(E-|Z)
β[k]
η
Z
1
X
1
Z
2
X
2
β[k]
η
50Slide51
Sampling for LDA
Y
N
d
D
θ
d
β
k
K
α
η
Z
di
W
d
Notation:
k
values for the
Z
d,i
’
s
Z
(
-d,
i)
=
all the Z
’
s but
Z
d,i
n
w,k
= #
Z
d,i
’
s with value
k
paired with
W
d,i
=w
n
*,k
= #
Z
d,i
’
s with value
k
n
w,k
(-d,i)
=
n
w,k
excluding
Z
i,d
n
*,k
(-d,i)
=
n
*k
excluding
Z
i,d
n
*,k
d,(-i)
=
n
*k from
doc
d
excluding
Z
i,d
Pr
(Z|E+)
Pr
(E-|Z)
fraction of time
Z=k
in doc
d
fraction of time
W=w in topic k
51Slide52
LDA (in Too Much Detail)
52Slide53
Way way more detail
53Slide54
More detail
54Slide55
55Slide56
56Slide57
What gets learned…..
57Slide58
In A Math-ier Notation
N[*,k]
N[
d,k
]
M[
w,k
]
N[*,*]=V
58Slide59
for each document
d
and word position
j
in
d
z[
d,j
] = k,
a random topic
N[d,k]++
W[w,k]++
where w = id of
j-th
word in d59Slide60
for each document
d
and word position
j
in
d
z[d,j] = k,
a
new random topic
update
N, W
to reflect the new assignment of z:
N[d,k]++; N[d,k
’
] - -
where k
’ is old
z[d,j]W[w,k]++; W[w,k
’] - - where
w is w[d,j]
for each pass t=1,2,….
60Slide61
61Slide62
z=1
z=2
z=3
…
…
unit height
random
You spend a
lot
of time sampling
There
’
s a loop over all topics here in the sampler
62