William Cohen MORE LDA SPEEDUPS First RECAP LDA DEtails Called collapsed Gibbs sampling since youve marginalized away some variables Fr Parameter estimation for text analysis ID: 621152
Download Presentation The PPT/PDF document "Fast sampling for LDA" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Fast sampling for LDA
William CohenSlide2
MORE LDA SPEEDUPSFirst - RECAP LDA
DEtailsSlide3Slide4
Called “collapsed Gibbs sampling” since you’ve marginalized away some variables
Fr
: Parameter estimation for text analysis -
Gregor Heinrich
prob
this
word/term
assigned to topic k
prob
this
doc
contains topic kSlide5
More detailSlide6Slide7
z=1
z=2
z=3
…
…
unit height
randomSlide8
SPEEDUP 1 - SparsitySlide9
KDD 2008Slide10
z=1
z=2
z=3
…
…
unit height
randomSlide11Slide12
Running total of P(z=k|…) or P(z<=k)Slide13
Discussion….
Where do you spend your time?
sampling the
z
’seach sampling step involves a loop over all topicsthis seems wasteful
even with many topics, words are often only assigned to a
few
different topicslow frequency words appear < K times … and there are lots and lots of them!even frequent words are not in every topicSlide14
Discussion….
What’s the solution?
Idea: come up with
approximations
to Z at each stage - then you might be able to stop early…..
computationally like a
sparser
vector
Want Zi>=ZSlide15
Tricks
How do you compute and maintain the bound?
see the paper
What order do you go in?
want to pick large
P(k)’s
first
… so we want large P(k|d) and P(k|w)… so we maintain k’s in sorted orderwhich only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted arraySlide16
ResultsSlide17
SPEEDUP 2 - ANOTHER APPROACH FOR USING SparsitySlide18
KDD 09Slide19
z=
s+r+q
t = topic (k)
w = word
d = docSlide20
z=
s+r+q
If
U<s
:
lookup
U
on line segment with tic-marks at α
1β/(βV + n.|1), α2β/(βV + n.|2), …If s<U<r:lookup U on line segment for
rOnly need to check t such that nt|d>0t = topic (k)w = wordd = docSlide21
z=
s+r+q
If
U<s
:
lookup
U
on line segment with tic-marks at α
1β/(βV + n.|1), α2β/(βV + n.|2), …If s<U<s+r:lookup U
on line segment for rIf s+r<U:lookup U on line segment for qOnly need to check t such that nw|t>0Slide22
z=
s+r+q
Only need to check t such that
n
w|t
>0
Only need to check t such that
n
t|d>0
Only need to check occasionally (< 10% of the time)Slide23
z=
s+r+q
Need to
store
n
w|t
for each word, topic pair …???
Only need to store nt|d for current d
Only need to store (and maintain) total words per topic and α’s,β,VTrick; count up nt|d for d when you start working on d and update incrementallySlide24
z=
s+r+q
Need to
store
n
w|t
for each word, topic pair …???
1. Precompute, for each t,
Most (>90%) of the time and space is here…2. Quickly find t’s such that nw|t is large for wSlide25
Need to
store
n
w|t
for each word, topic pair …???
1.
Precompute
, for each t,Most (>90%) of the time and space is here…2. Quickly find t’s such that n
w|t is large for wmap w to an int arrayno larger than frequency wno larger than #topicsencode (t,n) as a bit vectorn in the high-order bitst
in the low-order bits
keep ints sorted in descending orderSlide26Slide27
Outline
LDA/Gibbs algorithm details
How to speed it up by parallelizing
How to speed it up by faster sampling
Why sampling is keySome sampling ideas for LDA
The
Mimno
/McCallum decomposition (SparseLDA)Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)Slide28
Alias tables
http://
www.keithschwarz.com
/darts-dice-coins/
Basic problem: how can we sample from a biased coin quickly?
If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate
r~uniform
and use a binary tree
r in (23/40,7/10]O(K)O(log2K)Slide29
Alias tables
http://
www.keithschwarz.com
/darts-dice-coins/
Another idea…
Simulate the dart with two drawn values:
rx
int
(u1*K)ry u1*pmaxkeep throwing till you hit a stripeSlide30
Alias tables
http://
www.keithschwarz.com
/darts-dice-coins/
An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the
average
probability, not the
maximum probability, and cutting and pasting a bit.
You can always do this using only
two colors in each column of the final alias table and the dart never misses!mathematically speaking…Slide31
KDD 2014
Key ideas
use variant of
Mimno
/McCallum decomposition
Use alias tables to sample from the dense parts
Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of GibbsSlide32
KDD 2014
q
is stale, easy-to-draw from distribution
p
is updated distribution
computing ratios
p(
i
)/q(i) is cheapusually the ratio is close to oneelse the dart missedSlide33
KDD 2014Slide34
SPEEDUP 3 - Online LDASlide35
Pilfered from…
NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach &
BleiSlide36
ASIDE: VARIATIONAL INFERENCE FOR LDASlide37Slide38Slide39Slide40Slide41Slide42Slide43
uses
γ
uses
λSlide44Slide45Slide46Slide47Slide48Slide49Slide50Slide51
Back TO:SPEEDUP 3 - Online LDASlide52Slide53Slide54Slide55Slide56Slide57Slide58
SPEED 3.5 - Online Sparse LDASlide59Slide60Slide61Slide62
Compute expectations over the z’s
any way you want….Slide63Slide64
Technical Details
Variational
distrib
:
q(
z
d
) not q(zd)!Approximate using Gibbs:after sampling for a while estimate:
estimate using time and “coherence”:D(w) = # docs containing word wSlide65Slide66
betterSlide67
Summary of LDA speedup tricks
Gibbs sampler:
O(N*K*T) and K grows with N
Need to keep the corpus (and z’s) in memory
You can parallelizeYou need to keep a slice of the corpus
But you need to
synchronize
K multinomials over the vocabularyAllReduce helpsYou can sparsify the sampling and topic-countsMimno’s trick - greatly reduces memory
You can do the computation on-lineOnly need to keep K-multinomials and one document’s worth of corpus and z’s in memoryYou can combine some of these methodsOnline sparsified LDAParallel online sparsified LDA?Slide68
SPEEDUP FOR Parallel LDA - USING ALLREDUCE FOR SynchronizationSlide69Slide70
What if you try and parallelize?
Split document/term matrix randomly and distribute to
p
processors .. then run “Approximate Distributed LDA”
Common subtask in parallel versions of: LDA, SGD, …
.Slide71
Introduction
Common pattern:
do some learning in parallel
aggregate local changes from each processor
to shared parameters
distribute the new shared parameters
back to each processor
and repeat….AllReduce
implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible schemeMAPALLREDUCESlide72Slide73Slide74Slide75Slide76Slide77Slide78
Gory details of VW Hadoop-AllReduce
Spanning-tree server:
Separate process constructs a spanning tree of the
compute
nodes in the cluster
and then acts as a server
Worker nodes (“fake” mappers):Input for worker is locally cachedWorkers all connect to spanning-tree serverWorkers all execute the same code, which might contain AllReduce calls:Workers synchronize whenever they reach an all-reduceSlide79
Hadoop
AllReduce
don’t wait for duplicate jobsSlide80
Second-order method - like Newton’s methodSlide81
2
24
features
~=100 non-zeros/example
2.3B examples
example is user/page/ad and conjunctions of these, positive if there was a click-thru on the adSlide82
50M examples
explicitly constructed kernel
11.7M features
3,300
nonzeros
/example
old method: SVM, 3 days: reporting time to get to fixed test errorSlide83