/
Fast sampling for LDA Fast sampling for LDA

Fast sampling for LDA - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
367 views
Uploaded On 2018-01-07

Fast sampling for LDA - PPT Presentation

William Cohen MORE LDA SPEEDUPS First RECAP LDA DEtails Called collapsed Gibbs sampling since youve marginalized away some variables Fr Parameter estimation for text analysis ID: 621152

topic lda alias sampling lda topic sampling alias word allreduce speedup line time order kdd online tables gibbs segment

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Fast sampling for LDA" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Fast sampling for LDA

William CohenSlide2

MORE LDA SPEEDUPSFirst - RECAP LDA

DEtailsSlide3
Slide4

Called “collapsed Gibbs sampling” since you’ve marginalized away some variables

Fr

: Parameter estimation for text analysis -

Gregor Heinrich

prob

this

word/term

assigned to topic k

prob

this

doc

contains topic kSlide5

More detailSlide6
Slide7

z=1

z=2

z=3

unit height

randomSlide8

SPEEDUP 1 - SparsitySlide9

KDD 2008Slide10

z=1

z=2

z=3

unit height

randomSlide11
Slide12

Running total of P(z=k|…) or P(z<=k)Slide13

Discussion….

Where do you spend your time?

sampling the

z

’seach sampling step involves a loop over all topicsthis seems wasteful

even with many topics, words are often only assigned to a

few

different topicslow frequency words appear < K times … and there are lots and lots of them!even frequent words are not in every topicSlide14

Discussion….

What’s the solution?

Idea: come up with

approximations

to Z at each stage - then you might be able to stop early…..

computationally like a

sparser

vector

Want Zi>=ZSlide15

Tricks

How do you compute and maintain the bound?

see the paper

What order do you go in?

want to pick large

P(k)’s

first

… so we want large P(k|d) and P(k|w)… so we maintain k’s in sorted orderwhich only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted arraySlide16

ResultsSlide17

SPEEDUP 2 - ANOTHER APPROACH FOR USING SparsitySlide18

KDD 09Slide19

z=

s+r+q

t = topic (k)

w = word

d = docSlide20

z=

s+r+q

If

U<s

:

lookup

U

on line segment with tic-marks at α

1β/(βV + n.|1), α2β/(βV + n.|2), …If s<U<r:lookup U on line segment for

rOnly need to check t such that nt|d>0t = topic (k)w = wordd = docSlide21

z=

s+r+q

If

U<s

:

lookup

U

on line segment with tic-marks at α

1β/(βV + n.|1), α2β/(βV + n.|2), …If s<U<s+r:lookup U

on line segment for rIf s+r<U:lookup U on line segment for qOnly need to check t such that nw|t>0Slide22

z=

s+r+q

Only need to check t such that

n

w|t

>0

Only need to check t such that

n

t|d>0

Only need to check occasionally (< 10% of the time)Slide23

z=

s+r+q

Need to

store

n

w|t

for each word, topic pair …???

Only need to store nt|d for current d

Only need to store (and maintain) total words per topic and α’s,β,VTrick; count up nt|d for d when you start working on d and update incrementallySlide24

z=

s+r+q

Need to

store

n

w|t

for each word, topic pair …???

1. Precompute, for each t,

Most (>90%) of the time and space is here…2. Quickly find t’s such that nw|t is large for wSlide25

Need to

store

n

w|t

for each word, topic pair …???

1.

Precompute

, for each t,Most (>90%) of the time and space is here…2. Quickly find t’s such that n

w|t is large for wmap w to an int arrayno larger than frequency wno larger than #topicsencode (t,n) as a bit vectorn in the high-order bitst

in the low-order bits

keep ints sorted in descending orderSlide26
Slide27

Outline

LDA/Gibbs algorithm details

How to speed it up by parallelizing

How to speed it up by faster sampling

Why sampling is keySome sampling ideas for LDA

The

Mimno

/McCallum decomposition (SparseLDA)Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)Slide28

Alias tables

http://

www.keithschwarz.com

/darts-dice-coins/

Basic problem: how can we sample from a biased coin quickly?

If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate

r~uniform

and use a binary tree

r in (23/40,7/10]O(K)O(log2K)Slide29

Alias tables

http://

www.keithschwarz.com

/darts-dice-coins/

Another idea…

Simulate the dart with two drawn values:

rx

 int

(u1*K)ry  u1*pmaxkeep throwing till you hit a stripeSlide30

Alias tables

http://

www.keithschwarz.com

/darts-dice-coins/

An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the

average

probability, not the

maximum probability, and cutting and pasting a bit.

You can always do this using only

two colors in each column of the final alias table and the dart never misses!mathematically speaking…Slide31

KDD 2014

Key ideas

use variant of

Mimno

/McCallum decomposition

Use alias tables to sample from the dense parts

Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of GibbsSlide32

KDD 2014

q

is stale, easy-to-draw from distribution

p

is updated distribution

computing ratios

p(

i

)/q(i) is cheapusually the ratio is close to oneelse the dart missedSlide33

KDD 2014Slide34

SPEEDUP 3 - Online LDASlide35

Pilfered from…

NIPS 2010: Online Learning for LDA, Matthew Hoffman, Francis Bach &

BleiSlide36

ASIDE: VARIATIONAL INFERENCE FOR LDASlide37
Slide38
Slide39
Slide40
Slide41
Slide42
Slide43

uses

γ

uses

λSlide44
Slide45
Slide46
Slide47
Slide48
Slide49
Slide50
Slide51

Back TO:SPEEDUP 3 - Online LDASlide52
Slide53
Slide54
Slide55
Slide56
Slide57
Slide58

SPEED 3.5 - Online Sparse LDASlide59
Slide60
Slide61
Slide62

Compute expectations over the z’s

any way you want….Slide63
Slide64

Technical Details

Variational

distrib

:

q(

z

d

) not q(zd)!Approximate using Gibbs:after sampling for a while estimate:

estimate using time and “coherence”:D(w) = # docs containing word wSlide65
Slide66

betterSlide67

Summary of LDA speedup tricks

Gibbs sampler:

O(N*K*T) and K grows with N

Need to keep the corpus (and z’s) in memory

You can parallelizeYou need to keep a slice of the corpus

But you need to

synchronize

K multinomials over the vocabularyAllReduce helpsYou can sparsify the sampling and topic-countsMimno’s trick - greatly reduces memory

You can do the computation on-lineOnly need to keep K-multinomials and one document’s worth of corpus and z’s in memoryYou can combine some of these methodsOnline sparsified LDAParallel online sparsified LDA?Slide68

SPEEDUP FOR Parallel LDA - USING ALLREDUCE FOR SynchronizationSlide69
Slide70

What if you try and parallelize?

Split document/term matrix randomly and distribute to

p

processors .. then run “Approximate Distributed LDA”

Common subtask in parallel versions of: LDA, SGD, …

.Slide71

Introduction

Common pattern:

do some learning in parallel

aggregate local changes from each processor

to shared parameters

distribute the new shared parameters

back to each processor

and repeat….AllReduce

implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible schemeMAPALLREDUCESlide72
Slide73
Slide74
Slide75
Slide76
Slide77
Slide78

Gory details of VW Hadoop-AllReduce

Spanning-tree server:

Separate process constructs a spanning tree of the

compute

nodes in the cluster

and then acts as a server

Worker nodes (“fake” mappers):Input for worker is locally cachedWorkers all connect to spanning-tree serverWorkers all execute the same code, which might contain AllReduce calls:Workers synchronize whenever they reach an all-reduceSlide79

Hadoop

AllReduce

don’t wait for duplicate jobsSlide80

Second-order method - like Newton’s methodSlide81

2

24

features

~=100 non-zeros/example

2.3B examples

example is user/page/ad and conjunctions of these, positive if there was a click-thru on the adSlide82

50M examples

explicitly constructed kernel

 11.7M features

3,300

nonzeros

/example

old method: SVM, 3 days: reporting time to get to fixed test errorSlide83