/
SCALING SGD to SCALING SGD to

SCALING SGD to - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
392 views
Uploaded On 2016-07-21

SCALING SGD to - PPT Presentation

Big dATA amp Huge Models Alex Beutel Based on work done with Abhimanu Kumar Vagelis Papalexakis Partha Talukdar Qirong Ho Christos Faloutsos and Eric Xing Big Learning Challenges ID: 413077

run sgd amp update sgd run update amp tensor points decomposition block algorithm sgd

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "SCALING SGD to" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

SCALING SGD to Big dATA & Huge Models

Alex BeutelBased on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos Faloutsos, and Eric XingSlide2

Big Learning Challenges

Collaborative FilteringPredict movie preferences

Topic Modeling

What are the topics of webpages, tweets, or status updates

Dictionary Learning

Remove noise or missing pixels from images

Tensor Decomposition

Find communities in temporal graphs

300 Million Photos uploaded

to Facebook per day!

1 Billion users on Facebook

400 million tweets per day

2Slide3

Big Data & Huge Model Challenge2 Billion Tweets covering 300,000 words

Break into 1000 TopicsMore than 2 Trillion parameters to learnOver 7 Terabytes of model

Topic Modeling

What are the topics of webpages, tweets, or status updates

400 million tweets per day

3Slide4

OutlineBackground

OptimizationPartitioningConstraints & ProjectionsSystem DesignGeneral algorithmHow to use HadoopDistributed normalization“Always-On SGD” – Dealing with stragglers

ExperimentsFuture questions

4Slide5

Background

5Slide6

Stochastic Gradient Descent (SGD)

6Slide7

Stochastic Gradient Descent (SGD)

7Slide8

SGD for Matrix Factorization

X

U

V

Users

Movies

Genres

8Slide9

SGD for Matrix Factorization

X

U

V

Independent!

9Slide10

The Rise of SGDHogwild! (Niu et al, 2011)

Noticed independenceIf matrix is sparse, there will be little contentionIgnore locksDSGD (Gemulla et al, 2011)Noticed independenceBroke matrix into blocks10Slide11

DSGD for Matrix Factorization (Gemulla, 2011)

Independent Blocks

11Slide12

DSGD for Matrix Factorization

(Gemulla, 2011)

Partition your data & model into

d × d blocks

Results in d=3 strata

Process strata sequentially, process blocks in each stratum in parallel

12Slide13

Other Big Learning PlatformsGraphLab (Low et al, 2010) – Find independence in graphsPSGD (

Zinkevich et al, 2010) – Average independent runs on convex problemsParameter Servers (Li et al, 2014; Ho et al, 2014)Distributed cache of parametersAllow a little “staleness”13Slide14

Tensor Decomposition

14Slide15

What is a tensor?Tensors are used for structured data > 2 dimensionsThink of as a 3D-matrix

SubjectVerb

Object

For example:Derek Jeter plays baseball

15Slide16

Tensor Decomposition

U

V

W

X

Subject

Verb

Object

Derek Jeter plays baseball

16Slide17

Tensor Decomposition

U

V

W

X

17Slide18

Tensor Decomposition

U

V

W

X

Independent

Not Independent

18Slide19

Tensor Decomposition

19Slide20

For

d

=3 blocks per stratum, we require

d

2

=9 strata

20Slide21

Coupled Matrix + Tensor Decomposition

X

Y

Subject

Verb

Object

Document21Slide22

Coupled Matrix + Tensor Decomposition

U

V

W

X

Y

A

22Slide23

Coupled Matrix + Tensor Decomposition

23Slide24

Constraints & Projections

24Slide25

Example: Topic Modeling

Documents

Words

Topics

25Slide26

ConstraintsSometimes we want to restrict response:Non-negative

SparsitySimplex (so vectors become probabilities)Keep inside unit ball

26Slide27

How to enforce? Projections

Example: Non-negative

27Slide28

More projections

Sparsity (soft thresholding):SimplexUnit ball

28Slide29

Sparse Non-Negative Tensor Factorization

Sparse encoding

Non-negativity:

More interpretable results

29Slide30

Dictionary LearningLearn a dictionary of concepts and a sparse reconstructionUseful for fixing noise and missing pixels of images

Sparse encoding

Within unit ball

30Slide31

Mixed Membership Network Decomp.Used for modeling communities in graphs (e.g. a social network)

Simplex

Non-negative

31Slide32

Proof Sketch of Convergence

Regenerative process – each point is used once/epochProjections are not too big and don’t “wander off” (Lipschitz continuous)Step sizes are bounded:[Details]

Normal Gradient Descent Update

Noise from SGD

Projection

SGD Constraint error

32Slide33

System design

33Slide34

High level algorithmfor Epoch

e = 1 … T do for Subepoch s = 1 … d2 do Let be the set of blocks in stratum s for block b = 1 …

d in parallel do Run SGD on all points in block

end endend

Stratum 1

Stratum 2

Stratum 3

34Slide35

Bad Hadoop Algorithm: Subepoch 1

Run SGD on

Update:

Run SGD on

Update:

Run SGD on

Update:

Reducers

Mappers

U

2

V

1

W

3

U

3

V

2

W

1

U

1

V

3

W

2

35Slide36

Bad Hadoop Algorithm: Subepoch 2

Run SGD on

Update:

Run SGD on

Update:

Run SGD on

Update:

Reducers

Mappers

U

2

V

1

W

2

U

3

V

2

W

3

U

1

V

3

W

1

36Slide37

Hadoop ChallengesMapReduce is typically very bad for iterative algorithms

T × d2 iterationsSizable overhead per Hadoop jobLittle flexibility37Slide38

High Level Algorithm

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

U

1

V

1

W

1

U

2

V

2

W

2

U

3

V

3

W

3

38Slide39

High Level Algorithm

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

U

1

V

1

W

3

U

2

V

2

W

1

U

3

V

3

W

2

39Slide40

High Level Algorithm

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

U

1

V

1

W

2

U

2

V

2

W

3

U

3

V

3

W

1

40Slide41

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

U

1

V

1

W

1

Run SGD on

Update:

U

2

V

2

W

2

Run SGD on

Update:

U

3

V

3

W

3

Run SGD on

Update:

Reducers

Mappers

Partition

&

Sort

HDFS

HDFS

41Slide42

Hadoop Algorithm

Process points:Map each point

to its block

with necessary info to order

Reducers

Mappers

Partition

&

Sort

42Slide43

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

Reducers

Mappers

Partition

&

Sort

43Slide44

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

U

1

V

1

W

1

Run SGD on

Update:

U

2

V

2

W

2

Run SGD on

Update:

U

3

V

3

W

3

Run SGD on

Update:

Reducers

Mappers

Partition

&

Sort

44Slide45

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

U

1

V

1

W

1

Run SGD on

Update:

U

2

V

2

W

2

Run SGD on

Update:

U

3

V

3

W

3

Run SGD on

Update:

Reducers

Mappers

Partition

&

Sort

45Slide46

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

U

1

V

1

Run SGD on

Update:

U

2

V

2

Run SGD on

Update:

U

3

V

3

Run SGD on

Update:

Reducers

Mappers

Partition

&

Sort

HDFS

HDFS

W

2

W

1

W

3

46Slide47

System SummaryLimit storage and transfer of data and modelStock Hadoop

can be used with HDFS for communicationHadoop makes the implementation highly portableAlternatively, could also implement on top of MPI or even a parameter server47Slide48

Distributed Normalization

Documents

Words

Topics

π

1

β

1

π

2

β

2

π

3

β

3

48Slide49

Distributed Normalization

π

1

β

1

π

2

β

2

π

3

β

3

σ

(1)

σ

(2)

σ

(3)

σ

(b)

is a k-dimensional vector,

summing the terms of β

b

σ

(1)

σ

(1)

σ

(3)

σ

(3)

σ

(2)

σ

(2)

Transfer

σ

(b)

to all machines

Each machine calculates

σ

:

Normalize:

49Slide50

Barriers & Stragglers

Process points:

Map each point

to its block

with necessary info to order

Run SGD on

Run SGD on

Run SGD on

Reducers

Mappers

Partition

&

Sort

U

1

V

1

Update:

U

2

V

2

Update:

U

3

V

3

Update:

HDFS

HDFS

W

2

W

1

W

3

Wasting time

waiting!

50Slide51

Solution: “Always-On SGD”For each reducer:

Run SGD on all points in current block ZShuffle points in Z and decrease step size

Check if other reducers are ready to sync

Run SGD on points in

Z

again

If not ready to sync

Wait

If not ready to sync

Sync parameters

and get new block

Z

51Slide52

“Always-On SGD”

Process points:

Map each point

to its block

with necessary info to order

Run SGD on

Run SGD on

Run SGD on

Reducers

Partition

&

Sort

U

1

V

1

Update:

U

2

V

2

Update:

U

3

V

3

Update:

HDFS

HDFS

W

2

W

1

W

3

Run SGD on old points again!

52Slide53

Proof SketchMartingale Difference Sequence: At the beginning of each epoch, the expected number of times each point will be processed is equal

[Details]

53Slide54

Proof SketchMartingale Difference Sequence: At the beginning of each epoch, the expected number of times each point will be processed is equalCan use properties of SGD and MDS to show variance decreases with more points used

Extra updates are valuable[Details]54Slide55

“Always-On SGD”

First SGD pass of block Z Extra SGD Updates

Read Parameters from HDFS

Write Parameters to HDFS

Reducer 1

Reducer2

Reducer 3Reducer 4

55Slide56

Experiments

56Slide57

FlexiFaCT (Tensor Decomposition)

Convergence57Slide58

FlexiFaCT (Tensor Decomposition)

Scalability in Data Size58Slide59

FlexiFaCT (Tensor Decomposition)

Scalability in Tensor DimensionHandles up to 2 billion parameters!

59Slide60

FlexiFaCT (Tensor Decomposition)

Scalability in Rank of DecompositionHandles up to 4 billion parameters!

60Slide61

Fugue (Using “Always-On SGD”)

Dictionary Learning: Convergence61Slide62

Fugue (Using “Always-On SGD”)

Community Detection: Convergence62Slide63

Fugue (Using “Always-On SGD”)

Topic Modeling: Convergence63Slide64

Fugue (Using “Always-On SGD”)

Topic Modeling: Scalability in Data Size64GraphLab

cannot spill to diskSlide65

Fugue (Using “Always-On SGD”)

Topic Modeling: Scalability in Rank 65Slide66

Fugue (Using “Always-On SGD”)

Topic Modeling: Scalability over Machines66Slide67

Fugue (Using “Always-On SGD”)

Topic Modeling: Number of Machines67Slide68

Fugue (Using “Always-On SGD”)

68Slide69

Looking forward

69Slide70

Future QuestionsDo “extra updates” work on other techniques, e.g. Gibbs sampling? Other iterative algorithms?What other problems can be partitioned well? (Model & Data)

Can we better choose certain data for extra updates?How can we store large models on disk for I/O efficient updates?70Slide71

Key PointsFlexible method for tensors & ML models

Partition both data and model together for efficiency and scalabilityWhen waiting for slower machines, run extra updates on old data againAlgorithmic & systems challenges in scaling ML can be addressed through statistical innovation

71Slide72

Questions?

Alex Beutelabeutel@cs.cmu.eduhttp://alexbeutel.comSource code available at http://beu.tl/flexifact

72