Big dATA amp Huge Models Alex Beutel Based on work done with Abhimanu Kumar Vagelis Papalexakis Partha Talukdar Qirong Ho Christos Faloutsos and Eric Xing Big Learning Challenges ID: 413077
Download Presentation The PPT/PDF document "SCALING SGD to" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
SCALING SGD to Big dATA & Huge Models
Alex BeutelBased on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos Faloutsos, and Eric XingSlide2
Big Learning Challenges
Collaborative FilteringPredict movie preferences
Topic Modeling
What are the topics of webpages, tweets, or status updates
Dictionary Learning
Remove noise or missing pixels from images
Tensor Decomposition
Find communities in temporal graphs
300 Million Photos uploaded
to Facebook per day!
1 Billion users on Facebook
400 million tweets per day
2Slide3
Big Data & Huge Model Challenge2 Billion Tweets covering 300,000 words
Break into 1000 TopicsMore than 2 Trillion parameters to learnOver 7 Terabytes of model
Topic Modeling
What are the topics of webpages, tweets, or status updates
400 million tweets per day
3Slide4
OutlineBackground
OptimizationPartitioningConstraints & ProjectionsSystem DesignGeneral algorithmHow to use HadoopDistributed normalization“Always-On SGD” – Dealing with stragglers
ExperimentsFuture questions
4Slide5
Background
5Slide6
Stochastic Gradient Descent (SGD)
6Slide7
Stochastic Gradient Descent (SGD)
7Slide8
SGD for Matrix Factorization
X
U
V
≈
Users
Movies
Genres
8Slide9
SGD for Matrix Factorization
X
U
V
≈
Independent!
9Slide10
The Rise of SGDHogwild! (Niu et al, 2011)
Noticed independenceIf matrix is sparse, there will be little contentionIgnore locksDSGD (Gemulla et al, 2011)Noticed independenceBroke matrix into blocks10Slide11
DSGD for Matrix Factorization (Gemulla, 2011)
Independent Blocks
11Slide12
DSGD for Matrix Factorization
(Gemulla, 2011)
Partition your data & model into
d × d blocks
Results in d=3 strata
Process strata sequentially, process blocks in each stratum in parallel
12Slide13
Other Big Learning PlatformsGraphLab (Low et al, 2010) – Find independence in graphsPSGD (
Zinkevich et al, 2010) – Average independent runs on convex problemsParameter Servers (Li et al, 2014; Ho et al, 2014)Distributed cache of parametersAllow a little “staleness”13Slide14
Tensor Decomposition
14Slide15
What is a tensor?Tensors are used for structured data > 2 dimensionsThink of as a 3D-matrix
SubjectVerb
Object
For example:Derek Jeter plays baseball
15Slide16
Tensor Decomposition
≈
U
V
W
X
Subject
Verb
Object
Derek Jeter plays baseball
16Slide17
Tensor Decomposition
≈
U
V
W
X
17Slide18
Tensor Decomposition
≈
U
V
W
X
Independent
Not Independent
18Slide19
Tensor Decomposition
19Slide20
For
d
=3 blocks per stratum, we require
d
2
=9 strata
20Slide21
Coupled Matrix + Tensor Decomposition
X
Y
Subject
Verb
Object
Document21Slide22
Coupled Matrix + Tensor Decomposition
≈
U
V
W
X
Y
A
22Slide23
Coupled Matrix + Tensor Decomposition
23Slide24
Constraints & Projections
24Slide25
Example: Topic Modeling
Documents
Words
Topics
25Slide26
ConstraintsSometimes we want to restrict response:Non-negative
SparsitySimplex (so vectors become probabilities)Keep inside unit ball
26Slide27
How to enforce? Projections
Example: Non-negative
27Slide28
More projections
Sparsity (soft thresholding):SimplexUnit ball
28Slide29
Sparse Non-Negative Tensor Factorization
Sparse encoding
Non-negativity:
More interpretable results
29Slide30
Dictionary LearningLearn a dictionary of concepts and a sparse reconstructionUseful for fixing noise and missing pixels of images
Sparse encoding
Within unit ball
30Slide31
Mixed Membership Network Decomp.Used for modeling communities in graphs (e.g. a social network)
Simplex
Non-negative
31Slide32
Proof Sketch of Convergence
Regenerative process – each point is used once/epochProjections are not too big and don’t “wander off” (Lipschitz continuous)Step sizes are bounded:[Details]
Normal Gradient Descent Update
Noise from SGD
Projection
SGD Constraint error
32Slide33
System design
33Slide34
High level algorithmfor Epoch
e = 1 … T do for Subepoch s = 1 … d2 do Let be the set of blocks in stratum s for block b = 1 …
d in parallel do Run SGD on all points in block
end endend
Stratum 1
Stratum 2
Stratum 3
…
34Slide35
Bad Hadoop Algorithm: Subepoch 1
Run SGD on
Update:
Run SGD on
Update:
Run SGD on
Update:
Reducers
Mappers
U
2
V
1
W
3
U
3
V
2
W
1
U
1
V
3
W
2
35Slide36
Bad Hadoop Algorithm: Subepoch 2
Run SGD on
Update:
Run SGD on
Update:
Run SGD on
Update:
Reducers
Mappers
U
2
V
1
W
2
U
3
V
2
W
3
U
1
V
3
W
1
36Slide37
Hadoop ChallengesMapReduce is typically very bad for iterative algorithms
T × d2 iterationsSizable overhead per Hadoop jobLittle flexibility37Slide38
High Level Algorithm
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
U
1
V
1
W
1
U
2
V
2
W
2
U
3
V
3
W
3
38Slide39
High Level Algorithm
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
U
1
V
1
W
3
U
2
V
2
W
1
U
3
V
3
W
2
39Slide40
High Level Algorithm
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
U
1
V
1
W
2
U
2
V
2
W
3
U
3
V
3
W
1
40Slide41
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
U
1
V
1
W
1
Run SGD on
Update:
U
2
V
2
W
2
Run SGD on
Update:
U
3
V
3
W
3
Run SGD on
Update:
Reducers
Mappers
Partition
&
Sort
…
…
HDFS
HDFS
41Slide42
Hadoop Algorithm
Process points:Map each point
to its block
with necessary info to order
Reducers
Mappers
Partition
&
Sort
42Slide43
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
Reducers
Mappers
Partition
&
Sort
…
…
43Slide44
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
U
1
V
1
W
1
Run SGD on
Update:
U
2
V
2
W
2
Run SGD on
Update:
U
3
V
3
W
3
Run SGD on
Update:
Reducers
Mappers
…
…
Partition
&
Sort
44Slide45
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
U
1
V
1
W
1
Run SGD on
Update:
U
2
V
2
W
2
Run SGD on
Update:
U
3
V
3
W
3
Run SGD on
Update:
Reducers
Mappers
Partition
&
Sort
…
…
45Slide46
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
U
1
V
1
Run SGD on
Update:
U
2
V
2
Run SGD on
Update:
U
3
V
3
Run SGD on
Update:
Reducers
Mappers
Partition
&
Sort
…
…
HDFS
HDFS
W
2
W
1
W
3
46Slide47
System SummaryLimit storage and transfer of data and modelStock Hadoop
can be used with HDFS for communicationHadoop makes the implementation highly portableAlternatively, could also implement on top of MPI or even a parameter server47Slide48
Distributed Normalization
Documents
Words
Topics
π
1
β
1
π
2
β
2
π
3
β
3
48Slide49
Distributed Normalization
π
1
β
1
π
2
β
2
π
3
β
3
σ
(1)
σ
(2)
σ
(3)
σ
(b)
is a k-dimensional vector,
summing the terms of β
b
σ
(1)
σ
(1)
σ
(3)
σ
(3)
σ
(2)
σ
(2)
Transfer
σ
(b)
to all machines
Each machine calculates
σ
:
Normalize:
49Slide50
Barriers & Stragglers
Process points:
Map each point
to its block
with necessary info to order
Run SGD on
Run SGD on
Run SGD on
Reducers
Mappers
Partition
&
Sort
…
…
U
1
V
1
Update:
U
2
V
2
Update:
U
3
V
3
Update:
HDFS
HDFS
W
2
W
1
W
3
Wasting time
waiting!
50Slide51
Solution: “Always-On SGD”For each reducer:
Run SGD on all points in current block ZShuffle points in Z and decrease step size
Check if other reducers are ready to sync
Run SGD on points in
Z
again
If not ready to sync
Wait
If not ready to sync
Sync parameters
and get new block
Z
51Slide52
“Always-On SGD”
Process points:
Map each point
to its block
with necessary info to order
Run SGD on
Run SGD on
Run SGD on
Reducers
Partition
&
Sort
…
…
U
1
V
1
Update:
U
2
V
2
Update:
U
3
V
3
Update:
HDFS
HDFS
W
2
W
1
W
3
Run SGD on old points again!
52Slide53
Proof SketchMartingale Difference Sequence: At the beginning of each epoch, the expected number of times each point will be processed is equal
[Details]
53Slide54
Proof SketchMartingale Difference Sequence: At the beginning of each epoch, the expected number of times each point will be processed is equalCan use properties of SGD and MDS to show variance decreases with more points used
Extra updates are valuable[Details]54Slide55
“Always-On SGD”
First SGD pass of block Z Extra SGD Updates
Read Parameters from HDFS
Write Parameters to HDFS
Reducer 1
Reducer2
Reducer 3Reducer 4
55Slide56
Experiments
56Slide57
FlexiFaCT (Tensor Decomposition)
Convergence57Slide58
FlexiFaCT (Tensor Decomposition)
Scalability in Data Size58Slide59
FlexiFaCT (Tensor Decomposition)
Scalability in Tensor DimensionHandles up to 2 billion parameters!
59Slide60
FlexiFaCT (Tensor Decomposition)
Scalability in Rank of DecompositionHandles up to 4 billion parameters!
60Slide61
Fugue (Using “Always-On SGD”)
Dictionary Learning: Convergence61Slide62
Fugue (Using “Always-On SGD”)
Community Detection: Convergence62Slide63
Fugue (Using “Always-On SGD”)
Topic Modeling: Convergence63Slide64
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability in Data Size64GraphLab
cannot spill to diskSlide65
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability in Rank 65Slide66
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability over Machines66Slide67
Fugue (Using “Always-On SGD”)
Topic Modeling: Number of Machines67Slide68
Fugue (Using “Always-On SGD”)
68Slide69
Looking forward
69Slide70
Future QuestionsDo “extra updates” work on other techniques, e.g. Gibbs sampling? Other iterative algorithms?What other problems can be partitioned well? (Model & Data)
Can we better choose certain data for extra updates?How can we store large models on disk for I/O efficient updates?70Slide71
Key PointsFlexible method for tensors & ML models
Partition both data and model together for efficiency and scalabilityWhen waiting for slower machines, run extra updates on old data againAlgorithmic & systems challenges in scaling ML can be addressed through statistical innovation
71Slide72
Questions?
Alex Beutelabeutel@cs.cmu.eduhttp://alexbeutel.comSource code available at http://beu.tl/flexifact
72