amp Huge Models Alex Beutel Based on work done with Abhimanu Kumar Vagelis Papalexakis Partha Talukdar Qirong Ho Christos Faloutsos and Eric Xing Outline When to use SGD for distributed learning ID: 652614
Download Presentation The PPT/PDF document "SGD on Hadoop for Big dATA" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
SGD on Hadoopfor Big dATA & Huge Models
Alex BeutelBased on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos Faloutsos, and Eric XingSlide2
OutlineWhen to use SGD for distributed learning
OptimizationReview of DSGDSGD for TensorsSGD for ML models – topic modeling, dictionary learning, MMSBHadoopGeneral algorithmSetting up the MapReduce bodyReducer communicationDistributed normalization
“Always-On SGD” – How to deal with the straggler problemExperimentsSlide3
When distributed SGD is useful
Collaborative FilteringPredict movie preferences
Topic Modeling
What are the topics of webpages, tweets, or status updates
Dictionary Learning
Remove noise or missing pixels from images
Tensor Decomposition
Find communities in temporal graphs
300 Million Photos uploaded
to Facebook per day!
1 Billion users on Facebook
400 million tweets per daySlide4
Gradient DescentSlide5
Stochastic Gradient Descent (SGD)Slide6
Stochastic Gradient Descent (SGD)Slide7
SGD BackgroundSlide8
DSGD for Matrices (Gemulla, 2011)
X
U
V
≈
Users
Movies
GenresSlide9
DSGD for Matrices (Gemulla, 2011)
X
U
V
≈
Independent!Slide10
DSGD for Matrices (Gemulla, 2011)
Independent BlocksSlide11
DSGD for Matrices
(Gemulla, 2011)
Partition your data & model into
d × d blocks
Results in d=3 strata
Process strata sequentially, process blocks in each stratum in parallelSlide12
TensorsSlide13
What is a tensor?Tensors are used for structured data > 2 dimensionsThink of as a 3D-matrix
SubjectVerb
Object
For example:Derek Jeter plays baseballSlide14
Tensor Decomposition
≈
U
V
W
XSlide15
Tensor Decomposition
≈
U
V
W
XSlide16
Tensor Decomposition
≈
U
V
W
XSlide17
Tensor Decomposition
≈
U
V
W
X
Independent
Not IndependentSlide18
Tensor DecompositionSlide19
For
d
=3 blocks per stratum, we require
d
2
=9 strataSlide20
Tensor DecompositionSlide21
Coupled Matrix + Tensor Decomposition
X
Y
Subject
Verb
ObjectDocumentSlide22
Coupled Matrix + Tensor Decomposition
≈
U
V
W
X
Y
ASlide23
Coupled Matrix + Tensor DecompositionSlide24
Constraints & ProjectionsSlide25
Example: Topic Modeling
Documents
Words
TopicsSlide26
ConstraintsSometimes we want to restrict response:Non-negative
SparsitySimplex (so vectors become probabilities)Keep inside unit ballSlide27
How to enforce? Projections
Example: Non-negativeSlide28
More projections
Sparsity (soft thresholding):SimplexUnit ballSlide29
Dictionary LearningLearn a dictionary of concepts and a sparse reconstructionUseful for fixing noise and missing pixels of images
Sparse encoding
Within unit ballSlide30
Mixed Membership Network Decomp.Used for modeling communities in graphs (e.g. a social network)
Simplex
Non-negativeSlide31
Implementing on HadoopSlide32
High level algorithmfor Epoch e
= 1 … T do for Subepoch s = 1 … d2 do Let be the set of blocks in stratum s for block b = 1 … d
in parallel do Run SGD on all points in block
end endend
Stratum 1
Stratum 2
Stratum 3
…Slide33
Bad Hadoop Algorithm: Subepoch 1
Run SGD on
Update:
Run SGD on
Update:
Run SGD on
Update:
Reducers
Mappers
U
2
V
1
W
3
U
3
V
2
W
1
U
1
V
3
W
2Slide34
Bad Hadoop Algorithm: Subepoch 2
Run SGD on
Update:
Run SGD on
Update:
Run SGD on
Update:
Reducers
Mappers
U
2
V
1
W
2
U
3
V
2
W
3
U
1
V
3
W
1Slide35
Bad Hadoop Algorithm
U
2
V
1
W
3
Run SGD on
Update:
U
3
V
2
W
1
Run SGD on
Update:
U
1
V
3
W
2
Run SGD on
Update:
Reducers
MappersSlide36
Hadoop ChallengesMapReduce is typically very bad for iterative algorithms
T × d2 iterationsSizable overhead per Hadoop jobLittle flexibilitySlide37
High Level Algorithm
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
U
1
V
1
W
1
U
2
V
2
W
2
U
3
V
3
W
3Slide38
High Level Algorithm
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
U
1
V
1
W
1
U
2
V
2
W
2
U
3
V
3
W
3Slide39
High Level Algorithm
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
U
1
V
1
W
3
U
2
V
2
W
1
U
3
V
3
W
2Slide40
High Level Algorithm
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
V
1
V
2
V
3
U
1
U
2
U
3
W
1
W
2
W
3
U
1
V
1
W
2
U
2
V
2
W
3
U
3
V
3
W
1Slide41
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
U
1
V
1
W
1
Run SGD on
Update:
U
2
V
2
W
2
Run SGD on
Update:
U
3
V
3
W
3
Run SGD on
Update:
Reducers
Mappers
Partition
&
Sort
…
…
HDFS
HDFSSlide42
Hadoop Algorithm
Process points:Map each point
to its block
with necessary info to order
Reducers
Mappers
Partition
&
Sort
Use:
Partitioner
KeyComparator
GroupingComparatorSlide43
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
Reducers
Mappers
Partition
&
Sort
…
…Slide44
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
U
1
V
1
W
1
Run SGD on
Update:
U
2
V
2
W
2
Run SGD on
Update:
U
3
V
3
W
3
Run SGD on
Update:
Reducers
Mappers
…
…
Partition
&
SortSlide45
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
U
1
V
1
W
1
Run SGD on
Update:
U
2
V
2
W
2
Run SGD on
Update:
U
3
V
3
W
3
Run SGD on
Update:
Reducers
Mappers
Partition
&
Sort
…
…Slide46
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
U
1
V
1
Run SGD on
Update:
U
2
V
2
Run SGD on
Update:
U
3
V
3
Run SGD on
Update:
Reducers
Mappers
Partition
&
Sort
…
…
HDFS
HDFS
W
2
W
1
W
3Slide47
Hadoop
Algorithm
Process points:
Map each point
to its block
with necessary info to order
U
1
V
1
W
1
Run SGD on
Update:
U
2
V
2
W
2
Run SGD on
Update:
U
3
V
3
W
3
Run SGD on
Update:
Reducers
Mappers
Partition
&
Sort
…
…
HDFS
HDFSSlide48
Hadoop SummaryUse mappers to send data points to the correct reducers in order
Use reducers as machines in a normal clusterUse HDFS as the communication channel between reducersSlide49
Distributed Normalization
Documents
Words
Topics
π
1
β
1
π
2
β
2
π
3
β
3Slide50
Distributed Normalization
π
1
β
1
π
2
β
2
π
3
β
3
σ
(1)
σ
(2)
σ
(3)
σ
(b)
is a k-dimensional vector,
summing the terms of β
b
σ
(1)
σ
(1)
σ
(3)
σ
(3)
σ
(2)
σ
(2)
Transfer
σ
(b)
to all machines
Each machine calculates
σ
:
Normalize: Slide51
Barriers & Stragglers
Process points:
Map each point
to its block
with necessary info to order
Run SGD on
Run SGD on
Run SGD on
Reducers
Mappers
Partition
&
Sort
…
…
U
1
V
1
Update:
U
2
V
2
Update:
U
3
V
3
Update:
HDFS
HDFS
W
2
W
1
W
3
Wasting time
waiting!Slide52
Solution: “Always-On SGD”For each reducer:
Run SGD on all points in current block ZShuffle points in Z and decrease step size
Check if other reducers are ready to sync
Run SGD on points in
Z
again
If not ready to sync
Wait
If not ready to sync
Sync parameters
and get new block
Z
Slide53
“Always-On SGD”
Process points:
Map each point
to its block
with necessary info to order
Run SGD on
Run SGD on
Run SGD on
Reducers
Partition
&
Sort
…
…
U
1
V
1
Update:
U
2
V
2
Update:
U
3
V
3
Update:
HDFS
HDFS
W
2
W
1
W
3
Run SGD on old points again!
Slide54
“Always-On SGD”
First SGD pass of block Z Extra SGD Updates
Read Parameters from HDFS
Write Parameters to HDFS
Reducer 1
Reducer2Reducer 3
Reducer 4Slide55
ExperimentsSlide56
FlexiFaCT (Tensor Decomposition)ConvergenceSlide57
FlexiFaCT (Tensor Decomposition)Scalability in Data SizeSlide58
FlexiFaCT (Tensor Decomposition)Scalability in Tensor Dimension
Handles up to 2 billion parameters!Slide59
FlexiFaCT (Tensor Decomposition)Scalability in Rank of Decomposition
Handles up to 4 billion parameters!Slide60
FlexiFaCT (Tensor Decomposition)Scalability in Number of MachinesSlide61
Fugue (Using “Always-On SGD”)
Dictionary Learning: ConvergenceSlide62
Fugue (Using “Always-On SGD”)
Community Detection: ConvergenceSlide63
Fugue (Using “Always-On SGD”)
Topic Modeling: ConvergenceSlide64
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability in Data SizeSlide65
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability in Rank Slide66
Fugue (Using “Always-On SGD”)
Topic Modeling: Scalability over MachinesSlide67
Fugue (Using “Always-On SGD”)
Topic Modeling: Number of MachinesSlide68
Fugue (Using “Always-On SGD”)Slide69
Key PointsFlexible method for tensors & ML modelsCan use stock Hadoop through using HDFS for communication
When waiting for slower machines, run updates on old data againSlide70
Questions?
Alex Beutelabeutel@cs.cmu.eduhttp://alexbeutel.com