of Aurick Qiao Joseph Gonzalez Wei Dai and Jinliang Wei Parameter Servers 1 Regret analysis for online optimization RECAP 2 2009 RECAP 3 Take a gradient step x x ID: 584978
Download Presentation The PPT/PDF document "(slides courtesy" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
(slides courtesy
of Aurick Qiao, Joseph Gonzalez, Wei Dai, and Jinliang Wei)
Parameter Servers
1Slide2
Regret analysis for on-line optimization
RECAP2Slide3
2009
RECAP
3Slide4
Take a gradient step:
x’ = xt – ηt g
t If you’ve restricted the parameters to a subspace X (e.g., must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ ) But…. you might be using a “stale” g (from τ steps ago)f is loss function, x is parametersRECAP4Slide5
Regret: how much loss was incurred
during learning,over and above the loss incurrent with an optimal choice of xSpecial case: f
t is 1 if a mistake was made, 0 otherwiseft(x*) = 0 for optimal x*Regret = # mistakes made in learningRECAP5Slide6
Theorem
: you can find a learning rate so that the regret of delayed SGD is bounded by
T = # timestepsτ= staleness > 0RECAP6Slide7
Theorem 8: you can do better if you assume (1) examples are i.i.d. (2) the gradients are smooth, analogous to the assumption about L: Then you can show a bound on expected regret
No-delay loss
dominant termRECAP7Slide8
Experiments
RECAP8Slide9
Summary of “Slow Learners are Fast”Generalization of iterative parameter mixing
run multiple learners in parallelconceptually they share the same weight/parameter vector BUT …Learners share weights imperfectlylearners are almost synchronizedthere’s a bound τ on how stale the shared weights get9
RECAPSlide10
Background: Distributed Coordination ServicesExample: Apache
ZooKeeperDistributed processes coordinate through shared “data registers” (aka znodes) which look a bit like a shared in-memory filesystem10
BACKGROUNDSlide11
Background: Distributed Coordination Services
Client:create /w_fooset /w_foo “bar”get /w_foo “bar” Better with more reads than writes11
BACKGROUNDSlide12
(slides courtesy
of Aurick Qiao Joseph Gonzalez, Wei Dai, and Jinliang
Wei)Parameter Servers
12Slide13
13
Scalable SystemsAbstractions
Scalable Machine Learning Algorithms
ML SystemsSlide14
GraphLab
,Tensorflow
Hadoop, SparkBosen, DMTK, ParameterServer.org14
ML
Systems
Landscape
Graph Systems
Dataflow Systems
Shared Memory
Systems
ModelSlide15
GraphLab
,Tensorflow
Hadoop, SparkBosen, DMTK, ParameterServer.org15
ML
Systems
Landscape
Graph Systems
Dataflow Systems
Shared Memory
Systems
Model
AlgorithmsSlide16
Graph Algorithms,
Graphical ModelsSGD, Sampling
[NIPS’09, NIPS’13]GraphLab,TensorflowHadoop, SparkBosen, DMTK, ParameterServer.org
16
ML
Systems
Landscape
Graph Systems
Dataflow Systems
Shared Memory
Systems
Model
Naïve Bayes,
RocchioSlide17
Graph Algorithms,
Graphical ModelsSGD, Sampling
[NIPS’09, NIPS’13]GraphLab,TensorflowHadoop & SparkBosen, DMTK, ParameterServer.org
17
ML
Systems
Landscape
Graph Systems
Dataflow Systems
Shared Memory
Systems
Model
Abstractions
Naïve Bayes,
RocchioSlide18
Vertex-Programs
Parameter Server[UAI’
10][VLDB’10]Graph Algorithms,Graphical ModelsSGD, Sampling[NIPS’09, NIPS’13]
GraphLab
,
Tensorflow
Hadoop
&
Spark
Bosen
, DMTK,
ParameterServer.org
18
ML
Systems
Landscape
Graph Systems
Dataflow Systems
Shared Memory
Systems
Model
Naïve Bayes,
Rocchio
PIG,
GuineaPig
, …Slide19
Parameter Server
[VLDB’10]
[NIPS’09, NIPS’13]19
ML
Systems
Landscape
Graph Systems
Dataflow Systems
Shared Memory
Systems
Model
Simple case: Parameters of the ML system are stored in a
distributed
hash table that is accessible thru the
network
Param
Servers used in Google, Yahoo, ….
Academic work by
Smola
, Xing, …Slide20
Parameter Servers Are Flexible20
Implemented with Parameter ServerSlide21
Parameter Server (PS)
Model parameters are stored on PS machines and accessed via key-value interface (distributed shared memory
)Extensions: multiple keys (for a matrix); multiple “channels” (for multiple sparse vectors, multiple clients for same servers, …)Server Machines21
Worker Machines
[Smola et al 2010, Ho et al 2013, Li et al 2014]Slide22
Parameter Server (PS)
Extensions:
push/pull interface to send/receive most recent copy of (subset of) parameters, blocking is optionalExtension: can block until push/pulls with clock < (t – τ ) completeServer Machines22
Worker Machines
[Smola et al 2010, Ho et al 2013, Li et al 2014]Slide23
Data parallel learning with PS
Parameter Server
w1w2w3Parameter Serverw
4
w
5
w
6
Parameter Server
w
7
w
8
w
9
Data
Data
Data
Data
w
1
w
2
w
3
w
4
w
5
w
6
w
7
w
8
w
9
w
1
w
2
w
3
w
4
w
5
w
6
w
7
w
8
w
9
Split Data Across Machines
23Slide24
Data parallel learning with PS
Parameter Serverw
1w2w3Parameter Serverw4
w
5
w
6
Parameter Server
w
7
w
8
w
9
Data
Data
Data
Data
w
1
w
2
w
3
w
4
w
5
w
6
w
7
w
8
w
9
Split Data Across Machines
24
Different parts of the
model
on
different
servers
.
Workers
retrieve
the
part needed
as neededSlide25
Key-Value API for workers:
get
(key) valueadd(key, delta)Abstraction used for Data parallel learning with PS
25
Model
Model
ModelSlide26
Map-Reduce
PS vs HadoopSlide27
Iteration in Map-Reduce (IPM)
Training
DataMapReduceLearnedModel
w
(1)
w
(2)
w
(3)
w
(0)
Initial
Model
27Slide28
Cost of Iteration in Map-Reduce
Map
Reduce
Learned
Model
w
(1)
w
(2)
w
(3)
w
(0)
Initial
Model
Training
Data
Read 1
Read 2
Read 3
Repeatedly
load same data
28Slide29
Cost of Iteration in Map-Reduce
Map
ReduceLearned
Model
w
(1)
w
(2)
w
(3)
w
(0)
Initial
Model
Training
Data
Redundantly
save
output between
stages
29Slide30
(slides courtesy
of Aurick Qiao Joseph Gonzalez, Wei Dai, and Jinliang
Wei)Parameter Servers
Stale Synchronous Parallel
ModelSlide31
Parameter Server (PS)
Model parameters are stored on PS machines and accessed via key-value interface (distributed shared memory
)Server Machines31
Worker Machines
[Smola et al 2010, Ho et al 2013, Li et al 2014]Slide32
Iterative ML Algorithms
Topic Model, matrix factorization, SVM, Deep Neural Network…
Data
Worker
Model Parameters
32Slide33
33
Map-Reduce
Data ModelIndependent RecordsProgrammingAbstractionMap & Reduce
ExecutionSemantics
Bulk Synchronous
Parallel (BSP)
Parameter Server
?
Independent
Data
Key-Value Store
(Distributed Shared
Memory)
VS.Slide34
The Problem: Networks Are Slow!
Network
is slow compared to local memory accessWe want to explore options for handling this….Server Machines34
Worker Machines
[Smola et al 2010, Ho et al 2013, Li et al 2014]
get
(key)
add
(key, delta)Slide35
Solution 1: Cache Synchronization
Server
DataDataData
Data
Data
Data
35Slide36
Parameter Cache Synchronization
Server
DataDataData
Data
Data
Data
Sparse Changes
to Model
36Slide37
Parameter Cache Synchronization
Server
DataDataData
Data
Data
Data
37
(aka IPM)Slide38
Solution 2: Asynchronous Execution
Machine 1
Machine 2Machine 3IterationIterationIteration
Barrier
Compute
Communicate
Iteration
Iteration
Iteration
Compute
Waste
Waste
Barrier
Waste
Enable more frequent coordination on parameter values
38Slide39
Asynchronous Execution
Machine 1
Machine 2Machine 3Parameter Server (Logical)w1
w2
w
3
w
4
w
5
w
6
w
7
w
8
w
9
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
39
[
Smola
et al
201
0
]Slide40
Problem:Async lacks theoretical guarantee as distributed environment can have arbitrary delays from network & stragglersBut….
40
Asynchronous ExecutionSlide41
Take a gradient step:
x’ = xt – ηt g
t If you’ve restricted the parameters to a subspace X (e.g., must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ ) But…. you might be using a “stale” g (from τ steps ago)f is loss function, x is parametersRECAP41Slide42
?
42
Map-ReduceData ModelIndependent RecordsProgrammingAbstractionMap & Reduce
ExecutionSemantics
Bulk Synchronous
Parallel (BSP)
Parameter Server
Independent
Data
Key-Value Store
(Distributed Shared
Memory)
VS.
Bounded
AsynchronousSlide43
43
Parameter Server
BoundedAsynchronousStale synchronous parallel (SSP):Global clock time tParameters workers “get” can be out of datebut can’t be older than t-ττ controls “
staleness”
aka stale synchronous parallel (SSP)Slide44
Stale Synchronous Parallel (SSP)
44Interpolate between BSP and Async and subsumes both
Allow workers to usually run at own paceFastest/slowest threads not allowed to drift >s clocks apartEfficiently implemented: Cache parameters[Ho et al 2013]Slide45
Suitable
delay (SSP) gives big speed-upConsistency Matters
45[Ho et al 2013]Strong consistencyRelaxed consistencySlide46
Stale Synchronous Parallel (SSP)
46[Ho et al 2013]Slide47
Beyond the PS/SSP Abstraction…Slide48
Managed Communications
BSP stalls during communication.SSP is able to overlap communication and computation….but network can be underused48
computecomputecomputecompute
compute
compute
time
BSP
SSP
Barrier sync – computation stalls here
Stalls only if certain previous sync did not complete
[
Wei
et al
201
5
]Slide49
Network loads for PS/SSP 49
[Wei et al 2015]
computecomputecomputeHow can we use network capacity better?Maybe tell the system a little more about what the problem we’re solving is so it can manage communication better
Existing:
burst of traffic
idle networkSlide50
Bosen: choosing model partition
50[Wei et al 2015]
ML AppClient libML AppClient lib
server 1
Model
partition
server 2
Model
partition
Parameter Server [Power’10] [Ahmed’12] [Ho’13] [Li’14]
Coherent shared memory abstraction for application
Let the library worry about consistency, communication, etc
Data partition
Data partition
worker 1
worker 2Slide51
Ways To Manage Communication51
[Wei et al 2015]
51
Model parameters are not equally important
E.g. Majority of the parameters may converge in a few iteration.
Communicate the more important parameter values or updates
Magnitude of the changes indicates importance
Magnitude-based prioritization strategies
Example: Relative-magnitude prioritization
We saw many of these ideas in the signal/collect paperSlide52
Or more radically….52Slide53
Iterative ML Algorithms
Many ML algorithms are iterative-convergentExamples: Optimization, sampling methods
Topic Model, matrix factorization, SVM, Deep Neural Network…53A: params at time tL: lossΔ: gradD: data
F: updateSlide54
Iterative ML with a Parameter Server
: (1) Data ParallelEach worker assigned a data partitionModel parameters are
shared by workersWorkers read and update the model parameters54Δ: grad of LD: data, shard p
Usually add here
assume
i.i.d
Good fit for PS/SSP abstraction
Often add
locally
first
(~ combiner)Slide55
(2) Model parallel
55
Sp is a scheduler for processor p that selects params for p ignore D as well as LNot clear how this fits with PS/SSP abstraction…Slide56
Optional scheduling interface
schedule
(key) param keys svarspush(p=workerId, svars) changed keypull(svars, updates=(push1,….,pushn))
Parameter Server Scheduling
Scheduler machines
Worker machines
56
which
params
worker will access – largest updates, partition of graph, ….
~ signal: broadcast changes to PS
~ collect: aggregate changes from PS Slide57
Support for model-parallel programs
57Slide58
58
Similar to signal-collect:schedule() defines graph, workers
push params to scheduler,scheduler pulls to aggregate, and makes params available via get() and inc()centrally executedcentrally executeddistributedSlide59
A Data Parallel Example59Slide60
ICMR 2017
60Slide61
Training on matching vs non-matching pairs
margin-based matching loss61Slide62
About: Distance metric learning
Instance: pairs (x1,x2)Label: similar or dissimilarModel: scale x1 and x2
with matrix L, try and minimize distance ||Lx1 – Lx2||2 for similar pairs and max(0, 1-||Lx1 – Lx2||2 ) for dissimilar pairs62using x,y instead of x1,x2Slide63
Example: Data parallel SGD63
Could also get only keys I needSlide64
64Slide65
A Model Parallel Example:Lasso
65Slide66
Regularized logistic regression
Replace log conditional likelihood LCLwith LCL + penalty for large weights, egalternative penalty:
66Slide67
Regularized logistic regression
shallow grad near 0
steep grad near 0L1-regularization pushes parameters to zero: sparse 67Slide68
SGDRepeat for t=1,….,TFor each example
Compute gradient of regularized loss (for that example)Move all parameters in that direction (a little)68Slide69
Coordinate descentRepeat for t=1,….,TFor each parameter
jCompute gradient of regularized loss (for that parameter j) Move that parameter j (a good ways, sometimes to its minimal value relative to the others)69Slide70
Stochastic coordinate descentRepeat for t=1,….,TPick
a random parameter jCompute gradient of regularized loss (for that parameter j) Move that parameter j (a good ways, sometimes to its minimal value relative to the others)70Slide71
Parallel stochastic coordinate descent (shotgun)Repeat for t=1,….,T
Pick several coordinates j1,…,jp in parallelCompute gradient of regularized loss (for each parameter jk) Move each parameter jk71Slide72
Parallel coordinate descent (shotgun)
72Slide73
Parallel coordinate descent (shotgun)
73shotgun works best when you select uncorrelated parameters to process in parallelSlide74
Example: Model parallel SGDBasic ideas:Pick parameters stochastically
Prefer large parameter values (i.e., ones that haven’t converged)Prefer nearly-independent parameters 74Slide75
75Slide76
76Slide77
Case Study:Topic Modeling with LDASlide78
Example: Topic Modeling with LDA
Word-Topic Dist. Local Variables: Documents
Maintained by the Parameter ServerMaintained by the Workers NodesTokens
78Slide79
Gibbs Sampling for LDA
Title:
Oh, The Places You’ll Go!You have brains in your head.You have feet in your shoes.You can steer yourself any direction you choose.z1
z
2
z
3
z
4
z
5
z
6
z
7
Doc-Topic Distribution
θ
d
Word-Topic
Dist’n
Brains:
Choose:
Direction:
Feet:
Head:
Shoes:
Steer:
Doc-Topic Distribution
θ
d
79Slide80
Parameter Server
Parameter Server
Parameter Server
Ex:
Collapsed Gibbs Sampler for LDA
Partitioning the model and data
W
1:10K
W
10k:
2
0K
W
20k:30K
80Slide81
Parameter Server
Parameter Server
Parameter Server
Ex:
Collapsed Gibbs Sampler for LDA
Get model parameters and compute update
W
1:10K
W
10k:
2
0K
W
20k:30K
get(“car”)
get(“cat”)
get(“tire”)
get(“mouse”)
Cat
Car
Tire
Mouse
81Slide82
Parameter Server
Parameter Server
Parameter Server
Ex:
Collapsed Gibbs Sampler for LDA
Send changes back to the parameter server
W
1:10K
W
10k:
2
0K
W
20k:30K
add(“car”,
δ
)
add(“cat”,
δ
)
add(“tire”,
δ
)
add(“mouse”,
δ
)
δ
δ
δ
δ
82Slide83
Parameter Server
Parameter Server
Parameter Server
Ex:
Collapsed Gibbs Sampler for LDA
Adding a caching layer to collect updates
W
1:10K
W
10k:
2
0K
W
20k:30K
Parameter Cache
Car
Dog
Pig
Bat
Parameter Cache
Cat
Gas
Zoo
VW
Parameter Cache
Car
Rim
bmw
$$
Parameter Cache
Mac
iOS
iPod
Cat
83Slide84
Experiment: Topic Model (LDA)
84Dataset: NYTimes (100m tokens, 100k vocabularies, 100 topics)
Collapsed Gibbs samplingCompute Cluster: 8 nodes, each with 64 cores (512 cores total) and 128GB memoryESSP converges faster and robust to staleness sHigher is better[Dai et al 2015]Slide85
LDA Samplers Comparison85
[Yuan et al 2015]Slide86
Big LDA on Parameter Server86
[Li et al 2014]
Collapsed Gibbs samplerSize: 50B tokens, 2000 topics, 5M vocabularies1k~6k nodesSlide87
LDA Scale Comparison
87
YahooLDA (SparseLDA) [1]
Parameter Server (SparseLDA)[2]
Tencent
Peacock (SparseLDA)[3]
AliasLDA [4]
PetuumLDA
(LightLDA) [5]
# of words (dataset size)
20M documents
50B
4.5B
100M
200B
# of topics
1000
2000
100K
1024
1M
# of vocabularies
est. 100K[2]
5M
210K
100K
1M
Time to converge
N/A
20 hrs
6.6hrs/iterations
2 hrs
60 hrs
# of machines
400
6000 (60k cores)
500 cores
1 (1 core)
24 (480 cores)
Machine specs
N/A
10 cores, 128GB RAM
N/A
4 cores
12GB RAM
20 cores, 256GB RAM
Parameter Server
[1] Ahmed, Amr, et al. "Scalable inference in latent variable models."
WSDM
, (2012).
[2] Li, Mu, et al. "Scaling distributed machine learning with the parameter server."
OSDI
. (2014).
[3] Wang, Yi, et al. "Towards Topic Modeling for Big Data."
arXiv:1405.4402
(2014).
[4] Li, Aaron Q., et al. "Reducing the sampling complexity of topic models."
KDD,
(2014).
[5] Yuan, Ji
nhui, et al. “LightLDA: Big Topic Models on Modest Compute Clusters”
arXiv:1412.1576 (2014).