/
(slides courtesy (slides courtesy

(slides courtesy - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
397 views
Uploaded On 2017-09-04

(slides courtesy - PPT Presentation

of Aurick Qiao Joseph Gonzalez Wei Dai and Jinliang Wei Parameter Servers 1 Regret analysis for online optimization RECAP 2 2009 RECAP 3 Take a gradient step x x ID: 584978

server parameter model data parameter server data model systems iteration parallel parameters machines compute topic reduce map lda memory

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "(slides courtesy" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

(slides courtesy

of Aurick Qiao, Joseph Gonzalez, Wei Dai, and Jinliang Wei)

Parameter Servers

1Slide2

Regret analysis for on-line optimization

RECAP2Slide3

2009

RECAP

3Slide4

Take a gradient step:

x’ = xt – ηt g

t If you’ve restricted the parameters to a subspace X (e.g., must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ ) But…. you might be using a “stale” g (from τ steps ago)f is loss function, x is parametersRECAP4Slide5

Regret: how much loss was incurred

during learning,over and above the loss incurrent with an optimal choice of xSpecial case: f

t is 1 if a mistake was made, 0 otherwiseft(x*) = 0 for optimal x*Regret = # mistakes made in learningRECAP5Slide6

Theorem

: you can find a learning rate so that the regret of delayed SGD is bounded by

T = # timestepsτ= staleness > 0RECAP6Slide7

Theorem 8: you can do better if you assume (1) examples are i.i.d. (2) the gradients are smooth, analogous to the assumption about L: Then you can show a bound on expected regret

No-delay loss

dominant termRECAP7Slide8

Experiments

RECAP8Slide9

Summary of “Slow Learners are Fast”Generalization of iterative parameter mixing

run multiple learners in parallelconceptually they share the same weight/parameter vector BUT …Learners share weights imperfectlylearners are almost synchronizedthere’s a bound τ on how stale the shared weights get9

RECAPSlide10

Background: Distributed Coordination ServicesExample: Apache

ZooKeeperDistributed processes coordinate through shared “data registers” (aka znodes) which look a bit like a shared in-memory filesystem10

BACKGROUNDSlide11

Background: Distributed Coordination Services

Client:create /w_fooset /w_foo “bar”get /w_foo  “bar” Better with more reads than writes11

BACKGROUNDSlide12

(slides courtesy

of Aurick Qiao Joseph Gonzalez, Wei Dai, and Jinliang

Wei)Parameter Servers

12Slide13

13

Scalable SystemsAbstractions

Scalable Machine Learning Algorithms

ML SystemsSlide14

GraphLab

,Tensorflow

Hadoop, SparkBosen, DMTK, ParameterServer.org14

ML

Systems

Landscape

Graph Systems

Dataflow Systems

Shared Memory

Systems

ModelSlide15

GraphLab

,Tensorflow

Hadoop, SparkBosen, DMTK, ParameterServer.org15

ML

Systems

Landscape

Graph Systems

Dataflow Systems

Shared Memory

Systems

Model

AlgorithmsSlide16

Graph Algorithms,

Graphical ModelsSGD, Sampling

[NIPS’09, NIPS’13]GraphLab,TensorflowHadoop, SparkBosen, DMTK, ParameterServer.org

16

ML

Systems

Landscape

Graph Systems

Dataflow Systems

Shared Memory

Systems

Model

Naïve Bayes,

RocchioSlide17

Graph Algorithms,

Graphical ModelsSGD, Sampling

[NIPS’09, NIPS’13]GraphLab,TensorflowHadoop & SparkBosen, DMTK, ParameterServer.org

17

ML

Systems

Landscape

Graph Systems

Dataflow Systems

Shared Memory

Systems

Model

Abstractions

Naïve Bayes,

RocchioSlide18

Vertex-Programs

Parameter Server[UAI’

10][VLDB’10]Graph Algorithms,Graphical ModelsSGD, Sampling[NIPS’09, NIPS’13]

GraphLab

,

Tensorflow

Hadoop

&

Spark

Bosen

, DMTK,

ParameterServer.org

18

ML

Systems

Landscape

Graph Systems

Dataflow Systems

Shared Memory

Systems

Model

Naïve Bayes,

Rocchio

PIG,

GuineaPig

, …Slide19

Parameter Server

[VLDB’10]

[NIPS’09, NIPS’13]19

ML

Systems

Landscape

Graph Systems

Dataflow Systems

Shared Memory

Systems

Model

Simple case: Parameters of the ML system are stored in a

distributed

hash table that is accessible thru the

network

Param

Servers used in Google, Yahoo, ….

Academic work by

Smola

, Xing, …Slide20

Parameter Servers Are Flexible20

Implemented with Parameter ServerSlide21

Parameter Server (PS)

Model parameters are stored on PS machines and accessed via key-value interface (distributed shared memory

)Extensions: multiple keys (for a matrix); multiple “channels” (for multiple sparse vectors, multiple clients for same servers, …)Server Machines21

Worker Machines

[Smola et al 2010, Ho et al 2013, Li et al 2014]Slide22

Parameter Server (PS)

Extensions:

push/pull interface to send/receive most recent copy of (subset of) parameters, blocking is optionalExtension: can block until push/pulls with clock < (t – τ ) completeServer Machines22

Worker Machines

[Smola et al 2010, Ho et al 2013, Li et al 2014]Slide23

Data parallel learning with PS

Parameter Server

w1w2w3Parameter Serverw

4

w

5

w

6

Parameter Server

w

7

w

8

w

9

Data

Data

Data

Data

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

w

9

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

w

9

Split Data Across Machines

23Slide24

Data parallel learning with PS

Parameter Serverw

1w2w3Parameter Serverw4

w

5

w

6

Parameter Server

w

7

w

8

w

9

Data

Data

Data

Data

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

w

9

Split Data Across Machines

24

Different parts of the

model

on

different

servers

.

Workers

retrieve

the

part needed

as neededSlide25

Key-Value API for workers:

get

(key)  valueadd(key, delta)Abstraction used for Data parallel learning with PS

25

Model

Model

ModelSlide26

Map-Reduce

PS vs HadoopSlide27

Iteration in Map-Reduce (IPM)

Training

DataMapReduceLearnedModel

w

(1)

w

(2)

w

(3)

w

(0)

Initial

Model

27Slide28

Cost of Iteration in Map-Reduce

Map

Reduce

Learned

Model

w

(1)

w

(2)

w

(3)

w

(0)

Initial

Model

Training

Data

Read 1

Read 2

Read 3

Repeatedly

load same data

28Slide29

Cost of Iteration in Map-Reduce

Map

ReduceLearned

Model

w

(1)

w

(2)

w

(3)

w

(0)

Initial

Model

Training

Data

Redundantly

save

output between

stages

29Slide30

(slides courtesy

of Aurick Qiao Joseph Gonzalez, Wei Dai, and Jinliang

Wei)Parameter Servers

Stale Synchronous Parallel

ModelSlide31

Parameter Server (PS)

Model parameters are stored on PS machines and accessed via key-value interface (distributed shared memory

)Server Machines31

Worker Machines

[Smola et al 2010, Ho et al 2013, Li et al 2014]Slide32

Iterative ML Algorithms

Topic Model, matrix factorization, SVM, Deep Neural Network…

Data

Worker

Model Parameters

32Slide33

33

Map-Reduce

Data ModelIndependent RecordsProgrammingAbstractionMap & Reduce

ExecutionSemantics

Bulk Synchronous

Parallel (BSP)

Parameter Server

?

Independent

Data

Key-Value Store

(Distributed Shared

Memory)

VS.Slide34

The Problem: Networks Are Slow!

Network

is slow compared to local memory accessWe want to explore options for handling this….Server Machines34

Worker Machines

[Smola et al 2010, Ho et al 2013, Li et al 2014]

get

(key)

add

(key, delta)Slide35

Solution 1: Cache Synchronization

Server

DataDataData

Data

Data

Data

35Slide36

Parameter Cache Synchronization

Server

DataDataData

Data

Data

Data

Sparse Changes

to Model

36Slide37

Parameter Cache Synchronization

Server

DataDataData

Data

Data

Data

37

(aka IPM)Slide38

Solution 2: Asynchronous Execution

Machine 1

Machine 2Machine 3IterationIterationIteration

Barrier

Compute

Communicate

Iteration

Iteration

Iteration

Compute

Waste

Waste

Barrier

Waste

Enable more frequent coordination on parameter values

38Slide39

Asynchronous Execution

Machine 1

Machine 2Machine 3Parameter Server (Logical)w1

w2

w

3

w

4

w

5

w

6

w

7

w

8

w

9

Iteration

Iteration

Iteration

Iteration

Iteration

Iteration

Iteration

Iteration

Iteration

Iteration

Iteration

Iteration

39

[

Smola

et al

201

0

]Slide40

Problem:Async lacks theoretical guarantee as distributed environment can have arbitrary delays from network & stragglersBut….

40

Asynchronous ExecutionSlide41

Take a gradient step:

x’ = xt – ηt g

t If you’ve restricted the parameters to a subspace X (e.g., must be positive, …) find the closest thing in X to x’: xt+1 = argminX dist( x – x’ ) But…. you might be using a “stale” g (from τ steps ago)f is loss function, x is parametersRECAP41Slide42

?

42

Map-ReduceData ModelIndependent RecordsProgrammingAbstractionMap & Reduce

ExecutionSemantics

Bulk Synchronous

Parallel (BSP)

Parameter Server

Independent

Data

Key-Value Store

(Distributed Shared

Memory)

VS.

Bounded

AsynchronousSlide43

43

Parameter Server

BoundedAsynchronousStale synchronous parallel (SSP):Global clock time tParameters workers “get” can be out of datebut can’t be older than t-ττ controls “

staleness”

aka stale synchronous parallel (SSP)Slide44

Stale Synchronous Parallel (SSP)

44Interpolate between BSP and Async and subsumes both

Allow workers to usually run at own paceFastest/slowest threads not allowed to drift >s clocks apartEfficiently implemented: Cache parameters[Ho et al 2013]Slide45

Suitable

delay (SSP) gives big speed-upConsistency Matters

45[Ho et al 2013]Strong consistencyRelaxed consistencySlide46

Stale Synchronous Parallel (SSP)

46[Ho et al 2013]Slide47

Beyond the PS/SSP Abstraction…Slide48

Managed Communications

BSP stalls during communication.SSP is able to overlap communication and computation….but network can be underused48

computecomputecomputecompute

compute

compute

time

BSP

SSP

Barrier sync – computation stalls here

Stalls only if certain previous sync did not complete

[

Wei

et al

201

5

]Slide49

Network loads for PS/SSP 49

[Wei et al 2015]

computecomputecomputeHow can we use network capacity better?Maybe tell the system a little more about what the problem we’re solving is so it can manage communication better

Existing:

burst of traffic

idle networkSlide50

Bosen: choosing model partition

50[Wei et al 2015]

ML AppClient libML AppClient lib

server 1

Model

partition

server 2

Model

partition

Parameter Server [Power’10] [Ahmed’12] [Ho’13] [Li’14]

Coherent shared memory abstraction for application

Let the library worry about consistency, communication, etc

Data partition

Data partition

worker 1

worker 2Slide51

Ways To Manage Communication51

[Wei et al 2015]

51

Model parameters are not equally important

E.g. Majority of the parameters may converge in a few iteration.

Communicate the more important parameter values or updates

Magnitude of the changes indicates importance

Magnitude-based prioritization strategies

Example: Relative-magnitude prioritization

We saw many of these ideas in the signal/collect paperSlide52

Or more radically….52Slide53

Iterative ML Algorithms

Many ML algorithms are iterative-convergentExamples: Optimization, sampling methods

Topic Model, matrix factorization, SVM, Deep Neural Network…53A: params at time tL: lossΔ: gradD: data

F: updateSlide54

Iterative ML with a Parameter Server

: (1) Data ParallelEach worker assigned a data partitionModel parameters are

shared by workersWorkers read and update the model parameters54Δ: grad of LD: data, shard p

Usually add here

assume

i.i.d

Good fit for PS/SSP abstraction

Often add

locally

first

(~ combiner)Slide55

(2) Model parallel

55

Sp is a scheduler for processor p that selects params for p ignore D as well as LNot clear how this fits with PS/SSP abstraction…Slide56

Optional scheduling interface

schedule

(key)  param keys svarspush(p=workerId, svars)  changed keypull(svars, updates=(push1,….,pushn))

Parameter Server Scheduling

Scheduler machines

Worker machines

56

which

params

worker will access – largest updates, partition of graph, ….

~ signal: broadcast changes to PS

~ collect: aggregate changes from PS Slide57

Support for model-parallel programs

57Slide58

58

Similar to signal-collect:schedule() defines graph, workers

push params to scheduler,scheduler pulls to aggregate, and makes params available via get() and inc()centrally executedcentrally executeddistributedSlide59

A Data Parallel Example59Slide60

ICMR 2017

60Slide61

Training on matching vs non-matching pairs

margin-based matching loss61Slide62

About: Distance metric learning

Instance: pairs (x1,x2)Label: similar or dissimilarModel: scale x1 and x2

with matrix L, try and minimize distance ||Lx1 – Lx2||2 for similar pairs and max(0, 1-||Lx1 – Lx2||2 ) for dissimilar pairs62using x,y instead of x1,x2Slide63

Example: Data parallel SGD63

Could also get only keys I needSlide64

64Slide65

A Model Parallel Example:Lasso

65Slide66

Regularized logistic regression

Replace log conditional likelihood LCLwith LCL + penalty for large weights, egalternative penalty:

66Slide67

Regularized logistic regression

shallow grad near 0

steep grad near 0L1-regularization pushes parameters to zero: sparse 67Slide68

SGDRepeat for t=1,….,TFor each example

Compute gradient of regularized loss (for that example)Move all parameters in that direction (a little)68Slide69

Coordinate descentRepeat for t=1,….,TFor each parameter

jCompute gradient of regularized loss (for that parameter j) Move that parameter j (a good ways, sometimes to its minimal value relative to the others)69Slide70

Stochastic coordinate descentRepeat for t=1,….,TPick

a random parameter jCompute gradient of regularized loss (for that parameter j) Move that parameter j (a good ways, sometimes to its minimal value relative to the others)70Slide71

Parallel stochastic coordinate descent (shotgun)Repeat for t=1,….,T

Pick several coordinates j1,…,jp in parallelCompute gradient of regularized loss (for each parameter jk) Move each parameter jk71Slide72

Parallel coordinate descent (shotgun)

72Slide73

Parallel coordinate descent (shotgun)

73shotgun works best when you select uncorrelated parameters to process in parallelSlide74

Example: Model parallel SGDBasic ideas:Pick parameters stochastically

Prefer large parameter values (i.e., ones that haven’t converged)Prefer nearly-independent parameters 74Slide75

75Slide76

76Slide77

Case Study:Topic Modeling with LDASlide78

Example: Topic Modeling with LDA

Word-Topic Dist. Local Variables: Documents

Maintained by the Parameter ServerMaintained by the Workers NodesTokens

78Slide79

Gibbs Sampling for LDA

Title:

Oh, The Places You’ll Go!You have brains in your head.You have feet in your shoes.You can steer yourself any direction you choose.z1

z

2

z

3

z

4

z

5

z

6

z

7

Doc-Topic Distribution

θ

d

Word-Topic

Dist’n

Brains:

Choose:

Direction:

Feet:

Head:

Shoes:

Steer:

Doc-Topic Distribution

θ

d

79Slide80

Parameter Server

Parameter Server

Parameter Server

Ex:

Collapsed Gibbs Sampler for LDA

Partitioning the model and data

W

1:10K

W

10k:

2

0K

W

20k:30K

80Slide81

Parameter Server

Parameter Server

Parameter Server

Ex:

Collapsed Gibbs Sampler for LDA

Get model parameters and compute update

W

1:10K

W

10k:

2

0K

W

20k:30K

get(“car”)

get(“cat”)

get(“tire”)

get(“mouse”)

Cat

Car

Tire

Mouse

81Slide82

Parameter Server

Parameter Server

Parameter Server

Ex:

Collapsed Gibbs Sampler for LDA

Send changes back to the parameter server

W

1:10K

W

10k:

2

0K

W

20k:30K

add(“car”,

δ

)

add(“cat”,

δ

)

add(“tire”,

δ

)

add(“mouse”,

δ

)

δ

δ

δ

δ

82Slide83

Parameter Server

Parameter Server

Parameter Server

Ex:

Collapsed Gibbs Sampler for LDA

Adding a caching layer to collect updates

W

1:10K

W

10k:

2

0K

W

20k:30K

Parameter Cache

Car

Dog

Pig

Bat

Parameter Cache

Cat

Gas

Zoo

VW

Parameter Cache

Car

Rim

bmw

$$

Parameter Cache

Mac

iOS

iPod

Cat

83Slide84

Experiment: Topic Model (LDA)

84Dataset: NYTimes (100m tokens, 100k vocabularies, 100 topics)

Collapsed Gibbs samplingCompute Cluster: 8 nodes, each with 64 cores (512 cores total) and 128GB memoryESSP converges faster and robust to staleness sHigher is better[Dai et al 2015]Slide85

LDA Samplers Comparison85

[Yuan et al 2015]Slide86

Big LDA on Parameter Server86

[Li et al 2014]

Collapsed Gibbs samplerSize: 50B tokens, 2000 topics, 5M vocabularies1k~6k nodesSlide87

LDA Scale Comparison

87

YahooLDA (SparseLDA) [1]

Parameter Server (SparseLDA)[2]

Tencent

Peacock (SparseLDA)[3]

AliasLDA [4]

PetuumLDA

(LightLDA) [5]

# of words (dataset size)

20M documents

50B

4.5B

100M

200B

# of topics

1000

2000

100K

1024

1M

# of vocabularies

est. 100K[2]

5M

210K

100K

1M

Time to converge

N/A

20 hrs

6.6hrs/iterations

2 hrs

60 hrs

# of machines

400

6000 (60k cores)

500 cores

1 (1 core)

24 (480 cores)

Machine specs

N/A

10 cores, 128GB RAM

N/A

4 cores

12GB RAM

20 cores, 256GB RAM

Parameter Server

[1] Ahmed, Amr, et al. "Scalable inference in latent variable models."

WSDM

, (2012).

[2] Li, Mu, et al. "Scaling distributed machine learning with the parameter server."

OSDI

. (2014).

[3] Wang, Yi, et al. "Towards Topic Modeling for Big Data."

arXiv:1405.4402

(2014).

[4] Li, Aaron Q., et al. "Reducing the sampling complexity of topic models."

KDD,

(2014).

[5] Yuan, Ji

nhui, et al. “LightLDA: Big Topic Models on Modest Compute Clusters”

arXiv:1412.1576 (2014).