/
Joseph Gonzalez Joseph Gonzalez

Joseph Gonzalez - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
384 views
Uploaded On 2017-08-02

Joseph Gonzalez - PPT Presentation

Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David OHallaron A New Parallel Framework for Machine Learning Alex Smola In ML we face ID: 575134

cpu data consistency parallel data cpu parallel consistency graph map reduce computation model propagation scheduler update algorithms graphlab million

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Joseph Gonzalez" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Joseph GonzalezJoint work with

Yucheng

Low

Aapo

Kyrola

Danny

Bickson

Carlos

Guestrin

Guy

Blelloch

Joe

Hellerstein

David

O’Hallaron

A New Parallel Framework for

Machine Learning

AlexSmolaSlide2

In ML we face BIG problems

48 Hours a Minute

YouTube

24 Million

Wikipedia Pages

750 Million

Facebook

Users

6 Billion

Flickr

PhotosSlide3

Parallelism: Hope for the Future

Wide array of different parallel architectures:

New Challenges for Designing Machine Learning Algorithms: Race conditions and deadlocks

Managing distributed model stateNew Challenges for Implementing Machine Learning Algorithms:Parallel debugging and profilingHardware specific APIs

3

GPUs

Multicore

Clusters

Mini Clouds

CloudsSlide4

How will wedesign

and implementparallel learning systems?Slide5

Threads, Locks, & Messages

“low level parallel primitives”

We could use ….Slide6

Threads, Locks, and Messages

ML experts repeatedly solve the same parallel design challenges:Implement and debug complex parallel system

Tune for a specific parallel platformTwo months later the conference paper contains:“We implemented ______ in parallel.”

The resulting code:is difficult to maintain

is difficult to extendc

ouples learning model to parallel implementation

6

GraduatestudentsSlide7

Map-Reduce / Hadoop

Build learning algorithms on-top of

high-level parallel abstractions

... a better answer:Slide8

CPU

1

CPU

2

CPU

3

CPU

4

MapReduce

– Map Phase

8

Embarrassingly Parallel independent computation

1

2

.

9

4

2

.

3

2

1

.

3

2

5

.

8

No Communication neededSlide9

CPU

1

CPU

2

CPU

3

CPU

4

MapReduce

– Map Phase

9

1

2

.

9

4

2

.

3

2

1

.

3

2

5

.

8

2

4

.

1

8

4

.

3

1

8

.

4

8

4

.

4

Image FeaturesSlide10

CPU

1

CPU

2

CPU

3

CPU

4

MapReduce

– Map Phase

10

Embarrassingly Parallel independent computation

1

2

.

9

4

2

.

3

2

1

.

3

2

5

.

8

1

7

.

5

6

7

.

5

1

4

.

9

3

4

.

3

2

4

.

1

8

4

.

3

1

8

.

4

8

4

.

4

No Communication neededSlide11

CPU

1

CPU

2

MapReduce

– Reduce Phase

11

1

2

.

9

4

2

.

3

2

1

.

3

2

5

.

8

2

4

.

1

8

4

.

3

1

8

.

4

8

4

.

4

1

7

.

5

6

7

.

5

1

4

.

9

3

4

.

3

22

26

.

26

17

26

.

31

Image Features

Attractive Face

Statistics

Ugly Face

StatisticsSlide12

Belief

Propagation

Label Propagation

KernelMethods

Deep BeliefNetworks

Neural

Networks

Tensor FactorizationPageRank

Lasso

Map-Reduce for Data-Parallel ML

Excellent for large data-parallel tasks!12

Data-Parallel

Graph-Parallel

Cross

Validation

Feature

Extraction

Map Reduce

Computing Sufficient

Statistics

Is there more to

Machine Learning

?Slide13

Concrete Example

Label PropagationSlide14

Profile

Label Propagation Algorithm

Social Arithmetic:

Recurrence Algorithm:

iterate until convergence

Parallelism:

Compute all

Likes[

i

] in parallel

Sue Ann

Carlos

Me

50% What I list on my profile

40% Sue Ann Likes

10% Carlos Like

4

0%

10%

5

0%

80% Cameras

2

0% Biking

30% Cameras

7

0% Biking

50% Cameras

50% Biking

I Like:

+

6

0% Cameras, 40% BikingSlide15

Properties of Graph Parallel Algorithms

Dependency

Graph

Iterative

Computation

What I Like

What My

Friends Like

Factored

Computation Slide16

?

Belief

Propagation

Label Propagation

Kernel

Methods

Deep Belief

Networks

NeuralNetworks

Tensor Factorization

PageRankLasso

Map-Reduce for Data-Parallel ML

Excellent for large data-parallel tasks!

16

Data-Parallel

Graph-Parallel

Cross

Validation

Feature

Extraction

Map Reduce

Computing Sufficient

Statistics

Map Reduce?Slide17

Why not use Map-Reduce

for Graph Parallel Algorithms?Slide18

Data Dependencies

Map-Reduce does not efficiently express dependent dataUser must code substantial data transformations

Costly data replication

Independent Data RowsSlide19

Slow

Processor

Iterative Algorithms

Map-Reduce not efficiently express iterative algorithms:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Barrier

Barrier

BarrierSlide20

MapAbuse

: Iterative

MapReduce

Only a subset of data needs computation:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Barrier

Barrier

BarrierSlide21

MapAbuse: Iterative MapReduce

System is not optimized for iteration:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Disk Penalty

Disk Penalty

Disk Penalty

StartupPenalty

Startup Penalty

Startup PenaltySlide22

Belief

Propagation

SVM

KernelMethods

Deep BeliefNetworks

Neural

NetworksTensor

FactorizationPageRank

Lasso

Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!

22

Data-Parallel

Graph-Parallel

Cross

Validation

Feature

Extraction

Map Reduce

Computing Sufficient

Statistics

Map Reduce?

Pregel

(

Giraph

)?Slide23

Barrier

Pregel

(

Giraph

)

Bulk Synchronous Parallel Model:

Compute

CommunicateSlide24

Problem with Bulk Synchronous

Example Algorithm: If Red neighbor then turn RedBulk Synchronous Computation :

Evaluate condition on all vertices for every phase4 Phases each with 9 computations  36 Computations

Asynchronous Computation (Wave-front) :Evaluate condition only when neighbor changes4 Phases each with 2 computations  8 Computations

Time 0

Time 1

Time 2

Time 3

Time 4Slide25

Bulk synchronous computation can be highly inefficient.

25

Example:

Loopy Belief Propagation

ProblemSlide26

Loopy Belief Propagation (Loopy BP)

Iteratively estimate the “beliefs” about vertices

Read

in messagesUpdates marginalestimate (belief

)Send updated out messagesRepeat for all variables

until convergence

26Slide27

Bulk Synchronous Loopy BP

Often considered embarrassingly parallel Associate processor

with each vertexReceive all messagesUpdate all beliefs

Send all messagesProposed by:Brunton et al. CRV’06

Mendiburu et al. GECC’07Kang,et al. LDMTA’10…

27Slide28

Sequential Computational Structure

28Slide29

Hidden Sequential Structure

29Slide30

Hidden

Sequential

Structure

Running Time:

Evidence

Evidence

Time for a single

parallel iteration

Number of Iterations

30Slide31

Optimal Sequential Algorithm

Forward-Backward

Bulk Synchronous

2n

2

/p

p ≤ 2n

Running

Time

2n

Gap

p = 1

Optimal Parallel

n

p = 2

31Slide32

The Splash Operation

Generalize the optimal chain algorithm:

to arbitrary cyclic graphs:

~

Grow a BFS Spanning tree with fixed size

Forward Pass computing all messages at each vertex

Backward Pass computing all messages at each vertex

32Slide33

Data-Parallel Algorithms can be Inefficient

The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms.

Optimized in Memory Bulk Synchronous

Asynchronous Splash BPSlide34

Belief

Propagation

SVM

KernelMethods

Deep BeliefNetworks

Neural

NetworksTensor

FactorizationPageRank

Lasso

The Need for a New AbstractionMap-Reduce is not well suited for Graph-Parallelism

34

Data-Parallel

Graph-Parallel

Cross

Validation

Feature

Extraction

Map Reduce

Computing Sufficient

Statistics

Pregel

(

Giraph

)Slide35

What is GraphLab?Slide36

The GraphLab Framework

Scheduler

Consistency Model

Graph Based

Data Representation

Update Functions

User Computation

36Slide37

Data Graph

37

A

graph

with arbitrary data (C++ Objects) associated with each vertex and edge.

Vertex Data:

User profile text

Current interests estimates

Edge Data:

Similarity weights

Graph:

Social NetworkSlide38

Implementing the Data Graph

Multicore Setting

In MemoryRelatively Straight Forwardvertex_data

(vid)  dataedge_data(vid,vid

)  dataneighbors(vid) 

vid_listChallenge:Fast lookup, low overheadSolution:Dense data-structures

Fixed Vdata&Edata typesImmutable

graph structure

Cluster SettingIn MemoryPartition Graph:

ParMETIS or Random CutsCached Ghosting

Node 1

Node 2

A

B

C

D

A

B

C

D

A

B

C

DSlide39

The GraphLab Framework

Scheduler

Consistency Model

Graph Based

Data Representation

Update Functions

User Computation

39Slide40

l

abel_prop

(

i

, scope){

// Get Neighborhood data

(Likes[i

], Wij

, Likes[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed

if Likes[i] changes then reschedule_neighbors_of

(i); }

Update Functions

40

An

update function

is a user defined program which when applied to a

vertex

transforms the data in the

scope

of the vertexSlide41

The GraphLab Framework

Scheduler

Consistency Model

Graph Based

Data Representation

Update Functions

User Computation

41Slide42

The Scheduler

42

CPU

1

CPU

2

The

scheduler

determines the order that vertices are updated.

e

f

g

k

j

i

h

d

c

b

a

b

i

h

a

i

b

e

f

j

c

Scheduler

The process repeats until the scheduler is empty.Slide43

Choosing a Schedule

GraphLab provides several different schedulersRound Robin: vertices are updated in a fixed order

FIFO: Vertices are updated in the order they are addedPriority: Vertices are updated in priority order

43

The choice of schedule affects the correctness and parallel performance of the algorithm

Obtain different algorithm

s by simply changing a flag!

--scheduler=

roundrobin

--scheduler=

fifo

--scheduler=prioritySlide44

CPU 1

CPU 2

CPU 3

CPU 4

Implementing the Schedulers

Multicore Setting

Challenging!

Fine-grained locking

Atomic operations

Approximate

FiFo

/Priority

Random placement

Work

stealing

Cluster Setting

Multicore scheduler on each nodeSchedules only “local” verticesExchange update functions

Queue 1

Queue 2

Queue 3

Queue 4

Node 1

CPU 1

CPU 2

Queue 1

Queue 2

Node 2

CPU 1

CPU 2

Queue 1

Queue 2

v

1

v

2

f(v

1

)

f(v

2

)Slide45

The GraphLab Framework

Scheduler

Consistency Model

Graph Based

Data Representation

Update Functions

User Computation

45Slide46

Ensuring Race-Free Code

How much can computation

overlap

?Slide47

Importance of Consistency

Many algorithms require strict consistency, or performs significantly better under strict consistency.

Alternating Least SquaresSlide48

Importance of Consistency

Machine learning algorithms require “model debugging”

Build

Test

Debug

Tweak ModelSlide49

GraphLab

Ensures Sequential Consistency

49

For each parallel execution

, there exists a sequential execution of update functions which produces the same result.

CPU

1

CPU

2

Single

CPU

Parallel

Sequential

timeSlide50

CPU 1

CPU 2

Common Problem: Write-Write Race

50

Processors running

adjacent update functions

simultaneously modify shared data:

CPU1 writes:

CPU2 writes:

Final ValueSlide51

Full Consistency

Consistency Rules

51

Guaranteed sequential consistency for all update functions

DataSlide52

Full Consistency

Full Consistency

52Slide53

Full Consistency

Edge Consistency

Obtaining More Parallelism

53Slide54

Edge Consistency

Edge Consistency

54

CPU 1

CPU 2

Safe

ReadSlide55

Consistency Through R/W Locks

Read/Write locks:

Full Consistency

Edge Consistency

Write

Write

Write

Canonical Lock Ordering

Read

Write

Read

Read

WriteSlide56

Multicore Setting: Pthread R/W Locks

Distributed Setting: Distributed LockingPrefetch Locks and Data

Allow computation to proceed while locks/data are requested.

Node 2

Consistency Through R/W Locks

Node 1

Data Graph

Partition

Lock PipelineSlide57

Consistency Through Scheduling

Edge Consistency Model:Two vertices can be Updated simultaneously

if they do not share an edge.Graph Coloring:Two vertices can be assigned the same color if they do not share an edge.

Barrier

Phase 1

Barrier

Phase 2

Barrier

Phase 3Slide58

The GraphLab Framework

Scheduler

Consistency Model

Graph Based

Data Representation

Update Functions

User Computation

58Slide59

Algorithms Implemented

PageRankLoopy Belief PropagationGibbs SamplingCoEMGraphical Model Parameter Learning

Probabilistic Matrix/Tensor FactorizationAlternating Least SquaresLasso with Sparse FeaturesSupport Vector Machines with Sparse Features

Label-Propagation…Slide60

Shared MemoryExperiments

Shared Memory Setting16 Core Workstation

60Slide61

Loopy Belief Propagation

61

3D retinal image

denoising

Data Graph

Update Function

:

Loopy BP Update Equation

Scheduler:

Approximate Priority

Consistency Model:

Edge Consistency

Vertices:

1 Million

Edges:

3 MillionSlide62

Loopy Belief Propagation

62

Optimal

Better

SplashBP

15.5x speedup Slide63

CoEM (Rosie Jones, 2005)

Named Entity Recognition Task

the dog

Australia

Catalina Island

<X> ran quickly

travelled to <X>

<X> is pleasant

Hadoop

95

Cores

7.5 hrs

Is “Dog” an animal?

Is “Catalina” a place?

Vertices:

2 Million

Edges:

200 MillionSlide64

Better

Optimal

GraphLabCoEM

CoEM

(Rosie Jones, 2005)

64

GraphLab

16 Cores

30 min

15x Faster!

6x fewer CPUs!

Hadoop

95

Cores

7.5 hrsSlide65

Experiments

Amazon EC2 High-Performance Nodes

65Slide66

Video Cosegmentation

Segments mean the same

M

odel

:

10.5

million nodes,

31

million edges

Gaussian EM clustering + BP on 3D gridSlide67

Video Coseg. SpeedupsSlide68

Prefetching Data & LocksSlide69

Matrix Factorization

Netflix Collaborative FilteringAlternating Least Squares Matrix Factorization

M

odel: 0.5 million nodes,

99 million edges

Netflix

Users

Movies

dSlide70

Netflix

Speedup Increasing size of the

m

atrix factorizationSlide71

The Cost of HadoopSlide72

Summary

An abstraction tailored to Machine LearningTargets Graph-Parallel AlgorithmsNaturally expressesData/computational dependencies

Dynamic iterative computationSimplifies parallel algorithm designAutomatically ensures data consistencyAchieves state-of-the-art parallel performance on a variety of problems

72Slide73

Checkout GraphLab

http://

graphlab.org

73

Documentation… Code… Tutorials…

Questions & Feedback

jegonzal@cs.cmu.eduSlide74

Current/Future Work

Out-of-core StorageHadoop/HDFS IntegrationGraph ConstructionGraph Storage

Launching GraphLab from HadoopFault Tolerance through HDFS Checkpoints

Sub-scope parallelismAddress the challenge of very high degree nodesImproved graph partitioningSupport for dynamic graph structure