Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David OHallaron A New Parallel Framework for Machine Learning Alex Smola In ML we face ID: 575134
Download Presentation The PPT/PDF document "Joseph Gonzalez" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Joseph GonzalezJoint work with
Yucheng
Low
Aapo
Kyrola
Danny
Bickson
Carlos
Guestrin
Guy
Blelloch
Joe
Hellerstein
David
O’Hallaron
A New Parallel Framework for
Machine Learning
AlexSmolaSlide2
In ML we face BIG problems
48 Hours a Minute
YouTube
24 Million
Wikipedia Pages
750 Million
Facebook
Users
6 Billion
Flickr
PhotosSlide3
Parallelism: Hope for the Future
Wide array of different parallel architectures:
New Challenges for Designing Machine Learning Algorithms: Race conditions and deadlocks
Managing distributed model stateNew Challenges for Implementing Machine Learning Algorithms:Parallel debugging and profilingHardware specific APIs
3
GPUs
Multicore
Clusters
Mini Clouds
CloudsSlide4
How will wedesign
and implementparallel learning systems?Slide5
Threads, Locks, & Messages
“low level parallel primitives”
We could use ….Slide6
Threads, Locks, and Messages
ML experts repeatedly solve the same parallel design challenges:Implement and debug complex parallel system
Tune for a specific parallel platformTwo months later the conference paper contains:“We implemented ______ in parallel.”
The resulting code:is difficult to maintain
is difficult to extendc
ouples learning model to parallel implementation
6
GraduatestudentsSlide7
Map-Reduce / Hadoop
Build learning algorithms on-top of
high-level parallel abstractions
... a better answer:Slide8
CPU
1
CPU
2
CPU
3
CPU
4
MapReduce
– Map Phase
8
Embarrassingly Parallel independent computation
1
2
.
9
4
2
.
3
2
1
.
3
2
5
.
8
No Communication neededSlide9
CPU
1
CPU
2
CPU
3
CPU
4
MapReduce
– Map Phase
9
1
2
.
9
4
2
.
3
2
1
.
3
2
5
.
8
2
4
.
1
8
4
.
3
1
8
.
4
8
4
.
4
Image FeaturesSlide10
CPU
1
CPU
2
CPU
3
CPU
4
MapReduce
– Map Phase
10
Embarrassingly Parallel independent computation
1
2
.
9
4
2
.
3
2
1
.
3
2
5
.
8
1
7
.
5
6
7
.
5
1
4
.
9
3
4
.
3
2
4
.
1
8
4
.
3
1
8
.
4
8
4
.
4
No Communication neededSlide11
CPU
1
CPU
2
MapReduce
– Reduce Phase
11
1
2
.
9
4
2
.
3
2
1
.
3
2
5
.
8
2
4
.
1
8
4
.
3
1
8
.
4
8
4
.
4
1
7
.
5
6
7
.
5
1
4
.
9
3
4
.
3
22
26
.
26
17
26
.
31
Image Features
Attractive Face
Statistics
Ugly Face
StatisticsSlide12
Belief
Propagation
Label Propagation
KernelMethods
Deep BeliefNetworks
Neural
Networks
Tensor FactorizationPageRank
Lasso
Map-Reduce for Data-Parallel ML
Excellent for large data-parallel tasks!12
Data-Parallel
Graph-Parallel
Cross
Validation
Feature
Extraction
Map Reduce
Computing Sufficient
Statistics
Is there more to
Machine Learning
?Slide13
Concrete Example
Label PropagationSlide14
Profile
Label Propagation Algorithm
Social Arithmetic:
Recurrence Algorithm:
iterate until convergence
Parallelism:
Compute all
Likes[
i
] in parallel
Sue Ann
Carlos
Me
50% What I list on my profile
40% Sue Ann Likes
10% Carlos Like
4
0%
10%
5
0%
80% Cameras
2
0% Biking
30% Cameras
7
0% Biking
50% Cameras
50% Biking
I Like:
+
6
0% Cameras, 40% BikingSlide15
Properties of Graph Parallel Algorithms
Dependency
Graph
Iterative
Computation
What I Like
What My
Friends Like
Factored
Computation Slide16
?
Belief
Propagation
Label Propagation
Kernel
Methods
Deep Belief
Networks
NeuralNetworks
Tensor Factorization
PageRankLasso
Map-Reduce for Data-Parallel ML
Excellent for large data-parallel tasks!
16
Data-Parallel
Graph-Parallel
Cross
Validation
Feature
Extraction
Map Reduce
Computing Sufficient
Statistics
Map Reduce?Slide17
Why not use Map-Reduce
for Graph Parallel Algorithms?Slide18
Data Dependencies
Map-Reduce does not efficiently express dependent dataUser must code substantial data transformations
Costly data replication
Independent Data RowsSlide19
Slow
Processor
Iterative Algorithms
Map-Reduce not efficiently express iterative algorithms:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barrier
Barrier
BarrierSlide20
MapAbuse
: Iterative
MapReduce
Only a subset of data needs computation:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barrier
Barrier
BarrierSlide21
MapAbuse: Iterative MapReduce
System is not optimized for iteration:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Disk Penalty
Disk Penalty
Disk Penalty
StartupPenalty
Startup Penalty
Startup PenaltySlide22
Belief
Propagation
SVM
KernelMethods
Deep BeliefNetworks
Neural
NetworksTensor
FactorizationPageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
22
Data-Parallel
Graph-Parallel
Cross
Validation
Feature
Extraction
Map Reduce
Computing Sufficient
Statistics
Map Reduce?
Pregel
(
Giraph
)?Slide23
Barrier
Pregel
(
Giraph
)
Bulk Synchronous Parallel Model:
Compute
CommunicateSlide24
Problem with Bulk Synchronous
Example Algorithm: If Red neighbor then turn RedBulk Synchronous Computation :
Evaluate condition on all vertices for every phase4 Phases each with 9 computations 36 Computations
Asynchronous Computation (Wave-front) :Evaluate condition only when neighbor changes4 Phases each with 2 computations 8 Computations
Time 0
Time 1
Time 2
Time 3
Time 4Slide25
Bulk synchronous computation can be highly inefficient.
25
Example:
Loopy Belief Propagation
ProblemSlide26
Loopy Belief Propagation (Loopy BP)
Iteratively estimate the “beliefs” about vertices
Read
in messagesUpdates marginalestimate (belief
)Send updated out messagesRepeat for all variables
until convergence
26Slide27
Bulk Synchronous Loopy BP
Often considered embarrassingly parallel Associate processor
with each vertexReceive all messagesUpdate all beliefs
Send all messagesProposed by:Brunton et al. CRV’06
Mendiburu et al. GECC’07Kang,et al. LDMTA’10…
27Slide28
Sequential Computational Structure
28Slide29
Hidden Sequential Structure
29Slide30
Hidden
Sequential
Structure
Running Time:
Evidence
Evidence
Time for a single
parallel iteration
Number of Iterations
30Slide31
Optimal Sequential Algorithm
Forward-Backward
Bulk Synchronous
2n
2
/p
p ≤ 2n
Running
Time
2n
Gap
p = 1
Optimal Parallel
n
p = 2
31Slide32
The Splash Operation
Generalize the optimal chain algorithm:
to arbitrary cyclic graphs:
~
Grow a BFS Spanning tree with fixed size
Forward Pass computing all messages at each vertex
Backward Pass computing all messages at each vertex
32Slide33
Data-Parallel Algorithms can be Inefficient
The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms.
Optimized in Memory Bulk Synchronous
Asynchronous Splash BPSlide34
Belief
Propagation
SVM
KernelMethods
Deep BeliefNetworks
Neural
NetworksTensor
FactorizationPageRank
Lasso
The Need for a New AbstractionMap-Reduce is not well suited for Graph-Parallelism
34
Data-Parallel
Graph-Parallel
Cross
Validation
Feature
Extraction
Map Reduce
Computing Sufficient
Statistics
Pregel
(
Giraph
)Slide35
What is GraphLab?Slide36
The GraphLab Framework
Scheduler
Consistency Model
Graph Based
Data Representation
Update Functions
User Computation
36Slide37
Data Graph
37
A
graph
with arbitrary data (C++ Objects) associated with each vertex and edge.
Vertex Data:
User profile text
Current interests estimates
Edge Data:
Similarity weights
Graph:
Social NetworkSlide38
Implementing the Data Graph
Multicore Setting
In MemoryRelatively Straight Forwardvertex_data
(vid) dataedge_data(vid,vid
) dataneighbors(vid)
vid_listChallenge:Fast lookup, low overheadSolution:Dense data-structures
Fixed Vdata&Edata typesImmutable
graph structure
Cluster SettingIn MemoryPartition Graph:
ParMETIS or Random CutsCached Ghosting
Node 1
Node 2
A
B
C
D
A
B
C
D
A
B
C
DSlide39
The GraphLab Framework
Scheduler
Consistency Model
Graph Based
Data Representation
Update Functions
User Computation
39Slide40
l
abel_prop
(
i
, scope){
// Get Neighborhood data
(Likes[i
], Wij
, Likes[j]) scope;
// Update the vertex data
// Reschedule Neighbors if needed
if Likes[i] changes then reschedule_neighbors_of
(i); }
Update Functions
40
An
update function
is a user defined program which when applied to a
vertex
transforms the data in the
scope
of the vertexSlide41
The GraphLab Framework
Scheduler
Consistency Model
Graph Based
Data Representation
Update Functions
User Computation
41Slide42
The Scheduler
42
CPU
1
CPU
2
The
scheduler
determines the order that vertices are updated.
e
f
g
k
j
i
h
d
c
b
a
b
i
h
a
i
b
e
f
j
c
Scheduler
The process repeats until the scheduler is empty.Slide43
Choosing a Schedule
GraphLab provides several different schedulersRound Robin: vertices are updated in a fixed order
FIFO: Vertices are updated in the order they are addedPriority: Vertices are updated in priority order
43
The choice of schedule affects the correctness and parallel performance of the algorithm
Obtain different algorithm
s by simply changing a flag!
--scheduler=
roundrobin
--scheduler=
fifo
--scheduler=prioritySlide44
CPU 1
CPU 2
CPU 3
CPU 4
Implementing the Schedulers
Multicore Setting
Challenging!
Fine-grained locking
Atomic operations
Approximate
FiFo
/Priority
Random placement
Work
stealing
Cluster Setting
Multicore scheduler on each nodeSchedules only “local” verticesExchange update functions
Queue 1
Queue 2
Queue 3
Queue 4
Node 1
CPU 1
CPU 2
Queue 1
Queue 2
Node 2
CPU 1
CPU 2
Queue 1
Queue 2
v
1
v
2
f(v
1
)
f(v
2
)Slide45
The GraphLab Framework
Scheduler
Consistency Model
Graph Based
Data Representation
Update Functions
User Computation
45Slide46
Ensuring Race-Free Code
How much can computation
overlap
?Slide47
Importance of Consistency
Many algorithms require strict consistency, or performs significantly better under strict consistency.
Alternating Least SquaresSlide48
Importance of Consistency
Machine learning algorithms require “model debugging”
Build
Test
Debug
Tweak ModelSlide49
GraphLab
Ensures Sequential Consistency
49
For each parallel execution
, there exists a sequential execution of update functions which produces the same result.
CPU
1
CPU
2
Single
CPU
Parallel
Sequential
timeSlide50
CPU 1
CPU 2
Common Problem: Write-Write Race
50
Processors running
adjacent update functions
simultaneously modify shared data:
CPU1 writes:
CPU2 writes:
Final ValueSlide51
Full Consistency
Consistency Rules
51
Guaranteed sequential consistency for all update functions
DataSlide52
Full Consistency
Full Consistency
52Slide53
Full Consistency
Edge Consistency
Obtaining More Parallelism
53Slide54
Edge Consistency
Edge Consistency
54
CPU 1
CPU 2
Safe
ReadSlide55
Consistency Through R/W Locks
Read/Write locks:
Full Consistency
Edge Consistency
Write
Write
Write
Canonical Lock Ordering
Read
Write
Read
Read
WriteSlide56
Multicore Setting: Pthread R/W Locks
Distributed Setting: Distributed LockingPrefetch Locks and Data
Allow computation to proceed while locks/data are requested.
Node 2
Consistency Through R/W Locks
Node 1
Data Graph
Partition
Lock PipelineSlide57
Consistency Through Scheduling
Edge Consistency Model:Two vertices can be Updated simultaneously
if they do not share an edge.Graph Coloring:Two vertices can be assigned the same color if they do not share an edge.
Barrier
Phase 1
Barrier
Phase 2
Barrier
Phase 3Slide58
The GraphLab Framework
Scheduler
Consistency Model
Graph Based
Data Representation
Update Functions
User Computation
58Slide59
Algorithms Implemented
PageRankLoopy Belief PropagationGibbs SamplingCoEMGraphical Model Parameter Learning
Probabilistic Matrix/Tensor FactorizationAlternating Least SquaresLasso with Sparse FeaturesSupport Vector Machines with Sparse Features
Label-Propagation…Slide60
Shared MemoryExperiments
Shared Memory Setting16 Core Workstation
60Slide61
Loopy Belief Propagation
61
3D retinal image
denoising
Data Graph
Update Function
:
Loopy BP Update Equation
Scheduler:
Approximate Priority
Consistency Model:
Edge Consistency
Vertices:
1 Million
Edges:
3 MillionSlide62
Loopy Belief Propagation
62
Optimal
Better
SplashBP
15.5x speedup Slide63
CoEM (Rosie Jones, 2005)
Named Entity Recognition Task
the dog
Australia
Catalina Island
<X> ran quickly
travelled to <X>
<X> is pleasant
Hadoop
95
Cores
7.5 hrs
Is “Dog” an animal?
Is “Catalina” a place?
Vertices:
2 Million
Edges:
200 MillionSlide64
Better
Optimal
GraphLabCoEM
CoEM
(Rosie Jones, 2005)
64
GraphLab
16 Cores
30 min
15x Faster!
6x fewer CPUs!
Hadoop
95
Cores
7.5 hrsSlide65
Experiments
Amazon EC2 High-Performance Nodes
65Slide66
Video Cosegmentation
Segments mean the same
M
odel
:
10.5
million nodes,
31
million edges
Gaussian EM clustering + BP on 3D gridSlide67
Video Coseg. SpeedupsSlide68
Prefetching Data & LocksSlide69
Matrix Factorization
Netflix Collaborative FilteringAlternating Least Squares Matrix Factorization
M
odel: 0.5 million nodes,
99 million edges
Netflix
Users
Movies
dSlide70
Netflix
Speedup Increasing size of the
m
atrix factorizationSlide71
The Cost of HadoopSlide72
Summary
An abstraction tailored to Machine LearningTargets Graph-Parallel AlgorithmsNaturally expressesData/computational dependencies
Dynamic iterative computationSimplifies parallel algorithm designAutomatically ensures data consistencyAchieves state-of-the-art parallel performance on a variety of problems
72Slide73
Checkout GraphLab
http://
graphlab.org
73
Documentation… Code… Tutorials…
Questions & Feedback
jegonzal@cs.cmu.eduSlide74
Current/Future Work
Out-of-core StorageHadoop/HDFS IntegrationGraph ConstructionGraph Storage
Launching GraphLab from HadoopFault Tolerance through HDFS Checkpoints
Sub-scope parallelismAddress the challenge of very high degree nodesImproved graph partitioningSupport for dynamic graph structure