William Cohen 1 Announcements Next Tuesday 128 Presentations for 10805 projects 15 minutes per project Final written reports due Tues 12 15 For exam S pectral clustering will not be cov ID: 616979
Download Presentation The PPT/PDF document "Graph-Based Parallel Computing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Graph-Based Parallel Computing
William Cohen
1Slide2
Announcements
Next Tuesday 12/8:
Presentations for 10-805 projects.
15 minutes per project.
Final written reports due Tues 12/15For exam:Spectral clustering will not be coveredIt’s ok to bring in two pages of notesWe’ll give a solution sheet for HW7 out on Wednesday noonbut you get no credit on questions HW7 5-7 if you turn in answers after that point
2Slide3
Outline
Motivation/where it fits
Sample systems (
c.
2010)Pregel: and some sample programsBulk synchronous processingSignal/Collect and GraphLabAsynchronous processingGraphLab descendantsPowerGraph: partitioningGraphChi: graphs w/o parallelism
GraphX
: graphs over Spark
3Slide4
Many Graph-Parallel Algorithms
Collaborative Filtering
Alternating Least Squares
Stochastic Gradient Descent
Tensor FactorizationStructured PredictionLoopy Belief Propagation
Max-Product Linear Programs
Gibbs Sampling
Semi-supervised M
L
Graph SSL
CoEMCommunity DetectionTriangle-CountingK-core DecompositionK-TrussGraph AnalyticsPageRankPersonalized PageRankShortest PathGraph ColoringClassificationNeural Networks
4Slide5
Signal/collect model
signals are made available in a list and a map
next state for a vertex is output of the collect() operation
relax “
num_iterations
” soon
5Slide6
6Slide7
CoEM (Rosie Jones, 2005)
7
7
Optimal
Better
Small
Large
GraphLab
16 Cores
30 min
15x Faster!
6x fewer CPUs!
Hadoop
95
Cores
7.5 hrsSlide8
Graph Abstractions: GraphLab
Continued….
8Slide9
Outline
Motivation/where it fits
Sample systems (
c.
2010)Pregel: and some sample programsBulk synchronous processingSignal/Collect and GraphLabAsynchronous processingGraphLab descendantsPowerGraph: partitioningGraphChi: graphs w/o parallelism
GraphX
: graphs over Spark
9Slide10
GraphLab’s descendents
PowerGraph
GraphChi
GraphX
On multicore architecture: shared memory for workers
On cluster architecture (like
Pregel
): different memory spaces
What are the challenges moving away from shared-memory?
10Slide11
Natural Graphs
Power Law
Top 1% of vertices is adjacent to
53% of the edges!
Altavista
Web Graph: 1.4B Vertices,
6.7B
Edges
“Power Law”
-Slope =
α
≈ 2
GraphLab
group/
Aapo
11Slide12
Touches a large
fraction of graph
(
GraphLab
1)
Produces many
messages
(
Pregel
, Signal/Collect)
Edge information
too large for single
machine
Asynchronous consistency
requires heavy locking (
GraphLab
1)
Synchronous consistency is prone to
stragglers (
Pregel
)
Problem:
High Degree Vertices Limit Parallelism
GraphLab
group/
Aapo
12Slide13
PowerGraph
Problem:
GraphLab’s
localities can be large
“all neighbors of a node” can be large for hubs, high indegree nodesApproach:new graph partitioning algorithmcan replicate datagather-apply-scatter API: finer-grained parallelismgather ~ combinerapply ~ vertex UDF (for all replicates)scatter ~ messages from vertex to edges
13Slide14
Signal/collect examples
Single-source shortest path
14Slide15
Signal/collect examples
PageRank
Life
15Slide16
Signal/collect examples
Co-EM/
wvRN
/Harmonic fields
16Slide17
PageRank in
PowerGraph
PageRankProgram
(
i
)
Gather
(
j
i ) : return wji * R[j]sum(a, b) : return a + b;
Apply
(
i
, Σ)
:
R[
i
] =
β
+ (1 –
β
) * Σ
Scatter
(
i
j
)
:
if (R[
i
] changes) then
activate
(
j
)
17
GraphLab
group/
Aapo
j edge
i vertex
gather/sum like a
collect or a group by … reduce
(with combiner)
scatter is like a
signalSlide18
Machine 2
Machine 1
Machine 4
Machine 3
Distributed Execution
of a
PowerGraph
Vertex
-Program
Σ
1
Σ
2
Σ
3
Σ
4
+ + +
Y
Y
Y
Y
Y’
Σ
Y’
Y’
Y’
G
ather
A
pply
S
catter
18
GraphLab
group/
AapoSlide19
Minimizing Communication in
PowerGraph
Y
Y
Y
A
vertex-cut
minimizes
machines each vertex spans
Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000]
Communication is linear in
the number of machines
each vertex spans
19
GraphLab
group/
AapoSlide20
Partitioning Performance
Twitter Graph:
41M vertices, 1.4B edges
Oblivious
balances partition quality and partitioning time.
Random
Oblivious
Greedy
Greedy
Oblivious
Random
20
Cost
Construction Time
Better
GraphLab
group/
AapoSlide21
Partitioning matters…
GraphLab
group/
Aapo
21Slide22
Outline
Motivation/where it fits
Sample systems (
c.
2010)Pregel: and some sample programsBulk synchronous processingSignal/Collect and GraphLabAsynchronous processingGraphLab descendantsPowerGraph: partitioningGraphChi: graphs w/o parallelism
GraphX
: graphs over Spark
22Slide23
GraphLab’s descendents
PowerGraph
GraphChi
GraphX
23Slide24
GraphLab
con’t
PowerGraph
GraphChi
Goal: use graph abstraction on-disk, not in-memory, on a conventional workstation
Linux Cluster Services (Amazon AWS)
MPI/TCP-IP
PThreads
Hadoop/HDFS
General-purpose API
Graph
Analytics
Graphical
Models
Computer
Vision
Clustering
Topic
Modeling
Collaborative
Filtering
24Slide25
GraphLab
con’t
GraphChi
Key
insight: some algorithms on graph are streamable (i.e., PageRank-Nibble)in general we can’t easily stream the graph because neighbors will be scatteredbut maybe we can limit the degree to which they’re scattered … enough to make streaming possible?“almost-streaming”: keep P cursors in a file instead of one
25Slide26
Vertices are numbered from 1 to n
P intervals, each associated with a
shard
on disk.
sub-graph = interval of verticesPSW: Shards and Intervals
shard(1)
interval(1)
interval(2)
interval(P)
shard
(2)
shard(P)
1
n
v
1
v
2
26
1. Load
2. Compute
3. WriteSlide27
PSW: Layout
Shard 1
Shards small enough to fit in memory; balance size of shards
Shard: in-edges for
interval of vertices; sorted by source-id
in-edges for vertices 1..100
sorted by
source_id
Vertices
1..100
Vertices
101..700
Vertices
701..1000
Vertices
1001..10000
Shard 2
Shard 3
Shard 4
Shard 1
1. Load
2. Compute
3. Write
27Slide28
Vertices
1..100
Vertices
101..700
Vertices
701..1000
Vertices
1001..10000
Load all in-edges
in memory
Load
subgraph
for vertices 1..100
What about out-edges?
Arranged in sequence in other shards
Shard 2
Shard 3
Shard 4
PSW: Loading Sub-graph
Shard 1
in-edges for vertices 1..100
sorted by
source_id
1. Load
2. Compute
3. Write
28Slide29
Shard 1
Load all in-edges
in memory
Load
subgraph
for vertices 101..700
Shard 2
Shard 3
Shard 4
PSW: Loading Sub-graph
Vertices
1..100
Vertices
101..700
Vertices
701..1000
Vertices
1001..10000
Out-edge blocks
i
n memory
in-edges for vertices 1..100
sorted by
source_id
1. Load
2. Compute
3. Write
29Slide30
PSW Load-Phase
Only P large reads for each interval.
P
2
reads on one full pass.
30
1. Load
2. Compute
3. WriteSlide31
PSW: Execute updates
Update-function is executed on interval’s vertices
Edges have
pointers
to the loaded data blocksChanges take effect immediately asynchronous.
&Data
&Data
&Data
&Data
&Data
&Data
&Data
&Data
&Data
&Data
Block X
Block Y
31
1. Load
2. Compute
3. WriteSlide32
PSW: Commit to Disk
In write phase, the blocks are written
back
to disk
Next load-phase sees the preceding writes asynchronous.32
1. Load
2. Compute
3. Write
&Data
&Data
&Data
&Data
&Data
&Data
&Data
&Data
&Data
&Data
Block X
Block Y
In total:
P
2
reads and writes / full pass on the graph.
Performs well on
both
SSD and hard drive.
To make this work: the
size
of a vertex state can’t change when it’s updated (at last, as stored on disk).Slide33
Experiment Setting
Mac Mini (Apple Inc.)
8 GB RAM
256 GB SSD, 1TB hard drive
Intel Core i5, 2.5 GHzExperiment graphs:
Graph
Vertices
Edges
P (shards)
Preprocessing
live-journal
4.8M
69M
3
0.5 min
netflix
0.5M
99M
20
1 min
twitter-2010
42M
1.5B
20
2 min
uk-2007-05
106M
3.7B
40
31 min
uk
-union
133M
5.4B
50
33 min
yahoo-web
1.4B
6.6B
50
37 min
33Slide34
Comparison to Existing Systems
Notes:
comparison results do not include time to transfer the data to cluster
, preprocessing,
or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously.
PageRank
See the paper for more comparisons.
WebGraph
Belief Propagation
(U Kang et al.)Matrix Factorization (Alt. Least Sqr.)Triangle Counting
On a Mac Mini:
GraphChi
can solve as big problems as existing large-scale systems.
Comparable performance.
34Slide35
Outline
Motivation/where it fits
Sample systems (
c.
2010)Pregel: and some sample programsBulk synchronous processingSignal/Collect and GraphLabAsynchronous processingGraphLab “descendants”PowerGraph: partitioningGraphChi: graphs w/o parallelism
GraphX
: graphs over Spark (Gonzalez)
35Slide36
GraphLab’s descendents
PowerGraph
GraphChi
GraphX
implementation of GraphLabs API on top of SparkMotivations:avoid transfers between subsystemsleverage larger community for common infrastructureWhat’s different:Graphs are now immutable and operations transform one graph into another (RDD
RDG,
resiliant
distributed graph)
36Slide37
The
GraphX Stack
(Lines of Code)
GraphX
(3575)
Spark
Pregel
(28) +
GraphLab
(50)
PageRank (5)Connected Comp. (10)
Shortest Path (10)
ALS
(40)
LDA
(120)
K-core
(51)
Triangle
Count
(45)
SVD
(40)
37Slide38
Idea: Graph as Tables
Id
Rxin
Jegonzal
Franklin
Istoica
SrcId
DstId
rxin
jegonzal
franklinrxin
istoica
franklin
franklin
jegonzal
Property (E)
Friend
Advisor
Coworker
PI
Property (V)
(Stu.,
Berk
.)
(
PstDoc
,
Berk
.)
(Prof.,
Berk
)
(Prof.,
Berk
)
R
J
F
I
Property Graph
Vertex Property Table
Edge Property Table
Under the hood things can be split even more finely:
eg
a
vertex map table
+
vertex data table.
Operators maximize structure sharing and minimize communication.
(Not shown: partition id’s, carefully assigned….)
38
“Vertex State”
“Signal”Slide39
39
Like signal/collect:
Join vertex and edge tables
Does map with
mapFunc
on the edges
Reduces by destination vertex using
reduceFuncSlide40
Machine 2
Machine 1
Machine 4
Machine 3
Distributed Execution
of a
PowerGraph
Vertex
-Program
Σ
1
Σ
2
Σ
3
Σ
4
+ + +
Y
Y
Y
Y
Y’
Σ
Y’
Y’
Y’
G
ather
A
pply
S
catter
40
GraphLab
group/
AapoSlide41
41Slide42
42Slide43
Performance Comparisons
GraphX
is roughly
3
x slower
than
GraphLab
Live-Journal: 69 Million Edges
43
but: integrated with Spark, open-source,
resiliant Slide44
Summary
Large immutable data structures on (distributed) disk, processing by sweeping through then and creating new data structures:
stream-and-sort,
Hadoop
, PIG, Hive, …Large immutable data structures in distributed memory:Spark – distributed tablesLarge mutable data structures in distributed memory:parameter server: structure is a hashtablePregel, GraphLab,
GraphChi
,
GraphX: structure is a graph
44Slide45
Summary
APIs for the various systems vary in detail but have a similar flavor
Typical algorithms iteratively update vertex state
Changes in state are communicated with messages which need to be aggregated from neighbors
Biggest wins are on problems where graph is fixed in each iteration, but vertex data changeson graphs small enough to fit in (distributed) memory45Slide46
Some things to take away
Platforms for iterative operations on graphs
GraphX
: if you want to integrate with Spark
GraphChi: if you don’t have a clusterGraphLab/Dato: if you don’t need free software and performance is crucialPregel: if you work at GoogleGiraph, Signal/collect, … ??Important differencesIntended architecture: shared-memory and threads, distributed cluster memory, graph on disk
How graphs are partitioned for clusters
If processing is synchronous or asynchronous
46