with Distributed Immutable View Rong Chen Xin Ding Peng Wang Haibo Chen Binyu Zang and Haibing Guan Institute of Parallel and Distributed Systems ID: 798497
Download The PPT/PDF document "Efficient Graph Processing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Efficient
Graph Processing with Distributed Immutable View
Rong Chen+, Xin Ding+, Peng Wang+, Haibo Chen+, Binyu Zang+ and Haibing Guan*Institute of Parallel and Distributed Systems +Department of Computer Science *Shanghai Jiao Tong University
2014
HPDC
Communication
Computation
100
Hrs of Video every minute1.11 Billion Users
6 Billion Photos400 Million Tweets/dayHow do we understand and use Big Data?Big Data Everywhere
Slide3100
Hrs of Video every minute1.11 Billion Users
6 Billion Photos400 Million Tweets/day
NLP
Big Data
Big Learning
Machine Learning and Data Mining
Slide4It’s about the
graphs ...
Slide54
5
314
Example: PageRank
A
centrality analysis algorithm
to measure the
relative rank
for
each element of a linked setCharacteristicsLinked set data dependenceRank of who links it
local
accesses
Convergence
iterative
computation
4
5
1
2
3
4
5
3
1
4
4
5
3
1
2
1
Slide6Existing Graph-parallel Systems
“Think as a vertex” philosophy
aggregate value of neighborsupdate itself valueactivate neighborscompute (v) PageRankdouble sum = 0double value, last =
v.get ()foreach (n
in v.in_nbrs) sum +=
n.value / n.nedges;
value =
0.15 + 0.85 * sum;
v.set (value);activate (v.out_nbrs
);
1234
5
1
2
3
Slide7Existing Graph-parallel Systems
“Think as a vertex” philosophy
aggregate value of neighborsupdate itself valueactivate neighbors
Execution Engine
sync
: BSP-like model
async
: dist.
sched_queues
Communication
message passing: push valuedist. shared memory: sync & pull value
4
5
1
2
3
1
2
3
4
1
4
2
3
c
omp.
c
omm.
1
2
push
1
1
pull
2
sync
barrier
Slide8Issues of Existing Systems
Pregel
[SIGMOD’09]Sync engineEdge-cut + Message Passingw/o dynamic comp.high contention
3
k
eep alive
2
1
4
x1
x1
2
1
master
2
1
replica
msg
GraphLab
[VLDB’12]
PowerGraph
[OSDI’12]
Slide9Issues of Existing Systems
Pregel
[SIGMOD’09]
Sync engineEdge-cut
+ Message Passing
GraphLab
[VLDB’12]
Async
engine
Edge-cut
+ DSM (replicas)w/o dynamic comp.high contentionhigh
contention
hard to program
duplicated
edges
heavy
comm. cost
3
k
eep alive
2
2
3
3
1
1
2
replica
1
1
4
4
x1
x1
x2
x2
5
dup
2
1
master
2
1
replica
msg
PowerGraph
[OSDI’12]
Slide10Issues of Existing Systems
Pregel
[SIGMOD’09]Sync engineEdge-cut + Message Passing
GraphLab
[VLDB’12]
Async
engine
Edge-cut
+ DSM (replicas)
PowerGraph
[OSDI’12](A)Sync engine Vertex-cut + GAS (replicas)
w/o
dynamic comp.
high
contention
high
contention
hard
to program
duplicated
edges
heavy
comm. cost
high
contention
heavy
comm. cost
3
k
eep alive
2
3
1
1
2
1
x5
x5
1
4
4
x1
x1
2
3
3
1
1
2
replica
1
4
x2
x2
5
2
1
master
2
1
replica
msg
5
dup
Slide11Contributions
Distributed Immutable View
Easy to program/debugSupport dynamic computationMinimized communication cost (x1 /replica)Contention (comp. & comm.) immunity
Multicore-based Cluster Support
Hierarchical
sync. & deterministic execution
Improve
parallelism
and
locality
Slide12Outline
Distributed Immutable View
Graph organizationVertex computationMessage passingChange of execution flow
Multicore-based Cluster Support
Hierarchical model
Parallelism improvement
Evaluation
Slide13General Idea
: For most graph algorithms, vertex only aggregates neighbors’ data in
one direction and activates in another directione.g. PageRank, SSSP, Community Detection, …Observation
Local
aggregation/update
& distributed
activation
Partitioning
: avoid duplicate edgesComputation: one-way local semantics
Communication: merge update & activate messages
Slide14Graph Organization
Partitioning graph and build local sub-graph
Normal edge-cut: randomized (e.g., hash-based) or heuristic (e.g., Metis)Only create one direction edges (e.g., in-edges)Avoid duplicated edgesCreate read-only replicas for edges spanning machines4
5
2
3
1
4
3
1
4
2
3
1
5
2
1
master
replica
M1
M2
M3
Slide15Vertex Computation
Local aggregation/update
Support dynamic computationone-way local semanticImmutable view: read-only access neighborsEliminate contention on vertex4
5
2
3
1
4
3
1
4
2
3
1
5
2
1
M1
M
2
M3
r
ead-only
Slide16Communication
Sync. & Distributed Activation
Merge update & activate messagesUpdate value of replicasInvite replicas to activate neighbors45
2
3
1
4
3
1
4
2
3
1
5
2
1
rlist:W1
l-act: 1
value:
8
msg
:
4
l-act:3
value:6
msg:3
msg
:
v|m|s
e.g. 8 4 0
M1
M2
M3
8
4
a
ctive
s
0
Slide17Communication
Distributed Activation
Unidirectional message passingReplica will never be activatedAlways master replicas Contention immunity4
5
2
3
1
4
3
1
4
2
3
1
5
2
1
M
1
M2
M3
Slide18in-queues
M1
M3
out-queues
Change of Execution Flow
Original Execution Flow
(e.g. Pregel)
5
parsing
11
8
computation
sending
1
4
7
10
receiving
high
overhead
high
contention
M2
M3
M1
thread
vertex
message
4
2
Slide19Change of Execution Flow
M1
M3
out-queues
computation
sending
1
4
7
10
receiving
lock-free
2
3
8
9
5
2
11
8
4
3
1
6
1
7
4
4
7
4
7
1
3
6
Execution Flow on Distributed Immutable View
low
overhead
no
contention
thread
master
4
replica
4
M2
M3
M1
Slide20Outline
Distributed Immutable View
Graph organizationVertex computationMessage passingChange of execution flow
Multicore-based Cluster Support
Hierarchical model
Parallelism improvement
Evaluation
Slide21Multicore Support
Two Challenges
Two-level hierarchical organizationPreserve synchronous and deterministic computation nature (easy to program/debug)Original BSP-like model is hard to parallelize High contention to buffer and parse messagesPoor locality in message parsing
Slide22Hierarchical Model
Design Principle
Three level: iteration worker threadOnly the last-level participants perform actual tasksParents (i.e. higher level participants) just wait until all children finish their tasksloop
task
task
task
Level-0
Level-1
Level-2
worker
thread
iteration
global
barrier
local
barrier
Slide23Parallelism Improvement
Original BSP-like model is
hard to parallelize
M1
M3
out-queues
in-queues
5
parsing
2
11
8
computation
sending
1
4
7
10
receiving
thread
vertex
message
4
M2
M3
M1
Slide24Parallelism Improvement
Original BSP-like model is
hard to parallelize
M1
M3
priv.
out-queues
in-queues
5
parsing
2
11
8
computation
sending
1
4
7
10
receiving
M1
M3
high
contention
poor
locality
thread
vertex
message
4
M2
M3
M1
Slide25Parallelism Improvement
M1
M3
out-queues
1
4
7
10
2
3
8
9
5
2
11
8
4
3
1
6
1
7
4
4
7
1
7
4
6
3
computation
sending
receiving
Distributed immutable view
opens an
opportunity
thread
master
4
replica
4
M2
M3
M1
Slide26M2
M3
M1Parallelism Improvement
M1
M3
priv.
out-queues
1
4
7
10
M1
M3
2
3
8
9
5
2
11
8
1
7
4
4
7
1
7
4
6
3
4
3
1
6
poor
locality
lock-free
computation
sending
receiving
Distributed immutable view
opens an opportunity
thread
master
4
replica
4
Slide27Parallelism Improvement
M1
M3
1
4
7
10
M1
M3
2
3
8
9
5
2
11
8
1
7
4
4
7
1
7
4
3
6
6
3
1
4
lock-free
computation
sending
receiving
Distributed immutable view
opens an opportunity
no
interference
thread
master
4
replica
4
M2
M3
M1
priv.
out-queues
Slide28M2
M3
M1Parallelism Improvement
Distributed immutable view
opens an opportunity
M1
M3
1
4
7
10
M1
M3
2
3
8
9
5
2
11
8
1
7
4
4
7
1
7
4
3
6
6
3
4
1
lock-free
sorted
computation
sending
receiving
good
locality
thread
master
4
replica
4
priv.
out-queues
Slide29Outline
Distributed Immutable View
Graph organizationVertex computationMessage passingChange of execution flow
Multicore-based Cluster Support
Hierarchical model
Parallelism improvement
Implementation & Experiment
Slide30Implementation
Cyclops(MT)
Based on (Java & Hadoop)~2,800 SLOCProvide mostly compatible user interfaceGraph ingress and partitioningCompatible I/O-interfaceAdd an additional phase to build replicasFault toleranceIncremental checkpointReplication-based FT [DSN’14]
Slide31Experiment Settings
Platform
6X12-core AMD Opteron (64G RAM, 1GigE NIC)Graph AlgorithmsPageRank (PR), Community Detection (CD), Alternating Least Squares (ALS), Single Source Shortest Path (SSSP)Workload7 real-world dataset from SNAP1
1 synthetic dataset from GraphLab2
1
http://snap.stanford.edu/data/
Dataset
|V|
|E|
Amazon
0.4M3.4MGWeb0.9M5.1MLJournal4.8M
69M
Wiki
5.7M
130M
SYN-GL
0.1M
2.7M
DBLP
0.3M
1.0M
RoadCA
1.9M
5.5M
2
http://graphlab.org
Overall Performance Improvement
PageRank
ALSCDSSSPPush-mode8.69X
2.06X
48 workers
6 workers(8)
Slide33Performance Scalability
Amazon
GWebLJournalWiki50.2
SYN-GL
DBLP
RoadCA
threads
workers
Slide34Performance Breakdown
PageRank
ALSCDSSSPCyclopsMT
Hama
Cyclops
Slide35Comparison with PowerGraph
1
DatasetCOMP%Amazon11%GWeb15%LJournal25%Wiki
39%
Cyclops-like engine on GraphLab
1
Platform
Preliminary Results
1
http
://graphlab.org 2synthetic 10-million vertex regular (even edge) and power-law (α=2.0) graphs221C++ & Boost RPC
lib.
Slide36Conclusion
Cyclops
: a new synchronous vertex-oriented graph processing systemPreserve synchronous and deterministic computation nature (easy to program/debug)Provide efficient vertex computation with significantly fewer messages and contention immunity by distributed immutable viewFurther support multicore-based cluster with hierarchical processing model and
high parallelism
Source Code: http
://ipads.se.sjtu.edu.cn/projects/cyclops
Slide37Questions
Thanks
Cyclopshttp://ipads.se.sjtu.edu.cn/projects/cyclops.html IPADSInstitute of
Parallel
and Distributed
Systems
Slide38PowerLyra
:
differentiated graph computation and partitioning on skewed natural graphsHybrid engine and partitioning algorithmsOutperform PowerGraph by up to 3.26X for natural graphs
What’s Next?
http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
2
1
3
Low
High
Preliminary Results
PL
PG
Cyclops
Power-law
: “
most
vertices have relatively few neighbors while a
few
have many
neighbors”
Slide39Generality
Algorithms:
aggregate/activate all neighborse.g. Community Detection (CD)Transfer to undirected graph and duplicate edges4
3
1
4
2
3
1
5
2
1
M1
M2
M3
5
4
5
2
3
1
4
5
2
3
1
4
3
1
4
2
3
1
5
2
1
M1
M2
M3
Slide40Generality
Algorithms:
aggregate/activate all neighborse.g. Community Detection (CD)Transfer to undirected graph and duplicate edgesStill aggregate in one direction (e.g. in-edges) and activate in another direction (e.g. out-edges)Preserve all benefits of Cyclopsx1 /replica & contention
immunity &
good locality
4
3
1
4
2
3
1
5
2
1
M1
M2
M3
5
4
5
2
3
1
Slide414
3
14231
5
2
1
M1
M2
M3
5
Generality
Difference
between
Cyclops
and
GraphLab
How to
construct
local sub-graph
How to
aggregate
/
activate
neighbors
4
3
1
4
2
3
1
5
2
1
M1
M2
M3
5
4
5
2
3
1
4
5
2
3
1
Slide42Improvement of
CyclopsMT
#[M]achinesMxWxT/R#[W]orkers#[T]hreads#[R]eceivers
Cyclops
CyclopsMT
Slide43Communication Efficiency
50M
25M5M25.6X
16.2X
55.6%
12.6X
25.0%
W0
W1
W2
W3W4
W5
message
:
(
id,data
)
Hadoop RPC lib (Java)
Boost RPC lib (C++)
Hadoop
RPC lib (Java)
Hama:
PowerGraph:
Cyclops:
send + buffer + parse (contention)
send + update
(
contention
)
31.5%
Slide44Using Heuristic Edge-cut
(i.e. Metis)
PageRankALSCDSSSP23.04X5.95X
48 workers
6 workers(8)
Slide45Memory Consumption
Configuration
Max Cap (GB)Max Usage (GB)Young GC2 (#)Full GC2 (#)Hama/481.7
1.5
132
69
Cyclops/48
4.0
3.0
45
15CyclopsMT/6x812.6/8
11.0/8
268/8
32/8
Memory Behavior
1
per
Worker
(PageRank with Wiki dataset)
2
GC: Concurrent
Mark-Sweep
1
jStat
Slide46Ingress Time
Dataset
LDREPINITTOT
H
C
H
C
H
C
H
C
Amazon
6.2
5.9
0.0
2.5
1.7
1.5
7.9
9.9
GWeb
7.1
6.8
0.0
2.8
2.6
1.9
9.7
11.4
LJournal
27.1
31.0
0.0
44.7
17.9
9.2
45.0
84.9
Wiki
46.7
46.7
0.0
62.2
33.4
20.4
80.0
129.3
SYN-GL
4.2
4.0
0.0
2.6
2.4
1.8
6.6
8.4
DBLP
4.1
4.1
0.0
1.5
1.3
0.9
5.4
6.5
RoadCA
6.4
6.2
0.0
3.9
0.9
0.6
7.3
10.7
Cyclops
Hama
Slide47Selective Activation
Sync. & Distributed Activation
Merge update & activate messagesUpdate value of replicasInvite replicas to activate neighbors45
2
3
1
4
3
1
4
2
3
1
5
2
1
rlist:W1
l-act: 1
value:
8
msg
:
4
l-act:3
value:6
msg:3
msg
:
v|m|s
e.g. 8 4 0
M1
M2
M3
8
4
a
ctive
msg
:
v|m|
s
|
l
*Selective Activation
(e.g. ALS)
O
ption:
Activation_List
s
0
Slide48M2
M3
M1Parallelism Improvement
Distributed immutable view
opens an opportunity
M1
M3
out-queues
1
4
7
10
M1
M3
2
3
8
9
5
2
11
8
1
7
4
4
7
1
7
4
3
6
6
3
4
1
lock-free
sorted
computation
sending
receiving
good
locality
c
omp.
threads
comm.
threads
vs.
separate
configuration
thread
master
4
replica
4
Slide49w/
dynamic comp.no contention
easy to programduplicated edgeslow comm. costCyclops
Existing graph-parallel systems
(e.g., Pregel,
GraphLab, PowerGraph)
Cyclops(MT)
Distributed
Immutable
View
w/o dynamic comp.high contentionhard to program
duplicated
edges
heavy
comm. cost
2
3
3
1
1
5
replica
1
4
x1
x1
Slide50BiGraph
:
bipartite-oriented distributed graph partitioning for big learningA set of online distributed graph partition algorithms designed for bipartite graphs and applicationsPartition graphs in a differentiated way and loading data according to the data affinity
Outperform PowerGraph with default partition
by up to 17.75X
, and save up to 96% network
traffic
What’s Next?
http://ipads.se.sjtu.edu.cn/projects/powerlyra.html
Multicore Support
Two Challenges
Two-level hierarchical organizationPreserve synchronous and deterministic computation nature (easy to program/debug)Original BSP-like model is hard to parallelize High contention to buffer and parse messagesPoor locality in message parsingAsymmetric
degree of parallelism for CPU and NIC