/
Efficient   Graph  Processing Efficient   Graph  Processing

Efficient Graph Processing - PowerPoint Presentation

blackwidownissan
blackwidownissan . @blackwidownissan
Follow
343 views
Uploaded On 2020-08-05

Efficient Graph Processing - PPT Presentation

with Distributed Immutable View Rong Chen Xin Ding Peng Wang Haibo Chen Binyu Zang and Haibing Guan Institute of Parallel and Distributed Systems ID: 798497

graph distributed computation contention distributed graph contention computation replica amp queues high immutable view cyclops improvement master message parallelism

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Efficient Graph Processing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Efficient

Graph Processing with Distributed Immutable View

Rong Chen+, Xin Ding+, Peng Wang+, Haibo Chen+, Binyu Zang+ and Haibing Guan*Institute of Parallel and Distributed Systems +Department of Computer Science *Shanghai Jiao Tong University

2014

HPDC

Communication

Computation

Slide2

100

Hrs of Video every minute1.11 Billion Users

6 Billion Photos400 Million Tweets/dayHow do we understand and use Big Data?Big Data Everywhere

Slide3

100

Hrs of Video every minute1.11 Billion Users

6 Billion Photos400 Million Tweets/day

NLP

Big Data

Big Learning

Machine Learning and Data Mining

Slide4

It’s about the

graphs ...

Slide5

4

5

314

Example: PageRank

A

centrality analysis algorithm

to measure the

relative rank

for

each element of a linked setCharacteristicsLinked set  data dependenceRank of who links it

 local

accesses

Convergence

iterative

computation

 

 

 

 

4

5

1

2

3

4

5

3

1

4

4

5

3

1

2

1

Slide6

Existing Graph-parallel Systems

“Think as a vertex” philosophy

aggregate value of neighborsupdate itself valueactivate neighborscompute (v) PageRankdouble sum = 0double value, last =

v.get ()foreach (n

in v.in_nbrs) sum +=

n.value / n.nedges;

value =

0.15 + 0.85 * sum;

v.set (value);activate (v.out_nbrs

);

1234

5

1

2

3

Slide7

Existing Graph-parallel Systems

“Think as a vertex” philosophy

aggregate value of neighborsupdate itself valueactivate neighbors

Execution Engine

sync

: BSP-like model

async

: dist.

sched_queues

Communication

message passing: push valuedist. shared memory: sync & pull value

4

5

1

2

3

1

2

3

4

1

4

2

3

c

omp.

c

omm.

1

2

push

1

1

pull

2

sync

barrier

Slide8

Issues of Existing Systems

Pregel

[SIGMOD’09]Sync engineEdge-cut + Message Passingw/o dynamic comp.high contention

3

k

eep alive

2

1

4

x1

x1

2

1

master

2

1

replica

msg

GraphLab

[VLDB’12]

PowerGraph

[OSDI’12]

Slide9

Issues of Existing Systems

Pregel

[SIGMOD’09]

Sync engineEdge-cut

+ Message Passing

GraphLab

[VLDB’12]

Async

engine

Edge-cut

+ DSM (replicas)w/o dynamic comp.high contentionhigh

contention

hard to program

duplicated

edges

heavy

comm. cost

3

k

eep alive

2

2

3

3

1

1

2

replica

1

1

4

4

x1

x1

x2

x2

5

dup

2

1

master

2

1

replica

msg

PowerGraph

[OSDI’12]

Slide10

Issues of Existing Systems

Pregel

[SIGMOD’09]Sync engineEdge-cut + Message Passing

GraphLab

[VLDB’12]

Async

engine

Edge-cut

+ DSM (replicas)

PowerGraph

[OSDI’12](A)Sync engine Vertex-cut + GAS (replicas)

w/o

dynamic comp.

high

contention

high

contention

hard

to program

duplicated

edges

heavy

comm. cost

high

contention

heavy

comm. cost

3

k

eep alive

2

3

1

1

2

1

x5

x5

1

4

4

x1

x1

2

3

3

1

1

2

replica

1

4

x2

x2

5

2

1

master

2

1

replica

msg

5

dup

Slide11

Contributions

Distributed Immutable View

Easy to program/debugSupport dynamic computationMinimized communication cost (x1 /replica)Contention (comp. & comm.) immunity

Multicore-based Cluster Support

Hierarchical

sync. & deterministic execution

Improve

parallelism

and

locality

Slide12

Outline

Distributed Immutable View

Graph organizationVertex computationMessage passingChange of execution flow

Multicore-based Cluster Support

Hierarchical model

Parallelism improvement

Evaluation

Slide13

General Idea

: For most graph algorithms, vertex only aggregates neighbors’ data in

one direction and activates in another directione.g. PageRank, SSSP, Community Detection, …Observation

Local

aggregation/update

& distributed

activation

Partitioning

: avoid duplicate edgesComputation: one-way local semantics

Communication: merge update & activate messages

Slide14

Graph Organization

Partitioning graph and build local sub-graph

Normal edge-cut: randomized (e.g., hash-based) or heuristic (e.g., Metis)Only create one direction edges (e.g., in-edges)Avoid duplicated edgesCreate read-only replicas for edges spanning machines4

5

2

3

1

4

3

1

4

2

3

1

5

2

1

master

replica

M1

M2

M3

Slide15

Vertex Computation

Local aggregation/update

Support dynamic computationone-way local semanticImmutable view: read-only access neighborsEliminate contention on vertex4

5

2

3

1

4

3

1

4

2

3

1

5

2

1

M1

M

2

M3

r

ead-only

Slide16

Communication

Sync. & Distributed Activation

Merge update & activate messagesUpdate value of replicasInvite replicas to activate neighbors45

2

3

1

4

3

1

4

2

3

1

5

2

1

rlist:W1

l-act: 1

value:

8

msg

:

4

l-act:3

value:6

msg:3

msg

:

v|m|s

e.g. 8 4 0

M1

M2

M3

8

4

a

ctive

s

0

Slide17

Communication

Distributed Activation

Unidirectional message passingReplica will never be activatedAlways master  replicas Contention immunity4

5

2

3

1

4

3

1

4

2

3

1

5

2

1

M

1

M2

M3

Slide18

in-queues

M1

M3

out-queues

Change of Execution Flow

Original Execution Flow

(e.g. Pregel)

5

parsing

11

8

computation

sending

1

4

7

10

receiving

high

overhead

high

contention

M2

M3

M1

thread

vertex

message

4

2

Slide19

Change of Execution Flow

M1

M3

out-queues

computation

sending

1

4

7

10

receiving

lock-free

2

3

8

9

5

2

11

8

4

3

1

6

1

7

4

4

7

4

7

1

3

6

Execution Flow on Distributed Immutable View

low

overhead

no

contention

thread

master

4

replica

4

M2

M3

M1

Slide20

Outline

Distributed Immutable View

Graph organizationVertex computationMessage passingChange of execution flow

Multicore-based Cluster Support

Hierarchical model

Parallelism improvement

Evaluation

Slide21

Multicore Support

Two Challenges

Two-level hierarchical organizationPreserve synchronous and deterministic computation nature (easy to program/debug)Original BSP-like model is hard to parallelize High contention to buffer and parse messagesPoor locality in message parsing

Slide22

Hierarchical Model

Design Principle

Three level: iteration  worker  threadOnly the last-level participants perform actual tasksParents (i.e. higher level participants) just wait until all children finish their tasksloop

task

task

task

Level-0

Level-1

Level-2

worker

thread

iteration

global

barrier

local

barrier

Slide23

Parallelism Improvement

Original BSP-like model is

hard to parallelize

M1

M3

out-queues

in-queues

5

parsing

2

11

8

computation

sending

1

4

7

10

receiving

thread

vertex

message

4

M2

M3

M1

Slide24

Parallelism Improvement

Original BSP-like model is

hard to parallelize

M1

M3

priv.

out-queues

in-queues

5

parsing

2

11

8

computation

sending

1

4

7

10

receiving

M1

M3

high

contention

poor

locality

thread

vertex

message

4

M2

M3

M1

Slide25

Parallelism Improvement

M1

M3

out-queues

1

4

7

10

2

3

8

9

5

2

11

8

4

3

1

6

1

7

4

4

7

1

7

4

6

3

computation

sending

receiving

Distributed immutable view

opens an

opportunity

thread

master

4

replica

4

M2

M3

M1

Slide26

M2

M3

M1Parallelism Improvement

M1

M3

priv.

out-queues

1

4

7

10

M1

M3

2

3

8

9

5

2

11

8

1

7

4

4

7

1

7

4

6

3

4

3

1

6

poor

locality

lock-free

computation

sending

receiving

Distributed immutable view

opens an opportunity

thread

master

4

replica

4

Slide27

Parallelism Improvement

M1

M3

1

4

7

10

M1

M3

2

3

8

9

5

2

11

8

1

7

4

4

7

1

7

4

3

6

6

3

1

4

lock-free

computation

sending

receiving

Distributed immutable view

opens an opportunity

no

interference

thread

master

4

replica

4

M2

M3

M1

priv.

out-queues

Slide28

M2

M3

M1Parallelism Improvement

Distributed immutable view

opens an opportunity

M1

M3

1

4

7

10

M1

M3

2

3

8

9

5

2

11

8

1

7

4

4

7

1

7

4

3

6

6

3

4

1

lock-free

sorted

computation

sending

receiving

good

locality

thread

master

4

replica

4

priv.

out-queues

Slide29

Outline

Distributed Immutable View

Graph organizationVertex computationMessage passingChange of execution flow

Multicore-based Cluster Support

Hierarchical model

Parallelism improvement

Implementation & Experiment

Slide30

Implementation

Cyclops(MT)

Based on (Java & Hadoop)~2,800 SLOCProvide mostly compatible user interfaceGraph ingress and partitioningCompatible I/O-interfaceAdd an additional phase to build replicasFault toleranceIncremental checkpointReplication-based FT [DSN’14]

Slide31

Experiment Settings

Platform

6X12-core AMD Opteron (64G RAM, 1GigE NIC)Graph AlgorithmsPageRank (PR), Community Detection (CD), Alternating Least Squares (ALS), Single Source Shortest Path (SSSP)Workload7 real-world dataset from SNAP1

1 synthetic dataset from GraphLab2

1

http://snap.stanford.edu/data/

Dataset

|V|

|E|

Amazon

0.4M3.4MGWeb0.9M5.1MLJournal4.8M

69M

Wiki

5.7M

130M

SYN-GL

0.1M

2.7M

DBLP

0.3M

1.0M

RoadCA

1.9M

5.5M

2

http://graphlab.org

Slide32

Overall Performance Improvement

PageRank

ALSCDSSSPPush-mode8.69X

2.06X

48 workers

6 workers(8)

Slide33

Performance Scalability

Amazon

GWebLJournalWiki50.2

SYN-GL

DBLP

RoadCA

threads

workers

Slide34

Performance Breakdown

PageRank

ALSCDSSSPCyclopsMT

Hama

Cyclops

Slide35

Comparison with PowerGraph

1

DatasetCOMP%Amazon11%GWeb15%LJournal25%Wiki

39%

Cyclops-like engine on GraphLab

1

Platform

Preliminary Results

1

http

://graphlab.org 2synthetic 10-million vertex regular (even edge) and power-law (α=2.0) graphs221C++ & Boost RPC

lib.

Slide36

Conclusion

Cyclops

: a new synchronous vertex-oriented graph processing systemPreserve synchronous and deterministic computation nature (easy to program/debug)Provide efficient vertex computation with significantly fewer messages and contention immunity by distributed immutable viewFurther support multicore-based cluster with hierarchical processing model and

high parallelism

Source Code: http

://ipads.se.sjtu.edu.cn/projects/cyclops

Slide37

Questions

Thanks

Cyclopshttp://ipads.se.sjtu.edu.cn/projects/cyclops.html IPADSInstitute of

Parallel

and Distributed

Systems

Slide38

PowerLyra

:

differentiated graph computation and partitioning on skewed natural graphsHybrid engine and partitioning algorithmsOutperform PowerGraph by up to 3.26X for natural graphs

What’s Next?

http://ipads.se.sjtu.edu.cn/projects/powerlyra.html

2

1

3

Low

High

Preliminary Results

PL

PG

Cyclops

Power-law

: “

most

vertices have relatively few neighbors while a

few

have many

neighbors”

Slide39

Generality

Algorithms:

aggregate/activate all neighborse.g. Community Detection (CD)Transfer to undirected graph and duplicate edges4

3

1

4

2

3

1

5

2

1

M1

M2

M3

5

4

5

2

3

1

4

5

2

3

1

4

3

1

4

2

3

1

5

2

1

M1

M2

M3

Slide40

Generality

Algorithms:

aggregate/activate all neighborse.g. Community Detection (CD)Transfer to undirected graph and duplicate edgesStill aggregate in one direction (e.g. in-edges) and activate in another direction (e.g. out-edges)Preserve all benefits of Cyclopsx1 /replica & contention

immunity &

good locality

4

3

1

4

2

3

1

5

2

1

M1

M2

M3

5

4

5

2

3

1

Slide41

4

3

14231

5

2

1

M1

M2

M3

5

Generality

Difference

between

Cyclops

and

GraphLab

How to

construct

local sub-graph

How to

aggregate

/

activate

neighbors

4

3

1

4

2

3

1

5

2

1

M1

M2

M3

5

4

5

2

3

1

4

5

2

3

1

Slide42

Improvement of

CyclopsMT

#[M]achinesMxWxT/R#[W]orkers#[T]hreads#[R]eceivers

Cyclops

CyclopsMT

Slide43

Communication Efficiency

50M

25M5M25.6X

16.2X

55.6%

12.6X

25.0%

W0

W1

W2

W3W4

W5

message

:

(

id,data

)

Hadoop RPC lib (Java)

Boost RPC lib (C++)

Hadoop

RPC lib (Java)

Hama:

PowerGraph:

Cyclops:

send + buffer + parse (contention)

send + update

(

contention

)

31.5%

Slide44

Using Heuristic Edge-cut

(i.e. Metis)

PageRankALSCDSSSP23.04X5.95X

48 workers

6 workers(8)

Slide45

Memory Consumption

Configuration

Max Cap (GB)Max Usage (GB)Young GC2 (#)Full GC2 (#)Hama/481.7

1.5

132

69

Cyclops/48

4.0

3.0

45

15CyclopsMT/6x812.6/8

11.0/8

268/8

32/8

Memory Behavior

1

per

Worker

(PageRank with Wiki dataset)

2

GC: Concurrent

Mark-Sweep

1

jStat

Slide46

Ingress Time

Dataset

LDREPINITTOT

H

C

H

C

H

C

H

C

Amazon

6.2

5.9

0.0

2.5

1.7

1.5

7.9

9.9

GWeb

7.1

6.8

0.0

2.8

2.6

1.9

9.7

11.4

LJournal

27.1

31.0

0.0

44.7

17.9

9.2

45.0

84.9

Wiki

46.7

46.7

0.0

62.2

33.4

20.4

80.0

129.3

SYN-GL

4.2

4.0

0.0

2.6

2.4

1.8

6.6

8.4

DBLP

4.1

4.1

0.0

1.5

1.3

0.9

5.4

6.5

RoadCA

6.4

6.2

0.0

3.9

0.9

0.6

7.3

10.7

Cyclops

Hama

Slide47

Selective Activation

Sync. & Distributed Activation

Merge update & activate messagesUpdate value of replicasInvite replicas to activate neighbors45

2

3

1

4

3

1

4

2

3

1

5

2

1

rlist:W1

l-act: 1

value:

8

msg

:

4

l-act:3

value:6

msg:3

msg

:

v|m|s

e.g. 8 4 0

M1

M2

M3

8

4

a

ctive

msg

:

v|m|

s

|

l

*Selective Activation

(e.g. ALS)

O

ption:

Activation_List

s

0

Slide48

M2

M3

M1Parallelism Improvement

Distributed immutable view

opens an opportunity

M1

M3

out-queues

1

4

7

10

M1

M3

2

3

8

9

5

2

11

8

1

7

4

4

7

1

7

4

3

6

6

3

4

1

lock-free

sorted

computation

sending

receiving

good

locality

c

omp.

threads

comm.

threads

vs.

separate

configuration

thread

master

4

replica

4

Slide49

w/

dynamic comp.no contention

easy to programduplicated edgeslow comm. costCyclops

Existing graph-parallel systems

(e.g., Pregel,

GraphLab, PowerGraph)

Cyclops(MT)

Distributed

Immutable

View

w/o dynamic comp.high contentionhard to program

duplicated

edges

heavy

comm. cost

2

3

3

1

1

5

replica

1

4

x1

x1

Slide50

BiGraph

:

bipartite-oriented distributed graph partitioning for big learningA set of online distributed graph partition algorithms designed for bipartite graphs and applicationsPartition graphs in a differentiated way and loading data according to the data affinity

Outperform PowerGraph with default partition

by up to 17.75X

, and save up to 96% network

traffic

What’s Next?

http://ipads.se.sjtu.edu.cn/projects/powerlyra.html

Slide51

Slide52

Multicore Support

Two Challenges

Two-level hierarchical organizationPreserve synchronous and deterministic computation nature (easy to program/debug)Original BSP-like model is hard to parallelize High contention to buffer and parse messagesPoor locality in message parsingAsymmetric

degree of parallelism for CPU and NIC