/
STINGER: High Performance Data Structure for Streaming Grap STINGER: High Performance Data Structure for Streaming Grap

STINGER: High Performance Data Structure for Streaming Grap - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
425 views
Uploaded On 2017-05-16

STINGER: High Performance Data Structure for Streaming Grap - PPT Presentation

David Ediger Rob McColl Jason Riedy David A Bader Main Contributions High performance scalable and portable graph data structure for streaming graphs Several approaches to manufacture ID: 548973

block edge empty time edge block time empty insertion edges update cray vertex xmt search large space existing updates

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "STINGER: High Performance Data Structure..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

STINGER: High Performance Data Structure for Streaming Graphs

David

Ediger

Rob McColl Jason

Riedy

David A. BaderSlide2

Main Contributions

High performance,

scalable

, and portable graph data structure for streaming graphsSeveral approaches to manufacture parallelism in a stream of edges2-3 orders of magnitude better performance for edge insertions and deletions3 million updates per secondScale-free graph with 537 million edgesTarget platforms: Cray XMT, Intel/AMD x86, IBM Power7In use by: DARPA UHPC CP2, CMU, Sandia MTGL

2Slide3

Outline

Motivation

Data Structures

Target ArchitecturesFinding Parallelism in UpdatesPerformance ResultsConclusions3Slide4

Immense volume of data

Facebook

: ~1

billion usersaverage 130 friends30 billion pieces of content shared / monthTwitter: 500 million active users340 million tweets / dayGoal: Use the interaction network to understand and characterize information flow

Motivation: Social Media

4

Sources:

Facebook

,

Twitter

Twitter social network using Large Graph LayoutSlide5

Current

Example Data Rates

Financial:

NYSE processes 1.5TB daily, maintains 8PBSocial:

Facebook

: 37,000 Likes per second

Twitter: 13,684 Tweets per second (

Barcelona v. Chelsea

)

Google:

“Several dozen” 1PB data sets

Knowledge Graph: 500M entities, 3.5B relationships

Business:eBay: 17 trillion records, 1.5B new records per day Shared features

: All data is rich, semi-structured, and may contain missing or inconsistent information.5Slide6

Characteristics of Social Networks

Very, very large

Power-law distribution

Number of neighborsLarge clustering coefficientSmall diameter“Small world” propertyImplication: 3-hop neighborhood is most of the graphReflected in algorithm design, machine architecture6Image Source: Nexus (Facebook application)Slide7

STINGER

Semi-dense edge list blocks with free space

Compactly stores timestamps, types, weights

Maps from application IDs to storage IDsDeletion by negating IDs, separate compaction7

D. Bader, J. Berry, A. Amos-

Binks

, D.

Chavarría

-Miranda, C. Hastings, K.

Madduri

, S.

Poulos

, "STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation"Slide8

Tolerates latency by

extreme

multithreading

Each processor supports 128 hardware threadsContext switch in a single tickNo cache or local memoryContext switch on memory requestMultiple outstanding loadsRemote memory requests do not stall processorsOther streams work while the request gets fulfilledLight-weight, word-level synchronizationMinimizes access conflictsHashed global shared memory64-byte granularity, minimizes hotspotsCray XMT2 supports up to 64 TiB main memoryCray XMT Operation

8Slide9

Intel E7-8870-based Platforms

Architecture supports up to 2

TiB

, commercially availableTolerates some latency by hyperthreadingHardware: 2 threads / core, 10 cores / socket, 4 socketsFast cores (2.4 GHz), fast memory (1,066 MHz)Out-of-order execution, branch predictionLimited outstanding memory requests (60/socket), but large caches (30 MiB L3 per socket)Coherence-based atomicsGood system support4 MiB pages reduce TLB costsFast, user-level lockingTuned librariesSingle motherboard solution

9

Credit: Intel Press KitSlide10

Insertion Performance

Inserting edges one by one yields

Greatest time resolution

Lowest performanceTest Input Graph: RMAT with 16 M vertices & 270 M edges128-processor Cray XMT: 950 updates / second40-core Intel Xeon: 12,000 updates / secondConclusion: Not enough work / parallelism in a single edge insertion to fully utilize the memory10Slide11

STINGER Batch Update Algorithm

Sort

edge insertions/deletions by source vertex, destination vertex

Partition edge updates by source vertexFor each partition (in parallel)Process deletions seriallyProcess insertions seriallyUpdate block metadata11

Update block-level metadata safely

Multiple parallel algorithms

|V| = 16 M

|E| = 270 M

Batchsize

= 100 K

Cray XMT Updates / SecSlide12

Observations

After much tuning and measurement…

Choice of sorting algorithm had little effect on performance

Actual data structure manipulation dominatedIncreasing batch size yielded additional performanceIncreasing batch size = increasing parallelismHypothesis: Cray XMT can support additional parallelismCorollary: Some vertices have many incident edges to be insertedlong serial bottleneckSolution: Parallelize all edge updates12Slide13

Parallelizing Edge Insertions/Deletions

Solution

: Parallelize all edge

updatesBenefitsNo sortingNo partitioningAll actions in batch can be done in parallelRequirementsSafe parallel updates of edges and edge block metadataAccomplished using full-empty bit synchronization & atomic fetch-and-addOn x86 systems, emulate with atomic compare-and-swapDeletions handled similarly, but separately from insertions13Slide14

Insertion Protocol

No space in existing edge blocks

Update edge

Read all edge blocks again

If existing edge found

Insert edge

Read all edge blocks

If existing edge found

If empty space found

Acquire new empty edge block

Done!Slide15

Edge Block Example

15

1

9

0

2

4

0

-7

5

1

15

3

0

1

0

2

1

Weight

Dest

.

Created

Modified

Num. Edges 4

Small Time 0

Large Time 2

High 6

Vertex 6

Next NULL

Meta Data

Edge

Empty EdgeSlide16

6,3 2

6,13 4

Edge Insertion Example

16

1

9

0

2

4

0

-7

5

1

15

3

0

1

0

2

1

Weight

Dest

.

Created

Modified

Edge Block

Num. Edges 4

Small Time 0

Large Time 2

High 6

Vertex 6

Next NULL

17

17

3

3

Large Time 3

Large Time 3

6,13 8

X

13

4

3

Num. Edges 5

13

4

3

Num. Edges 5

3

X

12

12

3

Vertex 6’s

EdgesSlide17

Edge Insertion Protocol

For an

insertion

from u to v3rd Pass: If u’s list has no empty spaceLock the ‘next’ pointer of the last blockIf the pointer is NULLAtomically acquire an empty edge block Insert an edge into itLink it into the list and unlock. Return 1Otherwise, search the new block as in 2

nd

step

17Slide18

6,13 4

Edge Block Insertion Example

18

1

9

0

2

4

0

-7

5

1

3

16

2

5

1

2

15

3

0

1

0

2

1

2

1

Weight

Dest

.

Created

Modified

Edge Block

Num. Edges 6

Small Time 0

Large Time 2

High 6

Vertex 6

Next NULL

Weight

Dest

.

Created

Modified

Edge Block

Num. Edges 0

Small Time 0

Large Time 0

High 0

Vertex 0

Next NULL

Next XXXX

Next 0xF3

4

13

3

3

Num. Edges 1

Small Time 3

Large Time 3

High 1

Vertex 6

4

13

3

3

Num. Edges 1

Small Time 3

Large Time 3

High 1

Vertex 6

Vertex 6’s

EdgesSlide19

Edge Insertion Protocol

For an

insertion

from u to v1st Pass: Search u’s edge blocks for an existing edge to v.If found, update the weight and modified timestamp.If the search completes, go back to

u

’s first edge block.

2

nd

Pass:

Search

u’s edge blocks for an existing edge to

v or empty space.If an existing edge is found, update the weight and modified timestamp.If an empty space is found, lock its first timestamp.If is has been filled with an edge to v, update the edge and unlock.If it’s still empty, fill it, update the block metadata and the degrees, and unlock.If the space has been filled with another edge, unlock it and keep searching.3rd: If u’s list has no empty space, lock the next pointer of the last block.If the pointer is null, atomically acquire an empty edge block and link it in. Otherwise, search the new block for empty space.Insert the edge into the newly created block.19Slide20

Fine-grained Synchronization

Locking first

timestamp enables

Safety between threads, correctness guaranteed Threads in 1st pass read past lock without waitingIf edge already exists, no locking requiredMetadata kept consistentLocking next pointer ensures no extra edge block allocationsSimultaneously insert any edgeEven the same edgeEnables massive parallelism

20Slide21

Edge Deletion Protocol

For an

deletion

from u to v Search u’s edge blocks for an existing edge to vIf foundAtomically set the edge to emptyAtomically update the vertex degreesReturn 1If the search completes, return 0Thread-safe & correctness guaranteed

All deletions can be processed simultaneously

Massively parallel

21Slide22

Update performance – Intel & Cray XMT/2

22

10x

faster on Intel5x faster on Cray XMT|V| = 16 M|E| = 270 MBatchsize = 100 KSlide23

Larger graphs – Cray XMT/2

23

|V| = 67 M

|E| = 537 MBatchsize = 1 MCray XMT: 3.1 million updates per secondSlide24

Increasing Batch Size

Batch size increases scalability, with diminishing returns

24

|V| = 67 M, |E| = 537 MSlide25

Optimization Summary

25

Cray XMT

Updates / SecSpeed-upIntel E7-8870Updates / SecSpeed-upOriginal95012,000

Batch (100K)

225,000

237

.0

x

168,000

14

.0

xAtomics1,140,0005.0x1,600,0009.5x10x Batch Size2,610,0002.3x1,800,0001.1x2x Graph Size3,100,0001.2x

3,263.0x150.0

xFastest known parallel data structure for dynamic semantic graphs at this scaleAvailable at: http://www.cc.gatech.edu/stingerNote: Our experimental results limited by available resources.Slide26

STINGER Website

http://www.cc.gatech.edu/stingerSlide27

Conclusions

Several approaches to finding parallelism

in an edge update stream

100x to 3000x improvement oversingle edge updatesScalable, parallel data structure for dynamic, semantic graphs at over 3 million updates / secPortable code base runs on Cray XMT, Intel/AMD x86 systems, & IBM Power727Slide28

Acknowledgment of Support

28Slide29

29Slide30

Edge Insertion Protocol

For an

insertion

from u to v1st Search u’s edge blocks for an existing edge to vIf foundUpdate the weight and modified timestamp

Return 0

If the search completes

Go back to

u

’s first edge block

Continue

30Slide31

Edge Insertion Protocol

For an

insertion

from u to v2nd Search u for an edge to v or empty space

If an existing edge is found:

Update the weight and modified timestamp. Return 0

If an empty space is found, lock its first timestamp

If it’s been filled with an edge to

v

, unlock and see above.

If it’s still empty

Fill it, update the block metadata and vertex degrees

Unlock the edge and return 1If the space has been filled with another edgeUnlock the edgeKeep searching until the end of the last block31