/
Thesis Defense Large -Scale Graph Computation on Thesis Defense Large -Scale Graph Computation on

Thesis Defense Large -Scale Graph Computation on - PowerPoint Presentation

elena
elena . @elena
Follow
27 views
Uploaded On 2024-02-09

Thesis Defense Large -Scale Graph Computation on - PPT Presentation

Just a PC Aapo Kyrölä akyrolacscmuedu Carlos Guestrin University of Washington amp CMU Guy Blelloch CMU Alex Smola CMU Dave Andersen CMU Jure Leskovec Stanford Thesis Committee ID: 1045585

memory graph edge vertex graph memory vertex edge graphchi edges parallel random data computation psw graphs shard performance interval

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Thesis Defense Large -Scale Graph Comput..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Thesis DefenseLarge-Scale Graph Computation on Just a PCAapo Kyrölä akyrola@cs.cmu.eduCarlos GuestrinUniversity of Washington & CMUGuy BlellochCMUAlex SmolaCMUDave AndersenCMUJure LeskovecStanfordThesis Committee:1

2. Research FieldsMotivation and applicationsResearch contributions2

3. Why Graphs?Large-Scale Graph Computation on Just a PC3

4. BigData with Structure: BigGraphsocial graphsocial graphfollow-graphconsumer-products graphuser-movie ratings graphDNA interactiongraphWWWlink graphCommunication networks (but “only 3 hops”)4

5. Why on a single machine?Can’t we just use the Cloud?Large-Scale Graph Computation on a Just a PC5

6. Why use a cluster?Two reasons:One computer cannot handle my graph problem in a reasonable time.I need to solve the problem very fast.6

7. Why use a cluster?Two reasons:One computer cannot handle my graph problem in a reasonable time.I need to solve the problem very fast.Our work expands the space of feasible problems on one machine (PC):- Our experiments use the same graphs, or bigger, than previous papers on distributed graph computation. (+ we can do Twitter graph on a laptop)Our work raises the bar on required performance for a “complicated” system.7

8. Benefits of single machine systemsAssuming it can handle your big problems…Programmer productivityGlobal state, debuggers…Inexpensive to install, administer, less power.ScalabilityUse cluster of single-machine systems to solve many tasks in parallel. Idea: Trade latency for throughput< 32K bits/sec8

9. Computing on Big GraphsLarge-Scale Graph Computation on Just a PC9

10. Big Graphs != Big Data10Data size:140 billion connections ≈ 1 TBNot a problem!Computation:Hard to scaleTwitter network visualization, by Akshay Java, 2009

11. Research GoalCompute on graphs with billions of edges, in a reasonable time, on a single PC.Reasonable = close to numbers previously reported for distributed systems in the literature.Experiment PC: Mac Mini (2012)11

12. Terminology(Analytical) Graph Computation:Whole graph is processed, typically for several iterations  vertex-centric computation.Examples: Belief Propagation, Pagerank, Community detection, Triangle Counting, Matrix Factorization, Machine Learning…Graph Queries (database)Selective graph queries (compare to SQL queries)Traversals: shortest-path, friends-of-friends,…12

13. GraphChi (Parallel Sliding Windows)GraphChi-DB(Partitioned Adjacency Lists)Online graph updatesBatchcomp.Evolving graphIncremental comp.DrunkardMob: Parallel Random walk simulationGraphChi^2Triangle CountingItem-Item SimilarityPageRankSALSAHITSGraph ContractionMinimum Spanning Forestk-CoreGraph ComputationShortest PathFriends-of-FriendsInduced SubgraphsNeighborhood queryEdge and vertex propertiesLink predictionWeakly Connected ComponentsStrongly Connected ComponentsCommunity DetectionLabel PropagationMulti-BFSGraph samplingGraph QueriesMatrix FactorizationLoopy Belief PropagationCo-EMGraph traversalThesis statementThe Parallel Sliding Windows algorithm and the Partitioned Adjacency Lists data structure enable computation on very large graphs in external memory, on just a personal computer.13

14. Disk-based graph computation14

15. GraphChi (Parallel Sliding Windows)Batchcomp.Evolving graphTriangle CountingItem-Item SimilarityPageRankSALSAHITSGraph ContractionMinimum Spanning Forestk-CoreGraph ComputationWeakly Connected ComponentsStrongly Connected ComponentsCommunity DetectionLabel PropagationMulti-BFSMatrix FactorizationLoopy Belief PropagationCo-EMGraphChi-DB(Partitioned Adjacency Lists)Online graph updatesDrunkardMob: Parallel Random walk simulationGraphChi^2Incremental comp.Shortest PathFriends-of-FriendsInduced SubgraphsNeighborhood queryEdge and vertex propertiesLink predictionGraph samplingGraph QueriesGraph traversal15

16. Computational ModelGraph G = (V, E) directed edges: e = (source, destination)each edge and vertex associated with a value (user-defined type)vertex and edge values can be modified(structure modification also supported)DataDataDataDataDataDataDataDataDataData16ABeTerms: e is an out-edge of A, and in-edge of B.

17. DataDataDataDataDataDataDataDataDataDataVertex-centric Programming“Think like a vertex”Popularized by the Pregel and GraphLab projectsHistorically, systolic computation and the Connection MachineMyFunc(vertex) { // modify neighborhood }DataDataDataDataData17function Pagerank(vertex) insum = sum(edge.value for edge in vertex.inedges) vertex.value = 0.85 + 0.15 * insum foreach edge in vertex.outedges: edge.value = vertex.value / vertex.num_outedges

18. Computational SettingConstraints:Not enough memory to store the whole graph in memory, nor all the vertex values.Enough memory to store one vertex and its edges w/ associated values.18

19. The Main Challenge of Disk-based Graph Computation:Random Access~ 100K reads / sec (commodity)~ 1M reads / sec (high-end arrays)<< 5-10 M random edges / sec to achieve “reasonable performance”19

20. Random Access Problem20File: edge-valuesA: in-edgesA: out-edgesABB: in-edgesB: out-edgesxMoral: You can either access in- or out-edges sequentially, but not both!Random write!Random read!Processing sequentially

21. Our SolutionParallel Sliding Windows (PSW)21

22. Parallel Sliding Windows: PhasesPSW processes the graph one sub-graph a time:In one iteration, the whole graph is processed.And typically, next iteration is started.1. Load2. Compute3. Write22

23. Vertices are numbered from 1 to nP intervalssub-graph = interval of verticesPSW: Shards and Intervalsshard(1)interval(1)interval(2)interval(P)shard(2)shard(P)110000100700231. Load2. Compute3. Write<<partition-by>>sourcedestinationedgeIn shards, edges sorted by source.

24. Example: LayoutShard 1Shards small enough to fit in memory; balance size of shardsShard: in-edges for interval of vertices; sorted by source-idin-edges for vertices 1..100sorted by source_idVertices1..100Vertices101..700Vertices701..1000Vertices1001..10000Shard 2Shard 3Shard 4Shard 11. Load2. Compute3. Write24

25. Vertices1..100Vertices101..700Vertices701..1000Vertices1001..10000Load all in-edgesin memory Load subgraph for vertices 1..100What about out-edges? Arranged in sequence in other shardsShard 2Shard 3Shard 4PSW: Loading Sub-graphShard 1in-edges for vertices 1..100sorted by source_id1. Load2. Compute3. Write25

26. Shard 1Load all in-edgesin memory Load subgraph for vertices 101..700Shard 2Shard 3Shard 4PSW: Loading Sub-graphVertices1..100Vertices101..700Vertices701..1000Vertices1001..10000Out-edge blocksin memoryin-edges for vertices 1..100sorted by source_id1. Load2. Compute3. Write26

27. Parallel Sliding WindowsOnly P large reads and writes for each interval.= P2 random accesses on one full pass. 27Works well on both SSD and magnetic hard disks!

28. “gauss-Seidel” / AsynchronousHow PSW computesChapter 6 + Appendix28Joint work: Julian Shun

29. Synchronous vs. Gauss-SeidelBulk-Synchronous Parallel (Jacobi iterations)Updates see neighbors’ values from previous iteration. [Most systems are synchronous]Asynchronous (Gauss-Seidel iterations)Updates see most recent values.GraphLab is asynchronous.29

30. Shard 1Load all in-edgesin memory Load subgraph for vertices 101..700Shard 2Shard 3Shard 4PSW runs Gauss-SeidelVertices1..100Vertices101..700Vertices701..1000Vertices1001..10000Out-edge blocksin memoryin-edges for vertices 1..100sorted by source_idFresh values in this “window” from previous phase 30

31. Synchronous (Jacobi)1253411233111231111211111Bulk-Synchronous: requires graph diameter –many iterations to propagate the minimum label. Each vertex chooses minimum label of neighbor.31

32. PSW is Asynchronous (Gauss-Seidel)124351113311111Gauss-Seidel: expected # iterations on random schedule on a chain graph = (N - 1) / (e – 1) ≈ 60% of synchronousEach vertex chooses minimum label of neighbor.32

33. Open theoretical question!Label PropagationSide length = 100SynchronousPSW: Gauss-Seidel (average, random schedule)199~ 57100~29298# iterations~52Natural graphs (web, social)graph diameter - 1~ 0.6 * diameterJoint work: Julian Shun Chapter 633

34. PSW & External Memory Algorithms ResearchPSW is a new technique for implementing many fundamental graph algorithmsEspecially simple (compared to previous work) for directed graph problems: PSW handles both in- and out-edgesWe propose new graph contraction algorithm based on PSWMinimum-Spanning Forest & Connected Components… utilizing the Gauss-Seidel “acceleration”Joint work: Julian Shun SEA 2014, Chapter 6 + Appendix34

35. GraphChi: system evaluationSneak peekConsult the paper for a comprehensive evaluation:HD vs. SSDStriping data across multiple hard drivesComparison to an in-memory versionBottlenecks analysisEffect of the number of shardsBlock size and performance.35

36. GraphChi (Parallel Sliding Windows)Batchcomp.Evolving graphTriangle CountingItem-Item SimilarityPageRankSALSAHITSGraph ContractionMinimum Spanning Forestk-CoreGraph ComputationWeakly Connected ComponentsStrongly Connected ComponentsCommunity DetectionLabel PropagationMulti-BFSMatrix FactorizationLoopy Belief PropagationCo-EMGraphChi-DB(Partitioned Adjacency Lists)Online graph updatesDrunkardMob: Parallel Random walk simulationGraphChi^2Incremental comp.Shortest PathFriends-of-FriendsInduced SubgraphsNeighborhood queryEdge and vertex propertiesLink predictionGraph samplingGraph QueriesGraph traversal36

37. GraphChiC++ implementation: 8,000 lines of codeJava-implementation also availableSeveral optimizations to PSW (see paper).Source code and examples:http://github.com/graphchi37

38. Experiment SettingMac Mini (Apple Inc.)8 GB RAM256 GB SSD, 1TB hard driveIntel Core i5, 2.5 GHzExperiment graphs:GraphVerticesEdgesP (shards)Preprocessinglive-journal4.8M69M30.5 minnetflix0.5M99M201 mintwitter-201042M1.5B202 minuk-2007-05106M3.7B4031 minuk-union133M5.4B5033 minyahoo-web1.4B6.6B5037 min38

39. Comparison to Existing SystemsNotes: comparison results do not include time to transfer the data to cluster, preprocessing, or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously. PageRankSee the paper for more comparisons.WebGraph Belief Propagation (U Kang et al.)Matrix Factorization (Alt. Least Sqr.)Triangle CountingOn a Mac Mini:GraphChi can solve as big problems as existing large-scale systems.Comparable performance.39

40. PowerGraph ComparisonPowerGraph / GraphLab 2 outperforms previous systems by a wide margin on natural graphs.With 64 more machines, 512 more CPUs:Pagerank: 40x faster than GraphChiTriangle counting: 30x faster than GraphChi.OSDI’12GraphChi has good performance / CPU.2vs.GraphChi40

41. In-memory Comparison41Total runtime comparison to 1-shard GraphChi, with initial load + output write taken into accountGraphChiMac Mini – SSD790 secsLigra (J. Shun, Blelloch)40-core Intel E7-887015 secsComparison to state-of-the-art:- 5 iterations of Pageranks / Twitter (1.5B edges)However, sometimes better algorithm available for in-memory than external memory / distributed.Ligra (J. Shun, Blelloch) 8-core Xeon 5550230 s + preproc 144 sPSW – inmem version, 700 shards (see Appendix)8-core Xeon 5550100 s + preproc 210 s

42. Scalability / Input Size [SSD]Throughput: number of edges processed / second.Conclusion: the throughput remains roughly constant when graph size is increased. GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low).Graph size Performance Paper: scalability of other applications.42

43. Graphchi-DBNew work43

44. GraphChi (Parallel Sliding Windows)GraphChi-DB(Partitioned Adjacency Lists)Online graph updatesBatchcomp.Evolving graphIncremental comp.DrunkardMob: Parallel Random walk simulationGraphChi^2Triangle CountingItem-Item SimilarityPageRankSALSAHITSGraph ContractionMinimum Spanning Forestk-CoreGraph ComputationShortest PathFriends-of-FriendsInduced SubgraphsNeighborhood queryEdge and vertex propertiesLink predictionWeakly Connected ComponentsStrongly Connected ComponentsCommunity DetectionLabel PropagationMulti-BFSGraph samplingGraph QueriesMatrix FactorizationLoopy Belief PropagationCo-EMGraph traversal44

45. Research QuestionsWhat if there is lot of metadata associated with edges and vertices? How to do graph queries efficiently while retaining computational capabilities?How to add edges efficiently to the graph?Can we design a graph database based on GraphChi?45

46. Existing Graph Database Solutions461) Specialized single-machine graph databasesProblems:Poor performance with data >> memoryNo/weak support for analytical computation2) Relational / key-value databases as graph storageProblems:Large indicesIn-edge / out-edge dilemmaNo/weak support for analytical computation

47. Partitioned adjacency lists (PAL): data structureOur solution47

48. Review: Edges in ShardsShard 1sorted by source_idShard 2Shard 3Shard PShard 10100101700NEdge = (src, dst)Partition by dst48

49. Shard Structure (Basic)49SourceDestination181193176420312387271937212789139….….srcdst

50. Shard Structure (Basic)50Destination8193764201287219321289139….SourceFile offset103375….,…srcdstCompressed Sparse Row (CSR)Problem 1:How to find in-edges of a vertex quickly?Note: We know the shard, but edges in random order.Pointer-arrayEdge-array

51. PAL: In-edge Linkage51Destination8193764201287219321289139….SourceFile offset103375….,…srcdstPointer-arrayEdge-array

52. PAL: In-edge Linkage52DestinationLink83339193376420109212289872401932002212128913922….SourceFile offset103375….,…+ Index to the first in-edge for each vertex in interval.Pointer-arrayEdge-arrayProblem 2:How to find out-edges quickly?Note: Sorted inside a shard, but partitioned across all shards.Augmented linked list for in-edges

53. SourceFile offset103375….,…PAL: Out-edge Queries53Pointer-arrayEdge-arrayProblem: Can be big -- O(V)  Binary search on disk slow.DestinationNext-in-offset83339193376420109212289872401932002212128913922….2565121024….Option 1:Sparse index (inmem)Option 2:Delta-coding with unary code (Elias-Gamma) Completely in-memory

54. Experiment: IndicesMedian latency, Twitter-graph, 1.5B edges54

55. Queries: I/O costs55In-edge query: only one shardOut-edge query: each shard that has edgesTrade-off:More shards  Better locality for in-edge queries, worse for out-edge queries.

56. Edge Data & SearchesShard X- adjacency‘weight‘: [float] ‘timestamp’: [long]‘belief’: (factor)Edge (i)No foreign key required to find edge data!(fixed size)Note: vertex values stored similarly.

57. Efficient Ingest?Shards on DiskBuffers in RAM57

58. Merging Buffers to DiskShards on DiskBuffers in RAM58

59. Merging Buffers to Disk (2)Shards on DiskBuffers in RAMMerge requires loading existing shard from disk  Each edge will be rewritten always on a merge.Does not scale: number of rewrites: data size / buffer size = O(E)59

60. Log-Structured Merge-tree (LSM)Ref: O’Neil, Cheng et al. (1996)60Top shards with in-memory buffersOn-disk shardsLEVEL 1 (youngest)Downstream mergeinterval 1interval 2interval 3interval 4interval PInterval P-1Intervals 1--4Intervals (P-4) -- PIntervals 1--16New edgesLEVEL 2LEVEL 3(oldest)In-edge query: One shard on each levelOut-edge query:All shards

61. Experiment: Ingest61

62. Advantages of PALOnly sparse and implicit indicesPointer-array usually fits in RAM with Elias-Gamma. Small database size.Columnar data modelLoad only data you need.Graph structure is separate from data.Property graph modelGreat insertion throughput with LSMTree can be adjusted to match workload.62

63. Experimental comparisons63

64. GraphChi-DB: ImplementationWritten in ScalaQueries & ComputationOnline databaseAll experiments shown in this talk done on Mac Mini (8 GB, SSD)64Source code and examples:http://github.com/graphchi

65. Comparison: Database Size65Baseline: 4 + 4 bytes / edge.

66. Comparison: Ingest66SystemTime to ingest 1.5B edgesGraphChi-DB (ONLINE)1 hour 45 minsNeo4j (batch)45 hoursMySQL (batch)3 hour 30 minutes (including index creation)If running Pagerank simultaneously, GraphChi-DB takes 3 hour 45 minutes

67. Comparison: Friends-of-Friends QuerySee thesis for shortest-path comparison.Latency percentiles over 100K random queries68M edges1.5B edges

68. LinkBench: Online Graph DB Benchmark by FacebookConcurrent read/write workloadBut only single-hop queries (“friends”).8 different operations, mixed workload.Best performance with 64 parallel threadsEach edge and vertex has:Version, timestamp, type, random string payload.68GraphChi-DB (Mac Mini)MySQL+FB patch, server, SSD-array, 144 GB RAMEdge-update (95p)22 ms25 msEdge-get-neighbors (95p)18 ms9 msAvg throughput2,487 req/s 11,029 req/sDatabase size350 GB1.4 TBSee full results in the thesis.

69. Summary of ExperimentsEfficient for mixed read/write workload.See Facebook LinkBench experiments in thesis.LSM-tree  trade-off read performance (but, can adjust).State-of-the-art performance for graphs that are much larger than RAM.Neo4J’s linked-list data structure good for RAM.DEX performs poorly in practice.69More experiments in the thesis!

70. DiscussionGreater ImpactHindsightFuture Research Questions70

71. Greater impact71

72. Impact: “Big Data” ResearchGraphChi’s OSDI 2012 paper has received over 85 citations in just 18 months (Google Scholar).Two major direct descendant papers in top conferences:X-Stream: SOSP 2013TurboGraph: KDD 2013Challenging the mainstream:You can do a lot on just a PC  focus on right data structures, computational models.72

73. Impact: UsersGraphChi’s (C++, Java, -DB) have gained a lot of usersCurrently ~50 unique visitors / day.Enables ‘everyone’ to tackle big graph problemsEspecially the recommender toolkit (by Danny Bickson) has been very popular.Typical users: students, non-systems researchers, small companies…73

74. Impact: Users (cont.)I work in a [EU country] public university. I can't use a a distributed computing cluster for my research ... it is too expensive. Using GraphChi I was able to perform my experiments on my laptop. I thus have to admit that GraphChi saved my research. (...)74

75. Evaluation: hindsight75

76. What is GraphChi Optimized for?Original target algorithm: Belief Propagation on Probabilistic Graphical Models.Changing value of an edge (both in- and out!).Computation process whole, or most of the graph on each iteration.Random access to all vertex’s edges.Vertex-centric vs. edge centric.Async/Gauss-Seidel execution.76uv

77. GraphChi Not Good ForVery large vertex state.Traversals, and two-hop dependencies.Or dynamic scheduling (such as Splash BP).High diameter graphs, such as planar graphs.Unless the computation itself has short-range interactions.Very large number of iterations.Neural networks.LDA with Collapsed Gibbs sampling.No support for implicit graph structure.+ Single PC performance is limited.77

78. GraphChi (Parallel Sliding Windows)GraphChi-DB(Partitioned Adjacency Lists)Online graph updatesBatchcomp.Evolving graphIncremental comp.DrunkardMob: Parallel Random walk simulationGraphChi^2Triangle CountingItem-Item SimilarityPageRankSALSAHITSGraph ContractionMinimum Spanning Forestk-CoreGraph ComputationShortest PathFriends-of-FriendsInduced SubgraphsNeighborhood queryEdge and vertex propertiesLink predictionWeakly Connected ComponentsStrongly Connected ComponentsCommunity DetectionLabel PropagationMulti-BFSGraph samplingGraph QueriesMatrix FactorizationLoopy Belief PropagationCo-EMGraph traversalVersatility of PSW and PAL78

79. Future Research DirectionsDistributed SettingDistributed PSW (one shard / node) PSW is inherently sequentialLow bandwidth in the CloudCo-operating GraphChi(-DB)’s connected with a Parameter ServerNew graph programming models and ToolsVertex-centric programming sometimes too local: for example, two-hop interactions and many traversals cumbersome.Abstractions for learning graph structure; Implicit graphs.Hard to debug, especially async  Better tools needed.Graph-aware optimizations to GraphChi-DB.Buffer management.Smart caching.Learning configuration.79

80. What if we have plenty of memory?Semi-External Memory Setting80

81. ObservationsThe I/O performance of PSW is only weakly affected by the amount of RAM.Good: works with very little memory.Bad: Does not benefit from more memorySimple trick: cache some data.Many graph algorithms have O(V) state.Update function accesses neighbor vertex state.Standard PSW: ‘broadcast’ vertex value via edges.Semi-external: Store vertex values in memory.81

82. Using RAM efficientlyAssume that enough RAM to store many O(V) algorithm states in memory.But not enough to store the whole graph.Graph working memory (PSW)diskcomputational stateComputation 1 stateComputation 2 stateRAM82

83. Parallel Computation ExamplesDrunkardMob algorithm (Chapter 5):Store billions of random walk states in RAM.Multiple Breadth-First-Searches:Analyze neighborhood sizes by starting hundreds of random BFSes.Compute in parallel many different recommender algorithms (or with different parameterizations).See Mayank Mohta, Shu-Hao Yu’s Master’s project.83

84. conclusion84

85. Summary of Published Work85GraphLab: Parallel Framework for Machine Learning (with J. Gonzaled, Y.Low, D. Bickson, C.Guestrin)UAI 2010Distributed GraphLab: Framework for Machine Learning and Data Mining in the Cloud (-- same --)VLDB 2012GraphChi: Large-scale Graph Computation on Just a PC (with C.Guestrin, G. Blelloch)OSDI 2012DrunkardMob: Billions of Random Walks on Just a PCACM RecSys 2013Beyond Synchronous: New Techniques for External Memory Graph Connectivity and Minimum Spanning Forest (with Julian Shun, G. Blelloch)SEA 2014GraphChi-DB: Simple Design for a Scalable Graph Database – on Just a PC (with C. Guestrin)(submitted / arxiv)Parallel Coordinate Descent for L1-regularized :Loss Minimization (Shotgun) (with J. Bradley, D. Bickson, C.Guestrin)ICML 2011First author papers in bold.THESIS

86. Summary of Main Contributions Proposed Parallel Sliding Windows, a new algorithm for external memory graph computation  GraphChiExtended PSW to design Partitioned Adjacency Lists to build a scalable graph database  GraphChi-DBProposed DrunkardMob for simulating billions of random walks in parallel.Analyzed PSW and its Gauss-Seidel properties for fundamental graph algorithms  New approach for EM graph algorithms research. 86Thank You!

87. Additional slides87

88. EconomicsGraphChi (40 Mac Minis)PowerGraph (64 EC2 cc1.4xlarge)Investments67,320 $-Operating costsPer node, hour0.03 $1.30 $Cluster, hour1.19 $52.00 $Daily28.56 $1,248.00 $It takes about 56 days to recoup Mac Mini investments.Assumptions:Mac Mini: 85W (typical servers 500-1000W)Most expensive US energy: 35c / KwHEqual throughput configurations (based on OSDI’12)88

89. PSW for In-memory ComputationExternal memory setting:Slow memory = hard disk / SSDFast memory = RAMIn-memory:Slow = RAMFast = CPU caches Small (fast) Memorysize: M Large (slow) Memorysize: ‘unlimited’BCPUblock transfersDoes PSW help in the in-memory setting?89

90. PSW for in-memoryAppendixPSW outperforms (slightly) Ligra*, the fastest in-memory graph computation system on Pagerank with the Twitter graph, including preprocessing steps of both systems.*: Julian Shun, Guy Blelloch, PPoPP 201390

91. Remarks (sync vs. async)Bulk-Synchronous is embarrassingly parallelBut needs twice the amount of spaceAsync/G-S helps with high diameter graphsSome algorithms converge much better asynchronouslyLoopy BP, see Gonzalez et al. (2009)Also Bertsekas & Tsitsiklis Parallel and Distributed Optimization (1989)Asynchronous sometimes difficult to reason about and debugAsynchronous can be used to implement BSP91

92. I/O ComplexitySee the paper for theoretical analysis in the Aggarwal-Vitter’s I/O model.Worst-case only 2x best-case.Intuition:shard(1)interval(1)interval(2)interval(P)shard(2)shard(P)1|V|v1v2Inter-interval edge is loaded from disk only once / iteration.Edge spanning intervals is loaded twice / iteration.

93. Impact of Graph StructureAlgos with long range information propagation, need relatively small diameter  would require too many iterationsPer iteration cost not much affectedCan we optimize partitioning?Could help thanks to Gauss-Seidel (faster convergence inside “groups”)  topological sortLikely too expensive to do on single PC93

94. Graph Compression: Would it help?Graph Compression methods (e.g Blelloch et al., WebGraph Framework) can be used to compress edges to 3-4 bits / edge (web), ~ 10 bits / edge (social)But require graph partitioning  requires a lot of memory.Compression of large graphs can take days (personal communication).Compression problematic for evolving graphs, and associated data.GraphChi can be used to compress graphs?Layered label propagation (Boldi et al. 2011) 94

95. Previous research on (single computer) Graph Databases1990s, 2000s saw interest in object-oriented and graph databases:GOOD, GraphDB, HyperGraphDB…Focus was on modeling, graph storage on top of relational DB or key-value storeRDF databases Most do not use graph storage but store triples as relations + use indexing.Modern solutions have proposed graph-specific storage:Neo4j: doubly linked listTurboGraph: adjacency list chopped into pagesDEX: compressed bitmaps (details not clear)95

96. LinkBench96

97. Comparison to FB (cont.)GraphChi load time 9 hours, FB’s 12 hoursGraphChi database about 250 GB, FB > 1.4 terabytesHowever, about 100 GB explained by different variable data (payload) sizeFacebook/MySQL via JDBC, GraphChi embeddedBut MySQL native code, GraphChi-DB Scala (JVM)Important CPU bound bottleneck in sorting the results for high-degree vertices

98. LinkBench: GraphChi-DB performance / Size of DBWhy not even faster when everything in RAM?  computationally bound requests

99. Possible SolutionsUse SSD as a memory-extension? [SSDAlloc, NSDI’11]2. Compress the graph structure to fit into RAM?[ WebGraph framework]3. Cluster the graph and handle each cluster separately in RAM?4. Caching of hot nodes?Too many small objects, need millions / sec.Associated values do not compress well, and are mutated.Expensive; The number of inter-cluster edges is big.Unpredictable performance.

100. Number of ShardsIf P is in the “dozens”, there is not much effect on performance.

101. Multiple hard-drives (RAIDish)GraphChi supports striping shards to multiple disks  Parallel I/O.Experiment on a 16-core AMD server (from year 2007).

102. BottlenecksConnected Components on Mac Mini / SSDCost of constructing the sub-graph in memory is almost as large as the I/O cost on an SSDGraph construction requires a lot of random access in RAM  memory bandwidth becomes a bottleneck.

103. Bottlenecks / MulticoreExperiment on MacBook Pro with 4 cores / SSD.Computationally intensive applications benefit substantially from parallel execution.GraphChi saturates SSD I/O with 2 threads.

104. In-memory vs. Disk104

105. Experiment: Query latencySee thesis for I/O cost analysis of in/out queries.105

106. Example: Induced Subgraph QueriesInduced subgraph for vertex set S contains all edges in the graph that have both endpoints in S.Very fast in GraphChi-DB:Sufficient to query for out-edgesParallelizes well  multi-out-edge-queryCan be used for statistical graph analysisSample induced neighborhoods, induced FoF neighborhoods from graph106

107. Vertices / NodesVertices are partitioned similarly as edgesSimilar “data shards” for columnsLookup/update of vertex data is O(1)No merge tree here: Vertex files are “dense”Sparse structure could be supported1aa+12aMax-id

108. ID-mappingVertex IDs mapped to internal IDs to balance shards:Interval length constant a012..aa+1a+2....2a2a+1…012255Original ID:012255256257258511

109. What if we have a Cluster?Graph working memory (PSW)diskcomputational stateComputation 1 stateComputation 2 stateRAMTrade latency for throughput!109

110. Graph Computation: Research Challenges Lack of truly challenging (benchmark) applications… which is caused by lack of good data available for the academics: big graphs with metadataIndustry co-operation  But problem with reproducibilityAlso: it is hard to ask good questions about graphs (especially with just structure)Too much focus on performance  More important to enable “extracting value”110

111. Random walk in an in-memory graphCompute one walk a time (multiple in parallel, of course):parfor walk in walks: for i=1 to numsteps: vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor()) DrunkardMob - RecSys '13

112. Problem: What if Graph does not fit in memory?Twitter network visualization, by Akshay Java, 2009Distributed graph systems:- Each hop across partition boundary is costly.Disk-based “single-machine” graph systems:“Paging” from disk is costly.(This talk)DrunkardMob - RecSys '13

113. Random walks in GraphChiDrunkardMob –algorithmReverse thinkingForEach interval p: walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p): mywalks = walkSnapshot.getWalksAtVertex(vertex.id) ForEach walk in mywalks: walkManager.addHop(walk, vertex.randomNeighbor()) Note: Need to store only current position of each walk!DrunkardMob - RecSys '13