/
Big Data Infrastructure Week  5: Analyzing Graphs (2/ 2) This work is licensed under a Big Data Infrastructure Week  5: Analyzing Graphs (2/ 2) This work is licensed under a

Big Data Infrastructure Week 5: Analyzing Graphs (2/ 2) This work is licensed under a - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
342 views
Uploaded On 2019-10-31

Big Data Infrastructure Week 5: Analyzing Graphs (2/ 2) This work is licensed under a - PPT Presentation

Big Data Infrastructure Week 5 Analyzing Graphs 2 2 This work is licensed under a Creative Commons AttributionNoncommercialShare Alike 30 United States See httpcreativecommonsorglicensesbyncsa30us for details ID: 761539

hdfs pagerank reduce map pagerank hdfs map reduce adjacency join graph random lists mass vector nodes mapreduce source web

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Big Data Infrastructure Week 5: Analyzi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Big Data Infrastructure Week 5: Analyzing Graphs (2/ 2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 489/698 Big Data Infrastructure (Winter 2017) Jimmy LinDavid R. Cheriton School of Computer ScienceUniversity of Waterloo February 2, 2017 These slides are available at http :// lintool.github.io /bigdata-2017w/

Parallel BFS in MapReduce Data representation: Key: node nValue: d (distance from start), adjacency listInitialization: for all nodes except for start node, d = Mapper: m  adjacency list: emit ( m, d + 1) Remember to also emit distance to yourself Sort/Shuffle: Groups distances by reachable nodes Reducer: Selects minimum distance path for each reachable node Additional bookkeeping needed to keep track of actual path Remember to pass along the graph structure!

reduce map HDFS HDFS Convergence? Implementation Practicalities

n 0 n3 n2 n1 n7 n6n 5 n 4 n 9 n 8 Visualizing Parallel BFS

Non-toy?

Source: Wikipedia (Crowd) Application : Social Search

Social Search When searching, how to rank friends named “John”? Assume undirected graphsRank matches by distance to user Naïve implementations: Precompute all-pairs distancesCompute distances at query time Can we do better?

All Pairs? Floyd-Warshall Algorithm: difficult to MapReduce- ify…Multiple-source shortest paths in MapReduce: Run multiple parallel BFS simultaneously Assume source nodes { s0 , s1 , … sn }Instead of emitting a single distance, emit an array of distances, wrt each source Reducer selects minimum for each element in arrayDoes this scale?

Landmark Approach (aka sketches) Lots of details: How to more tightly bound distancesHow to select landmarks (random isn’t the best…) Compute distances from seeds to every node: What can we conclude about distances?Insight: landmarks bound the maximum path lengthSelect n seeds { s0 , s 1 , … sn } A = [2, 1, 1]B = [1, 1, 2] C = [4, 3, 1] D = [1, 2, 4] Nodes Distances to seeds Run multi-source parallel BFS in MapReduce!

Graphs and MapReduce (and Spark) A large class of graph algorithms involve: Local computations at each nodePropagating results: “traversing” the graph Generic recipe:Represent graphs as adjacency lists Perform local computations in mapperPass along partial results via outlinks, keyed by destination node Perform aggregation in reducer on inlinks to a nodeIterate until convergence: controlled by external “driver”Don’t forget to pass the graph structure between iterations

PageRank (The original “secret sauce” for evaluating the importance of web pages) (What’s the “Page” in PageRank?)

Random Walks Over the Web Random surfer model: User starts at a random Web pageUser randomly clicks on links, surfing from page to page PageRank Characterizes the amount of time spent on any given pageMathematically, a probability distribution over pagesUse in web ranking Correspondence to human intuition?One of thousands of features used in web search

Given page x with inlinks t1…tn, whereC(t) is the out-degree of t is probability of random jumpN is the total number of nodes in the graphX t1 t2 tn … PageRank: Defined

Computing PageRank Remember this? Sketch of algorithm: Start with seed PRi values Each page distributes PRi mass to all pages it links to Each target page adds up mass from in-bound links to compute PRi+1 Iterate until values convergeA large class of graph algorithms involve:Local computations at each nodePropagating results: “traversing” the graph

Simplified PageRank First, tackle the simple case: No random jump factorNo dangling nodes Then, factor in these complexities… Why do we need the random jump?Where do dangling nodes come from?

n 1 (0.2) n 4 (0.2) n3 (0.2) n5 (0.2) n2 (0.2) 0.10.1 0.2 0.2 0.1 0.1 0.066 0.066 0.066 n 1 (0.066) n 4 (0.3) n 3 (0.166) n 5 (0.3) n 2 (0.166) Iteration 1 Sample PageRank Iteration (1)

n 1 (0.066) n 4 (0.3) n 3 (0.166) n 5 (0.3) n 2 (0.166) 0.033 0.033 0.3 0.166 0.083 0.083 0.1 0.1 0.1 n 1 (0.1) n 4 (0.2) n 3 (0.183) n 5 (0.383) n 2 (0.133) Iteration 2 Sample PageRank Iteration (2)

n 5 [ n 1 , n2 , n3] n 1 [ n 2 , n 4 ] n 2 [ n 3 , n 5 ] n 3 [ n 4 ] n 4 [ n 5 ] n 2 n 4 n 3 n 5 n 1 n 2 n 3 n 4 n 5 n 2 n 4 n 3 n 5 n 1 n 2 n 3 n 4 n 5 n 5 [ n 1 , n 2 , n 3 ] n 1 [ n 2 , n 4 ] n 2 [ n 3 , n 5 ] n 3 [ n 4 ] n 4 [ n 5 ] Map Reduce PageRank in MapReduce

PageRank Pseudo-Code

Map Reduce PageRank BFS PR/Nd+1 sum minPageRank vs. BFS A large class of graph algorithms involve:Local computations at each node Propagating results: “traversing” the graph

p is PageRank value from before, p' is updated PageRank valueN is the number of nodes in the graphm is the missing PageRank mass Complete PageRank Two additional complexities What is the proper treatment of dangling nodes?How do we factor in the random jump factor?Solution: second pass to redistribute “missing PageRank mass” and account for random jumpsOne final optimization: fold into a single MR job

Convergence? reduce map HDFS HDFS map HDFS What’s the optimization? Implementation Practicalities

PageRank Convergence Alternative convergence criteria Iterate until PageRank values don’t changeIterate until PageRank rankings don’t change Fixed number of iterationsConvergence for web graphs? Not a straightforward questionWatch out for link spam and the perils of SEO: Link farmsSpider traps…

Log Probs PageRank values are really small …Product of probabilities = Addition of log probsAddition of probabilities? Solution?

More Implementation PracticalitiesHow do you even extract the webgraph? Lots of details…

Beyond PageRank Variations of PageRank Weighted edgesPersonalized PageRank Variants on graph random walks Hubs and authorities (HITS)SALSA

Applications Static prior for web ranking Identification of “special nodes” in a network Link recommendationAdditional feature in any machine learning problem

Convergence? reduce map HDFS HDFS map HDFS Implementation Practicalities

MapReduce Sucks Java verbosity Spark to the rescue? Hadoop task startup time StragglersNeedless graph shuffling Checkpointing at each iteration

reduce HDFS … map HDFS reduce map HDFS reduce map HDFS Let’s Spark!

reduce HDFS … map reduce map reduce map

reduce HDFS map reduce map reduce map Adjacency Lists PageRank Mass Adjacency Lists PageRank Mass Adjacency Lists PageRank Mass …

join HDFS map join map join map Adjacency Lists PageRank Mass Adjacency Lists PageRank Mass Adjacency Lists PageRank Mass …

join join join … HDFS HDFS Adjacency Lists PageRank vector PageRank vector flatMap reduceByKey PageRank vector flatMap reduceByKey

join join join … HDFS HDFS Adjacency Lists PageRank vector PageRank vector flatMap reduceByKey PageRank vector flatMap reduceByKey Cache!

Source : http:// ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-part-2-amp-camp-2012-standalone-programs.pdf MapReduce vs. Spark

Spark to the rescue? Java verbosity What have we fixed? Hadoop task startup time StragglersNeedless graph shuffling Checkpointing at each iteration

join join join … HDFS HDFS Adjacency Lists PageRank vector PageRank vector flatMap reduceByKey PageRank vector flatMap reduceByKey Cache! Still not particularly satisfying?

Source: https:// www.flickr.com/photos/ smuzz/4350039327/ Stay Tuned!

Source: Wikipedia (Japanese rock garden) Questions?