/
Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which

Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
344 views
Uploaded On 2019-10-31

Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which - PPT Presentation

Graph Algorithms Adapted from UMD Jimmy Lins slides which is licensed under a Creative Commons AttributionNoncommercialShare Alike 30 United States See httpcreativecommonsorglicensesbyncsa30us for ID: 761536

nodes node graph pagerank node nodes pagerank graph adjacency unvisited algorithm mass current dijkstra

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Graph Algorithms Adapted from UMD Jimmy ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsRevised based on the slides by Ruoming Jin @ Kent State 1

Outlines Graph problems and representationsParallel breadth-first searchPageRank2

Outlines Graph problems and representationsParallel breadth-first searchPageRank3

What’s a graph? G = (V,E), whereV represents the set of vertices (nodes)E represents the set of edges (links)Both vertices and edges may contain additional informationDifferent types of graphs:Directed vs. undirected edgesPresence or absence of cyclesGraphs are everywhere:Hyperlink structure of the WebPhysical structure of computers on the InternetInterstate highway systemSocial networks4

5

Source: Wikipedia (Königsberg)6

Some Graph Problems Finding shortest pathsRouting Internet traffic and UPS trucksFinding minimum spanning treesTelco laying down fiberFinding Max FlowAirline schedulingIdentify “special” nodes and communitiesBreaking up terrorist cells, spread of avian fluBipartite matchingMonster.com, Match.comAnd of course... PageRank7

Graphs and MapReduce Graph algorithms typically involve:Performing computations at each node: based on node features, edge features, and local link structurePropagating computations: “traversing” the graphKey questions:How do you represent graph data in MapReduce?How do you traverse a graph in MapReduce?8

Representing Graphs G = (V, E)Two common representationsAdjacency matrixAdjacency list9

Adjacency Matrices Represent a graph as an n x n square matrix Mn = |V|Mij = 1 means a link from node i to j n1 n2 n3 n4 n1 0 1 0 1 n2 1 0 1 1 n3 1 0 0 0 n4 1 0 1 0 n1 n2 n3 n4

Adjacency Matrices: Critique Advantages:Amenable to mathematical manipulationIteration over rows and columns corresponds to computations on outlinks and inlinksDisadvantages:Lots of zeros for sparse matricesLots of wasted space11

Adjacency Lists Take adjacency matrices… and throw away all the zeros1: 2, 42: 1, 3, 43: 14: 1, 3 n1 n2 n3 n4 n1 0 1 0 1 n2 1 0 1 1 n3 1 0 0 0 n4 1 0 1 0

Adjacency Lists: Critique Advantages:Much more compact representationEasy to compute over outlinksDisadvantages:Much more difficult to compute over inlinks13

Outlines Graph problems and representationsParallel breadth-first searchPageRank14

Single Source Shortest Path Problem: find shortest path from a source node to one or more target nodesShortest might also mean lowest weight or costFirst, a refresher: Dijkstra’s Algorithm15

Dijkstra’s Algorithm Assign to every node a tentative distance value: set it to zero for the initial (source) node and to infinity for all other nodes.Mark all nodes unvisited. Create a set of all the unvisited nodes called the unvisited set. Set the initial (source) node as the current node.For the current node, consider all its unvisited neighbors and calculate their tentative distances. Compare the newly calculated tentative distance with the current assigned value, and assign the smaller one . When we are done considering all the neighbors of the current node, mark the current node as visited and remove it from the unvisited set . A visited node will never be checked again . If all destination nodes have been marked visited or if the smallest tentative distance among the nodes in the unvisited set is infinity, then stop. The algorithm has finished . Otherwise, select the unvisited node that is marked with the smallest tentative distance, set it as the new "current node", and go back to step 3 . 16

Dijkstra’s Algorithm Example 0   10 5 2 3 2 1 9 7 4 6 Example from CLR n1 n3 n2 n4 n5 Current Unvisited Visited

Dijkstra’s Algorithm Example 0105   Example from CLR 10 5 2 3 2 1 9 7 4 6 n1 n3 n2 n4 n5 Current Unvisited Visited

Dijkstra’s Algorithm Example 085 14 7 Example from CLR 10 5 2 3 2 1 9 7 4 6 n1 n3 n2 n4 n5 Current Unvisited Visited

Dijkstra’s Algorithm Example 085 13 7 Example from CLR 10 5 2 3 2 1 9 7 4 6 n1 n3 n2 n4 n5 Current Unvisited Visited

Dijkstra’s Algorithm Example 085 9 7 1 Example from CLR 10 5 2 3 2 1 9 7 4 6 n1 n3 n2 n4 n5 Current Unvisited Visited

Dijkstra’s Algorithm Example 085 9 7 Example from CLR 10 5 2 3 2 1 9 7 4 6 n1 n3 n2 n4 n5 Current Unvisited Visited

Dijkstra’s Algorithm d[s]←0

Single Source Shortest Path Problem: find shortest path from a source node to one or more target nodesShortest might also mean lowest weight or costSequential algorithm: Dijkstra’s AlgorithmMapReduce: parallel Breadth-First Search (BFS)24

Finding the Shortest Path Consider simple case of equal edge weights e.g., weight=1Solution to the problem can be defined inductivelyHere’s the intuition:Define: b is reachable from a if b is on adjacency list of aDistanceTo(s) = 0For all nodes p reachable from s, DistanceTo(p) = 1For all nodes n reachable from some other set of nodes M, DistanceTo(n) = 1 + min( DistanceTo ( m ), m  M ) s m 3 m 2 m 1 n … … … d 1 d 2 d 3

Visualizing Parallel BFS n0n3n 2 n 1 n 7 n 6 n 5 n 4 n 9 n 8

From Intuition to Algorithm Data representation:Key: node nValue consists of two componentsd (distance from source node)adjacency list (list of nodes reachable from n)Initialization: for all nodes except for source node, d = Mapper:m  adjacency list: emit (m, d + 1)Sort/ShuffleGroups distances by reachable nodesReducer:Selects minimum distance path for each nodeAdditional bookkeeping needed to keep track of actual path 27

Multiple Iterations Needed Each MapReduce iteration advances the “known frontier” by one hopSubsequent iterations include more and more reachable nodes as frontier expandsMultiple iterations are needed to explore entire graphPreserving graph structure:Problem: Where did the adjacency list go?Solution: mapper emits (n, adjacency list) as well28

BFS Pseudo-Code 29

Stopping Criterion How many iterations are needed in parallel BFS (equal edge weight case)?Six degrees of separation?Practicalities of implementation in MapReduce30

Comparison with Dijkstra Dijkstra’s algorithm is more efficient At any step it only pursues edges from the minimum-cost path inside the frontierMapReduce explores all paths in parallelLots of “waste”Useful work is only done at the “frontier”Non-useful work can be avoided31

Weighted Edges Now add positive weights to the edgesSimple change: adjacency list now includes a weight w for each edgeIn mapper, emit (m, d + wp) instead of (m, d + 1) for each node m32

Stopping Criterion How many iterations are needed in parallel BFS (positive edge weight case)?Convince yourself: when a node is first “discovered”, we’ve found the shortest pathA node becomes “discovered” when the cost of the node becomes non-infinity. Not true!33

Additional Complexities s p q r search frontier 10 n 1 n 2 n 3 n 4 n 5 n 6 n 7 n 8 n 9 1 1 1 1 1 1 1 1

Graphs and MapReduce Graph algorithms typically involve:Performing computations at each node: based on node features, edge features, and local link structurePropagating computations: “traversing” the graphGeneric recipe:Represent graphs as adjacency listsPerform local computations in mapperPass along partial results via outlinks, keyed by destination nodePerform aggregation in reducer on inlinks to a nodeIterate until convergence: controlled by external “driver”Don’t forget to pass the graph structure between iterations35

A practical implementation Referenced from the following linkhttp://www.johnandcailin.com/blog/cailin/breadth-first-graph-search-using-iterative-map-reduce-algorithmA node is represented by a string as followsID    EDGES|WEIGHTS|DISTANCE_FROM_SOURCE|COLOR36

Three statuses of a node UnvisitedColor whiteBeing visitedColor grayVisitedColor blackWhiteGrayBlack 37

The mappers All white nodes and black nodes only reproduce themselvesFor each gray node (e.g., an exploding node)For each neighbor node n in the adjacency list, emit a gray noden null|null|distance of exploding node + weight|grayTurn its own color to black and emit itselfID edges|weights|distance from source|black38

The reducers Receive the data for all “copies” of each nodeConstruct a new node for each nodeThe non-null list of edges and weightsThe minimum distance from the sourceThe proper color39

Choose the proper color If only receiving a copy of white node, color is whiteIf only receiving a copy of black node, color is blackIf receiving copies consisting of white node and gray nodes, color is grayIf receiving copies consisting of gray nodes and black nodeIf minimum distance comes from black node, color is blackOtherwise, color is gray40

Outlines Graph problems and representationsParallel breadth-first searchPageRank41

Random Walks Over the Web Random surfer model:User starts at a random Web pageUser randomly clicks on links, surfing from page to pageOr, sometimes, user jumps to a random pagePageRankCharacterizes the amount of time spent on any given pageMathematically, a probability distribution over pagesPageRank captures notions of page importanceOne of thousands of features used in web search42

Given page x with inlinks t1…tn, whereC(ti) is the out-degree of ti, i.e., the number of outgoing links from ti is probability of random jumpN is the total number of nodes in the graphPageRank: Defined X t 1 t 2 t n …

Computing PageRank Properties of PageRankCan be computed iterativelyEffects at each iteration are localSketch of algorithm (ignoring random jump):Start with seed PRi valuesEach page distributes PRi mass to all pages it linksEach target page adds up mass from multiple in-bound links to compute PRi+1Iterate until values converge44

Sample PageRank Iteration (1) n 1 (0.2) n 4 (0.2) n 3 (0.2) n 5 (0.2) n 2 (0.2) 0.1 0.1 0.2 0.2 0.1 0.1 0.066 0.066 0.066 n 1 (0.066)n4 (0.3)n3 (0.166)n5 (0.3)n2 (0.166)Iteration 1

Sample PageRank Iteration (2) n 1 (0.066) n 4 (0.3) n 3 (0.166) n 5 (0.3) n 2 (0.166) 0.033 0.033 0.3 0.166 0.083 0.083 0.1 0.1 0.1 n 1 (0.1)n4 (0.2)n3 (0.183)n5 (0.383)n2 (0.133)Iteration 2

PageRank in MapReduce n 5 [ n 1 , n 2 , n 3 ] n 1 [ n 2 , n 4 ] n 2 [ n 3 , n5]n3 [n4]n4 [n5]n2n4n3n5n1n2n3n4 n 5 n 2 n 4 n 3 n5 n 1 n 2 n 3 n 4 n 5 n 5 [ n 1 , n 2 , n 3 ] n 1 [ n 2 , n 4 ] n 2 [ n 3 , n 5 ] n 3 [ n 4 ] n 4 [ n 5 ] Map Reduce

PageRank Pseudo-Code 48

Complete PageRank Two additional complexitiesWhat is the proper treatment of dangling nodes?How do we factor in the random jump factor?49

A dangling node is a node that has no outgoing edges The adjacency list is emptyDangling nodesXt1 t 2 t n … The PageRank mass of a dangling node will get lost during the mapper stage due to the lack of outgoing edges Solution Reserve a special key (i.e., a special node id) for storing PageRank mass from dangling nodes Mapper: dangling nodes emit the mass with the special key Reducer: sum up all the missing mass with the special key

Second pass Second pass to redistribute “missing PageRank mass” and account for random jumpsp is PageRank value from the first pass, p' is updated PageRank value|G| is the number of nodes in the graphm is the combined missing PageRank mass 51

Complete PageRank One iteration of PageRank requires two passes (i.e., two MapReduce jobs)The first to distribute PageRank mass along graph edgesAlso take care of the missing mass due to dangling nodesThe second to redistribute the missing mass and take into account the random jump factorThis job requires no reducers52

Sample PageRank Iteration (1) n 1 (0.2) n 4 (0.2) n 3 (0.2) n 5 (0.2) n 2 (0.2) 0.1 0.1 0.2 0.1 0.1 0.066 0.066 0.066 n 1 (0.066) n4 (0.1)n3 (0.166)n5 (0.3)n2 (0.166)Iteration 1Pass 1Missing PR mass = 0.2

Sample PageRank Iteration (1) n 1 (0.066) n 4 ( 0.1) n 3 (0.166) n 5 (0.3) n 2 (0.166) Iteration 1 Pass 2 Missing PR mass = 0.2 n 1 ( 0.1154) n 4 (0.146)n3 (0.2054)n5 (0.326)n2 (0.2054)α = 0.1, m = 0.2

PageRank Convergence Convergence criteriaIterate until PageRank values don’t changeFixed number of iterationsConvergence for web graphs?52 iterations for a graph with 322 million edges55