Graph Algorithms Adapted from UMD Jimmy Lins slides which is licensed under a Creative Commons AttributionNoncommercialShare Alike 30 United States See httpcreativecommonsorglicensesbyncsa30us for ID: 761536
Download Presentation The PPT/PDF document "Graph Algorithms Adapted from UMD Jimmy ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsRevised based on the slides by Ruoming Jin @ Kent State 1
Outlines Graph problems and representationsParallel breadth-first searchPageRank2
Outlines Graph problems and representationsParallel breadth-first searchPageRank3
What’s a graph? G = (V,E), whereV represents the set of vertices (nodes)E represents the set of edges (links)Both vertices and edges may contain additional informationDifferent types of graphs:Directed vs. undirected edgesPresence or absence of cyclesGraphs are everywhere:Hyperlink structure of the WebPhysical structure of computers on the InternetInterstate highway systemSocial networks4
5
Source: Wikipedia (Königsberg)6
Some Graph Problems Finding shortest pathsRouting Internet traffic and UPS trucksFinding minimum spanning treesTelco laying down fiberFinding Max FlowAirline schedulingIdentify “special” nodes and communitiesBreaking up terrorist cells, spread of avian fluBipartite matchingMonster.com, Match.comAnd of course... PageRank7
Graphs and MapReduce Graph algorithms typically involve:Performing computations at each node: based on node features, edge features, and local link structurePropagating computations: “traversing” the graphKey questions:How do you represent graph data in MapReduce?How do you traverse a graph in MapReduce?8
Representing Graphs G = (V, E)Two common representationsAdjacency matrixAdjacency list9
Adjacency Matrices Represent a graph as an n x n square matrix Mn = |V|Mij = 1 means a link from node i to j n1 n2 n3 n4 n1 0 1 0 1 n2 1 0 1 1 n3 1 0 0 0 n4 1 0 1 0 n1 n2 n3 n4
Adjacency Matrices: Critique Advantages:Amenable to mathematical manipulationIteration over rows and columns corresponds to computations on outlinks and inlinksDisadvantages:Lots of zeros for sparse matricesLots of wasted space11
Adjacency Lists Take adjacency matrices… and throw away all the zeros1: 2, 42: 1, 3, 43: 14: 1, 3 n1 n2 n3 n4 n1 0 1 0 1 n2 1 0 1 1 n3 1 0 0 0 n4 1 0 1 0
Adjacency Lists: Critique Advantages:Much more compact representationEasy to compute over outlinksDisadvantages:Much more difficult to compute over inlinks13
Outlines Graph problems and representationsParallel breadth-first searchPageRank14
Single Source Shortest Path Problem: find shortest path from a source node to one or more target nodesShortest might also mean lowest weight or costFirst, a refresher: Dijkstra’s Algorithm15
Dijkstra’s Algorithm Assign to every node a tentative distance value: set it to zero for the initial (source) node and to infinity for all other nodes.Mark all nodes unvisited. Create a set of all the unvisited nodes called the unvisited set. Set the initial (source) node as the current node.For the current node, consider all its unvisited neighbors and calculate their tentative distances. Compare the newly calculated tentative distance with the current assigned value, and assign the smaller one . When we are done considering all the neighbors of the current node, mark the current node as visited and remove it from the unvisited set . A visited node will never be checked again . If all destination nodes have been marked visited or if the smallest tentative distance among the nodes in the unvisited set is infinity, then stop. The algorithm has finished . Otherwise, select the unvisited node that is marked with the smallest tentative distance, set it as the new "current node", and go back to step 3 . 16
Dijkstra’s Algorithm Example 0 10 5 2 3 2 1 9 7 4 6 Example from CLR n1 n3 n2 n4 n5 Current Unvisited Visited
Dijkstra’s Algorithm Example 0105 Example from CLR 10 5 2 3 2 1 9 7 4 6 n1 n3 n2 n4 n5 Current Unvisited Visited
Dijkstra’s Algorithm Example 085 14 7 Example from CLR 10 5 2 3 2 1 9 7 4 6 n1 n3 n2 n4 n5 Current Unvisited Visited
Dijkstra’s Algorithm Example 085 13 7 Example from CLR 10 5 2 3 2 1 9 7 4 6 n1 n3 n2 n4 n5 Current Unvisited Visited
Dijkstra’s Algorithm Example 085 9 7 1 Example from CLR 10 5 2 3 2 1 9 7 4 6 n1 n3 n2 n4 n5 Current Unvisited Visited
Dijkstra’s Algorithm Example 085 9 7 Example from CLR 10 5 2 3 2 1 9 7 4 6 n1 n3 n2 n4 n5 Current Unvisited Visited
Dijkstra’s Algorithm d[s]←0
Single Source Shortest Path Problem: find shortest path from a source node to one or more target nodesShortest might also mean lowest weight or costSequential algorithm: Dijkstra’s AlgorithmMapReduce: parallel Breadth-First Search (BFS)24
Finding the Shortest Path Consider simple case of equal edge weights e.g., weight=1Solution to the problem can be defined inductivelyHere’s the intuition:Define: b is reachable from a if b is on adjacency list of aDistanceTo(s) = 0For all nodes p reachable from s, DistanceTo(p) = 1For all nodes n reachable from some other set of nodes M, DistanceTo(n) = 1 + min( DistanceTo ( m ), m M ) s m 3 m 2 m 1 n … … … d 1 d 2 d 3
Visualizing Parallel BFS n0n3n 2 n 1 n 7 n 6 n 5 n 4 n 9 n 8
From Intuition to Algorithm Data representation:Key: node nValue consists of two componentsd (distance from source node)adjacency list (list of nodes reachable from n)Initialization: for all nodes except for source node, d = Mapper:m adjacency list: emit (m, d + 1)Sort/ShuffleGroups distances by reachable nodesReducer:Selects minimum distance path for each nodeAdditional bookkeeping needed to keep track of actual path 27
Multiple Iterations Needed Each MapReduce iteration advances the “known frontier” by one hopSubsequent iterations include more and more reachable nodes as frontier expandsMultiple iterations are needed to explore entire graphPreserving graph structure:Problem: Where did the adjacency list go?Solution: mapper emits (n, adjacency list) as well28
BFS Pseudo-Code 29
Stopping Criterion How many iterations are needed in parallel BFS (equal edge weight case)?Six degrees of separation?Practicalities of implementation in MapReduce30
Comparison with Dijkstra Dijkstra’s algorithm is more efficient At any step it only pursues edges from the minimum-cost path inside the frontierMapReduce explores all paths in parallelLots of “waste”Useful work is only done at the “frontier”Non-useful work can be avoided31
Weighted Edges Now add positive weights to the edgesSimple change: adjacency list now includes a weight w for each edgeIn mapper, emit (m, d + wp) instead of (m, d + 1) for each node m32
Stopping Criterion How many iterations are needed in parallel BFS (positive edge weight case)?Convince yourself: when a node is first “discovered”, we’ve found the shortest pathA node becomes “discovered” when the cost of the node becomes non-infinity. Not true!33
Additional Complexities s p q r search frontier 10 n 1 n 2 n 3 n 4 n 5 n 6 n 7 n 8 n 9 1 1 1 1 1 1 1 1
Graphs and MapReduce Graph algorithms typically involve:Performing computations at each node: based on node features, edge features, and local link structurePropagating computations: “traversing” the graphGeneric recipe:Represent graphs as adjacency listsPerform local computations in mapperPass along partial results via outlinks, keyed by destination nodePerform aggregation in reducer on inlinks to a nodeIterate until convergence: controlled by external “driver”Don’t forget to pass the graph structure between iterations35
A practical implementation Referenced from the following linkhttp://www.johnandcailin.com/blog/cailin/breadth-first-graph-search-using-iterative-map-reduce-algorithmA node is represented by a string as followsID EDGES|WEIGHTS|DISTANCE_FROM_SOURCE|COLOR36
Three statuses of a node UnvisitedColor whiteBeing visitedColor grayVisitedColor blackWhiteGrayBlack 37
The mappers All white nodes and black nodes only reproduce themselvesFor each gray node (e.g., an exploding node)For each neighbor node n in the adjacency list, emit a gray noden null|null|distance of exploding node + weight|grayTurn its own color to black and emit itselfID edges|weights|distance from source|black38
The reducers Receive the data for all “copies” of each nodeConstruct a new node for each nodeThe non-null list of edges and weightsThe minimum distance from the sourceThe proper color39
Choose the proper color If only receiving a copy of white node, color is whiteIf only receiving a copy of black node, color is blackIf receiving copies consisting of white node and gray nodes, color is grayIf receiving copies consisting of gray nodes and black nodeIf minimum distance comes from black node, color is blackOtherwise, color is gray40
Outlines Graph problems and representationsParallel breadth-first searchPageRank41
Random Walks Over the Web Random surfer model:User starts at a random Web pageUser randomly clicks on links, surfing from page to pageOr, sometimes, user jumps to a random pagePageRankCharacterizes the amount of time spent on any given pageMathematically, a probability distribution over pagesPageRank captures notions of page importanceOne of thousands of features used in web search42
Given page x with inlinks t1…tn, whereC(ti) is the out-degree of ti, i.e., the number of outgoing links from ti is probability of random jumpN is the total number of nodes in the graphPageRank: Defined X t 1 t 2 t n …
Computing PageRank Properties of PageRankCan be computed iterativelyEffects at each iteration are localSketch of algorithm (ignoring random jump):Start with seed PRi valuesEach page distributes PRi mass to all pages it linksEach target page adds up mass from multiple in-bound links to compute PRi+1Iterate until values converge44
Sample PageRank Iteration (1) n 1 (0.2) n 4 (0.2) n 3 (0.2) n 5 (0.2) n 2 (0.2) 0.1 0.1 0.2 0.2 0.1 0.1 0.066 0.066 0.066 n 1 (0.066)n4 (0.3)n3 (0.166)n5 (0.3)n2 (0.166)Iteration 1
Sample PageRank Iteration (2) n 1 (0.066) n 4 (0.3) n 3 (0.166) n 5 (0.3) n 2 (0.166) 0.033 0.033 0.3 0.166 0.083 0.083 0.1 0.1 0.1 n 1 (0.1)n4 (0.2)n3 (0.183)n5 (0.383)n2 (0.133)Iteration 2
PageRank in MapReduce n 5 [ n 1 , n 2 , n 3 ] n 1 [ n 2 , n 4 ] n 2 [ n 3 , n5]n3 [n4]n4 [n5]n2n4n3n5n1n2n3n4 n 5 n 2 n 4 n 3 n5 n 1 n 2 n 3 n 4 n 5 n 5 [ n 1 , n 2 , n 3 ] n 1 [ n 2 , n 4 ] n 2 [ n 3 , n 5 ] n 3 [ n 4 ] n 4 [ n 5 ] Map Reduce
PageRank Pseudo-Code 48
Complete PageRank Two additional complexitiesWhat is the proper treatment of dangling nodes?How do we factor in the random jump factor?49
A dangling node is a node that has no outgoing edges The adjacency list is emptyDangling nodesXt1 t 2 t n … The PageRank mass of a dangling node will get lost during the mapper stage due to the lack of outgoing edges Solution Reserve a special key (i.e., a special node id) for storing PageRank mass from dangling nodes Mapper: dangling nodes emit the mass with the special key Reducer: sum up all the missing mass with the special key
Second pass Second pass to redistribute “missing PageRank mass” and account for random jumpsp is PageRank value from the first pass, p' is updated PageRank value|G| is the number of nodes in the graphm is the combined missing PageRank mass 51
Complete PageRank One iteration of PageRank requires two passes (i.e., two MapReduce jobs)The first to distribute PageRank mass along graph edgesAlso take care of the missing mass due to dangling nodesThe second to redistribute the missing mass and take into account the random jump factorThis job requires no reducers52
Sample PageRank Iteration (1) n 1 (0.2) n 4 (0.2) n 3 (0.2) n 5 (0.2) n 2 (0.2) 0.1 0.1 0.2 0.1 0.1 0.066 0.066 0.066 n 1 (0.066) n4 (0.1)n3 (0.166)n5 (0.3)n2 (0.166)Iteration 1Pass 1Missing PR mass = 0.2
Sample PageRank Iteration (1) n 1 (0.066) n 4 ( 0.1) n 3 (0.166) n 5 (0.3) n 2 (0.166) Iteration 1 Pass 2 Missing PR mass = 0.2 n 1 ( 0.1154) n 4 (0.146)n3 (0.2054)n5 (0.326)n2 (0.2054)α = 0.1, m = 0.2
PageRank Convergence Convergence criteriaIterate until PageRank values don’t changeFixed number of iterationsConvergence for web graphs?52 iterations for a graph with 322 million edges55