Scalable Distributed arallel BreadthFirst Searc Algorithm on BlueGeneL Andy Edmond Cho Keith Henderson William McLendon Bruce Hendric kson Umit ataly urek La wrence Liv ermore National Lab oratory Li
114K - views

Scalable Distributed arallel BreadthFirst Searc Algorithm on BlueGeneL Andy Edmond Cho Keith Henderson William McLendon Bruce Hendric kson Umit ataly urek La wrence Liv ermore National Lab oratory Li

E Sha Researc and Dev elopmen t New ork NY 10036 Ohio State Univ ersit Colum bus OH 43210 Abstract Man emerging largescale data science applications require searc hing large graphs dis tributed across ultiple memories and pro cessors This pap er pre

Tags : Sha Researc and
Download Pdf

Scalable Distributed arallel BreadthFirst Searc Algorithm on BlueGeneL Andy Edmond Cho Keith Henderson William McLendon Bruce Hendric kson Umit ataly urek La wrence Liv ermore National Lab oratory Li




Download Pdf - The PPT/PDF document "Scalable Distributed arallel BreadthFirs..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Scalable Distributed arallel BreadthFirst Searc Algorithm on BlueGeneL Andy Edmond Cho Keith Henderson William McLendon Bruce Hendric kson Umit ataly urek La wrence Liv ermore National Lab oratory Li"— Presentation transcript:


Page 1
Scalable Distributed arallel Breadth-First Searc Algorithm on BlueGene/L Andy Edmond Cho Keith Henderson William McLendon Bruce Hendric kson Umit ataly urek La wrence Liv ermore National Lab oratory Liv ermore, CA 94551 Sandia National Lab oratories, Albuquerque, NM 87185 D. E. Sha Researc and Dev elopmen t, New ork, NY 10036 Ohio State Univ ersit Colum bus, OH 43210 Abstract Man emerging large-scale data science applications require searc hing large graphs dis- tributed across ultiple memories and pro cessors. This pap er presen ts distributed breadth- rst searc (BFS)

sc heme that scales for random graphs with up to three billion ertices and 30 billion edges. Scalabilit as tested on IBM BlueGene/L with 32,768 no des at the La wrence Liv ermore National Lab oratory Scalabilit as obtained through series of optimizations, in particular, those that ensure scalable use of memory use 2D (edge) partitioning of the graph instead of con en tional 1D (v ertex) partitioning to reduce comm unication erhead. or oisson random graphs, sho that the exp ected size of the messages is scalable for oth 2D and 1D partitionings. Finally ha dev elop ed ecien collectiv

comm unication functions for the 3D torus arc hitecture of BlueGene/L that also tak adv an tage of the structure in the problem. The erformance and haracteristics of the algorithm are measured and rep orted. In tro duction Data science has gained uc atten tion in recen ears wing to gro wth in demand for tec hniques to explore large-scale data in imp ortan areas suc as genomics, astroph ysics, and national securit Graph searc pla ys an imp ortan role in analyzing large data sets since the relationship et een data ob jects is often represen ted in the form of graphs, suc as seman tic graphs [4

14 17 ]. Breadth-rst searc (BFS) is of particular imp ortance among dieren graph searc metho ds and is widely used (c) 2005 Asso ciation for Computing Mac hinery CM ac kno wledges that this con tribution as authored or co-authored con tractor or aliate of the U.S. Go ernmen t. As suc h, the Go ernmen retains nonexclusiv e, ro alt y-free righ to publish or repro duce this article, or to allo others to do so, for Go ernmen purp oses only SC 05 No em er 12{18, 2005, Seattle, ashington, USA (c) 2005 CM 1-59593-061-2/05/0011...$5.00
Page 2
in umerous applications. An

application of particular in terest to us is the analysis of ery large seman tic graphs. common query that arises in analyzing seman tic graph, for example, is to determine the nature of the relationship et een ertices in the graph, and suc query can answ ered nding the shortest path et een those ertices using BFS. urther, BFS can used to nd set of paths et een ertices whose lengths are in certain range. Another ey area of in terest is comm unit analysis in seman tic graphs [5 20 21 22 ]. comm unit detection algorithm Newman and Girv an [22 ], for example, iterativ ely in ok es

BFS for all pairs of ertices un til it nds all the comm unit structures in the graph. Searc hing ery large graphs with billions of ertices and edges, ho ev er, oses hallenges mainly due to the ast searc space imp osed the large graphs. Esp ecially it is often imp ossible to store suc large graphs in the main memory of single computer. This mak es the traditional PRAM-based parallel BFS algorithms [6 10 11 13 un usable and calls for distributed parallel BFS algorithms where the computation mo es to the pro cessor wning the data. Ob viously the scalabilit of the distributed BFS algorithm

for ery large graphs ecomes critical issue, since the demand for lo cal memory and in ter-pro cessor comm unication increases as the graph size increases. In this pap er, prop ose scalable and ecien distributed BFS sc heme that is capable of handling graphs with billions of ertices and edges. In this researc consider oisson random graphs, where the probabilit of an ertices eing connected an edge is equal. use the oisson random graphs, mainly ecause there are no publicly ailable large real graphs with whic can test the scalabilit of the prop osed BFS algorithm. so cial net ork graph

deriv ed from the orld Wide eb, for example, con tains 15 million ertices [8 and the largest citation net ork ailable has million ertices [19 ]. In the absence of the large real graphs, the syn thetic random graphs, whic are the simplest graphs that ha small diameter, the feature of real-w orld net orks, pro vide us with an easy means to construct ery large graphs with bil lions of ertices. urthermore, the random graphs ha almost no clustering and th us ha large edge-cuts when partitioned, allo wing us to understand the orst-case erformance of our algorithm. use arbitrary partitionings with

the constrain that the partitions are balanced in terms of um er of ertices and edges. ac hiev high scalabilit through set of clev er memory and comm unication optimizations. First, o-dimensional (2D) graph partitioning [3 15 16 is used instead of more con en tional one-dimensional (1D) partitioning. With the 2D partitioning, the um er of pro cesses in olv ed in collectiv comm unications is in con trast to of 1D partitioning, where is the total um er of pro cessors. Next, deriv the ounds on the length of messages for oisson random graphs. sho that giv en random graph with ertices, the exp

ected message length is n=P ). This allo ws us to manage the lo cal memory more ecien tly and impro the scalabilit Finally ha dev elop ed scalable collectiv es based on oin t-to-p oin comm unications for BlueGene/L [1 ]. Here, attempt to reduce the um er of oin t-to-p oin comm unications, taking adv an tage of the high-bandwidth torus net ork of BlueGene/L. In the implemen tation of the collectiv es, explore the use of reduce-scatter (where the reduction op eration is set-union) rather than straigh tforw ard use of all-to-all. It is sho wn that the reduce-scatter implemen tation

signican tly reduces message olume. Our BFS sc heme exhibits go scalabilit as it scales to graph with 3.2 billion ertices and 32 billion edges on BlueGene/L system with 32,768 no des. the est of our kno wledge, this is the largest explicitly formed graph ev er explored distributed algorithm. The erformance haracteristics of the prop osed BFS algorithm are also analyzed and rep orted. This pap er is organized as follo ws. Section describ es prop osed distributed BFS algorithm. The
Page 3
optimization of the BFS sc heme is discussed in Section 3. The exp erimen tal results are

presen ted in Section 4, follo ed concluding remarks and directions for future ork in Section 5. Prop osed Distributed BFS Algorithm In this section, presen the distributed BFS algorithm with 1D and 2D partitionings. The prop osed algorithm is lev el-sync hronized BFS algorithm that pro ceeds lev el lev el, starting with source ertex, where the lev el of ertex is dened as its graph distance from the source. In the follo wing, use to denote the um er of pro cessors, to denote the um er of ertices in oisson random graph, and to denote the erage degree. The pro cessors are mapp ed to

o-dimensional logical pro cessor arra and use and to denote the ro and column strip es of the pro cessor arra resp ectiv ely consider only undirected graphs in this pap er. 2.1 Distributed BFS with 1D artitioning 1D partitioning of graph is partitioning of its ertices suc that eac ertex and the edges emanating from it are wned one pro cessor The set of ertices wned pro cessor is also called its lo al vertic es The follo wing illustrates 1D -w partitioning using the adjacency matrix, of the graph, symmetrically reordered so that ertices wned the same pro cessor are con tiguous. The subscripts

indicate the index of the pro cessor wning the data. The edges emanating from ertex form its dge list whic is the list of ertex indices in ro of the adjacency matrix. or the partitioning to balanced, eac pro cessor should assigned appro ximately the same um er of ertices and emanating edges. distributed BFS with 1D partitioning pro ceeds as follo ws. eac lev el, eac pro cessor has set whic is the set of fron tier ertices wned that pro cessor. The edge lists of the ertices in are merged to form set of neigh oring ertices. Some of these ertices will wned the same pro cessor, and some will wned

other pro cessors. or ertices in the latter case, messages are sen to other pro cessors (neigh or ertices are sen to their wners) to oten tially add these ertices to their fron tier set for the next lev el. Eac pro cessor receiv es these sets of neigh or ertices and merges them to form set of ertices whic the pro cessor wns. The pro cessor ma ha mark ed some ertices in in previous iteration. In that case, the pro cessor will ignore this message and all subsequen messages regarding those ertices. Algorithm describ es the distributed breadth-rst expansion using the 1D partitioning,

starting with ertex In the algorithm, ev ery ertex ecomes lab eled with its lev el, ), whic denotes its graph distance from The data structure is also distributed so that pro cessor only stores for its lo cal ertices. assume that only one pro cess is assigned to pro cessor and use pro cess and pro cessor in terc hangeably
Page 4
2.2 Distributed BFS with 2D partitioning 2D partitioning of graph is partitioning of its edges suc that eac edge is wned one pro cessor. In addition, the ertices are also partitioned suc that eac ertex is wned one pro cessor. pro cess stores some edges

inciden on its wned ertices, and some edges that are not. This partitioning can illustrated using the adjacency matrix, of the graph, symmetrically reordered so that ertices wned the same pro cessor are con tiguous. (1) (1) (1) ;C (1) (1) (1) ;C (1) (1) (1) ;C ;C ;C ;C Here, the partitioning is for pro cessors, logically arranged in pro cessor mesh. will use the terms pr essor-r ow and pr essor-c olumn with resp ect to this pro cessor mesh. In the 2D partitioning ab e, the adjacency matrix is divided in to blo ro ws and blo columns. The notation i;j denotes blo wned pro cessor i; ). Eac pro

cessor wns blo ks. partition the ertices, pro cessor i; wns the ertices corresp onding to blo ro 1) or the partitioning to balanced, eac pro cessor should assigned appro ximately the same um er of ertices and edges. The con en tional 1D partitioning is equiv alen to the 2D partitioning with or 1. or the 2D partitioning, assume that the edge list for giv en ertex is olumn of the adjacency matrix. Th us eac blo in the 2D partitioning con tains artial edge lists. In BFS using this partitioning, eac pro cessor has set whic is the set of fron tier ertices wned that pro cessor. Consider ertex in The

wner of sends messages to other pro cessors in its pro cessor-column to tell them that is on the fron tier, since an of these pro cessors ma con tain partial edge lists for call this comm unication step the exp and op eration. The partial edge lists on eac pro cessor are merged to form the set whic are oten tial ertices on the next fron tier. The ertices in are then sen to their wners to oten tially added to the new fron tier set on those pro cessors. With 2D partitioning, these wner pro cessors are in the same pro cessor ro w. This comm unication step is referred to as the fold op eration.

The comm unication step in the 1D partitioning (steps 8{13 in the Algorithm 1) is the same as the fold op eration in the 2D partitioning. The adv an tage of 2D partitioning er 1D partitioning is that the pro cessor-column and pro cessor- ro comm unications in olv and pro cessors, resp ectiv ely; for 1D partitioning, all pro cessors are in olv ed in the comm unication op eration. Algorithm describ es the prop osed distributed BFS algorithm using 2D partitioning. Steps 7{11 and 13{18 corresp ond to expand and fold op erations, resp ectiv ely
Page 5
Algorithm Distributed Breadth-First

Expansion with 1D artitioning 1: Initialize where is source otherwise 2: for to do 3: the set of lo cal ertices with lev el 4: if for all pro cessors then 5: erminate main lo op 6: end if 7: neigh ors of ertices in (not necessarily lo cal) 8: for all pro cessors do 9: ertices in wned pro cessor 10: Send to pro cessor 11: Receiv from pro cessor 12: end for 13: (The ma erlap) 14: for and do 15: 16: end for 17: end for Algorithm Distributed Breadth-First Expansion with 2D artitioning 1: Initialize where is source otherwise 2: for to do 3: the set of lo cal ertices with lev el 4: if for all pro

cessors then 5: erminate main lo op 6: end if 7: for all pro cessors in this pro cessor-column do 8: Send to pro cessor 9: Receiv from pro cessor (The are disjoin t) 10: end for 11: 12: neigh ors of ertices in using edge lists on this pro cessor 13: for all pro cessors in this pro cessor-ro do 14: ertices in wned pro cessor 15: Send to pro cessor 16: Receiv from pro cessor 17: end for 18: (The ma erlap) 19: for and do 20: 21: end for 22: end for
Page 6
In the expand op eration, pro cessors send the indices of the fron tier ertices that they wn to other pro cessors. or dense matrices

[9 (and ev en in some cases for sparse matrices [12 ]), this op eration is traditionally implemen ted with an all-gather collectiv comm unication, since all indices wned pro cessor need to sen t. or BFS, this is equiv alen to the case where all ertices are on the fron tier. This comm unication is not scalable as the um er of pro cessors increases. or sparse graphs, ho ev er, it is adv an tageous to only send ertices on the fron tier, and to only send to pro cessors that ha non-empt partial edge lists corresp onding to these fron tier ertices. This op eration can no implemen ted an all-to-all

collectiv comm unication. In the 2D case, eac pro cessor needs to store information ab out the edge lists of other pro cessors in its pro cessor-column. The storage for this information is prop ortional to the um er of ertices wned pro cessor, and therefore it is scalable. will sho in Section that for oisson random graphs, the message lengths are scalable when comm unication is erformed this The fold op eration is traditionally implemen ted for dense matrix computations as an all-to-all comm unication. An alternativ is to implemen the fold op eration as reduce-scatter op eration. In this case,

eac pro cessor receiv es directly and line 18 of the Algorithm is not necessary The reduction op eration, whic ccurs within the reduction stage of the op eration, is set-union and eliminates all the duplicate ertices. 2.3 Bi-directional BFS The BFS algorithm describ ed ab is uni-directional in that the searc starts from the source and con tin ues un til it reac hes the destination or all the ertices in the graph are visited. The BFS algorithm can implemen ted in bi-directional fashion as ell. In bi-directional searc h, the searc starts from oth source and destination ertices and con tin ues un

til path connecting the source and destination is found. An ob vious adv an tage of the bi-directional searc is that the fron tier of the searc remains small compared to the uni-directional case. This reduces the comm unication olume as ell as the um er of memory accesses, signican tly impro ving the erformance of the searc h. The 1D or 2D partitioning can used in conjunction with the bi-directional BFS. or additional details, see [23 ]. Optimizations for Scalabilit It as sho wn in the previous section that the 2D partitioning reduces the um er of pro cessors in olv ed in collectiv

comm unications. In this section sho ho the BFS algorithm can further optimized to enhance its scalabilit 3.1 Bounds on message buer length and memory optimization ma jor factor limiting the scalabilit of our distributed BFS algorithm is that the length of message buers used in all-to-all collectiv comm unications gro ws as the um er of pro cessors increases. ey to ercoming this limitation is to use message buers of xed length. In the follo wing, deriv the upp er ounds on the length of messages in our BFS algorithm for oisson random graphs. Recall that dene as the um er of

ertices in the graph, as the erage degree, and as the um er of pro cessors. assume can factored as the dimensions of the pro cessor mesh in the 2D case. or simplicit further assume that is ultiple of and that eac pro cessor wns n=P ertices.
Page 7
Let the matrix formed an ro ws of the adjacency matrix of the random graph. dene the useful quan tit mk whic is the probabilit that giv en column of is nonzero. The quan tit mk is the exp ected um er of edges (nonzeros) in The function approac hes mk =n for large and approac hes for small or distributed BFS with 1D partitioning, pro

cessor wns the part of the adjacency matrix. In the comm unication op eration, pro cessor sends the indices of the neigh ors of its fron tier ertices to their wner pro cessors. If all ertices wned are on the fron tier, the exp ected um er of neigh or ertices is n=P 1) =P This comm unication length is nk =P in the orst case, whic is n=P ). The orst case view ed another is equal to the actual um er of nonzeros in ev ery edge causes comm unication. This orst case result is indep enden of the graph. In 2D expand comm unication, the indices of the ertices on the fron tier set are sen to the other

pro cessors in the pro cessor-column. In the orst case, if all n=P ertices wned pro cessor are on the fron tier (or if all-gather comm unication is used and all n=P indices are sen t) the um er of indices sen the pro cessor is 1) whic increases with and th us the message size is not con trolled when the um er of pro cessors increases. The maxim um exp ected message size is ounded as increases, ho ev er, if pro cessor only sends the indices needed another pro cessor (all-to-all comm unication, but requires kno wing whic in- dices to send). pro cessor only sends indices to pro cessors that ha

partial edge lists corresp onding to ertices wned it. The exp ected um er of indices is n=R 1) The result for the 2D fold comm unication is similar: n=C 1) These quan tities are also n=P in the orst case. Th us, for oth 1D and 2D partitionings, the length of the comm unication from single pro cessor is n=P ), prop ortional to the um er of ertices wned pro cessor. Once an upp er ound on the message size is determined, can use message buers of xed length indep enden of the um er of pro cessors used. 3.2 Optimization of collectiv es for BlueGene/L It can deduced from the equations presen

ted in Section 3.1 that the exp ected message size approac hes for large This implies that all-to-all comm unication ma not used for ery large graphs with high erage degree, due to the memory constrain t. limit the size of message buers to xed length, indep enden of the collectiv es ust implemen ted based on oin t-to- oin comm unication. ha dev elop ed scalable collectiv es using oin t-to-p oin comm unications
Page 8
expand fold (a) An logical pro cessor arra (b) Mapping on torus Figure 1: Mapping of the logical pro cessor arra to torus. sp ecically designed for

BlueGene/L. Here, attempt to reduce the um er of oin t-to-p oin comm unications, taking adv an tage of the high-bandwidth torus in terconnect of BlueGene/L and to reduce the olume of messages transmitted. 3.2.1 ask mapping Our BFS sc heme assumes that giv en graph is distributed to o-dimensional logical pro cessor arra This logical pro cessor arra is then mapp ed to three-dimensional torus of BlueGene/L. Figure illustrates this mapping. In this example, an logical pro cessor arra is mapp ed to torus. The giv en logical pro cessor arra is rst divided in to set of planes and then eac

plane is mapp ed to the torus in suc that the planes in the same column are mapp ed to adjac ent ph ysical planes as sho wn in Figure 1.b. With this mapping the expand op eration is erformed those pro cessors in the same column of adjacen ph ysical planes. On the other hand, the pro cessors erforming fold op eration are not in adjacen planes. These pro cessors form pro cessor grids on ultiple planes on whic expand and fold op erations are erformed as indicated blue (dashed) and red lines in Figure 1.b. concen trate on impro ving the erformance of the collectiv es on these grids. 3.2.2 The

optimization of the collectiv es Basically the optimized collectiv es for BlueGene/L are implemen ted using ring comm unication, oin t-to-p oin comm unication that naturally orks ery ell on torus in terconnect, to mak them scalable. impro the erformance of these collectiv es shortening the diameter of the ring in our optimization. In this sc heme, the collectiv comm unications are erformed in phases. The idea is to divide the pro cessors in the ring in to sev eral groups and erform the ring comm unication within eac group in parallel. ensure that pro cessors in group can receiv and pro cess

messages
Page 9
[0−2][1,4] [3−5][1,4] [0−2][2,5] [3−5][2,5] [0−2][0,3] [3−5][0,3] [0−5][1] [0−5][4] [0−5][0] [0−5][3] [0−5][5] [0−5][2] (a) pro cessor arra (b) Messages receiv ed after phase (c) Messages receiv ed after phase Figure 2: fold op eration on pro cessor grid (The notation [S][R] denotes set of messages sen pro cessors in group to pro cessors in group R. The sending and receiving groups are represen ted as range of comma-separated list of pro cessors.). from the pro cessors in all other groups, pro

cessors in eac group initially send messages targeted to other pro cessor groups. pro cessor sends messages to only one pro cessor in eac group in this stage (phase 1). These messages will ev en tually receiv ed all the pro cessors in the targeted group during the ring comm unication (phase 2). The pro cesses are mapp ed to pro cessors in suc that the pro cessors in eac group form ph ysical ring with (wraparound edges). The fold op eration is implemen ted as reduce-scatter in our optimization, where the reduction op eration is set-union. That is, all the messages are scanned while eing

transmitted to ensure that the messages do not con tain duplicate ertices. This union op eration reduces the total message olume and therefore impro es the comm unication erformance. In addition, the decrease in the message olume reduces the memory accesses for pro cessing the receiv ed ertices. The prop osed comm unication sc heme is similar to the all-to-all ersonalized comm unication tec hnique prop osed in [24 but diers in that our sc heme erforms the set-union op eration on transmitted messages. In this sc heme, the pro cessors in the same ro ws and columns of pro cessor grid are group ed

together. In phase 1, all the pro cessors in the same ro group exc hange messages in ring fashion. In this ro w-wise comm unication, pro cessor com bines messages for all the pro cessors in eac column group and sends them to the pro cessors in the same ro w. When pro cess adds its ertices to receiv ed message, it only adds those that are not already in the message. Eac pro cess has set of ertices from all the pro cesses in its ro group (including itself to the pro cesses in its column group after the phase 1. These ertices are then distributed to appropriate pro cesses in its column group in

phase using oin t-to-p oin comm unication to complete the fold op eration. This is illustrated in Figure 2. In this example, the fold op eration is erformed on pro cessor grid. The pro cessors are group ed in ro groups and three column groups as sho wn in Figure 2.a. After phase 1, eac pro cessor in ro group con tains the messages from all the pro cesses in the ro group to the pro cesses in the column group that the pro cessor elongs to (Figure 2.b). After these messages are exc hanged among the column pro cessors in phase 2, eac pro cessor has receiv ed all the messages destined to the pro

cessor (Figure 2.c). The expand op eration is simpler ariation of the fold op eration. The dierence is that eac pro cessor sends the same message to all the other pro cessors on pro cessor grid. describ this
Page 10
[2,5] [2,5] [1,4] [1,4] [0,3] [0,3] [0−5] [0−5] [0−5] [0−5] [0−5] [0−5] (a) Messages receiv ed after phase (b) Messages receiv ed after phase Figure 3: An expand op eration on pro cessor grid (The notation [S] denotes set of messages sen pro cessors in group to the receiving pro cessor. The receiving pro cessor is not sp ecied

in the notation for clarit y). using an example, where an expand op eration is erformed on pro cessor grid that is depicted in Figure 2.a. In the rst phase, the pro cessors in the same column group send messages to eac other. Therefore, all the messages to sen to ro pro cessor group ha receiv ed the pro cessors in the ro group after phase 1, as sho wn in Figure 3.a. These messages are then circulated in ro w-wise ring comm unications in phase 2. After the phase 2, eac pro cessor has receiv ed messages from all other pro cessors (Figure 3.b). The time complexit of oth fold and expand op

erations is for an pro cessor grid. erformance Ev aluation This section presen ts exp erimen tal results for the distributed BFS. ha conducted most of the exp erimen ts on IBM BlueGene/L [1 ]. also ha conducted some exp erimen ts on MCR [18 ], large Lin ux cluster, for the comparativ study of the erformance of the prop osed BFS algorithm on more con en tional computing platform. 4.1 Ov erview of BlueGene/L system BlueGene/L is massiv ely parallel system dev elop ed IBM join tly with La wrence Liv ermore National Lab oratory [1 ]. BlueGene/L comprises 65,536 compute no des (CNs) in terconnected

as 64 32 32 3D torus. Eac CN con tains 32-bit erPC 440 pro cessors, eac with dual floating- oin units. The eak erformance of eac CN is 5.6 GFlops/s running at 700 MHz, allo wing the BlueGene/L system to ac hiev the total eak erformance of 360 TFlops/s. BlueGene/L is also equipp ed with 512 MB of main memory er CN (and 32 TB of total memory). Eac CN con tains six bi-directional torus links directly connected to nearest neigh ors in eac of three dimensions. 1.4 Gbits/s er direction, the BlueGene/L system ac hiev es the bisection bandwidth of 360 GB/s er direction. The CNs are also

connected separate tree net ork in whic an CN can ro ot. The torus net ork is used mainly for comm unications in user applications and supp orts oin t-to-p oin as ell as collectiv comm unications. The tree net ork is also used for CNs to comm unicate with I/O no des. It can also used for some collectiv es suc as
Page 11
broadcast and reduce. CN runs on simple run-time system called compute no de ernel (CNK) that has ery small memory fo otprin t. The main task of the CNK is to load and execute user applications. The CNK do es not pro vide virtual memory and ulti-threading supp ort and

pro vides xed-size address space for single user pro cess. Man con en tional system calls including I/O requests are function-shipp ed to separate I/O no de whic runs on con en tional Lin ux op erating system. 4.2 erformance Study Results First, ha measured the scalabilit of the prop osed BFS algorithm in eak scaling exp erimen ts on 32768-no de BlueGene/L system and presen the results in Figure 4. In eak scaling study increase the global problem size as the um er of pro cessors increases. Therefore, the size of lo cal problem (i.e., the um er of lo cal ertices) remains constan t. The

lo cal problem size used in these exp erimen ts is 100000 ertices and the erage degree of the graphs aries from 10 to 200. The scalabilit of our BFS sc heme is clearly demonstrated in Figure 4.a. The largest graph used in this study has 3.2 billion ertices and 32 billion edges. the est of our kno wledge, this is the largest graph ev er explored distributed graph searc algorithm. Suc high scalabilit of our sc heme can attributed to the fact that the length of message buers used in our algorithm do es not increase as the size of graphs gro ws. Figure 4.a also rev eals that the comm unication

time is ery small compared to the computation time (in the case with lo cal problem size of 100000 ertices and the erage degree of 10). This indicates that our algorithm is highly memory-in tensiv as it in olv es ery little computation. Proling the co de has conrmed that it sp ends most of its time in hashing function that is in ok ed to pro cess the receiv ed ertices. The comm unication time for other graphs with dieren degrees is also ery small and is omitted in the gure for clarit It is sho wn in Figure 4.a that the execution time curv es increase in prop ortion to

log where is the um er of pro cessors, and it is conrmed regression analysis. art of the reason for the logarithmic scaling factor is that the searc time for graph is dep enden on the length of path et een the source and destination ertices, and the path length is ounded the diameter of the graph, whic is (log for random graph with ertices [2 ]. That is, increases prop ortionally as increases in eak scaling study and therefore the diameter of the graph (and the searc time) increases in prop ortion to log The erformance of the BFS algorithm impro es as the erage degree increases. This

is ob vious, ecause as the degree of ertices increases the length of path eing searc hed decreases, and hence the searc time decreases. Note, ho ev er, that for larger erage degree, the execution time increases faster than log ). The Figure 4.b sho ws the total olume of messages receiv ed our BFS algorithm as function of the um er of lev els used in the searc h. These results are for small graph with 12 million ertices and 120 million edges. It can clearly seen in the gure that the message olume increases quic kly as the path length increases un til the path length reac hes the

diameter of the graph. The scalabilit of the bi-directional BFS algorithm is compared with that of the uni-directional BFS for the case with the erage degree of 10 as sho wn in Figure 4.c. Similar to the uni-directional searc h, the scaling factor is log As exp ected, the bi-directional searc outp erforms the uni- directional searc h. The searc time of the bi-directional BFS in the orst case is only 33% of that of the uni-directional BFS. This is mainly ecause the bi-directional searc alks shorter distance than the uni-directional searc and signican tly reduces the olume of erall

messages to pro cessed. ha eried that the total olume of messages receiv ed eac pro cessor in bi-directional searc is orders of magnitude smaller than that in uni-directional searc h.
Page 12
10 100 1000 10000 32768 Number of Processors (log scale) Average Execution Time (Sec) |V|=100000, k=10 Comm. Time (|V|=100000, k=10) |V|=20000, k=50 |V|=10000, k=100 |V|=5000, k=200 Length of Search Path 10 Total Message Volume (Mil. Vertices) (a) Mean searc time (b) Message olume er lev el 100 1000 10000 Number of Processors (log scale) Average Execution Time (Sec) Bi-directional BFS

Uni-directional BFS (c) Bi-directional searc Figure 4: eak scaling results of the distributed BFS on 32768-no de BlueGene/L system. and denote the um er of ertices assigned to eac pro cessor and the erage degree, resp ectiv ely
Page 13
100 200 300 400 500 Number of Processors 10 20 30 40 50 60 70 Speedup |V|=100000, k=10 |V|=20000, k=50 |V|=10000, k=100 |V|=50000, k=200 Figure 5: Strong scaling results of the distributed BFS on BlueGene/L system. denotes the um er of ertices er pro cessor and denotes the erage ertex degree. ha conducted strong-scaling exp erimen ts and presen the

results in Figure 5. In con trast to eak scaling, x the size of graph while increasing the um er of pro cessors in the strong scaling exp erimen ts. In Figure 5, the sp eedup curv es gro ws in prop ortion to for small where is the um er of pro cessors. or larger the sp eedup tap ers o as the lo cal problem size ecomes ery small and the comm unication erhead ecomes dominan t. understand the erformance haracteristics of the distributed BFS algorithm on more con- en tional computing platform, ha measured its eak scaling erformance on MCR [18 ], large Lin ux cluster lo cated at La wrence

Liv ermore National Lab oratory MCR has 1,152 no des, eac with 2.4 GHz In tel en tium Xeon pro cessors and GB of memory in terconnected with Quadrics switc h. The results are compared with those obtained on BlueGene/L and presen ted in Figure 6. In these exp erimen ts, 20000 lo cal ertices are assigned to eac pro cessor and graphs with erage degrees of 5, 10, and 50 are considered. Figure 6.a plots the relativ erformance of the prop osed BFS algorithm on MCR and Blue- Gene/L. Here, the ratio of the execution time on BlueGene/L to that on MCR is used as erfor- mance metric. Figure 6.a rev eals

that the BFS algorithm runs faster on MCR than BuleGene/L for arying erage degrees, esp ecially for small graphs. or the small graphs, the execution time of the distributed BFS algorithm is dominated its computation time, rather than comm unication time. The computation time is in large go erned the computing er of compute no des, and therefore running BFS on MCR, whic has faster pro cessors and memory subsystems and runs at higher clo rate, results in the faster searc time. The execution-time ratio curv es in the graph, ho ev er, decreases as the size of graphs increases. In fact, oth MCR and

BlueGene/L sho similar erformance for the graphs with 20 million ertices. This is due to that the increased comm unication erhead on MCR ullies the erformance gain obtained with its faster computing capabilit This is more eviden in Figure 6.b, whic sho ws the comm unication erhead of the BFS algorithm
Page 14
200 400 600 800 1000 Number of Processors 1.5 2.5 3.5 Execution-time Ratio (BGL/MCR) k=5 k=10 k=50 200 400 600 800 1000 Number of Processors 0.1 0.2 0.3 0.4 0.5 Communication Ratio BlueGene/L, k=5 BlueGene/L, k=10 BlueGene/L, k=50 MCR, k=5 MCR. k=10 MCR. k=50 (a) Relativ

execution time ratio (b) Comm unication ratio Figure 6: erformance comparison of BFS algorithm on BlueGene/L and MCR. series of eak scaling exp erimen ts ere conducted for the comparison with 20000 lo cal ertices er pro cessor. running on MCR and BlueGene/L, in terms of the ratio of the comm unication time to the total execution time. The comm unication ratio for BlueGene/L remains almost flat as the um er of pro cessors increases. This is exp ected, ecause for BlueGene/L, 3D torus mac hine, the aggregate bandwidth prop ortionally increases as the um er of pro cessors used increases. On

the other hand, the comm unication ratio for MCR increases at uc more rapid rate compared to BlueGene/L as the size of the graphs (and hence the comm unication erhead) increases. Figures 6.a and 6.b suggest that MCR will outp erformed BlueGene/L for ery large graphs due to the high comm unication erhead. Unfortunately the limited size of MCR cluster prohibits us from erforming suc analysis for larger graphs. The erformance of the 2D and 1D partitioning are compared in able for dieren pro cessor top ologies. ha used graphs, whic ha 3.2 billion and 0.32 billion edges resp ectiv ely in the exp

erimen ts. It can clearly seen in the table that the comm unication time of 1D partitioning is uc higher than that of 2D partitioning. The erage length of messages receiv ed eac pro cessor er lev el is measured for the expand and fold op erations in addition to the total execution and comm unication time. The higher comm unication time of the 1D partitioning is due to the larger um er of pro cessors in olv ed in collectiv comm unications. In the orst case, the comm unication tak es ab out 40% of the total execution time. These results sho that 2D partitioning can reduce comm unication time. It

is in teresting to note that in some cases with lo er degree, where ro w-wise partition is used, the 1D partitioning outp erforms the 2D partitioning with the same problem size, despite the increased comm unication cost. The erage length of the fold messages in 1D partitioning is comparable to that of 2D partitioning. On the other hand, uc shorter messages are exc hanged during an
Page 15
Execution Time Comm. Time Avg. Msg. Length/Lev el Expand old 128 256 4.800 0.318 64016.70 65371.19 =100000 256 128 4.843 0.324 65315.12 64124.96 =10 32768 5.649 2.147 66640.10 9032.11 32768 4.180

2.246 6379.10 66640.50 128 256 2.283 0.157 95573.54 115960.29 =10000 256 128 2.385 0.164 114285.92 98418.21 =100 32768 3.172 1.391 138265.36 1760.00 32768 2.681 1.363 1361.99 138280.39 able 1: erformance results for arious pro cessor top ologies on BuleGene/L. denotes the um er of ertices er pro cessor and denotes the erage ertex degree. The larger comm unication timings for 1D partitioning is due to more pro cessors in olv ed in the collectiv comm unications. expand op eration. Not only those expand messages are transmitted lo cally the shorter messages result in reduction in memory accesses

and erformance impro emen t. In other ords, with the 1D partitioning there is trade-o et een higher comm unication cost and lo er memory accessing time. The 2D partitioning should outp erform 1D partitioning for the graphs with higher degree, and this as eried for graph with few er ertices (0.32 billion) but higher degree (100) in the table. The eect of the erage degree of graph on the erformance of the partitioning sc hemes is analyzed further and sho wn in Figure 7, whic plots the olume of messages receiv ed pro cessor at eac lev el-expansion of searc as function of lev el in the

searc h. Graphs with 40 million ertices with arying erage degrees, partitioned er 20 20 pro cessor mesh, are analyzed in this study ha used an unreac hable target ertex in the searc to capture the orst-case eha vior of the partitioning sc hemes. Figure 7.a, where graphs with the erage degrees of 10 and 50 are analyzed, sho ws that the message olume increases more slo wly with 1D partitioning than 2D partitioning for the lo w-degree graph as the searc progresses. or the high-degree graph, 2D partitioning generates less messages than 1D partitioning. urther, can determine the erage degree of

oisson random graph with whic 1D and 2D partitionings exhibit iden tical erformance. That is, assuming can calculate the alue of the solving an equation 1) for giv en and The left and righ hand sides of the equation represen the message lengths er lev el-expansion for 1D and 2D partitionings, resp ectiv ely ha computed the alue of suc for =400 and =40000000 and compared the erformance of 1D and 2D partitionings with the graph in Figure 7.b. As exp ected, oth 1D and 2D partitionings sho nearly iden tical erformance. demonstrate the eectiv eness of our union-fold op eration for the BlueGene/L in

the Figure 8. ha used the redundancy ratio as erformance metric in this exp erimen t. The redundancy ratio is dened as the ratio of duplicate ertices eliminated the union-fold op eration to the total um er of ertices receiv ed pro cessor. Ob viously more redundan ertices can eliminated the union-fold op eration for the graph with the higher degree (100). It is sho wn that the union-fold op eration can sa as uc as 80% of ertices receiv ed eac pro cessor. Although the prop osed union op eration requires cop ying of receiv ed messages incurring additional erhead, it reduces the total um

er of ertices to pro cessed eac pro cessor and ultimately impro es erall
Page 16
10 Levels 1e+06 2e+06 3e+06 4e+06 5e+06 Message Volume 2-D (k=10) 1-D (k=10) 2-D (k=50) 1-D (k=50) Levels 1e+06 2e+06 3e+06 4e+06 Message Volume 2-D (k=34) 1-D (k=34) (a) 10 and 50 (b) 34 Figure 7: Message olume as function of lev el in searc on BlueGene/L. Graphs with 40 million ertices are used. In (b), the alue of is deriv ed from an equation, 1) where 400 and 40000000. 1000 10000 Number of Processors (log scale) 20 40 60 80 Redundancy Ratio |V|=100000, k=10 |V|=10000, k=100 Figure 8: erformance of

the prop osed union-fold op eration for BlueGene/L. denotes the um er of ertices er pro cessor and denotes the erage ertex degree.
Page 17
erformance reducing memory accessing time of the pro cessor. The redundancy ratio declines for oth graphs, ho ev er, as the um er of pro cessors increases. It has een sho wn in Figure 4.b that the message length increases exp onen tially as searc expands its fron tiers un til the path length approac hes the diameter of the graph, after whic the message length remains constan t. This means that the total um er of ertices (or total message length)

receiv ed eac pro cessor should almost constan indep enden of the um er of pro cessors in eak scaling run, since the diameter of the graph increases ery slo wly esp ecially for large graphs. What this implies is that the um er of duplicate ertices in receiv ed messages should constan as ell. Ho ev er, in our union-fold op eration eac pro cessor receiv es more messages as the um er of pro cessors increases, ecause it passes the messages using ring comm unications. This is wh the redundancy ratio declines as more pro cessors are used. Conclusions prop ose scalable parallel distributed BFS

algorithm and ha demonstrated its scalabilit on BlueGene/L with 32,768 pro cessors in this pap er. The prop osed algorithm uses 2D edge partitioning. ha sho wn that for oisson random graphs the length of messages from single pro cessor is prop ortional to the um er of ertices assigned to the pro cessor. use this information to conne the length of message buers for etter scalabilit also ha dev elop ed ecien collectiv comm unication op erations based on oin t-to-p oin comm unication designed for BlueGene/L, whic utilizes the high-bandwidth torus net ork of the mac hine. Using

this algorithm, ha searc hed ery large graphs with more than billion ertices and 30 billion edges. the est of our kno wledge, this is the largest graph searc hed distributed algorithm. urthermore, this ork pro vides insigh on ho to design scalable algorithms for data- and comm unication-in tensiv applications for ery large parallel computers lik BlueGene/L. uture ork should address graphs esides oisson random graphs, e.g., graphs with large clus- tering co ecien and scale-free graphs, whic are graphs with few ertices of ery large degree. The optimized collectiv es for BlueGene/L are

curren tly implemen ted at application lev el using MPI and th us require memory copies et een buers in the MPI library and the application. oid the erhead, need to implemen these collectiv es using BlueGene/L lo w-lev el comm unication APIs. In addition, using the lo w-lev el comm unication APIs will allo us to deliv er messages via the tree net ork of BlueGene/L and ma enhance the erformance of the collectiv es. Ac kno wledgmen ts This ork as erformed under the auspices of the U.S. Departmen of Energy Univ ersit of California La wrence Liv ermore National Lab oratory under con tract No.

W-7405-Eng-48. References [1] Blue Gene/L. ttp://cmg-rr.llnl.go v/asci/platforms/bluegenel. [2] B. Bollob as. The diameter of random graphs. ans. meric an Mathematic al So ciety 267:41{ 52, 1981.
Page 18
[3] U. V. ataly urek and C. Ayk anat. yp ergraph-partitioning approac for coarse-grain de- comp osition. In CM/IEEE SC2001 Den er, CO, No em er 2001. [4] M. Chein and M.-L. Mugnier. Conceptual graphs: undamen tal notions. evue d'intel ligenc articiel le 6(4):365{406, 1992. [5] A. Clauset, M. E. J. Newman, and C. Mo ore. Finding comm unit structure in ery large net orks. Phys.

ev. 70(6):066111, Dec. 2004. [6] A. Crauser, K. Mehlhorn, U. Mey er, and Sanders. parallelization of Dijkstra's shortest path algorithm. ctur Notes in Computer Scienc 1450:722{731, 1998. [7] J. Duc and A. Arenas. Comm unit detection in complex net orks using extremal optimization. arXiv.c ond-mat/0501368 Jan. 2005. [8] C. aloutsos, K. McCurley and A. omkins. ast disco ery of connection subgraphs. In Pr dings of the 10th CM SIGKDD International Confer enc on Know le dge Disc overy and Data Mining pages 118{127, Seattle, A, USA, 2004. CM Press. [9] G. x, M. Johnson, S. Otto, J. Salmon, and D.

alk er. Solving Pr oblems on Concurr ent Pr essors Pren tice-Hall, Inc., 1988. [10] A. Y. Grama and V. Kumar. surv ey of parallel searc algorithms for discrete optimization problems, 1993. [11] Y. Han, V. Y. an, and J. H. Reif. Ecien parallel algorithms for computing all pair shortest paths in directed graphs. In CM Symp osium on Par al lel lgorithms and chite ctur es pages 353{362, 1992. [12] B. Hendric kson, R. Leland, and S. Plimpton. An ecien parallel algorithm for partitioning irregular graphs. Int. Journal of High Sp Computing 7(1):73{88, 1995. [13] N. Klein and S.

Subramanian. randomized parallel algorithm for single-source shortest paths. J. lgorithms 25(2):205{220, 1997. [14] R. Levinson. ards domain-indep enden mac hine in telligence. In G. Mineau, B. Moulin, and J. So a, editors, Pr c. 1st Int. Conf. on Conc eptual Structur es olume 699, pages 254{273, Queb ec Cit Canada, 1993. Springer-V erlag, Berlin. [15] J. G. Lewis, D. G. yne, and R. A. an de Geijn. Matrix-v ector ultiplication and conjugate gradien algorithms on distributed memory computers. In Pr dings of the Sc alable High Performanc Computing Confer enc pages 542{550, 1994. [16] J. G. Lewis

and R. A. an de Geijn. Distributed memory matrix-v ector ultiplication and conjugate gradien algorithms. In Pr dings of Sup er omputing'93 pages 484{492, ortland, OR, No em er 1993. [17] K. Mac herey F. Oc h, and H. Ney Natural language understanding using statistical mac hine translation, 2001. [18] Multiprogrammatic Capabilit Cluster (MCR). ttp://www.llnl.go v/lin ux/mcr.
Page 19
[19] M. E. J. Newman. rom the co er: The structure of scien tic collab oration net orks. Pr d- ings of the National ademy of Scienc es 98:404{409, 2001. [20] M. E. J. Newman. Detecting comm unit

structure in net orks. Eur op an Physic al Journal 38:321{330, Ma 2004. [21] M. E. J. Newman. ast algorithm for detecting comm unit structure in net orks. Phys. ev. 69(6):066133, June 2004. [22] M. E. J. Newman and M. Girv an. Finding and ev aluating comm unit structure in net orks. Phys. ev. 69(2):026113, eb. 2004. [23] I. ohl. Bi-directional searc h. Machine Intel ligenc 6:127{140, 1971. eds. Meltzer and Mic hie, Edin burgh Univ ersit Press. [24] Y.-J. Suh and K. G. Shin. All-to-all ersonalized comm unication in ultidimensional torus and mesh net orks. IEEE ans. on Par al lel and Distribute

Systems 12:38{59, 2001.