# AHighP erformanceSparseCholeskyF actorizationAlgorithmF or ScalableP arallelComputers GeorgeKarypisandVipinKumar Departmen tofComputerScience Univ ersit yofMinnesota MinneapolisMN ec hnicalReport Abs PDF document - DocSlides

2014-12-12 274K 274 0 0

##### Description

This algorithm uses subforesttosub cub e mapping instead of the subtreetosub cub e mapping of another recen tly in tro duced sc heme b y Gupta and Kumar 13 Asymptotically b oth form ulations are equally scalable on a wide range of arc hitectures an ID: 22871

**Direct Link:**

**Embed code:**

## Download this pdf

DownloadNote - The PPT/PDF document "AHighP erformanceSparseCholeskyF actoriz..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

## Presentations text content in AHighP erformanceSparseCholeskyF actorizationAlgorithmF or ScalableP arallelComputers GeorgeKarypisandVipinKumar Departmen tofComputerScience Univ ersit yofMinnesota MinneapolisMN ec hnicalReport Abs

Page 1

AHighP erformanceSparseCholeskyF actorizationAlgorithmF or ScalableP arallelComputers GeorgeKarypisandVipinKumar Departmen tofComputerScience Univ ersit yofMinnesota Minneapolis,MN55455 ec hnicalReport94-41 Abstract This pap er presen ts a new parallel algorithm for sparse matrix factorization. This algorithm uses subforest-to-sub cub e mapping instead of the subtree-to-sub cub e mapping of another recen tly in tro duced sc heme b y Gupta and Kumar [13]. Asymptotically , b oth form ulations are equally scalable on a wide range of arc hitectures and a wide v ariet y of problems. But the subtree-to-sub cub e mapping of the earlier form ulation causes signi can t load im balance among pro cessors, limiting o erall eciency and sp eedup. The new mapping largely eliminates the load im balance among pro cessors. F urthermore, the algorithm hasan um b er of enhancemen ts to impro e the o erall p erformance substan tially . This new algorithm ac hiev es up to 6GFlops on a 256-pro cessor Cra y T3D for mo derately large problems. T o our kno wledge, this is the highest p erformance ev er obtained on an MPP for sparse Cholesky factorization. In troduction Direct metho ds for solving sparse linear systems are imp ortan t b ecause of their generalit y and robustness. F or linear systems arising in certain applications, suc h as linear programming and some structural engineering applications, they are the only feasible metho ds for n umerical factorization. It is w ell kno wn that dense matrix factorization can b e implemen ted ecien tly on distributed-memory parallel computers [4 , 27, 7 , 22]. Ho ev er, despite inheren t parallelism in sparse sparse direct metho ds, not m uc h success has b een ac hiev ed to date in dev eloping their scalable parallel form ulations [15 , 38], and for sev eral y ears, it has b een a c hallenge to implemen t ecien t sparse linear system solv ers using direct metho ds on ev en mo derately parallel computers. In [38 ], Sc hreib er concludes that it is not y et clear whether sparse direct solv ers can b e made comp etitiv eat all for highly ( 256) and massiv ely ( 4096) parallel computers. A parallel form ulation for sparse matrix factorization can b e easily obtained b y simply distributing ro ws to di eren t pro cessors [8 ]. Due to the sparsit y of the matrix, comm unicatio no erhead is a large fraction of the computation for this metho d, resulting in p o or scalabilit . In particular, for sparse matrices arising out of planar nite elemen t graphs, the iso eciency of suc h a form ulation is log ), that is the problem size (in terms of total n um b er of computation) should gro was log ) to main tain a xed eciency In a smarter parallel form ulation [11 ], the ro ws of the matrix are allo cated to pro cessors using the subtree- to-sub cub e mapping. This lo calizes the comm unication among groups of pro cessors, and th us impro es the iso eciency of the sc heme to ). Roth b erg and Gupta [36 ,35 ] used a di eren t metho d to reduce the comm uni cation o erhead. In their metho d, the en tire sparse matrix is partitioned among pro cessors using a t o-dimensional blo c k cyclic mapping. This reduces the comm unicatio no erhead and impro es the iso eciency to log ). Gupta and Kumar [13] recen tly dev elop ed a parallel form ulation of sparse Cholesky factorization based on the m ultifron tal metho d. The multifr ontal metho d [2 ,23 ] is a form of submatrix Cholesky , in whic single elimination steps are p erformed on a sequence of small, dense fr ontal matric es . One of the adv an tages of m ultifron tal metho ds is that the fron tal matrices are dense, and therefore the elimination steps can b e

Page 2

implemen ted ecien tly using lev el three BLAS primitiv es. This algorithm has t ok ey features. It uses the subtree-to-sub cub e mapping to lo calize comm unicatio n among pro cessors, and it uses the highly scalable o-dimensional grid partitioning for dense matrix factorization for eac h sup erno dal computation in the ultifron tal algorithm. As a result, the comm unicati on o erhead of this sc heme is the lo est of all other kno wn parallel form ulations for sparse matrix factorization [24 , 25, 1, 31, 32 , 39, 8, 38 , 17, 33 ,37 ,3,6,18 15, 40 ,26 ,12 ,36 ,35 ]. In fact, asymptotically , the iso eciency of this sc heme is ) for sparse matrices arising out of t o- and three-dimensional nite elemen t problems on a wide v ariet yofarc hitectures suc as h yp ercub e, mesh, fat tree, and three-dimensional torus. Note that the iso eciency of the b est kno wn parallel form ulation of dense matrix factorization is also ) [22 ]. On a v ariet y of problems, Gupta and Kumar rep ort sp eedup of up to 364 on a 1024-pro cessor nCUBE 2, whic h is a ma jor impro emen to er the previously existing algorithms. Ho ev er, the subtree-to-sub cub e mapping results in gross im balance of load among di eren t pro cessors, as elimination trees for most practical problems tend to b e un balanced. This load im balance is resp onsible for a ma jor p ortion of the eciency loss of their sc heme. F urthermore, the o erall computation rate of their single pro cessor m ultifron tal co de on nCUBE 2 w as only 0.7MFlops and the maxim um o erall p erformance on a 1028-pro cessor nCUBE 2 w as only 300MFlops. This w as partly due to the slo w pro cessors of nCUBE 2 (3.5 MFlops p eak), and partly due to inadequacies in the implemen tation. This pap er presen ts a new parallel algorithm for sparse matrix factorization. This algorithm uses subforest-to-sub cub e mapping instead of the subtree-to-sub cub e mapping of the old sc heme. The new map- ping largely eliminates the load im balance among pro cessors. F urthermore, the algorithm has a n um ber of enhancemen ts to impro e the o erall p erformance substan tially . This new algorithm ac hiev es up to 6GFlops on a 256-pro cessor Cra y T3D for mo derately large problems (ev en the biggest problem w e tried to ok less than t o seconds on a 256-no de T3D. F or larger problems, ev en higher p erformance can b e ac hiev ed). T our kno wledge, this is the highest p erformance ev er obtained on an MPP for sparse Cholesky factorization. Our new sc heme, lik e the sc heme of Gupta and Kumar [13 ], has an asymptotic iso eciency of ) for matrices arising out of t o- and three-dimensional nite elemen t problems on a wide v ariet yofarc hitectures suc hash yp ercub e, mesh, fat tree, and three-dimensional torus. The rest of the pap er is organized as follo ws. Section 2 presen ts a general o erview of the Cholesky factorization pro cess and m ultifron tal metho ds. Section 3 pro vides a brief description of the algorithm in [13 ]. Section 4 describ es our new algorithm. Section 5 describ es some further enhancemen ts of the algorithm that signi can tly impro e the p erformance. Section 6 pro vides the exp erimen tal ev aluation of our new algorithms on a Cra y T3D. Section 7 con tains concluding remarks. Due to space limitati ons, man y imp ortan t topics, including the theoretical p erformance analysis of our algorithm ha e b een mo ed to the app endices. CholeskyF actorization Consider a system of linear equations Ax where is an symmetric p ositiv e de nite matrix, is a kno wn v ector, and is the unkno wn solution ector to b e computed. One w y to solv e the linear system is rst to compute the Cholesky factorization LL where the Cholesky factor isalo er triangular matrix. The solution v ector can b e computed b successiv e forw ard and bac k substitutions to solv e the triangular systems Ly b; y: If is sparse, then during the course of the factorization, some en tries that are initially zero in the upp er triangle of ma y b ecome nonzero en tries in . These newly created nonzero en tries of are kno wn as l l-in . The amoun t of ll-in generated can b e decreased b y carefully reordering the ro ws and columns of prior to factorization. More precisely ,w e can c ho ose a p erm utation matrix suc h that the Cholesky factors

Page 3

of PAP ha e minim al ll-in. The problem of nding the b est ordering for that minim izes the amoun of ll-in is NP-complete [41 ], therefore a n um b er of heuristic algorithms for ordering ha e b een dev elop ed. In particular, minim um degree ordering [9 ,14 , 10] is found to ha elo w ll-in. or a giv en ordering of a matrix, there exists a corresp onding elimination tree. Eac h no de in this tree is a column of the matrix. No de is the paren tofnode j>i )if i;j is the rst nonzero en try in column . Elimination of ro ws in di eren t subtrees can pro ceed concurren tly .F or a giv en matrix, elimination trees of smaller heigh t usually ha e greater concurrency than trees of larger heigh t. A desirable ordering for parallel computers m ust increase the amoun t of concurrency without increasing ll-in substan tially . Sp ectral nested dissection [29 ,30 , 19] has b een found to generate orderings that ha e b oth lo w ll-in and go o d parallelism. F or the exp erimen ts presen ted in this pap er w e used sp ectral nested dissection. F or a more extensiv e discussion on the e ect of orderings to the p erformance of our algorithm refer to [21 ]. In the m ultifron tal metho d for Cholesky factorization, a fron tal matrix and an up date matrix is asso ciated with eac hnode of the elimination tree. The ro ws and columns of corresp onds to + 1 indices of in increasing order. In the b eginning is initialized to an ( +1) + 1) matrix, where + 1 is the um b er of nonzeros in the lo er triangular part of column of . The rst ro w and column of this initial is simply the upp er triangular part of ro and the lo er triangular part of column of . The remainder of is initialized to all zeros. The tree is tra ersed in a p ostorder sequence. When the subtree ro oted at a no de has b een tra ersed, then b ecomes a dense ( +1) + 1) matrix, where is the n um ber of o diagonal nonzeros in If is a leaf in the elimination tree of , then the nal is the same as the initial . Otherwise, the nal for eliminating no de is obtained b y merging the initial with the up date matrices obtained from all the subtrees ro oted at via an extend-add op eration. The extend-add is an asso ciativ e and comm utativ op erator on t o up date matrices suc h the index set of the result is the union of the index sets of the original up date matrices. Eac hen try in the original up date matrix is mapp ed on to some lo cation in the accum ulated matrix. If en tries from b oth matrices o erlap on a lo cation, they are added. Empt yen tries are assigned a alue of zero. After has b een assem bled, a single step of the standard dense Cholesky factorization is p erformed with no de as the piv ot. A t the end of the elimination step, the column with index is remo ed from and forms the column of . The remaining matrix is called the up date matrix and is passed on to the paren tof in the eliminatio n tree. Since matrices are symmetric, only the upp er triangular part is stored. F or further details on the m ultifron tal metho d, the reader should refer to App endix A, and to the excellen t tutorial b y Liu [23 ]. If some consecutiv ely n um b ered no des form a c hain in the eliminatio n tree, and the corresp onding ro ws of ha e iden tical nonzero structure, then this c hain is called a sup erno de . The sup erno dal elimination tr is similar to the elimination tree, but no des forming a sup erno de are collapsed together. In the rest of this pap er w e use the sup erno dal m ultifron tal algorithm. An y reference to the eliminatio n tree or a no de of the elimination tree actually refers to a sup erno de and the sup erno dal elimination tree. EarlierW orkonP arallelMultifron talCholeskyF actorization In this section w e pro vide a brief description of the algorithm b y Gupta and Kumar. F or a more detailed description the reader should refer to [13 ]. Consider a -pro cessors h yp ercub e-connected computer. Let b e the matrix to b e factored, and let b e its sup erno dal elimination tree. The algorithm requires the elimination tree to b e binary for the rst log lev els. An y elimination tree of arbitrary shap e can b e con erted to a binary tree using a simple tree restructuring algorithm describ ed in [19 ]. In this sc heme, p ortions of the elimination tree are assigned to pro cessors using the standard subtree- to-sub cub e assignmen t strategy [11 ,14 ] illustrated in Figure 1. With subtree-to-sub cub e assignmen t, all pro cessors in the system co op erate to factor the fron tal matrix asso ciated with the ro ot no de of the elimination tree. The t o subtrees of the ro ot no de are assigned to sub cub es of p= 2 pro cessors eac h. Eac subtree is further partitioned recursiv ely using the same strategy .Th us, the subtrees at a depth of log lev els are eac h assigned to individual pro cessors. Eac h pro cessor can pro cess this part of the tree completely indep enden tly without an y comm uni cation o erhead.

Page 4

10 11 12 13 14 15 17 18 16 0123456789101112131415161718 XX XX XX X XX X XX X XX XXX X XX XX XXX XX X XXX XXX XX X XXX XX X XX XXXX XXX XXX X XXX XXX XXX 1349101213 2 5 11 14 15 16 17 18 Level 3 Level 0 Level 1 Level 2 PPPPPPPP 1234567 PPPP 0,1 2,3 4,5 6,7 0,1,2,3 4,5,6,7 0,1,2,3,4,5,6,7 Figure 1: The elimination tree asso ciated with a sparse matrix, and the subtree-to-sub cub e mapping of the tree on to eigh t pro cessors. Assume that the lev els of the binary sup erno dal elimination tree are lab eled from top starting with 0. In general, at lev el of the eliminatio n tree, 2 log pro cessors w ork on a single fron tal or up date matrix. These pro cessors form a logical 2 (log (log grid. All up date and fron tal matrices at this lev el are distributed on this grid of pro cessors. T o ensure load balance during factorization, the ro ws and columns of these matrices are distributed in a cyclic fashion. Bet een t o successiv e extend-add op erations, the parallel m ultifron tal algorithm p erforms a dense Cholesky factorization of the fron tal matrix corresp onding to the ro ot of the subtree. Since the tree is sup erno dal, this step usually requires the factorization of sev eral no des. The comm unication taking place in this phase is the standard comm unication in grid-based dense Cholesky factorization. Eac h pro cessor participates in log distributed extend-add op erations, in whic h the up date matrices from the factorization at lev el are redistributed to p erform the extend-add op eration at lev el 1 prior to factoring the fron tal matrix. In the algorithm prop osed in [13 ], eac h pro cessor exc hanges data with only one other pro cessor during eac h one of these log distributed extend-adds. The ab o eisac hiev ed b y a careful em b edding of the pro cessor grids on the h yp ercub e, and b y carefully mapping ro ws and columns of eac fron tal matrix on to this grid. This mapping is describ ed in [21 ], and is also giv en in App endix B. TheNewAlgorithm As men tioned in the in tro duction, the subtree-to-sub cub e mapping sc heme used in [13 ] do es not distribute the w ork equally among the pro cessors. This load im balance puts an upp er b ound on the ac hiev able eciency or example, consider the sup erno dal elimination tree sho wn in Figure 2. This eliminatio n tree is partitioned among 8 pro cessors using the subtree-to-sub cub e allo cation sc heme. All eigh t pro cessors factor the top no de, pro cessors zero through three are resp onsible for the subtree ro oted at 24{27, and pro cessors four through sev en are resp onsible for the subtree ro oted at 52{55. The subtree-to-sub cub e allo cation pro ceeds recursiv ely in eac h sub cub e resulting in the mapping sho wn in the gure. Note that the subtrees of the ro ot no de do not ha e the same amoun tof w ork. Th us, during the parallel m ultifron tal algorithm, pro cessors zero through three will ha etow ait for pro cessors four through sev en to nish their w ork, b efore they can p erform an extend-add op eration and pro ceed to factor the top no de. This idling puts an upp er b ound on the eciency of this algorithm. W e can compute this upp er b ound on the ac hiev able eciency due to load im balance in the follo wing w . The time required to factor a subtree of the elimination tree is equal to the time to factor the ro ot plus the maxim um of the time required to factor eac h of the t o subtrees ro oted at this

Page 5

ro ot. By applying the ab o e rule recursiv ely w e can compute the time required to p erform the Cholesky factorization. Assume that the comm unication o erhead is zero, and that eac h pro cessor can p erform an op eration in a time unit, the time to factor eac h subtree of the elimination tree in Figure 2 is sho wn on the righ tofeac h no de. F or instance, no de 9{11 requires 773 254 217 = 302 op erations, and since the computation is distributed o er pro cessors zero and one, it tak es 151 time units. No w its subtree ro oted at no de 4{5 requires 254 time units, while its subtree ro oted at no de 8 requires 217 time units. Th us, this particular subtree is factored in 151 + max 254 217 = 405 time units. The o erall eciency ac hiev able b the ab o e subtree-to-sub cub e mapping is 4302 812 =0 66 whic h is signi can tly less than one. F urthermore, the nal eciency is ev en lo er due to comm unication erheads. 88 254 773 217 1912 254 773 217 552 942 2250 88 552 942 48 44-45 36 32-33 20 16-17 4-5 6 7 13 15 18 19 29 31 34 36 41 43 46 47 42 40 30 28 14 12 {0} {1} {2} {3} {4} {5} {6} {7} 254 405 217 254 217 405 496.5 552 88 703 703 552 88 794.5 812 49-51 37-39 52-55 21-23 24-27 9-11 56-62 4302 {6-7} {4-5} {2-3} {4-7} {0-3} {0-7} {0-1} Figure 2: The sup erno dal elimination tree of a factorization problem and its mapping to eigh t pro cessors via subtree-to-sub cub e mapping. Eac hnode( i.e. , sup erno de) is lab eled b y the range of no des b elonging to it. The n um b er on the left of eac hnodeisthe n um b er of op erations required to factor the tree ro oted at this no de, the n um b ers ab o e eac h no de denotes the set of pro cessors that this subtree is assigned to using subtree-to-sub cub e allo cation, and the n um b er on the righ t of eac h no de is the time-units required to factor the subtree in parallel. This example illustrates another dicult y asso ciated with direct factorization. Ev en though b oth subtrees ro oted at no de 56{62 ha e 28 no des, they require di eren t amoun t of computation. Th us, balancing the computation cannot b e done during the ordering phase b y simply carefully selecting separators that split the graph in to t o roughly equal parts. The amoun t of load im balance among di eren t parts of the elimination tree can b e signi can tly w orse for general sparse matrices, for whic h it is not ev en p ossible to nd go o d separators that can split the graph in to t o roughly equal parts. T able 1 sho ws the load im balance at the top lev el of the elimination tree for some matrices from the Bo eing-Harw ell matrix set. These matrices w ere ordered using the sp ectral nested dissection [29 , 30, 19 ]. Note that for all matrices the load im balance in terms of op eration coun t is substan tially higher than the relativ e di erence in the n um b er of no des in the left and righ t subtrees. Also, the upp er b ound on the eciency sho wn in this table is due only to the the top lev el subtrees. Since subtree-to-sub cub e mapping is recursiv ely applied in eac h sub cub e, the o erall load im balance will b e higher, b ecause it adds up as w egodo wn in the tree. or elimination trees of general sparse matrices, the load im balance can b e usually decreased b y p erform- ing some simple elimination tree reorderings describ ed in [19 ]. Ho ev er, these tec hniques ha et o serious limitations. First, they increase the ll-in as they try to balance the elimination tree b y adding extra dep en- dencies. Th us, the total time required to p erform the factorization increases. Second, these tec hniques are lo cal heuristics that try to minim ize the load im balance at a giv en lev el of the tree. Ho ev er, v ery often suc

Page 6

Left Subtree Righ t Subtree Name Separator Size No des Remaining W ork No des Remaining W ork Eciency Bound BCSSTK29 180 6912 45% 6695 55% 0.90 BCSSTK30 222 14946 59% 13745 41% 0.85 BCSSTK31 492 16728 40% 18332 60% 0.83 BCSSTK32 513 21713 45% 22364 55% 0.90 able 1: Ordering and load im balance statistics for some matrices from the Bo eing-Harw ell set. The matrices ha e b een reordered using sp ectral nested dissection. F or eac h matrix, the size of the top separator is sho wn, and for eac h subtree the n um b er of no des, and the p ercen t of the remaining w ork is sho wn. Also, the last column sho ws the maxim um ac hiev able eciency ,ifan y subsequen t lev els of the elimination tree w ere p erfectly balanced, or if only t o pro cessors w ere used for the factorization. lo cal impro emen ts do not result in impro ving the o erall load im balance. F or example, for a wide v ariet of problems from the Bo eing-Harw ell matrix set and linear programmi ng (LP) matrices from NETLIB [5 ], ev en after applying the tree balancing heuristics, the eciency b ound due to load im balance is still around 80% to 60% [13 , 20, 19]. If the increased ll-in is tak en in to accoun t, then the maxim um ac hiev able eciency is ev en lo er than that. In the rest of this section w e presen t a mo di cation to the algorithm presen ted in Section 3 that uses a di eren tsc heme for mapping the elimination tree on to the pro cessors. This mo di ed mapping sc heme signi can tly reduces the load im balance. 4.1 Subforest-T o-Sub cub e Mapping Sc heme In our new elimination tree mapping sc heme, w e assign man y subtrees (subforest) of the elimination tree to eac h pro cessor sub cub e. These trees are c hosen in suc haw y that the total amoun tof w ork assigned to eac h sub cub e is as equal as p ossible. The b est w y to describ e this partitioning sc heme is via an example. Consider the elimination tree sho wn in Figure 3. Assume that it tak es a total of 100 time-units to factor the en tire sparse matrix. Eac h no de in the tree is mark ed with the n um b er of time-units required to factor the subtree ro oted at this particular no de (including the time required to factor the no de itself ). F or instance, the subtree ro oted at no de requires 65 units of time, while the subtree ro oted at no de requires only 18. As sho wn in Figure 3(b), the subtree-to-sub cub e mapping sc heme will assign the computation asso ciated with the top sup erno de to all the pro cessors, the subtree ro oted at to half the pro cessors, and the subtree ro oted at to the remaining half of the pro cessors. Since, these subtrees require di eren t amoun tof computation, this particular partition will lead to load im balances. Since 7 time-units of w ork (corresp onding to the no de ) is distributed among all the pro cessors, this factorization tak es at least 7 =p units of time. No eac h sub cub e of p= 2 pro cessors indep enden tly w orks on eac h subtree. The time required for these sub cub es to nish is lo er b ounded b y the time to p erform the computation for the larger subtree (the one ro oted at no de ). Ev en if w e assume that all subtrees of are p erfectly balanced, computation of the subtree ro oted at p= 2 pro cessors will tak e at least 65 p= 2) time-units. Th us an upp er b ound on the eciency of this mapping is only 100 (7 =p +65 p= 2))) 73. No w consider the follo wing mapping sc heme: The computation asso ciated with sup erno des and is assigned to all the pro cessors. The subtrees ro oted at and are assigned to half of the pro cessors, while the subtree ro oted at is assigned to the remaining pro cessors. In this mapping sc heme, the rst half of the pro cessors are assigned 43 time-units of w ork, while the other half is assigned 45 time-units. The upp er b ound on the eciency due to load im balance of this new assignmen t is 100 (12 =p +45 p= 2)))) 98, whic h is a signi can t impro emen to er the earlier b ound of 73. The ab o e example illustrates the basic ideas b ehind the new mapping sc heme. Since it assigns subforests of the elimination tree to pro cessor sub cub es, w e will refer to it as subfor est-to-sub cub mapping sc heme. The general mapping algorithm is outlined in Program 4.1. The tree partitioning algorithm uses a set that con tains the unassigned no des of the elimination tree. The algorithm inserts the ro ot of the elimination tree in to , and then it calls the routine Elp art that

Page 7

(a) Top 2 levels of a partial elimination tree 15 100 65 28 18 15 18 65 28 45 15 18 45 100 65 28 45 100 BC DF EG EF BC Distributed to the other half of processors Distributed to one half of processors Distributed to all the processors Elimination tree of (a) partitioned using subtree-to-subcube (b) Elimination tree of (b) partitioned using subforest-to-subcube (c) Figure 3: The top t o lev els of an elimination tree is sho wn in (a). The subtree-to-sub cub e mapping is sho wn in (b), the subforest-to-sub cub e mapping is sho wn in (c). recursiv ely partitions the elimination tree. Elp art partitions in to t o parts, and and c hec ks if this partitioning is acceptable. If y es, then it assigns to half of the pro cessors, and to the remaining half, and recursiv ely calls Elp art to p erform the partitioning in eac h of these halv es. If the partitioning is not acceptable, then one no de of i.e. no de = sele ct( ) is assigned to all the pro cessors, no de is deleted from , and the c hildren of no de are inserted in to the . The algorithm then con tin ues b y rep eating the whole pro cess. The ab o e description pro vides a high lev el o erview of the subforest-to-sub cub e partitioning sc heme. Ho ev er,an um b er of details need to b e clari ed. In particular, w e need to sp ecify ho w the sele ct halfsplit , and ac eptable pro cedures w ork. Selectionofanodefrom There are t o di eren tw ys of de ning the pro cedure sele ct( One w y is to select a no de whose subtree requires the largest n um b er of op erations to b e factored. The second w y is to select a no de that requires the largest n um b er of op erations to factor it. The rst metho d fa ors no des whose subtrees require signi can t amoun t of computation. Th us, b selecting suc h a no de and inserting its c hildren in ema y get a go o d partitioning of in to t o halv es. Ho ev er, this approac h can assign no des with relativ ely small computation to all the pro cessors, causing p o or eciency in the factorization of these no des. The second metho d guaran tees that the selected no de has more w ork, and th us its factorization can ac hiev e higher eciency when it is factored b y all pro cessors. Note,thattheinformationrequiredb ythesemethods(theamoun tofcomputatio ntoeliminateanode,orthetotalamoun ofcomputationassociatedwithasubtree),canbeeasilyobtainedduringthesym bolicfactorizatio nphase.

Page 8

1. artition( )/* P artition the elimination tree , among pro cessors */ 2. fg 3. Add ro ot( )in to 4. Elpart( 5. End P artition 6. Elpart( 7. if ( == 1) return 8. done = false 9. while ( done == false) 10. halfsplit( 11. if (acceptable( )) 12. Elpart( p= 2) 13. Elpart( p= 2) 14. done = true 15. else 16. no de = select( 17. delete( no de 18. no de >p /* Assign no de to all pro cessors */ 19. Insert in to the c hildren of no de in 20. end while 21. End Elpart Program 4.1: The subforest-to-sub cub e partitioning algorithm. Ho ev er, if the subtrees attac hed to this no de are not large, then this ma y not lead to a go o d partitioning of in later steps. In particular, if the ro ot of the subtree ha ving most of the remaining w ork, requires little computation ( e.g. , single no de sup erno de), then the ro ot of this subtree will not b e selected for expansion un til v ery late, leading to to o man y no des b eing assigned at all the pro cessors. Another p ossibilit yistocom bine the ab o et osc hemes and apply eac h one in alternate steps. This com bined approac h eliminates most of the limitati ons of the ab o esc hemes while retaining their adv an tages. This is the sc heme w e used in the exp erimen ts describ ed in Section 6. So far w e considered only the oating p oin t op erations when w ew ere referring to the n um b er of op erations required to factor a subtree. On systems where the cost of eac h memory access relativ e to a oating p oin op eration is relativ ely high, a more accurate cost mo del will also tak e the cost of eac h extend-add op eration in to accoun t. The total n um b er of memory accesses required for extend-add can b e easily computed from the sym b olic factorization of the matrix. Splitti gTheSet In eac h step, the partitioning algorithm c hec ks to see if it can split the set in to t o roughly equal halv es. The abilit y of the halfsplit pro cedure to nd a partition of the no des (and consequen tly create t o subforests) is crucial to the o erall abilit y of this partitioning algorithm to balance the computation. F ortunately , this is a t ypical bin-pac king problem, and ev en though, bin-pac king is NP complete, a n um b er of go o d appro ximate algorithms exist [28]. The use of bin-pac king mak es it p ossible to balance the computation and to signi can tly reduce the load im balance. AcceptableP artiti ons A partition is acceptable if the p ercen tage di erence in the amoun tof w ork in the o parts is less that a small constan .If is c hosen to b e high ( e.g. 2), then the subforest-to-sub cub e mapping b ecomes similar to the subtree-to-sub cub e mapping sc heme. If is c hosen to b e to o small, then most of the no des of the elimination tree will b e pro cessed b y all the pro cessors, and the comm uni cation o erhead during the dense Cholesky factorization will b ecome to o high. F or example, consider the task of factoring

Page 9

matrices and on -pro cessor square mesh or a h yp ercub e using a standard algorithm that uses t o-dimensional partitioning and pip elining. If eac h of the matrices is factored b y all the pro cessors, then the total comm unication time for factoring the t o matrices is [22 ]. If and are factored concurren tly b p= 2 pro cessors eac h, then the comm unication time is (2 p= ) whic h is smaller. Th us the alue of has to b e c hosen to strik e a go o d balance b et een these t o con icting goals of minim izing load im balance and the comm unicati on o erhead in individual factorization steps. F or the exp erimen ts rep orted in Section 6, w e used =0 05. 4.2 Other Mo di cations In this section w e describ e the necessary mo di cations of the algorithm presen ted in Section 3 to accommo- date this new mapping sc heme. Initially eac h pro cessor factors the subtrees of the elimination tree assigned to itself. This represen ts lo cal computation and requires no comm unicati on just lik e the earlier algorithm. Ho ev er, since eac h pro cessor is assigned more than one subtree of the elimination tree, at the end of this lo cal computation, its stac k will con tain one up date matrix for eac h tree assigned to it. A t this p oin t, it needs to p erform a distributed extend-add op eration with its neigh b or pro cessor at the rst lev el of the virtual binary tree. During this step, eac h pro cessor splits the up date matrices, and sends the part that is not lo cal, to the other pro cessor. This is similar to the parallel extend-add op eration required b y the algorithm describ ed in Section 3, except that more than one up date matrix is split and sen t. Note, that these pieces from di eren t up date matrices can all b e pac ed in a single message, as the comm unication happ ens with only one other pro cessor. F urthermore, as sho wn in App endix D, the amoun t of data b eing transmitted in eac h parallel extend-add step is no more than it is in the earlier algorithm [13 ]. The reason is that ev en though more up date matrices are b eing transmitted, these up date matrices corresp ond to no des that are deep er in the elimination tree and the size of these matrices is m uc h smaller. No w, after pairs of pro cessors ha e p erformed the extend-add op eration, they co op erate to factor the no des of the elimination tree assigned to them. The no des are eliminated in a p ostorder order. Next, groups of four pro cessors exc hange the up date matrices that are stored in their stac k to p erform the extend-add op eration for the next lev el. This pro cess con tin ues un til all the no des ha e b een factored. The new parallel ultifron tal algorithm is outlined in App endix C. Impro vingP erformance eha e added a n um b er of mo di cations to the algorithm describ ed in Section 4 that greatly impro its p erformance. In the follo wing sections w e brie y describ e these mo di cations. or a more detailed description of these enhancemen ts the reader should refer to [21 ]. 5.1 Blo c k Cyclic Mapping As discussed in App endix D, for the factorization of a sup erno de, w e use the pip elined v arian t of the grid- based dense Cholesky algorithm [22 ]. In this algorithm, successiv ero ws of the fron tal matrix are factored one after the other, and the comm unicatio n and computation pro ceeds in a pip elined fashion. Ev en though this sc heme is simple, it has t o ma jor limitati ons. Since the ro ws and columns of a fron tal matrix among the pro cessor grid in a cyclic fashion, information for only one ro w is transmitted at an y giv en time. Hence, on arc hitectures in whic h the message startup time is relativ ely high compared to the transfer time, the comm unicati on o erhead is dominated b y the startup time. F or example, consider a pro cessor grid, and a -no de sup erno de that has a fron tal matrix of size . While p erforming elimination steps on an fron tal matrix, on a erage, a message of size (2 (2 )) needs to b e send in eac h step along eac h direction of the grid. If the message startup time is 100 times higher than the per w ord transfer time, then for = 256, as long as 2 k< 3200 the startup time will dominate the data transfer time. Note, that the ab o e translates to m> 1600. F or most sparse matrices, the size of the fron tal matrices tends to b e m uc h less than 1600.

Page 10

The second limitation of the cyclic mapping has to do with the implemen tation eciency of the compu- tation phase of the factorization. Since, at eac h step, only one ro w is eliminated, the factorization algorithm ust p erform a rank-one up date. On systems with BLAS lev el routines, this can b e done using either lev el one BLAS (D AXPY), or lev el t o BLAS (DGER, DGEMV). On most micropro cessors, including high p er- formance RISC pro cessors suc h as the Dec Alpha AXP , the p eak p erformance ac hiev able b y these primitiv es is usually signi can tly less than that ac hiev ed b y lev el three BLAS primitiv es, suc h as matrix-m atrix m ulti- ply (DGEMM). The reason is that for lev el one and lev el t o BLAS routines, the amoun t of computation is of the same order as the amoun t of data mo emen tbet een CPU and memory . In con trast, for lev el three BLAS op erations, the amoun t of computation is m uc h higher than the amoun t of data required from memory . Hence, lev el three BLAS op erations can b etter exploit the m ultiple functional units, and deep pip elines a ailable in these pro cessors. Ho ev er, b y distributing the fron tal matrices using a blo c k cyclic mapping [22 ], w e are able to eliminate b oth of the ab o e limitatio ns and greatly impro e the p erformance of our algorithm. In the blo c k cyclic mapping, the ro ws and columns of the matrix are divided in to groups, eac h of size , and these groups are assigned to the pro cessors in a cyclic fashion. As a result, diagonal pro cessors no w store blo c ks of consecutiv e piv ots. Instead of p erforming a single elimination step, they no w p erform elimination steps, and send data corresp onding to ro ws in a single message. Note that the o erall v olume of data transferred remains the same. F or sucien tly large v alues of , the startup time b ecomes a small fraction of the data transmission time. This result in a signi can t impro emen ts on arc hitectures with high startup time. In eac h phase no w, eac h pro cessor receiv es ro ws and columns and has to p erform a rank- up date on the not et factored part of its fron tal matrix. The rank- up date can no w b e implem en ted using matrix-m atrix ultiply , leading to higher computational rate. There are a n um b er of design issues in selecting the prop er v alue for . Clearly , the blo c k size should b e large enough so that the rank- up date ac hiev es high p erformance and the startup time b ecomes a small fraction of the data transfer time. On the other hand a v ery large v alue of leads to a n um b er of problems. First, pro cessors storing the curren t set of ro ws to b e eliminated, ha e to construct the rank- up date b p erforming rank-1 up dates. If is large, p erforming these rank-1 up dates tak es considerable amoun tof time, and stalls the pip eline. Also, a large v alue of leads to load im balances on the n um b er of elemen ts of the fron tal matrix assigned to eac h pro cessor, b ecause there are few er blo c ks to distribute in a cyclic fashion. Note that this load im balance within the dense factorization phase is di eren t from the load im balance asso ciated with distributing the elimination tree among the pro cessors describ ed in Section 4. An um b er of other design issues in olv ed in using blo c k cyclic mapping and w ys to further impro e the p erformance are describ ed in [21 ]. 5.2 Pip elined Cholesky F actorization In the parallel p ortion of our m ultifron tal algorithm, eac h fron tal matrix is factored using a grid based pip elined Cholesky factorization algorithm. This pip elined algorithms w orks as follo ws [22 ]. Assume that the pro cessor grid stores the upp er triangular part of the fron tal matrix, and that the pro cessor grid is square. The diagonal pro cessor that stores the curren t piv ot, divides the elemen ts of the piv ot ro w it stores y the piv ot and sends the piv ot to its neigh b or on the righ t, and the scaled piv ot ro w to its neigh bor do wn. Eac h pro cessor up on receiving the piv ot, scales its part of the piv ot ro w and sends the piv ot to the righ t and its scaled piv ot ro wdo wn. When a diagonal pro cessor receiv es a scaled piv ot ro w from its up pro cessor, it forw ards this do wn along its column, and also to its righ t neigh b or. Ev ery other pro cessor, up on receiving a scaled piv ot ro w either from the left or from the top, stores it lo cally and then forw ards it to the pro cessor at the opp osite end. F or simplicit , assume that data is tak en out from the pip eline b y the pro cessor who initiated the transmission. Eac h pro cessor p erforms a rank-1 up date of its lo cal part of the fron tal matrix as so on as it receiv es the necessary elemen ts from the top and the left. The pro cessor storing the next piv ot elemen t starts eliminating the next ro w as so on as it has nished computation for the previous iteration. The pro cess con tin ues un til all the ro ws ha e b een eliminated. Ev en though this algorithm is correct, and its asymptotic p erformance is as describ ed in App endix D, it requires bu ers for storing messages that ha e arriv ed and cannot y et b e pro cessed. This is b ecause certain pro cessors receiv e the t o sets of data they need to p erform a rank-1 up date at di eren t times. Consider for 10

Page 11

example a 4 4 pro cessor grid, and assume that pro cessor (0 0) has the rst piv ot. Ev en though pro cessor (1 0) receiv es data from the top almost righ ta , the data from the left m ust come from pro cessor (0 1) via (1 1) (1 2) and (1 3). No w if the pro cessor (0 0), also had the second piv ot (due to a greater than one blo c k size), then the message bu er on pro cessor (1 0) migh t con tain the message from pro cessor (0 0) corresp onding to the second piv ot, b efore the message from (0 1) corresp onding to the rst piv ot had arriv ed. The source of this problem is that the pro cessors along the ro w act as the sources for b oth t yp e of messages (those circulating along the ro ws and those circulating along the columns). When a similar algorithm is used for Gaussian elimination, the problem do esn't arise b ecause data start from a column and a ro wof pro cessors, and messages from these ro ws and columns arriv eateac h pro cessor at roughly the same time [22 ]. On mac hines with v ery high bandwidth, the o erhead in olv ed in managing bu ers signi can tly reduces the p ercen tage of the obtainable bandwidth. This e ect is ev en more pronounced for small messages. F or this reason, w e decided to implemen t our algorithm with only a single message bu er p er neigh b or. As men tioned in Section 6, this comm unicatio n proto col enable us to utilize most of the theoretical bandwidth on a Cra y T3D. Ho ev er, under the restrictions of limited message bu er space, the ab o e Cholesky algorithm sp ends a signi can t amoun t of time idling. This is due to the follo wing requiremen t imp osed b y the single comm unica- tion bu er requiremen t. Consider pro cessors k;k , and +1 ;k . Before pro cessor k;k can start the ( + 1)th iteration, it m ust w ait un til pro cessor +1 ;k has started p erforming the rank-1 up date for the th iteration (so that pro cessor k;k can go ahead and reuse the bu ers for iteration ( + 1)). Ho ev er, since pro cessor +1 ;k receiv es data from the left m uc h later than it do es from the top, pro cessor k;k ust w ait un til this latter data transmission has tak en place. Essen tially during this time pro cessor k;k sits idle. Because, it sits idle, the + 1 iteration will start late, and data will arriv e at pro cessor +1 ;k ev en later. Th us, at eac h step, certain pro cessors sp end certain amoun t of time b eing idle. This time is prop ortional to the time it tak es for a message to tra el along an en tire ro w of pro cessors, whic h increases substan tially with the n um ber of pro cessors. o solv e this problem, w e sligh tly mo di ed the comm unication pattern of the pip elined Cholesky algo- rithm as follo ws. As so on as a pro cessor that con tains elemen ts of the piv ot ro w has nished scaling it, it sends it b oth do wn and also to the transp osed pro cessor along the main diagonal. This latter pro cessor up on receiving the scaled ro w, starts mo ving it along the ro ws. No w the diagonal pro cessors do not an ymore forw ard the data they receiv e from the top to the righ t. The reason for this mo di cation is to mimi c the b e- ha vior of the algorithm that p erforms Gaussian elimination. On arc hitectures with cut-through routing, the erhead of this comm unicatio n step is comparable to that of a nearest neigh b or transmission for sucien tly large messages. F urthermore, b ecause these tr ansp ose messages are initiated at di eren t times cause little or no con ten tion. As a result, eac h diagonal pro cessor no w will ha e to sit idle for a v ery small amoun tof time [21 ]. Experimen talResults e implemen ted our new parallel sparse m ultifron tal algorithm on a 256-pro cessors Cra y T3D parallel computer. Eac h pro cessor on the T3D is a 150Mhz Dec Alpha c hip, with p eak p erformance of 150MFlops for 64-bit op erations (double precision). The pro cessors are in terconnected via a three dimensional torus net ork that has a p eak unidirectional bandwidth of 150Bytes p er second, and a v ery small latency .Ev en though the memory on T3D is ph ysically distributed, it can b e addressed globally . That is, pro cessors can directly access (read and/or write) other pro cessor's memory . T3D pro vides a library in terface to this capabilit y called SHMEM. W e used SHMEM to dev elop a ligh eigh t message passing system. Using this system w ew ere able to ac hiev e unidirectional data transfer rates up to 70Mb ytes p er second. This is signi can tly higher than the 35MBytes c hannel bandwidth usually obtained when using T3D's PVM. or the computation p erformed during the dense Cholesky factorization, w e used single-pro cessor im- plemen tation of BLAS primitiv es. These routines are part of the standard scien ti c library on T3D, and they ha e b een ne tuned for the Alpha c hip. The new algorithm w as tested on matrices from a v ariet of sources. F our matrices (BCSSTK40, BCSSTK31, BCSSTK32, and BCSSTK33) come from the Bo eing- Harw ell matrix set. MAR OS-R7 is from a linear programming problem tak en from NETLIB. COPTER2 11

Page 12

comes from a mo del of a helicopter rotor. CUBE35 is a 35 35 35 regular three-dimensional grid. In all of our exp erimen ts, w e used sp ectral nested dissection [29 , 30] to order the matrices. The p erformance obtained b y our m ultifron tal algorithm in some of these matrices is sho wn in T able 2. The op eration coun t sho ws only the n um b er of op erations required to factor the no des of the elimination tree (it do es not include the op erations in olv ed in extend-add). Some of these problems could not b e run on 32 pro cessors due to memory constrain ts (in our T3D, eac h pro cessor had only 2Mw ords of memory). Figure 4 graphically represen ts the data sho wn in T able 2. Figure 4(a) sho ws the o erall p erformance obtained v ersus the n um b er of pro cessors, and is similar in nature to a sp eedup curv e. Figure 4(b) sho ws the p er pro cessor p erformance v ersus the n um b er of pro cessors, and re ects reduction in eciency as increases. Since all these problems run out of memory on one pro cessor, the standard sp eedup and eciency could not b e computed exp erimen tally Num b er of Pro cessors Problem Op eration Coun 32 64 128 256 MAR OS-R7 3136 330472 1345241 720M 0.83 1.41 2.18 3.08 BCSSTK30 28924 1007284 5796797 2400M 1.48 2.45 3.59 BCSSTK31 35588 572914 6415883 3100M 1.47 2.42 3.87 BCSSTK32 44609 985046 8582414 4200M 1.51 2.63 4.12 BCSSTK33 8738 291583 2295377 1000M 0.78 1.23 1.92 2.86 COPTER2 55476 352238 12681357 9200M 1.92 3.17 5.51 CUBE35 42875 124950 11427033 10300M 2.23 3.75 6.06 able 2: The p erformance of sparse direct factorization on Cra y T3D. F or eac h problem the table con tains the n um b er of equations of the matrix , the original n um b er of nonzeros in , the nonzeros in the Cholesky factor ,n um b er of op erations required to factor the no des, and the p erformance in giga ops for di eren tn um b er of pro cessors. 32 64 128 256 32 64 128 256 Processors GigaFlop CUBE35 COPTER2 BCSSTK32 BCSSTK31 BCSSTK30 Processors (a) (b) 10 20 30 40 MFlops/Processor BCSSTK30 BCSSTK31 BCSSTK32 COPTER2 CUBE35 MAROS-R7 BCSSTK33 MAROS-R7 BCSSTK33 Figure 4: Plot of the p erformance of the parallel sparse m ultifron tal algorithm for v arious problems on Cra T3D. (a) T otal Giga ops obtained; (b) Mega ops p er pro cessor. The highest p erformance of 6GFlops w as obtained for CUBE35, whic h is a regular three-dimensional problem. Nearly as high p erformance (5.51GFlops) w as also obtained for COPTER2 whic h is irregular. Since b oth problems ha e similar op eration coun t, this sho ws that our algorithm p erforms equally w ell in factoring matrices arising in irregular problems. F o cusing our atten tion to the other problems sho wn in able 2, w e see that ev en on smaller problems, our algorithm p erforms quite w ell. F or BCSSTK33, it w as able to ac hiev e 2.86GFlops on 256 pro cessors, while for BCSSTK30, it ac hiev ed 3.59GFlops. 12

Page 13

o further illustrate ho wv arious comp onen ts of our algorithm w ork, w eha e included a breakdo wn of the v arious phases for BCSSTK31 and CUBE35 in T able 3. This table sho ws the a erage time sp en tb all the pro cessors in the lo cal computation and in the distributed computation. F urthermore, w e break do wn the time tak en b y distributed computation in to t o ma jor phases, (a) dense Cholesky factorization, (b) extend-add o erhead. The latter includes the cost of p erforming the extend-add op eration, splitting the stac ks, transferring the stac ks, and idling due to load im balances in the subforest-to-sub cub e partitioning. Note that the gures in this table are a erages o er all pro cessors, and they should b e used only as an appro ximate indication of the time required for eac h phase. An um ber of in teresting observ ations can b e made from this table. First, as the n um b er of pro cessors increases, the time sp en t pro cessing the lo cal tree in eac h pro cessor decreases substan tially b ecause the subforest assigned to eac h pro cessor b ecomes smaller. This trend is more pronounced for three-dimensional problems, b ecause they tend to ha e fairly shallo w trees. The cost of the distributed extend-add phase decreases almost linearly as the n um b er of pro cessors increases. This is consisten t with the analysis presen ted in App endix D, since the o erhead of distributed extend-add is (( log =p ). Since the gure for the time sp en t during the extend-add steps also includes the idling due to load im balance, the almost linear decrease also sho ws that the load im balance is quite small. The time sp en t in distributed dense Cholesky factorization decreases as the n um b er of pro cessors in- creases. This reduction is not linear with resp ect to the n um b er of pro cessors for t o reasons: (a) the ratio of comm unicati on to computation during the dense Cholesky factorization steps increases, and (b) for a xed size problem load im balances due to the blo c k cyclic mapping b ecomes w orse as increases. or reasons discussed in Section 5.1, w e distributed the fron tal matrices in a blo c k-cyclic fashion. T o get go o d p erformance on Cra y T3D out of lev el three BLAS routines, w e used a blo c k size of sixteen (blo c k sizes of less than sixteen result in degradation of lev el 3 BLAS p erformance on Cra y T3D) Ho ev er, suc h a large blo c k size results in a signi can t load im balance within the dense factorization phase. This load im balance b ecomes w orse as the n um b er of pro cessors increases. Ho ev er, as the size of the problem increases, b oth the comm unication o erhead during dense Cholesky and the load im balance due to the blo c k cyclic mapping b ecomes less signi can t. The reason is that larger problems usually ha e larger fron tal matrices at the top lev els of the elimination tree, so ev en large pro cessor grids can b e e ectiv ely utilized to factor them. This is illustrated b y comparing ho w the v arious o erheads decrease for BCSSTK31 and CUBE35. F or example, for BCSSTK31, the factorization on 128 pro cessors is only 48% faster compared to 64 pro cessors, while for CUBE35, the factorization on 128 pro cessors is 66% faster compared to 64 pro cessors. BCSSTK31 CUBE35 DistributedComputation DistributedComputation LocalComputation actorizatio Extend-Add LocalComputation actorization Extend-Add 64 0.17 1.34 0.58 0.15 3.74 0.71 128 0.06 0.90 0.32 0.06 2.25 0.43 256 0.02 0.61 0.18 0.01 1.44 0.24 able 3: A break-do wn of the v arious phases of the sparse m ultifron tal algorithm for BCSSTK31 and CUBE35. Eac hn um b er represen ts time in seconds. o see the e ect of the c hoice of in the o erall p erformance of the sparse factorization algorithm w factored BCSSTK31 on 128 pro cessors using =0 4 and =0 0001. Using these v alues for e obtained a p erformance of 1.18GFlops when =0 4, and 1.37GFlops when =0 0001. In either case, the p erformance is w orse than the 2.42GFlops obtained for =0 05. When =0 4, the mapping of the elimination tree to the pro cessors resem bles that of the subtree-to-sub cub e allo cation. Th us, the p erformance degradation is due to the elimination tree load im balance. When =0 0001, the elimination tree mapping assigns a large um b er of no des to all the pro cessors, leading to p o or p erformance during the dense Cholesky factorization. 13

Page 14

Conclusion Exp erimen tal results clearly sho w that our new sc heme is capable of using a large n um b er of pro cessor ecien tly On a single pro cessor of a state of the art v ector sup ercomputer suc hasCra y C90, sparse Cholesky factorization can b e done at the rate of roughly 500MFlops for the larger problems studied in Section 6. Ev en a 32-pro cessor Cra y T3D clearly outp erforms a single no de C-90 for these problems. Our algorithm as presen ted (and implemen ted) w orks for Cholesky factorization of symmetric p ositiv de nite matrices. With little mo di cations, it is also applicable to LU factorization of other sparse matrices, as long as no piv oting is required ( e.g. , sparse matrices arising out of structural engineering problems). With highly parallel form ulatio na ailable, the factorization step is no longer the most time consuming step in the solution of sparse systems of equations. Another step that is quite time consuming, and has not b een parallelized e ectiv ely is that of ordering. In our curren t researc hw e are in estigating ordering algorithms that can b e implemen ted fast on parallel computers [16 , 34]. Ac kno wledgmen ew ould lik e to thank the Minnesota Sup ercomputing Cen ter for pro viding access to a 64-pro cessor Cra T3D, and Cra y Researc h Inc. for pro viding access to a 256-pro cessor Cra y T3D. Finally w e wish to thank Dr. Alex P othen for his guidance with sp ectral nested dissection ordering. References [1] Clev e Ashcraft, S. C. Eisenstat, J. W.-H. Liu, and A. H. Sherman. Ac omp arison of thr ec olumn b ase d distribute sp arse factorization schemes .T ec hnical Rep ort Y ALEU/DCS/RR-810, Y ale Univ ersit , New Ha en, CT, 1990. Also app ears in Pr dings of the Fifth SIAM Confer enc e on Par al lel Pr essing for Scienti c Computing , 1991. [2] I. S. Du and J. K. Reid. The multifr ontal solution of inde nite sp arse symmetric line ar e quations CM ansactions on Mathematic al Softwar , (9):302{325, 1983. [3] Kalluri Esw ar, P onn usw am y Sada appan, and V. Visv anathan. Sup erno dal Sp arse Cholesky factorization on distribute d-memory multipr essors .In Internation al Confer enc e on Par al lel Pr essing , pages 18{22 (v ol. 3), 1993. [4] K. A. Galliv an, R. J. Plemmons, and A. H. Sameh. Par al lel algorithms for dense line ar algebr ac omputations SIAM R eview , 32(1):54{135, Marc h 1990. Also app ears in K. A. Galliv an et al. Par al lel A lgorithms for Matrix Computations . SIAM, Philadelph ia, P A, 1990. [5] D. M. Ga Ele ctr onic Mail Distribution of Line ar Pr gr amming T est Pr oblems al Pr gr amming So ciety CO AL Newsletter , Decem b er 1985. [6] G. A. Geist and E. G.-Y. Ng. ask sche duling for p ar al lel sp arse Cholesky factorization Internationa l Journal of Par al lel Pr gr amming , 18(4):291{314, 1989. [7] G. A. Geist and C. H. Romine. LU factorization algorithms on distribute d-memory multipr essor ar chite ctur es SIAM Journal on Scienti c and Statistic al Computing , 9(4):639{649, 1988. Also a ailable as T ec hnical Rep ort ORNL/TM-10383, Oak Ridge National Lab oratory , Oak Ridge, TN, 1987. [8] A. George, M. T. Heath, J. W.-H. Liu, and E. G.-Y. Ng. Sp arse Cholesky F actorization on a lo al memory multipr essor SIAM Journal on Scienti c and Statistic al Computing , 9:327{340, 1988. [9] A. George and J. W.-H. Liu. Computer Solution of L ar ge Sp arse Positive De nite Systems . Pren tice-Hall, Englew o o d Cli s, NJ, 1981. [10] A. George and J. W.-H. Liu. The evolution of the minimum de gr eor dering algorithm SIAM R eview , 31(1):1{19, Marc h 1989. [11] A. George, J. W.-H. Liu, and E. G.-Y. Ng. Communic ation R esults for Par al lel Sp arse Cholesky F actorization on a Hyp er cub Par al lel Computing , 10(3):287{298, Ma y 1989. [12] John R. Gilb ert and Rob ert Sc hreib er. Highly Par al lel Sp arse Cholesky F actorization SIAM Journal on Scienti c and Statistic al Computing , 13:1151{1172, 1992. 14

Page 15

[13] Ansh ul Gupta and Vipin Kumar. Asc alable p ar al lel algorithm for sp arse matrix factorization .T ec hnical Re- p ort 94-19, Departmen t of Computer Science, Univ ersit y of Minnesota, Minneap olis, MN, 1994. A shorter ersion app ears in Sup ercomputing '94. TR a ailable in users/kumar/sp arse -ch ole sky .ps at anon ymous FTP site ftp.cs.umn.e du [14] M. T. Heath, E. Ng, and B. W. P yton. Par al lel A lgorithms for Sp arse Line ar Systems SIAM R eview , 33(3):420{ 460, 1991. [15] M. T. Heath, E. G.-Y. Ng, and Barry W. P eyton. Par al lel A lgorithms for Sp arse Line ar Systems SIAM R eview 33:420{460, 1991. Also app ears in K. A. Galliv an et al. Par al lel A lgorithms for Matrix Computations . SIAM, Philadel phi a, P A, 1990. [16] M. T. Heath and P . Ragha an. A Cartesian neste d disse ction algorithm .T ec hnical Rep ort UIUCDCS-R-92-1772, Departmen t of Computer Science, Univ ersit y of Illinois, Urbana, IL 61801, Octob er 1992. to app ear in SIMAX. [17] M. T. Heath and P . Ragha an. Distribute d solution of sp arse line ar systems .T ec hnical Rep ort 93-1793, Depart- men t of Computer Science, Univ ersit y of Illinois , Urbana, IL, 1993. [18] Laurie Hulb ert and Earl Zmijewski. Limiting c ommunic ation in p ar al lel sp arse Cholesky factorization SIAM Journal on Scienti c and Statistic al Computing , 12(5):1184{1197, Septem b er 1991. [19] George Karypis, Ansh ul Gupta, and Vipin Kumar. Or dering and lo ad b alancing for p ar al lel factorization of sp arse matric es .T ec hnical Rep ort (in preparation), Departmen t of Computer Science, Univ ersit y of Minnesota, Minneap ol is, MN, 1994. [20] George Karypis, Ansh ul Gupta, and Vipin Kumar. A Par al lel F ormulation of Interior Point A lgorithms .In Sup er omputing 94 , 1994. Also a ailable as T ec hnical Rep ort TR 94-20, Departmen t of Computer Science, Univ ersit y of Minnesota, Minneap olis, MN. [21] George Karypis and Vipin Kumar. A High Performanc eSp arse Cholesky F actorization A lgorithm F or Sc ala- ale Par al lel Computers .T ec hnical Rep ort 94-41, Departmen t of Computer Science, Univ ersit y of Minnesota, Minneap ol is, MN, 1994. [22] Vipin Kumar, Anan th Grama, Ansh ul Gupta, and George Karypis. Intr ductio n to Par al lel Computing: Design and A nalysis of A lgorithms . Benjamin/Cummings Publishing Compan , Redw ood Cit , CA, 1994. [23] Joseph W. H. Liu. The Multifr ontal Metho d for Sp arse Matrix Solution: The ory and Pr actic SIAM R eview 34(1):82{109, 1992. [24] Rob ert F. Lucas. Solving planar systems of e quations on distribute d- memory multipr essors . PhD thesis, Departmen t of Electrical Engineering, Stanford Univ ersit ,P alo Alto, CA, 1987. Also see IEEE T ansactions on Computer A ide d Design , 6:981{991, 1987. [25] Rob ert F. Lucas, T om Blank, and Jerome J. Tiemann. Ap ar al lel solution metho d for lar ge sp arse systems of quations IEEE T ansactions on Computer A ide d Design , CAD-6(6):981{991, No em b er 1987. [26] Mo Mu and John R. Rice. A grid-b ase d subtr e-sub cu e assignment str ate gy for solving p artial di er ential quations on hyp er cub es SIAM Journal on Scienti c and Statistic al Computing , 13(3):826{839, Ma y 1992. [27] Dianne P . O'Leary and G. W. Stew art. Assignment and Sche duling in Par al lel Matrix F actorization Line ar lgebr a and its Applic atio ns , 77:275{299, 1986. [28] Christos H. P apadimitriou and Kenneth Steiglitz. Combinatorial Optimization, A lgorithms and Complexity Pren tice Hall, 1982. [29] A. P othen and C-J. F an. Computing the blo ck triangular form of a sp arse matrix CM T ansactions on Mathematic al Softwar , 1990. [30] Alex P othen, Horst D. Simon, and Kang-Pu Liou. Partitioning Sp arse Matric es With Eigenve ctors of Gr aphs SIAM J. on Matrix A nalysis and Applic ations , 11(3):430{452, 1990. [31] Alex P othen and Ch unguang Sun. Distribute d multifr ontal factorization using clique tr es .In Pr dings of the Fifth SIAM Confer enc eonPar al lel Pr essing for Scienti c Computing , pages 34{40, 1991. [32] Roland P ozo and Sharon L. Smith. Performanc e evaluation of the p ar al lel multifr ontal metho d in a distribute d- memory envir onment .In Pr dings of the Sixth SIAM Confer enc e on Par al lel Pr essing for Scienti c Com- puting , pages 453{456, 1993. [33] P . Ragha an. Distribute dsp arse Gaussian elimination and ortho gonal factorization .T ec hnical Rep ort 93-1818, Departmen t of Computer Science, Univ ersit y of Illinois, Urbana, IL, 1993. [34] P . Ragha an. Line and plane sep ar ators .T ec hnical Rep ort UIUCDCS-R-93-1794, Departmen t of Computer Science, Univ ersit y of Illinois, Urbana, IL 61801, F ebruary 1993. 15

Page 16

[35] Edw ard Roth b erg. Performanc e of Panel and Blo ck Appr aches to Sp arse Cholesky F actorization on the iPSC/860 and Par agon Multic omputers .In Pr dings of the 1994 Sc alable High Performanc e Computing Confer enc ,Ma y 1994. [36] Edw ard Roth b erg and Ano op Gupta. n ecient blo ck-oriente d appr ach to p ar al lel sp arse Cholesky factoriza- tion .In Sup er omputin g '93 Pr dings , 1993. [37] P . Sada appan and Sailesh K. Rao. Communic ation R duction for Distribute dSp arse Matrix F actorization on a Pr essors Mesh .In Sup er omputing '89 Pr dings , pages 371{379, 1989. [38] Rob ert Sc hreib er. Sc alability of sp arse dir ct solvers .T ec hnical Rep ort RIA CS TR 92.13, NASA Ames Researc Cen ter, Mo et Field, CA, Ma y 1992. Also app ears in A. George, John R. Gilb ert, and J. W.-H. Liu, editors, Sp arse Matrix Computations: Gr aph The ory Issues and A lgorithms (An IMA W orkshop V olume). Springer- erlag, New Y ork, NY, 1993. [39] Ch unguang Sun. Ecient p ar al lel solutions of lar ge sp arse SPD systems on distribute d- memory multipr essors ec hnical rep ort, Departmen t of Computer Science, Cornell Univ ersit , Ithaca, NY, 1993. [40] Sesh V en ugopal and Vija y K. Naik. E e cts of p artitioning and sche duling sp arse matrix factorization on c om- munic ation and lo ad b alanc .In Sup er omputin g '91 Pr dings , pages 866{875, 1991. [41] M. Y annak akis. Computing the minimum l l-in is NP-c omplete SIAM J. A lgebr aic Discr ete Metho ds , 2:77{79, 1981. AppendixA Multifron talMethod Let be an symmetric p ositiv e de nite matrix and b e its Cholesky factor. Let b e its elimination tree and de ne ] to represen t the set of descendan ts of the no de in the elimination tree . Consider the th column of . Let ;i ;:::;i b e the ro w subscripts of the nonzeros in ;j with i.e. , column has o -diagonal nonzeros). The subtr eup date matrix at column for is de ned as f j;k ;k ;k j;k ;l ;k ;:::;l ;k (1) Note that con tains outer pro duct con tributions from those previously eliminated columns that are de- scendan ts of in the elimination tree. The th fr ontal matrix is de ned to b e j;j j;i j;i ;j ;j (2) Th us, the rst ro w/column of is formed from ;j and the subtree up date matrix at column .Ha ving formed the fron tal matrix , the algorithm pro ceeds to p erform one step of elimination on that giv es the nonzero en tries of the factor of ;j . In particular, this elimination can b e written in matrix notation as j;j ;j ;j j;j ;j ;j (3) where ;j are the nonzero elemen ts of the Cholesky factor of column . The matrix is called up date matrix for column and is formed as part of the elimination step. 16

Page 17

In practice, is nev er computed using Equation 1, but is constructed from the up date matrices as follo ws. Let ;:::;c b e the c hildren of in the elimination tree, then j;j j;i j;i ;j ;j l$ (4) where is called the extend-add op er ator , and is a generalized matrix addition. The extend-add op eration is illustrated in the follo wing example. Consider the follo wing t o up date matrices of ;S where is the index set of i.e. , the rst ro w/column of corresp onds to the second ro w/column of , and the second ro w/column of corresp onds to the fth ro w/column of ), and is the index set of . Then Note that the submatrix $ ma yha e few er ro ws/columns than , but if it is prop erly extended b y the index set of , it b ecomes the subtree up date matrix The pro cess of forming from the nonzero structure elemen ts of column of and the up dates matrices is called fr ontal matrix assembly op er ation .Th us, in the m ultifron tal metho d, the elimination of eac h column of in olv es the assem bly of a fron tal matrix and one step of elimination. The sup erno dal m ultifron tal algorithm pro ceeds similarly to the m ultifron tal algorithm, but when a sup erno de of the elimination tree gets eliminated, this corresp onds to m ultiple elimination steps on the same fron tal matrix. The n um b er of elimination steps is equal to the n um b er of no des in the sup erno de. AppendixB DataDistribution Figure 5(a) sho wsa4 4 grid of pro cessors em b edded in the h yp ercub e. This em b edding uses the sh ue mapping. In particular grid p osition with co ordinate ( y; x ) is mapp ed on to pro cessor whose address is made yin terlea ving the bits in the binary represen tation of and .F or instance, the grid p osition with binary co ordinates ( ab; cd ) is mapp ed on to pro cessor whose address in binary is cadb , for example grid (10 11) is mapp ed on to 1110. Note that when w e split this 4 4 pro cessor grid in to t o4 2 subgrids, these subgrids corresp ond to distinct sub cub es of the original h yp ercub e. This prop ert y is main tained in subsequen t splits, for instance eac h2 2 grid of a 4 2 grid is again a sub cub e. This is imp ortan t, b ecause the algorithm uses subtree-to-sub cub e partitioning sc heme, and b y using sh uing mapping, eac h sub cub e is simply half of the pro cessor grid. Consider next, the elimination tree sho wn in Figure 5(b). In this gure only the top t o lev els are sho wn, and at eac hnode , the nonzero elemen ts of for this particular no de is also sho wn. Using the subtree-to- sub cub e partitioning sc heme, the elimination tree is partitioned among the 16-pro cessors as follo ws. No de is assigned to all the 16 pro cessors, no de is assigned to half the pro cessors, while no de is assigned to the other half of the pro cessors. In Figure 5(c), w e sho who w the fron tal matrices corresp onding to no des , and will b e distributed among the pro cessors. Consider no de . The fron tal matrix is a 7 7 triangular matrix, that corresp onds to . The distribution of the ro ws and columns of this fron tal matrix on the 4 2 pro cessor grid is sho wn in part (c). Ro of is mapp ed on ro of the pro cessor grid, while column of is mapp ed on column of the pro cessor grid. In this pap er is used to denote bit wise exclusiv e-or. 17

Page 18

{4, 5, 7, 8, 9, 10, 12, 13, 14, 15, 16, 19} {2, 5, 7, 8, 12, 14, 15} {3, 4, 7, 10, 13, 14, 19} 4, 10, 14 1000 3, 7, 13, 19 1100 1001 1011 1110 1111 1101 10, 14 3, 7, 19 Factoring node C Factoring node B 7, 15 2, 14 1010 4 0000 0001 0010 0011 0100 0101 0110 0111 13 8, 12 5, 7, 15 1011 2, 8, 12, 14 1101 1110 1111 4, 8, 16 10, 14 5, 9, 13 7, 15, 19 4, 8, 16 5, 9, 13 10, 14 7, 15, 19 Factoring node A 0111 0110 0011 0010 0101 0100 0000 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 00 01 00 01 10 11 10 11 0001 1010 1000 1001 1100 BC (a) Processor Grid (b) An Elimination Tree (c) Distributed Factorization and Extend-Add Figure 5: Mapping fron tal matrices to the pro cessor grid. An alternate w y of describing this mapping is to consider the co ordinates of the pro cessors in eac pro cessor grid. Eac h pro cessor in the 4 2 grid has an ( y; x ) co ordinate suc h that 0 y< 4, and x< 2. Note that this co ordinate is lo cal with resp ect to the curren t grid, and is not the same with the grid co ordinates of the pro cessor in the en tire 4 4 grid. F or the rest of the pap er, when w e discuss grid co ordinates w e will alw ys assume that they are lo cal, unless sp eci ed otherwise. Ro of will b e mapp ed on the pro cessors ( y; ) suc h that the t o least signi can t bits of is equal to . Similarly , column of will b e mapp ed on the pro cessors ( ;x ), suc h that the one least signi can t bit of is equal to In general, in a 2 pro cessor grid, i;j is mapp ed on to the pro cessor with grid co ordinates ( ;j ). Or lo oking it from the pro cessor's p oin t of view, pro cessor ( y; x ) will get all elemen ts i;j suc h that the least signi can t bits of are equal to , and the least signi can t bits of are equal to .( , and , where is used to denote bit wise logical and). The fron tal matrix for no de is mapp ed on the other 4 2 grid in a similar fashion. Note that the ro ws of b oth and are mapp ed on pro cessors along the same ro w of the 4 4 grid. This is b ecause, b oth grids ha e the same n um ber of ro ws. Also, b ecause the nal grid also has 4 ro ws, the ro ws of that are similar to either ro ws of and , are mapp ed on the same ro w of pro cessors. F or example ro w7ofboth , and is mapp ed on the fourth ro w of the three grids in question. This is imp ortan t, b ecause when the pro cessors need to p erform the extend-add op eration , no data mo emen tbet een ro ws of the grid is required. Ho ev er, data needs to b e comm unicated along the columns of the grid. This is b ecause, the grids that are storing the up date matrices ha e few er columns than the grid of pro cessors that is going to factor 18

Page 19

Data needs to b e comm unicated so that the columns of the up date matrices and matc hes that of the fron tal matrix . Eac h pro cessor can easily determine whic h columns of the up date matrix it already stores needs to k eep and whic h needs to send a .Itk eeps the columns whose 2 least signi can t bits matc its co ordinate in the 4 4 grid, and it sends the other a . Let i;j b e a rejected elemen t. This elemen needs to b e send to pro cessor along the same ro w in the pro cessor grid, but whose co ordinate is Ho ev er, since elemen i;j resides in this pro cessor during the factorization in olving the 4 2 grid, then ust b e the co ordinate of this pro cessor. Therefore, the rejected columns need to b e send to the pro cessor along the same ro w whose grid co ordinate di ers in the most signi can t bit. Th us, during this extend-add op eration, pro cessors that are neigh b ors in the h yp ercub e need to exc hange data. AppendixC NewP arallelMultifron talAlgorithm p order[1..k] /* The no des of the elimination tree at eac h pro cessor n um b ered in a lev el-p ostorder */ 1. lvl = log 2. for (j=1; j k; j++) 3. i = p order[j] 4. if (lev el[i] != lvl) 5. Split Stac ks(lvl, lev el[i]) 6. lvl = lev el[i] 7. Let b e the fron tal matrix for no de Let ;c ;:::;c b e the c hildren of in the elimination tree 8. l$ 9. factor( ) /* Dense factorization using 2 log lvl pro cessors */ 10. push( 11. The parallel m ultifron tal algorithm using the subforest-to-sub cub e partitioning sc heme. The function Split Stacks , p erforms lvl level ] parallel extend-add op erations. In eac h of these extend-adds eac h pro- cessor splits its lo cal stac k and sends data to the corresp onding pro cessors of the other group of pro cessors. Note that when lvl = log , the function factor p erforms the factorization of lo cally AppendixD Analysis In this section w e analyze the p erformance of the parallel m ultifron tal algorithm describ ed in Section 4. In our new parallel m ultifron tal algorithm and that of Gupta and Kumar [13 ], there are t ot yp es of o erheads due to the parallelization: (a) load im balances due to the w ork partitioning; (b) comm unication o erhead due to the parallel extend-add op eration and due to the factorization of the fron tal matrices. As our exp erimen ts sho w in Section 6, the subforest-to-sub cub e mapping used in our new sc heme es- sen tially eliminates the load im balance. In con trast, for the subtree-to-sub cub e sc heme used in [13 ], the erhead due to load im balance is quite high. No w the question is whether the subforest-to-sub cub e mapping used in our new sc heme results in higher comm unication o erhead during the extend-add and the dense factorization phase. This analysis is dicult due to the heuristic nature of our new mapping sc heme. T ok eep the analysis simple, w e presen t analysis for regular t o-dimensional grids, in whic h the n um b er of subtrees mapp ed on to eac h sub cub e is four. The analysis also holds for an y small constan tn um b er of subtrees. Our exp erimen ts ha e sho wn that the n um ber of subtrees mapp ed on to eac h sub cub e is indeed small. D.1 Comm unication Ov erhead Consider a regular nite di erence grid, and a -pro cessor h yp ercub e-connecte d parallel computer. o simplify the analysis, w e assume that the grid has b een ordered using a nested dissection algorithm, that selects cross-shap ed separators [11 ]. F or an no de square grid, this sc heme selects a separator of size 19

Page 20

that partitions the grid in to four subgrids eac h of size ( 1) 1) n= n= 2. Eac h of these subgrids is recursiv ely dissected in a similar fashion. The resulting sup erno dal elimination tree, is a quadtree. If the ro ot of the elimination tree is at lev el zero, then a no de at lev el corresp onds to a grid of size n= n= , and the separator of suc h a grid is of size 2 n= . Also, it is pro en in [11 , 13], that the size of the up date matrix of a no de at lev el is b ounded b y2 kn= , where = 341 2. e assume that the subforest-to-sub cub e partitioning sc heme partitions the elimination tree in the fol- lo wing w . It assigns all the no des in the rst t o lev els (the zeroth and rst lev el) to all the pro cessors. It then, splits the no des at the second lev el in to four equal parts and assigns eac h a quarter of the pro cessors. The c hildren of the no des of eac h of these four groups are split in to four equal parts and eac h is assigned to a quarter of the pro cessors of eac h quarter. This pro cesses con tin ues, un til the no des at lev el log ha b een assigned to individual pro cessors. Figure 6 sho ws this t yp e of subforest-to-sub cub e partitioning sc heme. This sc heme assigns to eac h sub cub e of pro cessors, four no des of the same lev el of the elimination tree. In particular, for i> 0, a sub cub e of size p= is assigned four no des of lev el + 1 of the tree. Th us, eac h sub cub e is assigned a forest consisting of four di eren t trees. (a) Top 4 levels of a quad tree (b) The subforest assigned to a fourth of the processors (c) The subforest assigned to a sixteenth of the processors Figure 6: A quadtree and the subforest-to-sub cub e allo cation sc heme. (a) Sho ws the top four lev els of a quadtree. (b) Sho ws the subforest assigned to a quarter of the pro cessors. (c) Sho ws the subforest assigned to a quarter of a pro cessors of the quarter of part (b). Consider a sub cub e of size p= . After it has nished factoring the no des at lev el + 1 of the 20

Page 21

elimination tree assigned to it, it needs to p erform an extend-add with its corresp onding sub cub e, so the new formed sub cub e of size p= p= , can go ahead and factor the no des of the th lev el of the elimination tree. During this extend-add op eration, eac h sub cub e needs to split the ro ots of eac h one of the subtrees b eing assigned to it. Since, eac h sub cub e is assigned four subtrees, eac h sub cub e needs to split and send elemen ts from four up date matrices. Since eac hnode atlev el + 1 has an up date matrix of size 2 kn= +1 distributed o er p= p= pro cessors, eac h pro cessor needs to exc hange with the corresp onding pro cessor of the other sub cub e kn= elemen ts. Since, eac h pro cessor has data from four up date matrices, the total amoun t of data b eing exc hanged is k n=p . The are a total of log extend-add phases; th us, the total n um ber of data that need to b e exc hanged is log p=p ). Note that this comm unication o erhead is iden tical to that required b y the subtree-to-sub cub e partitioning sc heme [13 ]. or the dense Cholesky factorization w e use the pip eline implemen tation of the algorithm on a t o- dimensional pro cessor grid using c hec erb oard partitioning. It can b e sho wn [27 , 22] that the comm unication erhead to p erform factorization steps of an matrix, in a pip elined implemen tatio non a mesh of pro cessors, is dm=q ). Since, eac hnodeoflev el + 1 of the elimination tree is assigned to a grid of , and the fron tal matrix of eac h no de is b ounded b kn= +1 kn= +1 , and w e p erform n= +1 factorization steps, the comm unication o erhead is +1 kn +1 kn 42 Since, eac h suc h grid of pro cessor has four suc h no des, the comm unication o erhead of eac h lev el is 2 kn= (2 ). Th us, the comm unication o erhead o er the log lev els is log =0 Therefore, the comm unication o erhead summed o er all the pro cessor due to the parallel extend-add and the dense Cholesky factorization is ), whic h is of the same order as the sc heme presen ted in [13 ]. Since the o erall comm unication o erhead of our new subforest-to-sub cub e mapping sc heme is of the same order as that for the subtree-to-sub cub e, the iso eciency function for b oth sc hemes is the same. The analysis just presen ted can b e extended similarly to three-dimensional grid problems and to arc hitectures other than h yp ercub e. In particular, the analysis applies directly to arc hitectures suc h as CM-5, and Cra T3D. 21