/
AHighP erformanceSparseCholeskyF actorizationAlgorithmF or ScalableP arallelComputers AHighP erformanceSparseCholeskyF actorizationAlgorithmF or ScalableP arallelComputers

AHighP erformanceSparseCholeskyF actorizationAlgorithmF or ScalableP arallelComputers - PDF document

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
617 views
Uploaded On 2014-12-12

AHighP erformanceSparseCholeskyF actorizationAlgorithmF or ScalableP arallelComputers - PPT Presentation

This algorithm uses subforesttosub cub e mapping instead of the subtreetosub cub e mapping of another recen tly in tro duced sc heme b y Gupta and Kumar 13 Asymptotically b oth form ulations are equally scalable on a wide range of arc hitectures an ID: 22871

This algorithm uses subforesttosub

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "AHighP erformanceSparseCholeskyF actoriz..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AHighPerformanceSparseCholeskyFactorizationAlgorithmFScalableParallelComputersGeorgeKarypisandVipinKumartofComputerScienceyofMinnesotaMinneapolis,MN55455hnicalReport94-41Thispaperpresentsanewparallelalgorithmforsparsematrixfactorization.Thisalgorithmusessubforest-to-subcubemappinginsteadofthesubtree-to-subcubemappingofanotherrecentlyintroducedhemebyGuptaandKumar[13].Asymptotically,bothformulationsareequallyscalableonawiderangeofarchitecturesandawidevyofproblems.Butthesubtree-to-subcubemappingoftheearlierulationcausessigni cantloadimbalanceamongprocessors,limitingoeralleciencyandspeedup.Thenewmappinglargelyeliminatestheloadimbalanceamongprocessors.Furthermore,thealgorithmhasanberofenhancementstoimproetheoerallperformancesubstan.Thisnewalgorithmesupto6GFlopsona256-processorCrayT3Dformoderatelylargeproblems.ToourknothisisthehighestperformanceeverobtainedonanMPPforsparseCholeskyfactorization.troductionDirectmethodsforsolvingsparselinearsystemsareimportantbecauseoftheirgeneralityandrobustness.Flinearsystemsarisingincertainapplications,suchaslinearprogrammingandsomestructuralengineeringapplications,theyaretheonlyfeasiblemethodsfornumericalfactorization.Itiswellknownthatdensematrixfactorizationcanbeimplementedecientlyondistributed-memoryparallelcomputers[4,27,7,22].er,despiteinherentparallelisminsparsesparsedirectmethods,notmhsuccesshasbeenacedtodateindevelopingtheirscalableparallelformulations[15,38],andforseveralyears,ithasbeenachallengetotecientsparselinearsystemsolversusingdirectmethodsonevenmoderatelyparallelcomputers.In[38],Schreiberconcludesthatitisnotyetclearwhethersparsedirectsolverscanbemadecompetitiveatallforhighly(256)andmassively(4096)parallelcomputers.Aparallelformulationforsparsematrixfactorizationcanbeeasilyobtainedbysimplydistributingrotodi erentprocessors[8].Duetothesparsityofthematrix,commerheadisalargefractionofthecomputationforthismethod,resultinginpoorscalabilit.Inparticular,forsparsematricesarisingoutofplanar niteelementgraphs,theisoeciencyofsuchaformulationis),thatistheproblemsize(intermsoftotalnberofcomputation)shouldgrowas)tomaintaina xedeciencyInasmarterparallelformulation[11],therowsofthematrixareallocatedtoprocessorsusingthesubtree-to-subcubemapping.Thislocalizesthecommunicationamonggroupsofprocessors,andthusimprotheisoeciencyoftheschemeto).RothbergandGupta[36,35]usedadi erentmethodtoreducethecommcationoerhead.Intheirmethod,theentiresparsematrixispartitionedamongprocessorsusingato-dimensionalblockcyclicmapping.ThisreducesthecommerheadandimproestheisoeciencytoGuptaandKumar[13]recentlydevelopedaparallelformulationofsparseCholeskyfactorizationbasedonthemtalmethod.Themethod[2,23]isaformofsubmatrixCholesky,inwhicsingleeliminationstepsareperformedonasequenceofsmall,denseontalmatric.Oneoftheadvofmtalmethodsisthatthefrontalmatricesaredense,andthereforetheeliminationstepscanbe tedecientlyusinglevelthreeBLASprimitives.Thisalgorithmhasteyfeatures.Itusesthesubtree-to-subcubemappingtolocalizecommnamongprocessors,anditusesthehighlyscalableo-dimensionalgridpartitioningfordensematrixfactorizationforeachsupernodalcomputationinthetalalgorithm.Asaresult,thecommonoerheadofthisschemeistheloestofallotherwnparallelformulationsforsparsematrixfactorization[24,25,1,31,32,39,8,38,17,33,37,3,6,1815,40,26,12,36,35].Infact,asymptotically,theisoeciencyofthisschemeis)forsparsematricesarisingoutofto-andthree-dimensional niteelementproblemsonawidevyofarchitecturessucashypercube,mesh,fattree,andthree-dimensionaltorus.Notethattheisoeciencyofthebestknoparallelformulationofdensematrixfactorizationisalso)[22].Onavyofproblems,GuptaandKumarreportspeedupofupto364ona1024-processornCUBE2,whichisamajorimproerthepreviouslyexistingalgorithms.er,thesubtree-to-subcubemappingresultsingrossimbalanceofloadamongdi erentprocessors,aseliminationtreesformostpracticalproblemstendtobeunbalanced.Thisloadimbalanceisresponsibleforamajorportionoftheeciencylossoftheirscheme.Furthermore,theoerallcomputationrateoftheirsingleprocessormtalcodeonnCUBE2wasonly0.7MFlopsandthemaximumoerallperformanceona1028-processornCUBE2wasonly300MFlops.ThiswaspartlyduetotheslowprocessorsofnCUBE2(3.5MFlopspeak),andpartlyduetoinadequaciesintheimplemenThispaperpresentsanewparallelalgorithmforsparsematrixfactorization.Thisalgorithmusessubforest-to-subcubemappinginsteadofthesubtree-to-subcubemappingoftheoldscheme.Thenewmap-pinglargelyeliminatestheloadimbalanceamongprocessors.Furthermore,thealgorithmhasanberoftstoimproetheoerallperformancesubstan.Thisnewalgorithmacesupto6GFlopsona256-processorCrayT3Dformoderatelylargeproblems(eventhebiggestproblemwetriedtooklessthantosecondsona256-nodeT3D.Forlargerproblems,evenhigherperformancecanbeaced).Tourknowledge,thisisthehighestperformanceeverobtainedonanMPPforsparseCholeskyfactorization.Ournewscheme,liketheschemeofGuptaandKumar[13],hasanasymptoticisoeciencyof)formatricesarisingoutofto-andthree-dimensional niteelementproblemsonawidevyofarchashypercube,mesh,fattree,andthree-dimensionaltorus.Therestofthepaperisorganizedasfollows.Section2presentsageneraloerviewoftheCholeskyfactorizationprocessandmtalmethods.Section3providesabriefdescriptionofthealgorithminin].Section4describesournewalgorithm.Section5describessomefurtherenhancementsofthealgorithmthatsigni cantlyimproetheperformance.Section6providestheexperimentalevaluationofournewalgorithmsonaCrayT3D.Section7containsconcludingremarks.Duetospacelimitations,manyimportanttopics,includingthetheoreticalperformanceanalysisofouralgorithmhaebeenmoedtotheappendices.CholeskyFConsiderasystemoflinearequationsisansymmetricpositivede nitematrix,isaknownvector,andistheunknownsolutionectortobecomputed.Onewytosolvethelinearsystemis rsttocomputetheCholeskyfactorizationwheretheCholeskyfactorisaloertriangularmatrix.Thesolutionvcanbecomputedbeforwardandbacksubstitutionstosolvethetriangularsystemsissparse,thenduringthecourseofthefactorization,someentriesthatareinitiallyzerointheuppertriangleofybecomenonzeroentriesin.Thesenewlycreatednonzeroentriesofareknownas ll-in.Theamountof ll-ingeneratedcanbedecreasedbycarefullyreorderingtherowsandcolumnsofpriortofactorization.MorepreciselyecanchooseapermutationmatrixhthattheCholeskyfactors PAPeminimal ll-in.Theproblemof ndingthebestorderingforthatminimizestheamounof ll-inisNP-complete[41],thereforeanberofheuristicalgorithmsfororderinghaebeendeveloped.Inparticular,minimumdegreeordering[9,14,10]isfoundtohaelow ll-in.oragivenorderingofamatrix,thereexistsacorrespondingeliminationtree.Eachnodeinthistreeisacolumnofthematrix.Nodeistheparentofnode�ji)ifisthe rstnonzeroentryincolumn.Eliminationofrowsindi erentsubtreescanproceedconcurrenoragivenmatrix,eliminationtreesofsmallerheightusuallyhaegreaterconcurrencythantreesoflargerheighAdesirableorderingforparallelcomputersmustincreasetheamountofconcurrencywithoutincreasing ll-insubstan.Spectralnesteddissection[29,30,19]hasbeenfoundtogenerateorderingsthathaebothlow ll-inandgoodparallelism.Fortheexperimentspresentedinthispaperweusedspectralnesteddissection.Foramoreediscussiononthee ectoforderingstotheperformanceofouralgorithmreferto[21InthemtalmethodforCholeskyfactorization,afrontalmatrixandanupdatematrixassociatedwitheachnodeoftheeliminationtree.Therowsandcolumnsofcorrespondsto+1indicesinincreasingorder.Inthebeginningisinitializedtoan(+1)+1)matrix,where+1istheberofnonzerosintheloertriangularpartofcolumn.The rstrowandcolumnofthisinitialissimplytheuppertriangularpartofroandtheloertriangularpartofcolumn.Theremainderisinitializedtoallzeros.Thetreeistraersedinapostordersequence.Whenthesubtreerootedatanodehasbeentraersed,thenbecomesadense(+1)+1)matrix,whereisthenberofo diagonalnonzerosinisaleafintheeliminationtreeof,thenthe nalisthesameastheinitial.Otherwise,theforeliminatingnodeisobtainedbymergingtheinitialwiththeupdatematricesobtainedfromallthesubtreesrootedatviaanextend-addoperation.Theextend-addisanassociativeandcommoperatorontoupdatematricessuchtheindexsetoftheresultistheunionoftheindexsetsoftheoriginalupdatematrices.Eachentryintheoriginalupdatematrixismappedontosomelocationintheaccummatrix.Ifentriesfrombothmatricesoerlaponalocation,theyareadded.Emptyentriesareassignedaalueofzero.Afterhasbeenassembled,asinglestepofthestandarddenseCholeskyfactorizationisperformedwithnodeasthepivot.Attheendoftheeliminationstep,thecolumnwithindexisremoandformsthecolumn.Theremainingmatrixiscalledtheupdatematrixandispassedontotheparentofintheeliminationtree.Sincematricesaresymmetric,onlytheuppertriangularpartisstored.Forfurtherdetailsonthemtalmethod,thereadershouldrefertoAppendixA,andtotheexcellenttutorialbyLiu[23Ifsomeconsecutivelynberednodesformachainintheeliminationtree,andthecorrespondingroeidenticalnonzerostructure,thenthischainiscalleda.Thedaleliminationtrissimilartotheeliminationtree,butnodesformingasupernodearecollapsedtogether.Intherestofthispaperweusethesupernodalmtalalgorithm.Anyreferencetotheeliminationtreeoranodeoftheeliminationtreeactuallyreferstoasupernodeandthesupernodaleliminationtree.EarlierWorkonParallelMultifrontalCholeskyFInthissectionweprovideabriefdescriptionofthealgorithmbyGuptaandKumar.Foramoredetaileddescriptionthereadershouldreferto[13Considera-processorshypercube-connectedcomputer.Letbethematrixtobefactored,andbeitssupernodaleliminationtree.Thealgorithmrequirestheeliminationtreetobebinaryforthe rstlogels.Anyeliminationtreeofarbitraryshapecanbeconertedtoabinarytreeusingasimpletreerestructuringalgorithmdescribedin[19Inthisscheme,portionsoftheeliminationtreeareassignedtoprocessorsusingthestandardsubtree-to-subcubeassignmentstrategy[11,14]illustratedinFigure1.Withsubtree-to-subcubeassignment,allprocessorsinthesystemcooperatetofactorthefrontalmatrixassociatedwiththerootnodeoftheeliminationtree.Thetosubtreesoftherootnodeareassignedtosubcubesof2processorseach.Eacsubtreeisfurtherpartitionedrecursivelyusingthesamestrategy.Thus,thesubtreesatadepthoflogelsareeachassignedtoindividualprocessors.Eachprocessorcanprocessthispartofthetreecompletelyindependentlywithoutanycommcationo 0123456789101112131415161718XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 1349101213251114 2,34,56,70,1,2,34,5,6,7Figure1:Theeliminationtreeassociatedwithasparsematrix,andthesubtree-to-subcubemappingofthetreeontoeightprocessors.Assumethatthelevelsofthebinarysupernodaleliminationtreearelabeledfromtopstartingwith0.Ingeneral,atlevoftheeliminationtree,2processorsworkonasinglefrontalorupdatematrix.Theseprocessorsformalogical2grid.Allupdateandfrontalmatricesatthislevelaredistributedonthisgridofprocessors.Toensureloadbalanceduringfactorization,therowsandcolumnsofthesematricesaredistributedinacyclicfashion.eentosuccessiveextend-addoperations,theparallelmtalalgorithmperformsadenseCholeskyfactorizationofthefrontalmatrixcorrespondingtotherootofthesubtree.Sincethetreeissupernodal,thisstepusuallyrequiresthefactorizationofseveralnodes.Thecommunicationtakingplaceinthisphaseisthestandardcommunicationingrid-baseddenseCholeskyfactorization.hprocessorparticipatesinlogdistributedextend-addoperations,inwhichtheupdatematricesfromthefactorizationatlevareredistributedtoperformtheextend-addoperationatlev1priortofactoringthefrontalmatrix.Inthealgorithmproposedin[13],eachprocessorexchangesdatawithonlyoneotherprocessorduringeachoneoftheselogdistributedextend-adds.Theaboeisacedbyacarefulbeddingoftheprocessorgridsonthehypercube,andbycarefullymappingrowsandcolumnsofeactalmatrixontothisgrid.Thismappingisdescribedin[21],andisalsogiveninAppendixB.TheNewAlgorithmAsmentionedintheintroduction,thesubtree-to-subcubemappingschemeusedin[13]doesnotdistributetheworkequallyamongtheprocessors.Thisloadimbalanceputsanupperboundontheacableeciencyorexample,considerthesupernodaleliminationtreeshowninFigure2.Thiseliminationtreeispartitionedamong8processorsusingthesubtree-to-subcubeallocationscheme.Alleightprocessorsfactorthetopnode,processorszerothroughthreeareresponsibleforthesubtreerootedat24{27,andprocessorsfourthroughenareresponsibleforthesubtreerootedat52{55.Thesubtree-to-subcubeallocationproceedsrecursivineachsubcuberesultinginthemappingshowninthe gure.Notethatthesubtreesoftherootnodedonotethesameamountofwork.Thus,duringtheparallelmtalalgorithm,processorszerothroughthreewillhaetowaitforprocessorsfourthroughsevento nishtheirwork,beforetheycanperformanextend-addoperationandproceedtofactorthetopnode.Thisidlingputsanupperboundontheeciencyofthisalgorithm.Wecancomputethisupperboundontheacableeciencyduetoloadiminthefollowingw.Thetimerequiredtofactorasubtreeoftheeliminationtreeisequaltothetimetofactortherootplusthemaximumofthetimerequiredtofactoreachofthetosubtreesrootedatthis root.ByapplyingtheaboerulerecursivelywecancomputethetimerequiredtoperformtheCholeskyfactorization.Assumethatthecommunicationoerheadiszero,andthateachprocessorcanperformanoperationinatimeunit,thetimetofactoreachsubtreeoftheeliminationtreeinFigure2isshownontherightofeachnode.Forinstance,node9{11requires773217=302operations,andsincethecomputationisdistributedoerprocessorszeroandone,ittakes151timeunits.Nowitssubtreerootedatnode4{5requires254timeunits,whileitssubtreerootedatnode8requires217timeunits.Thus,thisparticularsubtreeisfactoredin151+max=405timeunits.Theoeralleciencyacablebtheaboesubtree-to-subcubemappingis hissigni cantlylessthanone.Furthermore,the naleciencyisevenloerduetocomm 21755288552 67131518192931343641434647{0}{1}{2}{3}{4}{5}{6}{7}2172542175528870370355288Figure2:Thesupernodaleliminationtreeofafactorizationproblemanditsmappingtoeightprocessorsviasubtree-to-subcubemapping.Eachnode(,supernode)islabeledbytherangeofnodesbelongingtoit.Thenberontheleftofeachnodeisthenberofoperationsrequiredtofactorthetreerootedatthisnode,thenbersaboeeachnodedenotesthesetofprocessorsthatthissubtreeisassignedtousingsubtree-to-subcubeallocation,andthenberontherightofeachnodeisthetime-unitsrequiredtofactorthesubtreeinparallel.Thisexampleillustratesanotherdicultyassociatedwithdirectfactorization.Eventhoughbothsubtreesrootedatnode56{62hae28nodes,theyrequiredi erentamountofcomputation.Thus,balancingthecomputationcannotbedoneduringtheorderingphasebysimplycarefullyselectingseparatorsthatsplitthegraphintotoroughlyequalparts.Theamountofloadimbalanceamongdi erentpartsoftheeliminationtreecanbesigni cantlyworseforgeneralsparsematrices,forwhichitisnotevenpossibleto ndgoodseparatorsthatcansplitthegraphintotoroughlyequalparts.Table1showstheloadimbalanceatthetopleveloftheeliminationtreeforsomematricesfromtheBoeing-Harwellmatrixset.Thesematricesworderedusingthespectralnesteddissection[29,30,19].Notethatforallmatricestheloadimbalanceintermsofoperationcountissubstantiallyhigherthantherelativedi erenceinthenberofnodesintheleftandrightsubtrees.Also,theupperboundontheeciencyshowninthistableisdueonlytothethetoplevelsubtrees.Sincesubtree-to-subcubemappingisrecursivelyappliedineachsubcube,theoerallloadbalancewillbehigher,becauseitaddsupaswegodowninthetree.oreliminationtreesofgeneralsparsematrices,theloadimbalancecanbeusuallydecreasedbyperform-ingsomesimpleeliminationtreereorderingsdescribedin[19].Hoer,thesetechniqueshaoseriouslimitations.First,theyincreasethe ll-inastheytrytobalancetheeliminationtreebyaddingextradepen-dencies.Thus,thetotaltimerequiredtoperformthefactorizationincreases.Second,thesetechniquesarelocalheuristicsthattrytominimizetheloadimbalanceatagivenlevelofthetree.Hoer,veryoftensuc LeftSubtree tSubtree Name SeparatorSize Nodes RemainingW Nodes RemainingW EciencyBound BCSSTK29 180 6912 45% 6695 55% 0.90 BCSSTK30 222 14946 59% 13745 41% 0.85 BCSSTK31 492 16728 40% 18332 60% 0.83 BCSSTK32 513 21713 45% 22364 55% 0.90 able1:OrderingandloadimbalancestatisticsforsomematricesfromtheBoeing-Harwellset.Thematricesebeenreorderedusingspectralnesteddissection.Foreachmatrix,thesizeofthetopseparatorisshoandforeachsubtreethenberofnodes,andthepercentoftheremainingworkisshoAlso,thelastcolumnshowsthemaximumacableeciency,ifanysubsequentlevelsoftheeliminationtreewperfectlybalanced,orifonlytoprocessorswereusedforthefactorization.localimprotsdonotresultinimprovingtheoerallloadimbalance.Forexample,forawidevofproblemsfromtheBoeing-Harwellmatrixsetandlinearprogramming(LP)matricesfromNETLIB[5enafterapplyingthetreebalancingheuristics,theeciencyboundduetoloadimbalanceisstillaround80%to60%[13,20,19].Iftheincreased ll-inistakenintoaccount,thenthemaximumacableeciencyisevenloerthanthat.Intherestofthissectionwepresentamodi cationtothealgorithmpresentedinSection3thatusesadi erentschemeformappingtheeliminationtreeontotheprocessors.Thismodi edmappingsctlyreducestheloadimo-SubcubeMappingScInourneweliminationtreemappingscheme,weassignmanysubtrees(subforest)oftheeliminationtreetohprocessorsubcube.Thesetreesarechoseninsuchawythatthetotalamountofworkassignedtohsubcubeisasequalaspossible.Thebestwytodescribethispartitioningschemeisviaanexample.ConsidertheeliminationtreeshowninFigure3.Assumethatittakesatotalof100time-unitstofactorthetiresparsematrix.Eachnodeinthetreeismarkedwiththenberoftime-unitsrequiredtofactorthesubtreerootedatthisparticularnode(includingthetimerequiredtofactorthenodeitself).Forinstance,thesubtreerootedatnoderequires65unitsoftime,whilethesubtreerootedatnoderequiresonly18.AsshowninFigure3(b),thesubtree-to-subcubemappingschemewillassignthecomputationassociatedwiththetopsupernodetoalltheprocessors,thesubtreerootedattohalftheprocessors,andthesubtreerootedattotheremaininghalfoftheprocessors.Since,thesesubtreesrequiredi erentamountofcomputation,thisparticularpartitionwillleadtoloadimbalances.Since7time-unitsofwork(correspondingtothenode)isdistributedamongalltheprocessors,thisfactorizationtakesatleast7unitsoftime.Nohsubcubeof2processorsindependentlyworksoneachsubtree.Thetimerequiredforthesesubcubesto nishisloerboundedbythetimetoperformthecomputationforthelargersubtree(theonerootedatnode).Evenifweassumethatallsubtreesofareperfectlybalanced,computationofthesubtreerootedat2processorswilltakeatleast652)time-units.Thusanupperboundontheeciencyofthismappingisonly100+6573.Nowconsiderthefollowingmappingscheme:Thecomputationassociatedwithsupernodesisassignedtoalltheprocessors.Thesubtreesrootedatareassignedtohalfoftheprocessors,whilethesubtreerootedatisassignedtotheremainingprocessors.Inthismappingscheme,the rsthalfoftheprocessorsareassigned43time-unitsofwork,whiletheotherhalfisassigned45time-units.Theupperboundontheeciencyduetoloadimbalanceofthisnewtis100+4598,whichisasigni cantimproertheearlierboundTheaboeexampleillustratesthebasicideasbehindthenewmappingscheme.Sinceitassignssubforestsoftheeliminationtreetoprocessorsubcubes,wewillrefertoitasmappingscThegeneralmappingalgorithmisoutlinedinProgram4.1.Thetreepartitioningalgorithmusesasetthatcontainstheunassignednodesoftheeliminationtree.Thealgorithminsertstherootoftheeliminationtreein,andthenitcallstheroutine (a) Top 2 levels of a partial elimination tree 1510065289181518 1518 45100652845100 ADBGCGFAEDistributed to one half of processors Figure3:Thetoptolevelsofaneliminationtreeisshownin(a).Thesubtree-to-subcubemappingiswnin(b),thesubforest-to-subcubemappingisshownin(c).elypartitionstheeliminationtree.totoparts,andcksifthispartitioningisacceptable.Ifyes,thenitassignstohalfoftheprocessors,andtotheremaininghalf,andrecursivelycallstoperformthepartitioningineachofthesehalves.Ifthepartitioningisnotacceptable,thenonenodeofde=sele)isassignedtoalltheprocessors,isdeleted,andthechildrenofareinsertedintothe.Thealgorithmthenconuesbyrepeatingthewholeprocess.Theaboedescriptionprovidesahighleveloerviewofthesubforest-to-subcubepartitioningheme.Hoer,anberofdetailsneedtobeclari ed.Inparticular,weneedtospecifyhowthe,andprocedureswSelectionofanodefromTherearetodi erenofde ningtheprocedureOnewyistoselectanodewhosesubtreerequiresthelargestnberofoperationstobefactored.Thesecondwyistoselectanodethatrequiresthelargestnberofoperationstofactorit.The rstmethodfaorsnodeswhosesubtreesrequiresigni cantamountofcomputation.us,bselectingsuchanodeandinsertingitschildreninemaygetagoodpartitioningoftotohalver,thisapproachcanassignnodeswithrelativelysmallcomputationtoalltheprocessors,causingpooreciencyinthefactorizationofthesenodes.Thesecondmethodguaranteesthattheselectednodehasmorework,andthusitsfactorizationcanacehighereciencywhenitisfactoredbyallprocessors. Note,thattheinformationrequiredbythesemethods(theamountofcomputationtoeliminateanode,orthetotalamounofcomputationassociatedwithasubtree),canbeeasilyobtainedduringthesymbolicfactorizationphase. )/*Partitiontheeliminationtree,amongprocessors*/Addroot()inEndPif(==1)return=falsewhile(==false)if(acceptable(=true=select(/*Assigntoallprocessors*/InsertinthechildrenofendwhileEndElpartProgram4.1:Thesubforest-to-subcubepartitioningalgorithm.er,ifthesubtreesattachedtothisnodearenotlarge,thenthismaynotleadtoagoodpartitioninginlatersteps.Inparticular,iftherootofthesubtreehavingmostoftheremainingwork,requireslittlecomputation(,singlenodesupernode),thentherootofthissubtreewillnotbeselectedforexpansiontilverylate,leadingtotoomanynodesbeingassignedatalltheprocessors.Anotherpossibilityistocombinetheabooschemesandapplyeachoneinalternatesteps.ThisbinedapproacheliminatesmostofthelimitationsoftheaboeschemeswhileretainingtheiradvThisistheschemeweusedintheexperimentsdescribedinSection6.Sofarweconsideredonlythe oatingpointoperationswhenwerereferringtothenberofoperationsrequiredtofactorasubtree.Onsystemswherethecostofeachmemoryaccessrelativetoa oatingpoinoperationisrelativelyhigh,amoreaccuratecostmodelwillalsotakethecostofeachextend-addoperationtoaccount.Thetotalnberofmemoryaccessesrequiredforextend-addcanbeeasilycomputedfromthesymbolicfactorizationofthematrix.gTheSetIneachstep,thepartitioningalgorithmckstoseeifitcansplitthesettotoroughlyequalhalves.Theabilityoftheprocedureto ndapartitionofthenodes(andtlycreatetosubforests)iscrucialtotheoerallabilityofthispartitioningalgorithmtobalancethecomputation.F,thisisatypicalbin-packingproblem,andeventhough,bin-packingisNPcomplete,anberofgoodapproximatealgorithmsexist[28].Theuseofbin-packingmakesitpossibletobalancethecomputationandtosigni cantlyreducetheloadimAcceptablePApartitionisacceptableifthepercentagedi erenceintheamountofworkintheopartsislessthatasmallconstan.Ifischosentobehigh(2),thenthesubforest-to-subcubemappingbecomessimilartothesubtree-to-subcubemappingscheme.Ifischosentobetoosmall,thenmostofthenodesoftheeliminationtreewillbeprocessedbyalltheprocessors,andthecommcationoduringthedenseCholeskyfactorizationwillbecometoohigh.Forexample,considerthetaskoffactoring -processorsquaremeshorahypercubeusingastandardalgorithmthatusesto-dimensionalpartitioningandpipelining.Ifeachofthematricesisfactoredbyalltheprocessors,thenthetotalcommunicationtimeforfactoringthetomatricesis ].Ifarefactoredtlyb2processorseach,thenthecommunicationtimeis )whichissmaller.Thusthealueofhastobechosentostrikeagoodbalancebeteenthesetocon ictinggoalsofminimizingloadbalanceandthecommonoerheadinindividualfactorizationsteps.FortheexperimentsreportedinSection6,weusedOtherModi cationsInthissectionwedescribethenecessarymodi cationsofthealgorithmpresentedinSection3toaccommo-datethisnewmappingscInitiallyeachprocessorfactorsthesubtreesoftheeliminationtreeassignedtoitself.Thisrepresenlocalcomputationandrequiresnocommonjustliketheearlieralgorithm.er,sinceeachprocessorisassignedmorethanonesubtreeoftheeliminationtree,attheendofthislocalcomputation,itsstackwillcontainoneupdatematrixforeachtreeassignedtoit.Atthispoinitneedstoperformadistributedextend-addoperationwithitsneighborprocessoratthe rstlevelofthevirtualbinarytree.Duringthisstep,eachprocessorsplitstheupdatematrices,andsendsthepartthatisnotlocal,totheotherprocessor.Thisissimilartotheparallelextend-addoperationrequiredbythealgorithmdescribedinSection3,exceptthatmorethanoneupdatematrixissplitandsent.Note,thatthesepiecesfromdi erentupdatematricescanallbepacedinasinglemessage,asthecommhappenswithonlyoneotherprocessor.Furthermore,asshowninAppendixD,theamountofdatabeingtransmittedineachparallelextend-addstepisnomorethanitisintheearlieralgorithm[13].Thereasonisthateventhoughmoreupdatematricesarebeingtransmitted,theseupdatematricescorrespondtonodesthataredeeperintheeliminationtreeandthesizeofthesematricesismhsmaller.w,afterpairsofprocessorshaeperformedtheextend-addoperation,theycooperatetofactorthenodesoftheeliminationtreeassignedtothem.Thenodesareeliminatedinapostorderorder.Next,groupsoffourprocessorsexchangetheupdatematricesthatarestoredintheirstacktoperformtheextend-addoperationforthenextlevel.Thisprocessconuesuntilallthenodeshaebeenfactored.ThenewparalleltalalgorithmisoutlinedinAppendixC.vingPehaeaddedanberofmodi cationstothealgorithmdescribedinSection4thatgreatlyimproitsperformance.Inthefollowingsectionswebrie ydescribethesemodi cations.oramoredetaileddescriptionoftheseenhancementsthereadershouldreferto[21BlockCyclicMappingAsdiscussedinAppendixD,forthefactorizationofasupernode,weusethepipelinedvtofthegrid-baseddenseCholeskyalgorithm[22].Inthisalgorithm,successiverowsofthefrontalmatrixarefactoredoneaftertheother,andthecommnandcomputationproceedsinapipelinedfashion.enthoughthisschemeissimple,ithastomajorlimitatiSincetherowsandcolumnsofatalmatrixamongtheprocessorgridinacyclicfashion,informationforonlyonerowistransmittedatygiventime.Hence,onarchitecturesinwhichthemessagestartuptimeisrelativelyhighcomparedtothetransfertime,thecommonoerheadisdominatedbythestartuptime.Forexample,considera qp processorgrid,anda-nodesupernodethathasafrontalmatrixofsize.Whileperformingeliminationstepsonantalmatrix,onaerage,amessageofsize(2 ))needstobesendineachstepalongeachdirectionofthegrid.Ifthemessagestartuptimeis100timeshigherthantheperwordtransfertime,thenfor=256,aslongas23200thestartuptimewilldominatethedatatransfertime.Note,thattheaboetranslatesto1600.Formostsparsematrices,thesizeofthefronmatricestendstobemhlessthan1600. Thesecondlimitationofthecyclicmappinghastodowiththeimplementationeciencyofthecompu-tationphaseofthefactorization.Since,ateachstep,onlyonerowiseliminated,thefactorizationalgorithmustperformarank-oneupdate.OnsystemswithBLASlevelroutines,thiscanbedoneusingeitherlevoneBLAS(DAXPY),orleveltoBLAS(DGER,DGEMV).Onmostmicroprocessors,includinghighper-formanceRISCprocessorssuchastheDecAlphaAXP,thepeakperformanceacablebytheseprimitivisusuallysigni cantlylessthanthatacedbylevelthreeBLASprimitives,suchasmatrix-matrixmply(DGEMM).ThereasonisthatforleveloneandleveltoBLASroutines,theamountofcomputationisofthesameorderastheamountofdatamotbeteenCPUandmemory.Incontrast,forlevthreeBLASoperations,theamountofcomputationismhhigherthantheamountofdatarequiredfrom.Hence,levelthreeBLASoperationscanbetterexploitthemultiplefunctionalunits,anddeeppipelinesaailableintheseprocessors.er,bydistributingthefrontalmatricesusingablockcyclicmapping[22],weareabletoeliminatebothoftheaboelimitationsandgreatlyimproetheperformanceofouralgorithm.Intheblockcyclicmapping,therowsandcolumnsofthematrixaredividedintogroups,eachofsize,andthesegroupsareassignedtotheprocessorsinacyclicfashion.Asaresult,diagonalprocessorsnowstoreblocksofepivots.Insteadofperformingasingleeliminationstep,theynowperformeliminationsteps,andsenddatacorrespondingtowsinasinglemessage.Notethattheoerallvolumeofdatatransferredremainsthesame.Forsucientlylargevaluesof,thestartuptimebecomesasmallfractionofthedatatransmissiontime.Thisresultinasigni cantimprotsonarchitectureswithhighstartuptime.Inhphasenow,eachprocessorreceivwsandcolumnsandhastoperformarank-updateonthenotetfactoredpartofitsfrontalmatrix.Therank-updatecannowbeimplemtedusingmatrix-m,leadingtohighercomputationalrate.Thereareanberofdesignissuesinselectingthepropervaluefor.Clearly,theblocksizeshouldbelargeenoughsothattherank-updateaceshighperformanceandthestartuptimebecomesasmallfractionofthedatatransfertime.Ontheotherhandaverylargevalueofleadstoanberofproblems.First,processorsstoringthecurrentsetofwstobeeliminated,haetoconstructtherank-updatebperformingrank-1updates.Ifislarge,performingtheserank-1updatestakesconsiderableamountoftime,andstallsthepipeline.Also,alargevalueofleadstoloadimbalancesonthenberofelementsofthefrontalmatrixassignedtoeachprocessor,becausetherearefewerblockstodistributeinacyclicfashion.Notethatthisloadimbalancewithinthedensefactorizationphaseisdi erentfromtheloadimassociatedwithdistributingtheeliminationtreeamongtheprocessorsdescribedinSection4.berofotherdesignissuesinedinusingblockcyclicmappingandwystofurtherimproetheperformancearedescribedin[21PipelinedCholeskyFIntheparallelportionofourmtalalgorithm,eachfrontalmatrixisfactoredusingagridbasedpipelinedCholeskyfactorizationalgorithm.Thispipelinedalgorithmsworksasfollows[22].Assumethattheprocessorgridstorestheuppertriangularpartofthefrontalmatrix,andthattheprocessorgridissquare.Thediagonalprocessorthatstoresthecurrentpivot,dividestheelementsofthepivotrowitstoresythepivotandsendsthepivottoitsneighborontheright,andthescaledpivotrowtoitsneighbordohprocessoruponreceivingthepivot,scalesitspartofthepivotrowandsendsthepivottotherightanditsscaledpivotrowdown.Whenadiagonalprocessorreceivesascaledpivotrowfromitsupprocessor,itardsthisdownalongitscolumn,andalsotoitsrightneighbor.Everyotherprocessor,uponreceivingascaledpivotroweitherfromtheleftorfromthetop,storesitlocallyandthenforwardsittotheprocessorattheoppositeend.Forsimplicit,assumethatdataistakenoutfromthepipelinebytheprocessorwhoinitiatedthetransmission.Eachprocessorperformsarank-1updateofitslocalpartofthefrontalmatrixassoonasitreceivesthenecessaryelementsfromthetopandtheleft.Theprocessorstoringthenextpivtstartseliminatingthenextrowassoonasithas nishedcomputationforthepreviousiteration.Theprocessconuesuntilalltherowshaebeeneliminated.enthoughthisalgorithmiscorrect,anditsasymptoticperformanceisasdescribedinAppendixD,itrequiresbu ersforstoringmessagesthathaearrivedandcannotyetbeprocessed.Thisisbecausecertainprocessorsreceivethetosetsofdatatheyneedtoperformarank-1updateatdi erenttimes.Considerfor examplea44processorgrid,andassumethatprocessor(00)hasthe rstpivot.Eventhoughprocessor0)receivesdatafromthetopalmostrigh,thedatafromtheleftmustcomefromprocessor(0via(13).Nowiftheprocessor(00),alsohadthesecondpivot(duetoagreaterthanoneblocksize),thenthemessagebu eronprocessor(10)mightcontainthemessagefromprocessor(0correspondingtothesecondpivot,beforethemessagefrom(01)correspondingtothe rstpivothadarrivThesourceofthisproblemisthattheprocessorsalongtherowactasthesourcesforbothtypeofmessages(thosecirculatingalongtherowsandthosecirculatingalongthecolumns).WhenasimilaralgorithmisusedforGaussianelimination,theproblemdoesn'tarisebecausedatastartfromacolumnandarowofprocessors,andmessagesfromtheserowsandcolumnsarriveateachprocessoratroughlythesametimetime].Onmachineswithveryhighbandwidth,theoerheadinedinmanagingbu erssigni cantlyreducesthepercentageoftheobtainablebandwidth.Thise ectisevenmorepronouncedforsmallmessages.Fthisreason,wedecidedtoimplementouralgorithmwithonlyasinglemessagebu erperneighbor.AstionedinSection6,thiscommnprotocolenableustoutilizemostofthetheoreticalbandwidthonaCrayT3D.er,undertherestrictionsoflimitedmessagebu erspace,theaboeCholeskyalgorithmspendsatamountoftimeidling.Thisisduetothefollowingrequirementimposedbythesinglecommtionbu errequirement.Considerprocessorsk;k,and.Beforeprocessork;kcanstartthe(+1)thiteration,itmustwaituntilprocessorhasstartedperformingtherank-1updateforthethiteration(sothatprocessork;kcangoaheadandreusethebu ersforiteration(+1)).Hoer,sinceprocessoresdatafromtheleftmhlaterthanitdoesfromthetop,processork;kustwaituntilthislatterdatatransmissionhastakenplace.Essentiallyduringthistimeprocessork;ksitsidle.Because,itsitsidle,the+1iterationwillstartlate,anddatawillarriveatprocessorenlater.Thus,ateachstep,certainprocessorsspendcertainamountoftimebeingidle.Thistimeisproportionaltothetimeittakforamessagetotraelalonganentirerowofprocessors,whichincreasessubstantiallywiththenberofprocessors.osolvethisproblem,weslightlymodi edthecommunicationpatternofthepipelinedCholeskyalgo-rithmasfollows.Assoonasaprocessorthatcontainselementsofthepivotrowhas nishedscalingit,itsendsitbothdownandalsotothetransposedprocessoralongthemaindiagonal.Thislatterprocessoruponreceivingthescaledrow,startsmovingitalongtherows.Nowthediagonalprocessorsdonotanardthedatatheyreceivefromthetoptotheright.Thereasonforthismodi cationistomimicthebe-viorofthealgorithmthatperformsGaussianelimination.Onarchitectureswithcut-throughrouting,theerheadofthiscommnstepiscomparabletothatofanearestneighbortransmissionforsucienlargemessages.Furthermore,becausethesemessagesareinitiatedatdi erenttimescauselittleornocontion.Asaresult,eachdiagonalprocessornowwillhaetositidleforaverysmallamountoftime[21ExperimentalResultseimplementedournewparallelsparsemtalalgorithmona256-processorsCrayT3Dparallelcomputer.EachprocessorontheT3Disa150MhzDecAlphachip,withpeakperformanceof150MFlopsfor64-bitoperations(doubleprecision).Theprocessorsareinterconnectedviaathreedimensionaltorusorkthathasapeakunidirectionalbandwidthof150Bytespersecond,andaverysmalllatency.EvthoughthememoryonT3Disphysicallydistributed,itcanbeaddressedglobally.Thatis,processorscandirectlyaccess(readand/orwrite)otherprocessor'smemory.T3DprovidesalibraryinterfacetothisycalledSHMEM.WeusedSHMEMtodevelopalightmessagepassingsystem.Usingthissystemwereabletoaceunidirectionaldatatransferratesupto70Mbytespersecond.Thisistlyhigherthanthe35MByteschannelbandwidthusuallyobtainedwhenusingT3D'sPVM.orthecomputationperformedduringthedenseCholeskyfactorization,weusedsingle-processorim-tationofBLASprimitives.Theseroutinesarepartofthestandardscienti clibraryonT3D,andtheyhaebeen netunedfortheAlphachip.Thenewalgorithmwastestedonmatricesfromavofsources.Fourmatrices(BCSSTK40,BCSSTK31,BCSSTK32,andBCSSTK33)comefromtheBoeing-ellmatrixset.MAROS-R7isfromalinearprogrammingproblemtakenfromNETLIB.COPTER2 comesfromamodelofahelicopterrotor.CUBE35isa3535regularthree-dimensionalgrid.Inallofourexperiments,weusedspectralnesteddissection[29,30]toorderthematrices.TheperformanceobtainedbyourmtalalgorithminsomeofthesematricesisshowninTable2.Theoperationcountshowsonlythenberofoperationsrequiredtofactorthenodesoftheeliminationtree(itdoesnotincludetheoperationsinedinextend-add).Someoftheseproblemscouldnotberunon32processorsduetomemoryconstraints(inourT3D,eachprocessorhadonly2Mwordsofmemory).Figure4graphicallyrepresentsthedatashowninTable2.Figure4(a)showstheoerallperformanceobtainedversusthenberofprocessors,andissimilarinnaturetoaspeedupcurve.Figure4(b)showstheperprocessorperformanceversusthenberofprocessors,andre ectsreductionineciencyasSincealltheseproblemsrunoutofmemoryononeprocessor,thestandardspeedupandeciencycouldnotbecomputedexperimen berofProcessors Problem n jAj jLj OperationCoun 32 64 128 256 MAROS-R7 3136 330472 1345241 720M 0.83 1.41 2.18 3.08 BCSSTK30 28924 1007284 5796797 2400M 1.48 2.45 3.59 BCSSTK31 35588 572914 6415883 3100M 1.47 2.42 3.87 BCSSTK32 44609 985046 8582414 4200M 1.51 2.63 4.12 BCSSTK33 8738 291583 2295377 1000M 0.78 1.23 1.92 2.86 COPTER2 55476 352238 12681357 9200M 1.92 3.17 5.51 CUBE35 42875 124950 11427033 10300M 2.23 3.75 6.06 able2:TheperformanceofsparsedirectfactorizationonCrayT3D.Foreachproblemthetableconthenberofequationsofthematrix,theoriginalnberofnonzerosin,thenonzerosintheCholeskyfactorberofoperationsrequiredtofactorthenodes,andtheperformanceingiga opsforberofprocessors. 3264128256 3264128256 (a)(b)Figure4:PlotoftheperformanceoftheparallelsparsemtalalgorithmforvariousproblemsonCraT3D.(a)TotalGiga opsobtained;(b)Mega opsperprocessor.Thehighestperformanceof6GFlopswasobtainedforCUBE35,whichisaregularthree-dimensionalproblem.Nearlyashighperformance(5.51GFlops)wasalsoobtainedforCOPTER2whichisirregular.Sincebothproblemshaesimilaroperationcount,thisshowsthatouralgorithmperformsequallywellinfactoringmatricesarisinginirregularproblems.Focusingourattentiontotheotherproblemsshowninable2,weseethatevenonsmallerproblems,ouralgorithmperformsquitewell.ForBCSSTK33,itwabletoace2.86GFlopson256processors,whileforBCSSTK30,itaced3.59GFlops. ofurtherillustratehoariouscomponentsofouralgorithmwork,wehaeincludedabreakdownofthevariousphasesforBCSSTK31andCUBE35inTable3.Thistableshowstheaeragetimespenalltheprocessorsinthelocalcomputationandinthedistributedcomputation.Furthermore,webreakwnthetimetakenbydistributedcomputationintotomajorphases,(a)denseCholeskyfactorization,(b)extend-addoerhead.Thelatterincludesthecostofperformingtheextend-addoperation,splittingtheks,transferringthestacks,andidlingduetoloadimbalancesinthesubforest-to-subcubepartitioning.Notethatthe guresinthistableareaeragesoerallprocessors,andtheyshouldbeusedonlyasanximateindicationofthetimerequiredforeachphase.berofinterestingobservationscanbemadefromthistable.First,asthenberofprocessorsincreases,thetimespentprocessingthelocaltreeineachprocessordecreasessubstantiallybecausethesubforestassignedtoeachprocessorbecomessmaller.Thistrendismorepronouncedforthree-dimensionalproblems,becausetheytendtohaefairlyshallowtrees.Thecostofthedistributedextend-addphasedecreasesalmostlinearlyasthenberofprocessorsincreases.ThisisconsistentwiththeanalysispreseninAppendixD,sincetheoerheadofdistributedextend-addis).Sincethe gureforthetimespentduringtheextend-addstepsalsoincludestheidlingduetoloadimbalance,thealmostlineardecreasealsoshowsthattheloadimbalanceisquitesmall.ThetimespentindistributeddenseCholeskyfactorizationdecreasesasthenberofprocessorsin-creases.Thisreductionisnotlinearwithrespecttothenberofprocessorsfortoreasons:(a)theratioofcommontocomputationduringthedenseCholeskyfactorizationstepsincreases,and(b)fora xedsizeproblemloadimbalancesduetotheblockcyclicmappingbecomesworseasorreasonsdiscussedinSection5.1,wedistributedthefrontalmatricesinablock-cyclicfashion.TogetgoodperformanceonCrayT3DoutoflevelthreeBLASroutines,weusedablocksizeofsixteen(blocksizesoflessthansixteenresultindegradationoflevel3BLASperformanceonCrayT3D)Hoer,suchalargeblocksizeresultsinasigni cantloadimbalancewithinthedensefactorizationphase.Thisloadimbecomesworseasthenberofprocessorsincreases.er,asthesizeoftheproblemincreases,boththecommunicationoerheadduringdenseCholeskyandtheloadimbalanceduetotheblockcyclicmappingbecomeslesssigni cant.Thereasonisthatlargerproblemsusuallyhaelargerfrontalmatricesatthetoplevelsoftheeliminationtree,soevenlargeprocessorgridscanbee ectivelyutilizedtofactorthem.ThisisillustratedbycomparinghowthevariousodecreaseforBCSSTK31andCUBE35.Forexample,forBCSSTK31,thefactorizationon128processorsisonly48%fastercomparedto64processors,whileforCUBE35,thefactorizationon128processorsis66%fastercomparedto64processors. BCSSTK31 CUBE35 DistributedComputation DistributedComputation p LocalComputation Factorization Extend-Add LocalComputation Factorization Extend-Add 64 0.17 1.34 0.58 0.15 3.74 0.71 128 0.06 0.90 0.32 0.06 2.25 0.43 256 0.02 0.61 0.18 0.01 1.44 0.24 able3:Abreak-downofthevariousphasesofthesparsemtalalgorithmforBCSSTK31andCUBE35.Eacberrepresentstimeinseconds.oseethee ectofthechoiceofintheoerallperformanceofthesparsefactorizationalgorithmwfactoredBCSSTK31on128processorsusing4and0001.Usingthesevaluesforeobtainedaperformanceof1.18GFlopswhen4,and1.37GFlopswhen0001.Ineithercase,theperformanceisworsethanthe2.42GFlopsobtainedfor05.When4,themappingoftheeliminationtreetotheprocessorsresemblesthatofthesubtree-to-subcubeallocation.Thus,theperformancedegradationisduetotheeliminationtreeloadimbalance.When0001,theeliminationtreemappingassignsalargeberofnodestoalltheprocessors,leadingtopoorperformanceduringthedenseCholeskyfactorization. ExperimentalresultsclearlyshowthatournewschemeiscapableofusingalargenberofprocessorOnasingleprocessorofastateoftheartvectorsupercomputersuchasCrayC90,sparseCholeskyfactorizationcanbedoneattherateofroughly500MFlopsforthelargerproblemsstudiedinSection6.Evena32-processorCrayT3DclearlyoutperformsasinglenodeC-90fortheseproblems.Ouralgorithmaspresented(andimplemented)worksforCholeskyfactorizationofsymmetricpositivde nitematrices.Withlittlemodi cations,itisalsoapplicabletofactorizationofothersparsematrices,aslongasnopivotingisrequired(,sparsematricesarisingoutofstructuralengineeringproblems).Withhighlyparallelformailable,thefactorizationstepisnolongerthemosttimeconsumingstepinthesolutionofsparsesystemsofequations.Anotherstepthatisquitetimeconsuming,andhasnotbeenparallelizede ectivelyisthatofordering.Inourcurrentresearceareinestigatingorderingalgorithmsthatcanbeimplementedfastonparallelcomputers[16,34].ouldliketothanktheMinnesotaSupercomputingCenterforprovidingaccesstoa64-processorCraT3D,andCrayResearchInc.forprovidingaccesstoa256-processorCrayT3D.FinallywewishtothankDr.AlexPothenforhisguidancewithspectralnesteddissectionordering.[1]CleveAshcraft,S.C.Eisenstat,J.W.-H.Liu,andA.H.Sherman.arisonofthrolumnbddistributearsefactorizationschemeshnicalReportYALEU/DCS/RR-810,YaleUniv,NewHaen,CT,1990.AlsoappearsindingsoftheFifthSIAMConfereonParallelPressingforScienti cComputing,1991.[2]I.S.Du andJ.K.Reid.Themultifrontalsolutionofinde nitesparsesymmetriclineareansactionsonMathematicalSoftwar,(9):302{325,1983.[3]KalluriEswar,PySadaappan,andV.VisvdalSparseCholeskyfactorizationond-memorymultipr.InalConfereonParallelPr,pages18{22(vol.3),[4]K.A.Gallivan,R.J.Plemmons,andA.H.Sameh.allelalgorithmsfordenselinearalgebrSIAMR,32(1):54{135,March1990.AlsoappearsinK.A.Gallivanetal.allelAlgorithmsforMatrix.SIAM,Philadelphia,PA,1990.[5]D.M.GaonicMailDistributionofLinearPrammingTestPralPrcietyCOALNewsletter,December1985.[6]G.A.GeistandE.G.-Y.Ng.askschedulingforpallelsparseCholeskyfactorizationlJournalofParallelPr,18(4):291{314,1989.[7]G.A.GeistandC.H.Romine.LUfactorizationalgorithmsondistributed-memorymultipressorarSIAMJournalonScienti candStatisticalComputing,9(4):639{649,1988.AlsoaailableasThnicalReportORNL/TM-10383,OakRidgeNationalLaboratory,OakRidge,TN,1987.[8]A.George,M.T.Heath,J.W.-H.Liu,andE.G.-Y.Ng.arseCholeskyFactorizationonaloalmemorySIAMJournalonScienti candStatisticalComputing,9:327{340,1988.[9]A.GeorgeandJ.W.-H.Liu.ComputerSolutionofLgeSparsePositiveDe niteSystems.PrenoodCli s,NJ,1981.[10]A.GeorgeandJ.W.-H.Liu.TheevolutionoftheminimumdeeorderingalgorithmSIAMR,31(1):1{19,h1989.[11]A.George,J.W.-H.Liu,andE.G.-Y.Ng.ationResultsforParallelSparseCholeskyFonaHypallelComputing,10(3):287{298,May1989.[12]JohnR.GilbertandRobertSchreiber.HighlyParallelSparseCholeskyFSIAMJournalonScienti candStatisticalComputing,13:1151{1172,1992. [13]AnshulGuptaandVipinKumar.AscalablepallelalgorithmforsparsematrixfactorizationhnicalRe-port94-19,DepartmentofComputerScience,UnivyofMinnesota,Minneapolis,MN,1994.AshorterersionappearsinSupercomputing'94.TRaailableinatanonymousFTPsite[14]M.T.Heath,E.Ng,andB.W.PallelAlgorithmsforSparseLinearSystemsSIAMR,33(3):420{460,1991.[15]M.T.Heath,E.G.-Y.Ng,andBarryW.PallelAlgorithmsforSparseLinearSystemsSIAMR33:420{460,1991.AlsoappearsinK.A.Gallivanetal.allelAlgorithmsforMatrixComputations.SIAM,Philadelphia,PA,1990.[16]M.T.HeathandP.RaghaACartesiannesteddissectionalgorithmhnicalReportUIUCDCS-R-92-1772,tofComputerScience,UnivyofIllinois,Urbana,IL61801,October1992.toappearinSIMAX.[17]M.T.HeathandP.RaghadsolutionofsparselinearsystemshnicalReport93-1793,Depart-tofComputerScience,UnivyofIllinois,Urbana,IL,1993.[18]LaurieHulbertandEarlZmijewski.LimitingcationinpallelsparseCholeskyfactorizationJournalonScienti candStatisticalComputing,12(5):1184{1197,September1991.[19]GeorgeKarypis,AnshulGupta,andVipinKumar.deringandloadbalancingforpallelfactorizationofarsematrichnicalReport(inpreparation),DepartmentofComputerScience,UnivyofMinnesota,Minneapolis,MN,1994.[20]GeorgeKarypis,AnshulGupta,andVipinKumar.AParallelFormulationofInteriorPointA.Inomputing94,1994.AlsoaailableasThnicalReportTR94-20,DepartmentofComputerScience,yofMinnesota,Minneapolis,MN.[21]GeorgeKarypisandVipinKumar.AHighPerformanceSparseCholeskyFactorizationAlgorithmForScaleParallelComputershnicalReport94-41,DepartmentofComputerScience,UnivyofMinnesota,Minneapolis,MN,1994.[22]VipinKumar,AnanthGrama,AnshulGupta,andGeorgeKarypis.ntoParallelComputing:DesignandAnalysisofA.Benjamin/CummingsPublishingCompan,RedwoodCit,CA,1994.[23]JosephW.H.Liu.TheMultifrontalMethodforSparseMatrixSolution:TheoryandPrSIAMR34(1):82{109,1992.[24]RobertF.Lucas.Solvingplanarsystemsofequationsondistributememorymultipr.PhDthesis,tofElectricalEngineering,StanfordUnivaloAlto,CA,1987.AlsoseeIEEETonComputerAdDesign,6:981{991,1987.[25]RobertF.Lucas,TomBlank,andJeromeJ.Tiemann.allelsolutionmethodforlargesparsesystemsofIEEETansactionsonComputerAdDesign,CAD-6(6):981{991,Nober1987.[26]MoMuandJohnR.Rice.Agrid-bdsubtreassignmentstrgyforsolvingpartialdi erquationsonhypSIAMJournalonScienti candStatisticalComputing,13(3):826{839,May1992.[27]DianneP.O'LearyandG.W.StewAssignmentandSchedulinginParallelMatrixFaanditsApplic,77:275{299,1986.[28]ChristosH.PapadimitriouandKennethSteiglitz.CombinatorialOptimization,AlgorithmsandComplexityticeHall,1982.[29]A.PothenandC-J.FComputingtheblocktriangularformofasparsematrixCMTansactionsonalSoftwar,1990.[30]AlexPothen,HorstD.Simon,andKang-PuLiou.PartitioningSparseMatricesWithEigenvectorsofGrSIAMJ.onMatrixAnalysisandApplic,11(3):430{452,1990.[31]AlexPothenandChunguangSun.dmultifrontalfactorizationusingcliquetr.IndingsoftheFifthSIAMConfereonParallelPressingforScienti cComputing,pages34{40,1991.[32]RolandPozoandSharonL.Smith.eevaluationofthepallelmultifrontalmethodinadistributememoryenvir.IndingsoftheSixthSIAMConfereonParallelPressingforScienti cCom-,pages453{456,1993.[33]P.RaghadsparseGaussianeliminationandorthogonalfactorizationhnicalReport93-1818,tofComputerScience,UnivyofIllinois,Urbana,IL,1993.[34]P.RaghaLineandplanesephnicalReportUIUCDCS-R-93-1794,DepartmentofComputerScience,UnivyofIllinois,Urbana,IL61801,February1993. [35]EdwardRothberg.eofPanelandBlockApprachestoSparseCholeskyFactorizationontheiPSC/860andParagonMultic.Indingsofthe1994ScalableHighPerformanceComputing,May1994.[36]EdwardRothbergandAnoopGupta.necientblodapprachtopallelsparseCholeskyfactoriza-.Ing'93Pr,1993.[37]P.SadaappanandSaileshK.Rao.ationRductionforDistributedSparseMatrixFactorizationonaessorsMesh.Inomputing'89Pr,pages371{379,1989.[38]RobertSchreiber.alabilityofsparsedirctsolvershnicalReportRIACSTR92.13,NASAAmesResearcter,Mo etField,CA,May1992.AlsoappearsinA.George,JohnR.Gilbert,andJ.W.-H.Liu,editors,arseMatrixComputations:GraphTheoryIssuesandA(AnIMAWorkshopVolume).Springer-erlag,NewYork,NY,1993.[39]ChunguangSun.EcientpallelsolutionsoflargesparseSPDsystemsondistributememorymultiprhnicalreport,DepartmentofComputerScience,CornellUniv,Ithaca,NY,1993.[40]SeshVugopalandVijayK.Naik.ctsofpartitioningandschedulingsparsematrixfactorizationoncationandloadb.Ing'91Pr,pages866{875,1991.[41]M.YComputingtheminimum ll-inisNP-cSIAMJ.AaicDiscreteMetho,2:77{79,AppendixAtalMethodbeansymmetricpositivede nitematrixandbeitsCholeskyfactor.Letbeitseliminationtreeandde nede nei]torepresentthesetofdescendantsofthenodeintheeliminationtree.Considerthethcolumnof.Let;:::;ibetherowsubscriptsofthenonzerosin,columno -diagonalnonzeros).eupdatematrixatcolumnisde nedasasj]�fjg0BBB@lj;kli1;k...lir;k1CCCA(lj;k;li1;k;:::;lNotethattainsouterproductcontributionsfromthosepreviouslyeliminatedcolumnsthatarede-tsofintheeliminationtree.Theontalmatrixisde nedtobeus,the rstrow/columnofisformedfromandthesubtreeupdatematrixatcolumn.Haformedthefrontalmatrix,thealgorithmproceedstoperformonestepofeliminationonthatgivthenonzeroentriesofthefactorof.Inparticular,thiseliminationcanbewritteninmatrixnotationasarethenonzeroelementsoftheCholeskyfactorofcolumn.Thematrixiscalledforcolumnandisformedaspartoftheeliminationstep. Inpractice,isnevercomputedusingEquation1,butisconstructedfromtheupdatematricesasws.Let;:::;cbethechildrenofintheeliminationtree,thenl$iscalledtheextend-addop,andisageneralizedmatrixaddition.Theextend-addoperationisillustratedinthefollowingexample.Considerthefollowingtoupdatematricesofistheindexsetof,the rstrow/columnofcorrespondstothesecondrow/columnof,andthesecondrow/columnofcorrespondstothe fthrow/columnof),andistheindexsetof.ThenNotethatthesubmatrix$yhaefewerrows/columnsthan,butifitisproperlyextendedbytheindexsetof,itbecomesthesubtreeupdatematrixTheprocessofformingfromthenonzerostructureelementsofcolumnandtheupdatesmatricesiscalledontalmatrixassemblyop.Thus,inthemtalmethod,theeliminationofeachcolumnestheassemblyofafrontalmatrixandonestepofelimination.Thesupernodalmtalalgorithmproceedssimilarlytothemtalalgorithm,butwhenasupernodeoftheeliminationtreegetseliminated,thiscorrespondstomultipleeliminationstepsonthesametalmatrix.Thenberofeliminationstepsisequaltothenberofnodesinthesupernode.AppendixBDataDistributionFigure5(a)showsa44gridofprocessorsembeddedinthehypercube.Thisembeddingusestheshmapping.Inparticulargridpositionwithcoordinate(y;x)ismappedontoprocessorwhoseaddressismadeyinvingthebitsinthebinaryrepresentationoforinstance,thegridpositionwithbinarycoordinates(ab;cd)ismappedontoprocessorwhoseaddressinbinaryis,forexamplegrid(1011)ismappedonto1110.Notethatwhenwesplitthis44processorgridintot2subgrids,thesesubgridscorrespondtodistinctsubcubesoftheoriginalhypercube.Thispropertyismaintainedinsubsequentsplits,forinstanceeac2gridofa42gridisagainasubcube.Thisisimportant,becausethealgorithmusessubtree-to-subcubepartitioningscheme,andbyusingshuingmapping,eachsubcubeissimplyhalfoftheprocessorgrid.Considernext,theeliminationtreeshowninFigure5(b).Inthis gureonlythetoptolevelsareshoandateachnode,thenonzeroelementsofforthisparticularnodeisalsoshown.Usingthesubtree-to-subcubepartitioningscheme,theeliminationtreeispartitionedamongthe16-processorsasfollows.Nodeisassignedtoallthe16processors,nodeisassignedtohalftheprocessors,whilenodeisassignedtotheotherhalfoftheprocessors.InFigure5(c),weshowhowthefrontalmatricescorrespondingtonodes,andwillbedistributedamongtheprocessors.Considernode.Thefrontalmatrixisa77triangularmatrix,thatcorresponds.Thedistributionoftherowsandcolumnsofthisfrontalmatrixonthe42processorgridisshoinpart(c).Roismappedonrooftheprocessorgrid,whilecolumnismappedonoftheprocessorgrid.Inthispaperisusedtodenotebitwiseexclusiv {4, 5, 7, 8, 9, 10, 12, 13, 14, 15, 16, 19} {2, 5, 7, 8, 12, 14, 15}{3, 4, 7, 10, 13, 14, 19} 10003, 7, 13, 191100 10011011 110110, 143, 7, 19Factoring node C 400004, 8, 1610, 145, 9, 137, 15, 1900011011 1001 Figure5:Mappingfrontalmatricestotheprocessorgrid.Analternatewyofdescribingthismappingistoconsiderthecoordinatesoftheprocessorsineacprocessorgrid.hprocessorinthe42gridhasan(y;x)coordinatesuchthat04,and2.Notethatthiscoordinateislocalwithrespecttothecurrentgrid,andisnotthesamewiththegridcoordinatesoftheprocessorintheentire44grid.Fortherestofthepaper,whenwediscussgridcoordinateswewillalwysassumethattheyarelocal,unlessspeci edotherwise.Rowillbemappedontheprocessors()suchthatthetoleastsigni cantbitsofisequalto.Similarly,columnwillbemappedontheprocessors(),suchthattheoneleastsigni cantbitofisequaltoIngeneral,ina2processorgrid,ismappedontotheprocessorwithgridcoordinates(Orlookingitfromtheprocessor'spointofview,processor(y;x)willgetallelemenhthattheleastsigni cantbitsofareequalto,andtheleastsigni cantbitsofareequalto,and,whereisusedtodenotebitwiselogicaland).Thefrontalmatrixfornodeismappedontheother42gridinasimilarfashion.Notethattheroofbotharemappedonprocessorsalongthesamerowofthe44grid.Thisisbecause,bothgridshaethesamenberofrows.Also,becausethe nalgridalsohas4rows,therowsofthataresimilartoeitherrowsof,aremappedonthesamerowofprocessors.Forexamplerow7ofboth,andismappedonthefourthrowofthethreegridsinquestion.Thisisimportant,becausewhentheprocessorsneedtoperformtheextend-addoperation,nodatamotbeteenroofthegridisrequired.er,dataneedstobecommunicatedalongthecolumnsofthegrid.Thisisbecause,thegridsthatarestoringtheupdatematriceshaefewercolumnsthanthegridofprocessorsthatisgoingtofactor Dataneedstobecommunicatedsothatthecolumnsoftheupdatematriceshesthatofthefrontalmatrix.Eachprocessorcaneasilydeterminewhichcolumnsoftheupdatematrixitalreadystoresneedstokeepandwhichneedstosenda.Itkeepsthecolumnswhose2leastsigni cantbitsmatccoordinateinthe44grid,anditsendstheothera.Letbearejectedelement.Thiselemenneedstobesendtoprocessoralongthesamerowintheprocessorgrid,butwhosecoordinateiser,sinceelemenresidesinthisprocessorduringthefactorizationinolvingthe42grid,thenustbethecoordinateofthisprocessor.Therefore,therejectedcolumnsneedtobesendtotheprocessoralongthesamerowwhosegridcoordinatedi ersinthemostsigni cantbit.Thus,duringthisextend-addoperation,processorsthatareneighborsinthehypercubeneedtoexchangedata.AppendixCNewParallelMultifrontalAlgorithmporder[1..k]/*Thenodesoftheeliminationtreeateachprocessornberedinalevel-postorder*/lvl=logfor(j=1;jk;j++)i=porder[j]if(level[i]!=lvl) ks(lvl,levlev)6.lvl=levlev7.gLetFibethefrontalmatrixfornode;:::;cbethechildrenofintheeliminationtreel$)/*Densefactorizationusing2lvlprocessors*/Theparallelmtalalgorithmusingthesubforest-to-subcubepartitioningscheme.Thefunction ,performslevelli]parallelextend-addoperations.Ineachoftheseextend-addseachpro-cessorsplitsitslocalstackandsendsdatatothecorrespondingprocessorsoftheothergroupofprocessors.Notethatwhen=log,thefunctionperformsthefactorizationoflocallyAppendixDInthissectionweanalyzetheperformanceoftheparallelmtalalgorithmdescribedinSection4.InournewparallelmtalalgorithmandthatofGuptaandKumar[13],therearetypesofoduetotheparallelization:(a)loadimbalancesduetotheworkpartitioning;(b)communicationoduetotheparallelextend-addoperationandduetothefactorizationofthefrontalmatrices.AsourexperimentsshowinSection6,thesubforest-to-subcubemappingusedinournewschemees-tiallyeliminatestheloadimbalance.Incontrast,forthesubtree-to-subcubeschemeusedin[13],theerheadduetoloadimbalanceisquitehigh.wthequestioniswhetherthesubforest-to-subcubemappingusedinournewschemeresultsinhigherunicationoerheadduringtheextend-addandthedensefactorizationphase.Thisanalysisisdicultduetotheheuristicnatureofournewmappingscheme.Teeptheanalysissimple,wepresentanalysisforregularto-dimensionalgrids,inwhichthenberofsubtreesmappedontoeachsubcubeisfour.Theanalysisalsoholdsforanysmallconstanberofsubtrees.Ourexperimentshaeshownthatthenberofsubtreesmappedontoeachsubcubeisindeedsmall.unicationOvConsidera np regular nitedi erencegrid,anda-processorhypercube-connectedparallelcomputer.osimplifytheanalysis,weassumethatthegridhasbeenorderedusinganesteddissectionalgorithm,thatselectscross-shapedseparators[11].Forannodesquaregrid,thisschemeselectsaseparatorofsize 2p n�12p thatpartitionsthegridintofoursubgridseachofsize( n�1)=2(p n�1)=2p n=2p hofthesesubgridsisrecursivelydissectedinasimilarfashion.Theresultingsupernodaleliminationtree,isaquadtree.Iftherootoftheeliminationtreeisatlevelzero,thenanodeatlevcorrespondstoagridofsize n=2lp ,andtheseparatorofsuchagridisofsize2 .Also,itisproenin[11,13],thatthesizeoftheupdatematrixofanodeatlevisboundedbkn=,where=341eassumethatthesubforest-to-subcubepartitioningschemepartitionstheeliminationtreeinthefol-wingw.Itassignsallthenodesinthe rsttolevels(thezerothand rstlevel)toalltheprocessors.Itthen,splitsthenodesatthesecondlevelintofourequalpartsandassignseachaquarteroftheprocessors.Thechildrenofthenodesofeachofthesefourgroupsaresplitintofourequalpartsandeachisassignedtoaquarteroftheprocessorsofeachquarter.Thisprocessesconues,untilthenodesatlevellogbeenassignedtoindividualprocessors.Figure6showsthistypeofsubforest-to-subcubepartitioningscThisschemeassignstoeachsubcubeofprocessors,fournodesofthesameleveloftheeliminationtree.Inparticular,for0,asubcubeofsize p=2ip isassignedfournodesoflev+1ofthetree.Thhsubcubeisassignedaforestconsistingoffourdi erenttrees. Figure6:Aquadtreeandthesubforest-to-subcubeallocationscheme.(a)Showsthetopfourlevelsofaquadtree.(b)Showsthesubforestassignedtoaquarteroftheprocessors.(c)Showsthesubforestassignedtoaquarterofaprocessorsofthequarterofpart(b).Considerasubcubeofsize p=2ip .Afterithas nishedfactoringthenodesatlev+1ofthe eliminationtreeassignedtoit,itneedstoperformanextend-addwithitscorrespondingsubcube,sothenewformedsubcubeofsize p=2i�1p ,cangoaheadandfactorthenodesofthethleveloftheeliminationtree.Duringthisextend-addoperation,eachsubcubeneedstosplittherootsofeachoneofthesubtreesbeingassignedtoit.Since,eachsubcubeisassignedfoursubtrees,eachsubcubeneedstosplitandsendelementsfromfourupdatematrices.Sinceeachnodeatlev+1hasanupdatematrixofsize2kn=distributedo p=2ip processors,eachprocessorneedstoexchangewiththecorrespondingprocessoroftheothersubcubekn=ts.Since,eachprocessorhasdatafromfourupdatematrices,thetotaltofdatabeingexchangediskn=p.Theareatotaloflogextend-addphases;thus,thetotalnberofdatathatneedtobeexchangedis).Notethatthiscommunicationoerheadisidenticaltothatrequiredbythesubtree-to-subcubepartitioningscheme[13orthedenseCholeskyfactorizationweusethepipelineimplementationofthealgorithmonatdimensionalprocessorgridusingcerboardpartitioning.Itcanbeshown[27,22]thatthecommerheadtoperformfactorizationstepsofanmatrix,inapipelinedimplemennona qp meshofprocessors,is).Since,eachnodeoflev+1oftheeliminationtreeisassignedtoagridof p=2ip ,andthefrontalmatrixofeachnodeisboundedb kn= kn=,andweperform factorizationsteps,thecommunicationoerheadis 2i+1p 2 i+12i p p=2 ip Since,eachsuchgridofprocessorhasfoursuchnodes,thecommunicationoerheadofeachlevelis2kn= us,thecommunicationoerheadoerthelogelsis p plogpXi=01 2i=On p Therefore,thecommunicationoerheadsummedoeralltheprocessorduetotheparallelextend-addandthedenseCholeskyfactorizationis ),whichisofthesameorderastheschemepresentedinin].Sincetheoerallcommunicationoerheadofournewsubforest-to-subcubemappingschemeisofthesameorderasthatforthesubtree-to-subcube,theisoeciencyfunctionforbothschemesisthesame.Theanalysisjustpresentedcanbeextendedsimilarlytothree-dimensionalgridproblemsandtoarcotherthanhypercube.Inparticular,theanalysisappliesdirectlytoarchitecturessuchasCM-5,andCra