This algorithm uses subforesttosub cub e mapping instead of the subtreetosub cub e mapping of another recen tly in tro duced sc heme b y Gupta and Kumar 13 Asymptotically b oth form ulations are equally scalable on a wide range of arc hitectures an ID: 22871
Download Pdf The PPT/PDF document "AHighP erformanceSparseCholeskyF actoriz..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
AHighPerformanceSparseCholeskyFactorizationAlgorithmFScalableParallelComputersGeorgeKarypisandVipinKumartofComputerScienceyofMinnesotaMinneapolis,MN55455hnicalReport94-41Thispaperpresentsanewparallelalgorithmforsparsematrixfactorization.Thisalgorithmusessubforest-to-subcubemappinginsteadofthesubtree-to-subcubemappingofanotherrecentlyintroducedhemebyGuptaandKumar[13].Asymptotically,bothformulationsareequallyscalableonawiderangeofarchitecturesandawidevyofproblems.Butthesubtree-to-subcubemappingoftheearlierulationcausessignicantloadimbalanceamongprocessors,limitingoeralleciencyandspeedup.Thenewmappinglargelyeliminatestheloadimbalanceamongprocessors.Furthermore,thealgorithmhasanberofenhancementstoimproetheoerallperformancesubstan.Thisnewalgorithmesupto6GFlopsona256-processorCrayT3Dformoderatelylargeproblems.ToourknothisisthehighestperformanceeverobtainedonanMPPforsparseCholeskyfactorization.troductionDirectmethodsforsolvingsparselinearsystemsareimportantbecauseoftheirgeneralityandrobustness.Flinearsystemsarisingincertainapplications,suchaslinearprogrammingandsomestructuralengineeringapplications,theyaretheonlyfeasiblemethodsfornumericalfactorization.Itiswellknownthatdensematrixfactorizationcanbeimplementedecientlyondistributed-memoryparallelcomputers[4,27,7,22].er,despiteinherentparallelisminsparsesparsedirectmethods,notmhsuccesshasbeenacedtodateindevelopingtheirscalableparallelformulations[15,38],andforseveralyears,ithasbeenachallengetotecientsparselinearsystemsolversusingdirectmethodsonevenmoderatelyparallelcomputers.In[38],Schreiberconcludesthatitisnotyetclearwhethersparsedirectsolverscanbemadecompetitiveatallforhighly(256)andmassively(4096)parallelcomputers.Aparallelformulationforsparsematrixfactorizationcanbeeasilyobtainedbysimplydistributingrotodierentprocessors[8].Duetothesparsityofthematrix,commerheadisalargefractionofthecomputationforthismethod,resultinginpoorscalabilit.Inparticular,forsparsematricesarisingoutofplanarniteelementgraphs,theisoeciencyofsuchaformulationis),thatistheproblemsize(intermsoftotalnberofcomputation)shouldgrowas)tomaintainaxedeciencyInasmarterparallelformulation[11],therowsofthematrixareallocatedtoprocessorsusingthesubtree-to-subcubemapping.Thislocalizesthecommunicationamonggroupsofprocessors,andthusimprotheisoeciencyoftheschemeto).RothbergandGupta[36,35]usedadierentmethodtoreducethecommcationoerhead.Intheirmethod,theentiresparsematrixispartitionedamongprocessorsusingato-dimensionalblockcyclicmapping.ThisreducesthecommerheadandimproestheisoeciencytoGuptaandKumar[13]recentlydevelopedaparallelformulationofsparseCholeskyfactorizationbasedonthemtalmethod.Themethod[2,23]isaformofsubmatrixCholesky,inwhicsingleeliminationstepsareperformedonasequenceofsmall,denseontalmatric.Oneoftheadvofmtalmethodsisthatthefrontalmatricesaredense,andthereforetheeliminationstepscanbe tedecientlyusinglevelthreeBLASprimitives.Thisalgorithmhasteyfeatures.Itusesthesubtree-to-subcubemappingtolocalizecommnamongprocessors,anditusesthehighlyscalableo-dimensionalgridpartitioningfordensematrixfactorizationforeachsupernodalcomputationinthetalalgorithm.Asaresult,thecommonoerheadofthisschemeistheloestofallotherwnparallelformulationsforsparsematrixfactorization[24,25,1,31,32,39,8,38,17,33,37,3,6,1815,40,26,12,36,35].Infact,asymptotically,theisoeciencyofthisschemeis)forsparsematricesarisingoutofto-andthree-dimensionalniteelementproblemsonawidevyofarchitecturessucashypercube,mesh,fattree,andthree-dimensionaltorus.Notethattheisoeciencyofthebestknoparallelformulationofdensematrixfactorizationisalso)[22].Onavyofproblems,GuptaandKumarreportspeedupofupto364ona1024-processornCUBE2,whichisamajorimproerthepreviouslyexistingalgorithms.er,thesubtree-to-subcubemappingresultsingrossimbalanceofloadamongdierentprocessors,aseliminationtreesformostpracticalproblemstendtobeunbalanced.Thisloadimbalanceisresponsibleforamajorportionoftheeciencylossoftheirscheme.Furthermore,theoerallcomputationrateoftheirsingleprocessormtalcodeonnCUBE2wasonly0.7MFlopsandthemaximumoerallperformanceona1028-processornCUBE2wasonly300MFlops.ThiswaspartlyduetotheslowprocessorsofnCUBE2(3.5MFlopspeak),andpartlyduetoinadequaciesintheimplemenThispaperpresentsanewparallelalgorithmforsparsematrixfactorization.Thisalgorithmusessubforest-to-subcubemappinginsteadofthesubtree-to-subcubemappingoftheoldscheme.Thenewmap-pinglargelyeliminatestheloadimbalanceamongprocessors.Furthermore,thealgorithmhasanberoftstoimproetheoerallperformancesubstan.Thisnewalgorithmacesupto6GFlopsona256-processorCrayT3Dformoderatelylargeproblems(eventhebiggestproblemwetriedtooklessthantosecondsona256-nodeT3D.Forlargerproblems,evenhigherperformancecanbeaced).Tourknowledge,thisisthehighestperformanceeverobtainedonanMPPforsparseCholeskyfactorization.Ournewscheme,liketheschemeofGuptaandKumar[13],hasanasymptoticisoeciencyof)formatricesarisingoutofto-andthree-dimensionalniteelementproblemsonawidevyofarchashypercube,mesh,fattree,andthree-dimensionaltorus.Therestofthepaperisorganizedasfollows.Section2presentsageneraloerviewoftheCholeskyfactorizationprocessandmtalmethods.Section3providesabriefdescriptionofthealgorithminin].Section4describesournewalgorithm.Section5describessomefurtherenhancementsofthealgorithmthatsignicantlyimproetheperformance.Section6providestheexperimentalevaluationofournewalgorithmsonaCrayT3D.Section7containsconcludingremarks.Duetospacelimitations,manyimportanttopics,includingthetheoreticalperformanceanalysisofouralgorithmhaebeenmoedtotheappendices.CholeskyFConsiderasystemoflinearequationsisansymmetricpositivedenitematrix,isaknownvector,andistheunknownsolutionectortobecomputed.OnewytosolvethelinearsystemisrsttocomputetheCholeskyfactorizationwheretheCholeskyfactorisaloertriangularmatrix.Thesolutionvcanbecomputedbeforwardandbacksubstitutionstosolvethetriangularsystemsissparse,thenduringthecourseofthefactorization,someentriesthatareinitiallyzerointheuppertriangleofybecomenonzeroentriesin.Thesenewlycreatednonzeroentriesofareknownasll-in.Theamountofll-ingeneratedcanbedecreasedbycarefullyreorderingtherowsandcolumnsofpriortofactorization.MorepreciselyecanchooseapermutationmatrixhthattheCholeskyfactors PAPeminimalll-in.Theproblemofndingthebestorderingforthatminimizestheamounofll-inisNP-complete[41],thereforeanberofheuristicalgorithmsfororderinghaebeendeveloped.Inparticular,minimumdegreeordering[9,14,10]isfoundtohaelowll-in.oragivenorderingofamatrix,thereexistsacorrespondingeliminationtree.Eachnodeinthistreeisacolumnofthematrix.Nodeistheparentofnodeji)ifistherstnonzeroentryincolumn.Eliminationofrowsindierentsubtreescanproceedconcurrenoragivenmatrix,eliminationtreesofsmallerheightusuallyhaegreaterconcurrencythantreesoflargerheighAdesirableorderingforparallelcomputersmustincreasetheamountofconcurrencywithoutincreasingll-insubstan.Spectralnesteddissection[29,30,19]hasbeenfoundtogenerateorderingsthathaebothlowll-inandgoodparallelism.Fortheexperimentspresentedinthispaperweusedspectralnesteddissection.Foramoreediscussionontheeectoforderingstotheperformanceofouralgorithmreferto[21InthemtalmethodforCholeskyfactorization,afrontalmatrixandanupdatematrixassociatedwitheachnodeoftheeliminationtree.Therowsandcolumnsofcorrespondsto+1indicesinincreasingorder.Inthebeginningisinitializedtoan(+1)+1)matrix,where+1istheberofnonzerosintheloertriangularpartofcolumn.Therstrowandcolumnofthisinitialissimplytheuppertriangularpartofroandtheloertriangularpartofcolumn.Theremainderisinitializedtoallzeros.Thetreeistraersedinapostordersequence.Whenthesubtreerootedatanodehasbeentraersed,thenbecomesadense(+1)+1)matrix,whereisthenberofodiagonalnonzerosinisaleafintheeliminationtreeof,thenthenalisthesameastheinitial.Otherwise,theforeliminatingnodeisobtainedbymergingtheinitialwiththeupdatematricesobtainedfromallthesubtreesrootedatviaanextend-addoperation.Theextend-addisanassociativeandcommoperatorontoupdatematricessuchtheindexsetoftheresultistheunionoftheindexsetsoftheoriginalupdatematrices.Eachentryintheoriginalupdatematrixismappedontosomelocationintheaccummatrix.Ifentriesfrombothmatricesoerlaponalocation,theyareadded.Emptyentriesareassignedaalueofzero.Afterhasbeenassembled,asinglestepofthestandarddenseCholeskyfactorizationisperformedwithnodeasthepivot.Attheendoftheeliminationstep,thecolumnwithindexisremoandformsthecolumn.Theremainingmatrixiscalledtheupdatematrixandispassedontotheparentofintheeliminationtree.Sincematricesaresymmetric,onlytheuppertriangularpartisstored.Forfurtherdetailsonthemtalmethod,thereadershouldrefertoAppendixA,andtotheexcellenttutorialbyLiu[23Ifsomeconsecutivelynberednodesformachainintheeliminationtree,andthecorrespondingroeidenticalnonzerostructure,thenthischainiscalleda.Thedaleliminationtrissimilartotheeliminationtree,butnodesformingasupernodearecollapsedtogether.Intherestofthispaperweusethesupernodalmtalalgorithm.Anyreferencetotheeliminationtreeoranodeoftheeliminationtreeactuallyreferstoasupernodeandthesupernodaleliminationtree.EarlierWorkonParallelMultifrontalCholeskyFInthissectionweprovideabriefdescriptionofthealgorithmbyGuptaandKumar.Foramoredetaileddescriptionthereadershouldreferto[13Considera-processorshypercube-connectedcomputer.Letbethematrixtobefactored,andbeitssupernodaleliminationtree.Thealgorithmrequirestheeliminationtreetobebinaryfortherstlogels.Anyeliminationtreeofarbitraryshapecanbeconertedtoabinarytreeusingasimpletreerestructuringalgorithmdescribedin[19Inthisscheme,portionsoftheeliminationtreeareassignedtoprocessorsusingthestandardsubtree-to-subcubeassignmentstrategy[11,14]illustratedinFigure1.Withsubtree-to-subcubeassignment,allprocessorsinthesystemcooperatetofactorthefrontalmatrixassociatedwiththerootnodeoftheeliminationtree.Thetosubtreesoftherootnodeareassignedtosubcubesof2processorseach.Eacsubtreeisfurtherpartitionedrecursivelyusingthesamestrategy.Thus,thesubtreesatadepthoflogelsareeachassignedtoindividualprocessors.Eachprocessorcanprocessthispartofthetreecompletelyindependentlywithoutanycommcationo 0123456789101112131415161718XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 1349101213251114 2,34,56,70,1,2,34,5,6,7Figure1:Theeliminationtreeassociatedwithasparsematrix,andthesubtree-to-subcubemappingofthetreeontoeightprocessors.Assumethatthelevelsofthebinarysupernodaleliminationtreearelabeledfromtopstartingwith0.Ingeneral,atlevoftheeliminationtree,2processorsworkonasinglefrontalorupdatematrix.Theseprocessorsformalogical2grid.Allupdateandfrontalmatricesatthislevelaredistributedonthisgridofprocessors.Toensureloadbalanceduringfactorization,therowsandcolumnsofthesematricesaredistributedinacyclicfashion.eentosuccessiveextend-addoperations,theparallelmtalalgorithmperformsadenseCholeskyfactorizationofthefrontalmatrixcorrespondingtotherootofthesubtree.Sincethetreeissupernodal,thisstepusuallyrequiresthefactorizationofseveralnodes.Thecommunicationtakingplaceinthisphaseisthestandardcommunicationingrid-baseddenseCholeskyfactorization.hprocessorparticipatesinlogdistributedextend-addoperations,inwhichtheupdatematricesfromthefactorizationatlevareredistributedtoperformtheextend-addoperationatlev1priortofactoringthefrontalmatrix.Inthealgorithmproposedin[13],eachprocessorexchangesdatawithonlyoneotherprocessorduringeachoneoftheselogdistributedextend-adds.Theaboeisacedbyacarefulbeddingoftheprocessorgridsonthehypercube,andbycarefullymappingrowsandcolumnsofeactalmatrixontothisgrid.Thismappingisdescribedin[21],andisalsogiveninAppendixB.TheNewAlgorithmAsmentionedintheintroduction,thesubtree-to-subcubemappingschemeusedin[13]doesnotdistributetheworkequallyamongtheprocessors.Thisloadimbalanceputsanupperboundontheacableeciencyorexample,considerthesupernodaleliminationtreeshowninFigure2.Thiseliminationtreeispartitionedamong8processorsusingthesubtree-to-subcubeallocationscheme.Alleightprocessorsfactorthetopnode,processorszerothroughthreeareresponsibleforthesubtreerootedat24{27,andprocessorsfourthroughenareresponsibleforthesubtreerootedat52{55.Thesubtree-to-subcubeallocationproceedsrecursivineachsubcuberesultinginthemappingshowninthegure.Notethatthesubtreesoftherootnodedonotethesameamountofwork.Thus,duringtheparallelmtalalgorithm,processorszerothroughthreewillhaetowaitforprocessorsfourthroughseventonishtheirwork,beforetheycanperformanextend-addoperationandproceedtofactorthetopnode.Thisidlingputsanupperboundontheeciencyofthisalgorithm.Wecancomputethisupperboundontheacableeciencyduetoloadiminthefollowingw.Thetimerequiredtofactorasubtreeoftheeliminationtreeisequaltothetimetofactortherootplusthemaximumofthetimerequiredtofactoreachofthetosubtreesrootedatthis root.ByapplyingtheaboerulerecursivelywecancomputethetimerequiredtoperformtheCholeskyfactorization.Assumethatthecommunicationoerheadiszero,andthateachprocessorcanperformanoperationinatimeunit,thetimetofactoreachsubtreeoftheeliminationtreeinFigure2isshownontherightofeachnode.Forinstance,node9{11requires773217=302operations,andsincethecomputationisdistributedoerprocessorszeroandone,ittakes151timeunits.Nowitssubtreerootedatnode4{5requires254timeunits,whileitssubtreerootedatnode8requires217timeunits.Thus,thisparticularsubtreeisfactoredin151+max=405timeunits.Theoeralleciencyacablebtheaboesubtree-to-subcubemappingis hissignicantlylessthanone.Furthermore,thenaleciencyisevenloerduetocomm 21755288552 67131518192931343641434647{0}{1}{2}{3}{4}{5}{6}{7}2172542175528870370355288Figure2:Thesupernodaleliminationtreeofafactorizationproblemanditsmappingtoeightprocessorsviasubtree-to-subcubemapping.Eachnode(,supernode)islabeledbytherangeofnodesbelongingtoit.Thenberontheleftofeachnodeisthenberofoperationsrequiredtofactorthetreerootedatthisnode,thenbersaboeeachnodedenotesthesetofprocessorsthatthissubtreeisassignedtousingsubtree-to-subcubeallocation,andthenberontherightofeachnodeisthetime-unitsrequiredtofactorthesubtreeinparallel.Thisexampleillustratesanotherdicultyassociatedwithdirectfactorization.Eventhoughbothsubtreesrootedatnode56{62hae28nodes,theyrequiredierentamountofcomputation.Thus,balancingthecomputationcannotbedoneduringtheorderingphasebysimplycarefullyselectingseparatorsthatsplitthegraphintotoroughlyequalparts.Theamountofloadimbalanceamongdierentpartsoftheeliminationtreecanbesignicantlyworseforgeneralsparsematrices,forwhichitisnotevenpossibletondgoodseparatorsthatcansplitthegraphintotoroughlyequalparts.Table1showstheloadimbalanceatthetopleveloftheeliminationtreeforsomematricesfromtheBoeing-Harwellmatrixset.Thesematricesworderedusingthespectralnesteddissection[29,30,19].Notethatforallmatricestheloadimbalanceintermsofoperationcountissubstantiallyhigherthantherelativedierenceinthenberofnodesintheleftandrightsubtrees.Also,theupperboundontheeciencyshowninthistableisdueonlytothethetoplevelsubtrees.Sincesubtree-to-subcubemappingisrecursivelyappliedineachsubcube,theoerallloadbalancewillbehigher,becauseitaddsupaswegodowninthetree.oreliminationtreesofgeneralsparsematrices,theloadimbalancecanbeusuallydecreasedbyperform-ingsomesimpleeliminationtreereorderingsdescribedin[19].Hoer,thesetechniqueshaoseriouslimitations.First,theyincreasethell-inastheytrytobalancetheeliminationtreebyaddingextradepen-dencies.Thus,thetotaltimerequiredtoperformthefactorizationincreases.Second,thesetechniquesarelocalheuristicsthattrytominimizetheloadimbalanceatagivenlevelofthetree.Hoer,veryoftensuc LeftSubtree tSubtree Name SeparatorSize Nodes RemainingW Nodes RemainingW EciencyBound BCSSTK29 180 6912 45% 6695 55% 0.90 BCSSTK30 222 14946 59% 13745 41% 0.85 BCSSTK31 492 16728 40% 18332 60% 0.83 BCSSTK32 513 21713 45% 22364 55% 0.90 able1:OrderingandloadimbalancestatisticsforsomematricesfromtheBoeing-Harwellset.Thematricesebeenreorderedusingspectralnesteddissection.Foreachmatrix,thesizeofthetopseparatorisshoandforeachsubtreethenberofnodes,andthepercentoftheremainingworkisshoAlso,thelastcolumnshowsthemaximumacableeciency,ifanysubsequentlevelsoftheeliminationtreewperfectlybalanced,orifonlytoprocessorswereusedforthefactorization.localimprotsdonotresultinimprovingtheoerallloadimbalance.Forexample,forawidevofproblemsfromtheBoeing-Harwellmatrixsetandlinearprogramming(LP)matricesfromNETLIB[5enafterapplyingthetreebalancingheuristics,theeciencyboundduetoloadimbalanceisstillaround80%to60%[13,20,19].Iftheincreasedll-inistakenintoaccount,thenthemaximumacableeciencyisevenloerthanthat.IntherestofthissectionwepresentamodicationtothealgorithmpresentedinSection3thatusesadierentschemeformappingtheeliminationtreeontotheprocessors.Thismodiedmappingsctlyreducestheloadimo-SubcubeMappingScInourneweliminationtreemappingscheme,weassignmanysubtrees(subforest)oftheeliminationtreetohprocessorsubcube.Thesetreesarechoseninsuchawythatthetotalamountofworkassignedtohsubcubeisasequalaspossible.Thebestwytodescribethispartitioningschemeisviaanexample.ConsidertheeliminationtreeshowninFigure3.Assumethatittakesatotalof100time-unitstofactorthetiresparsematrix.Eachnodeinthetreeismarkedwiththenberoftime-unitsrequiredtofactorthesubtreerootedatthisparticularnode(includingthetimerequiredtofactorthenodeitself).Forinstance,thesubtreerootedatnoderequires65unitsoftime,whilethesubtreerootedatnoderequiresonly18.AsshowninFigure3(b),thesubtree-to-subcubemappingschemewillassignthecomputationassociatedwiththetopsupernodetoalltheprocessors,thesubtreerootedattohalftheprocessors,andthesubtreerootedattotheremaininghalfoftheprocessors.Since,thesesubtreesrequiredierentamountofcomputation,thisparticularpartitionwillleadtoloadimbalances.Since7time-unitsofwork(correspondingtothenode)isdistributedamongalltheprocessors,thisfactorizationtakesatleast7unitsoftime.Nohsubcubeof2processorsindependentlyworksoneachsubtree.Thetimerequiredforthesesubcubestonishisloerboundedbythetimetoperformthecomputationforthelargersubtree(theonerootedatnode).Evenifweassumethatallsubtreesofareperfectlybalanced,computationofthesubtreerootedat2processorswilltakeatleast652)time-units.Thusanupperboundontheeciencyofthismappingisonly100+6573.Nowconsiderthefollowingmappingscheme:Thecomputationassociatedwithsupernodesisassignedtoalltheprocessors.Thesubtreesrootedatareassignedtohalfoftheprocessors,whilethesubtreerootedatisassignedtotheremainingprocessors.Inthismappingscheme,thersthalfoftheprocessorsareassigned43time-unitsofwork,whiletheotherhalfisassigned45time-units.Theupperboundontheeciencyduetoloadimbalanceofthisnewtis100+4598,whichisasignicantimproertheearlierboundTheaboeexampleillustratesthebasicideasbehindthenewmappingscheme.Sinceitassignssubforestsoftheeliminationtreetoprocessorsubcubes,wewillrefertoitasmappingscThegeneralmappingalgorithmisoutlinedinProgram4.1.Thetreepartitioningalgorithmusesasetthatcontainstheunassignednodesoftheeliminationtree.Thealgorithminsertstherootoftheeliminationtreein,andthenitcallstheroutine (a) Top 2 levels of a partial elimination tree 1510065289181518 1518 45100652845100 ADBGCGFAEDistributed to one half of processors Figure3:Thetoptolevelsofaneliminationtreeisshownin(a).Thesubtree-to-subcubemappingiswnin(b),thesubforest-to-subcubemappingisshownin(c).elypartitionstheeliminationtree.totoparts,andcksifthispartitioningisacceptable.Ifyes,thenitassignstohalfoftheprocessors,andtotheremaininghalf,andrecursivelycallstoperformthepartitioningineachofthesehalves.Ifthepartitioningisnotacceptable,thenonenodeofde=sele)isassignedtoalltheprocessors,isdeleted,andthechildrenofareinsertedintothe.Thealgorithmthenconuesbyrepeatingthewholeprocess.Theaboedescriptionprovidesahighleveloerviewofthesubforest-to-subcubepartitioningheme.Hoer,anberofdetailsneedtobeclaried.Inparticular,weneedtospecifyhowthe,andprocedureswSelectionofanodefromTherearetodierenofdeningtheprocedureOnewyistoselectanodewhosesubtreerequiresthelargestnberofoperationstobefactored.Thesecondwyistoselectanodethatrequiresthelargestnberofoperationstofactorit.Therstmethodfaorsnodeswhosesubtreesrequiresignicantamountofcomputation.us,bselectingsuchanodeandinsertingitschildreninemaygetagoodpartitioningoftotohalver,thisapproachcanassignnodeswithrelativelysmallcomputationtoalltheprocessors,causingpooreciencyinthefactorizationofthesenodes.Thesecondmethodguaranteesthattheselectednodehasmorework,andthusitsfactorizationcanacehighereciencywhenitisfactoredbyallprocessors. Note,thattheinformationrequiredbythesemethods(theamountofcomputationtoeliminateanode,orthetotalamounofcomputationassociatedwithasubtree),canbeeasilyobtainedduringthesymbolicfactorizationphase. )/*Partitiontheeliminationtree,amongprocessors*/Addroot()inEndPif(==1)return=falsewhile(==false)if(acceptable(=true=select(/*Assigntoallprocessors*/InsertinthechildrenofendwhileEndElpartProgram4.1:Thesubforest-to-subcubepartitioningalgorithm.er,ifthesubtreesattachedtothisnodearenotlarge,thenthismaynotleadtoagoodpartitioninginlatersteps.Inparticular,iftherootofthesubtreehavingmostoftheremainingwork,requireslittlecomputation(,singlenodesupernode),thentherootofthissubtreewillnotbeselectedforexpansiontilverylate,leadingtotoomanynodesbeingassignedatalltheprocessors.Anotherpossibilityistocombinetheabooschemesandapplyeachoneinalternatesteps.ThisbinedapproacheliminatesmostofthelimitationsoftheaboeschemeswhileretainingtheiradvThisistheschemeweusedintheexperimentsdescribedinSection6.Sofarweconsideredonlythe oatingpointoperationswhenwerereferringtothenberofoperationsrequiredtofactorasubtree.Onsystemswherethecostofeachmemoryaccessrelativetoa oatingpoinoperationisrelativelyhigh,amoreaccuratecostmodelwillalsotakethecostofeachextend-addoperationtoaccount.Thetotalnberofmemoryaccessesrequiredforextend-addcanbeeasilycomputedfromthesymbolicfactorizationofthematrix.gTheSetIneachstep,thepartitioningalgorithmckstoseeifitcansplitthesettotoroughlyequalhalves.Theabilityoftheproceduretondapartitionofthenodes(andtlycreatetosubforests)iscrucialtotheoerallabilityofthispartitioningalgorithmtobalancethecomputation.F,thisisatypicalbin-packingproblem,andeventhough,bin-packingisNPcomplete,anberofgoodapproximatealgorithmsexist[28].Theuseofbin-packingmakesitpossibletobalancethecomputationandtosignicantlyreducetheloadimAcceptablePApartitionisacceptableifthepercentagedierenceintheamountofworkintheopartsislessthatasmallconstan.Ifischosentobehigh(2),thenthesubforest-to-subcubemappingbecomessimilartothesubtree-to-subcubemappingscheme.Ifischosentobetoosmall,thenmostofthenodesoftheeliminationtreewillbeprocessedbyalltheprocessors,andthecommcationoduringthedenseCholeskyfactorizationwillbecometoohigh.Forexample,considerthetaskoffactoring -processorsquaremeshorahypercubeusingastandardalgorithmthatusesto-dimensionalpartitioningandpipelining.Ifeachofthematricesisfactoredbyalltheprocessors,thenthetotalcommunicationtimeforfactoringthetomatricesis ].Ifarefactoredtlyb2processorseach,thenthecommunicationtimeis )whichissmaller.Thusthealueofhastobechosentostrikeagoodbalancebeteenthesetocon ictinggoalsofminimizingloadbalanceandthecommonoerheadinindividualfactorizationsteps.FortheexperimentsreportedinSection6,weusedOtherModicationsInthissectionwedescribethenecessarymodicationsofthealgorithmpresentedinSection3toaccommo-datethisnewmappingscInitiallyeachprocessorfactorsthesubtreesoftheeliminationtreeassignedtoitself.Thisrepresenlocalcomputationandrequiresnocommonjustliketheearlieralgorithm.er,sinceeachprocessorisassignedmorethanonesubtreeoftheeliminationtree,attheendofthislocalcomputation,itsstackwillcontainoneupdatematrixforeachtreeassignedtoit.Atthispoinitneedstoperformadistributedextend-addoperationwithitsneighborprocessorattherstlevelofthevirtualbinarytree.Duringthisstep,eachprocessorsplitstheupdatematrices,andsendsthepartthatisnotlocal,totheotherprocessor.Thisissimilartotheparallelextend-addoperationrequiredbythealgorithmdescribedinSection3,exceptthatmorethanoneupdatematrixissplitandsent.Note,thatthesepiecesfromdierentupdatematricescanallbepacedinasinglemessage,asthecommhappenswithonlyoneotherprocessor.Furthermore,asshowninAppendixD,theamountofdatabeingtransmittedineachparallelextend-addstepisnomorethanitisintheearlieralgorithm[13].Thereasonisthateventhoughmoreupdatematricesarebeingtransmitted,theseupdatematricescorrespondtonodesthataredeeperintheeliminationtreeandthesizeofthesematricesismhsmaller.w,afterpairsofprocessorshaeperformedtheextend-addoperation,theycooperatetofactorthenodesoftheeliminationtreeassignedtothem.Thenodesareeliminatedinapostorderorder.Next,groupsoffourprocessorsexchangetheupdatematricesthatarestoredintheirstacktoperformtheextend-addoperationforthenextlevel.Thisprocessconuesuntilallthenodeshaebeenfactored.ThenewparalleltalalgorithmisoutlinedinAppendixC.vingPehaeaddedanberofmodicationstothealgorithmdescribedinSection4thatgreatlyimproitsperformance.Inthefollowingsectionswebrie ydescribethesemodications.oramoredetaileddescriptionoftheseenhancementsthereadershouldreferto[21BlockCyclicMappingAsdiscussedinAppendixD,forthefactorizationofasupernode,weusethepipelinedvtofthegrid-baseddenseCholeskyalgorithm[22].Inthisalgorithm,successiverowsofthefrontalmatrixarefactoredoneaftertheother,andthecommnandcomputationproceedsinapipelinedfashion.enthoughthisschemeissimple,ithastomajorlimitatiSincetherowsandcolumnsofatalmatrixamongtheprocessorgridinacyclicfashion,informationforonlyonerowistransmittedatygiventime.Hence,onarchitecturesinwhichthemessagestartuptimeisrelativelyhighcomparedtothetransfertime,thecommonoerheadisdominatedbythestartuptime.Forexample,considera qp processorgrid,anda-nodesupernodethathasafrontalmatrixofsize.Whileperformingeliminationstepsonantalmatrix,onaerage,amessageofsize(2 ))needstobesendineachstepalongeachdirectionofthegrid.Ifthemessagestartuptimeis100timeshigherthantheperwordtransfertime,thenfor=256,aslongas23200thestartuptimewilldominatethedatatransfertime.Note,thattheaboetranslatesto1600.Formostsparsematrices,thesizeofthefronmatricestendstobemhlessthan1600. Thesecondlimitationofthecyclicmappinghastodowiththeimplementationeciencyofthecompu-tationphaseofthefactorization.Since,ateachstep,onlyonerowiseliminated,thefactorizationalgorithmustperformarank-oneupdate.OnsystemswithBLASlevelroutines,thiscanbedoneusingeitherlevoneBLAS(DAXPY),orleveltoBLAS(DGER,DGEMV).Onmostmicroprocessors,includinghighper-formanceRISCprocessorssuchastheDecAlphaAXP,thepeakperformanceacablebytheseprimitivisusuallysignicantlylessthanthatacedbylevelthreeBLASprimitives,suchasmatrix-matrixmply(DGEMM).ThereasonisthatforleveloneandleveltoBLASroutines,theamountofcomputationisofthesameorderastheamountofdatamotbeteenCPUandmemory.Incontrast,forlevthreeBLASoperations,theamountofcomputationismhhigherthantheamountofdatarequiredfrom.Hence,levelthreeBLASoperationscanbetterexploitthemultiplefunctionalunits,anddeeppipelinesaailableintheseprocessors.er,bydistributingthefrontalmatricesusingablockcyclicmapping[22],weareabletoeliminatebothoftheaboelimitationsandgreatlyimproetheperformanceofouralgorithm.Intheblockcyclicmapping,therowsandcolumnsofthematrixaredividedintogroups,eachofsize,andthesegroupsareassignedtotheprocessorsinacyclicfashion.Asaresult,diagonalprocessorsnowstoreblocksofepivots.Insteadofperformingasingleeliminationstep,theynowperformeliminationsteps,andsenddatacorrespondingtowsinasinglemessage.Notethattheoerallvolumeofdatatransferredremainsthesame.Forsucientlylargevaluesof,thestartuptimebecomesasmallfractionofthedatatransmissiontime.Thisresultinasignicantimprotsonarchitectureswithhighstartuptime.Inhphasenow,eachprocessorreceivwsandcolumnsandhastoperformarank-updateonthenotetfactoredpartofitsfrontalmatrix.Therank-updatecannowbeimplemtedusingmatrix-m,leadingtohighercomputationalrate.Thereareanberofdesignissuesinselectingthepropervaluefor.Clearly,theblocksizeshouldbelargeenoughsothattherank-updateaceshighperformanceandthestartuptimebecomesasmallfractionofthedatatransfertime.Ontheotherhandaverylargevalueofleadstoanberofproblems.First,processorsstoringthecurrentsetofwstobeeliminated,haetoconstructtherank-updatebperformingrank-1updates.Ifislarge,performingtheserank-1updatestakesconsiderableamountoftime,andstallsthepipeline.Also,alargevalueofleadstoloadimbalancesonthenberofelementsofthefrontalmatrixassignedtoeachprocessor,becausetherearefewerblockstodistributeinacyclicfashion.NotethatthisloadimbalancewithinthedensefactorizationphaseisdierentfromtheloadimassociatedwithdistributingtheeliminationtreeamongtheprocessorsdescribedinSection4.berofotherdesignissuesinedinusingblockcyclicmappingandwystofurtherimproetheperformancearedescribedin[21PipelinedCholeskyFIntheparallelportionofourmtalalgorithm,eachfrontalmatrixisfactoredusingagridbasedpipelinedCholeskyfactorizationalgorithm.Thispipelinedalgorithmsworksasfollows[22].Assumethattheprocessorgridstorestheuppertriangularpartofthefrontalmatrix,andthattheprocessorgridissquare.Thediagonalprocessorthatstoresthecurrentpivot,dividestheelementsofthepivotrowitstoresythepivotandsendsthepivottoitsneighborontheright,andthescaledpivotrowtoitsneighbordohprocessoruponreceivingthepivot,scalesitspartofthepivotrowandsendsthepivottotherightanditsscaledpivotrowdown.Whenadiagonalprocessorreceivesascaledpivotrowfromitsupprocessor,itardsthisdownalongitscolumn,andalsotoitsrightneighbor.Everyotherprocessor,uponreceivingascaledpivotroweitherfromtheleftorfromthetop,storesitlocallyandthenforwardsittotheprocessorattheoppositeend.Forsimplicit,assumethatdataistakenoutfromthepipelinebytheprocessorwhoinitiatedthetransmission.Eachprocessorperformsarank-1updateofitslocalpartofthefrontalmatrixassoonasitreceivesthenecessaryelementsfromthetopandtheleft.Theprocessorstoringthenextpivtstartseliminatingthenextrowassoonasithasnishedcomputationforthepreviousiteration.Theprocessconuesuntilalltherowshaebeeneliminated.enthoughthisalgorithmiscorrect,anditsasymptoticperformanceisasdescribedinAppendixD,itrequiresbuersforstoringmessagesthathaearrivedandcannotyetbeprocessed.Thisisbecausecertainprocessorsreceivethetosetsofdatatheyneedtoperformarank-1updateatdierenttimes.Considerfor examplea44processorgrid,andassumethatprocessor(00)hastherstpivot.Eventhoughprocessor0)receivesdatafromthetopalmostrigh,thedatafromtheleftmustcomefromprocessor(0via(13).Nowiftheprocessor(00),alsohadthesecondpivot(duetoagreaterthanoneblocksize),thenthemessagebueronprocessor(10)mightcontainthemessagefromprocessor(0correspondingtothesecondpivot,beforethemessagefrom(01)correspondingtotherstpivothadarrivThesourceofthisproblemisthattheprocessorsalongtherowactasthesourcesforbothtypeofmessages(thosecirculatingalongtherowsandthosecirculatingalongthecolumns).WhenasimilaralgorithmisusedforGaussianelimination,theproblemdoesn'tarisebecausedatastartfromacolumnandarowofprocessors,andmessagesfromtheserowsandcolumnsarriveateachprocessoratroughlythesametimetime].Onmachineswithveryhighbandwidth,theoerheadinedinmanagingbuerssignicantlyreducesthepercentageoftheobtainablebandwidth.Thiseectisevenmorepronouncedforsmallmessages.Fthisreason,wedecidedtoimplementouralgorithmwithonlyasinglemessagebuerperneighbor.AstionedinSection6,thiscommnprotocolenableustoutilizemostofthetheoreticalbandwidthonaCrayT3D.er,undertherestrictionsoflimitedmessagebuerspace,theaboeCholeskyalgorithmspendsatamountoftimeidling.Thisisduetothefollowingrequirementimposedbythesinglecommtionbuerrequirement.Considerprocessorsk;k,and.Beforeprocessork;kcanstartthe(+1)thiteration,itmustwaituntilprocessorhasstartedperformingtherank-1updateforthethiteration(sothatprocessork;kcangoaheadandreusethebuersforiteration(+1)).Hoer,sinceprocessoresdatafromtheleftmhlaterthanitdoesfromthetop,processork;kustwaituntilthislatterdatatransmissionhastakenplace.Essentiallyduringthistimeprocessork;ksitsidle.Because,itsitsidle,the+1iterationwillstartlate,anddatawillarriveatprocessorenlater.Thus,ateachstep,certainprocessorsspendcertainamountoftimebeingidle.Thistimeisproportionaltothetimeittakforamessagetotraelalonganentirerowofprocessors,whichincreasessubstantiallywiththenberofprocessors.osolvethisproblem,weslightlymodiedthecommunicationpatternofthepipelinedCholeskyalgo-rithmasfollows.Assoonasaprocessorthatcontainselementsofthepivotrowhasnishedscalingit,itsendsitbothdownandalsotothetransposedprocessoralongthemaindiagonal.Thislatterprocessoruponreceivingthescaledrow,startsmovingitalongtherows.Nowthediagonalprocessorsdonotanardthedatatheyreceivefromthetoptotheright.Thereasonforthismodicationistomimicthebe-viorofthealgorithmthatperformsGaussianelimination.Onarchitectureswithcut-throughrouting,theerheadofthiscommnstepiscomparabletothatofanearestneighbortransmissionforsucienlargemessages.Furthermore,becausethesemessagesareinitiatedatdierenttimescauselittleornocontion.Asaresult,eachdiagonalprocessornowwillhaetositidleforaverysmallamountoftime[21ExperimentalResultseimplementedournewparallelsparsemtalalgorithmona256-processorsCrayT3Dparallelcomputer.EachprocessorontheT3Disa150MhzDecAlphachip,withpeakperformanceof150MFlopsfor64-bitoperations(doubleprecision).Theprocessorsareinterconnectedviaathreedimensionaltorusorkthathasapeakunidirectionalbandwidthof150Bytespersecond,andaverysmalllatency.EvthoughthememoryonT3Disphysicallydistributed,itcanbeaddressedglobally.Thatis,processorscandirectlyaccess(readand/orwrite)otherprocessor'smemory.T3DprovidesalibraryinterfacetothisycalledSHMEM.WeusedSHMEMtodevelopalightmessagepassingsystem.Usingthissystemwereabletoaceunidirectionaldatatransferratesupto70Mbytespersecond.Thisistlyhigherthanthe35MByteschannelbandwidthusuallyobtainedwhenusingT3D'sPVM.orthecomputationperformedduringthedenseCholeskyfactorization,weusedsingle-processorim-tationofBLASprimitives.TheseroutinesarepartofthestandardscienticlibraryonT3D,andtheyhaebeennetunedfortheAlphachip.Thenewalgorithmwastestedonmatricesfromavofsources.Fourmatrices(BCSSTK40,BCSSTK31,BCSSTK32,andBCSSTK33)comefromtheBoeing-ellmatrixset.MAROS-R7isfromalinearprogrammingproblemtakenfromNETLIB.COPTER2 comesfromamodelofahelicopterrotor.CUBE35isa3535regularthree-dimensionalgrid.Inallofourexperiments,weusedspectralnesteddissection[29,30]toorderthematrices.TheperformanceobtainedbyourmtalalgorithminsomeofthesematricesisshowninTable2.Theoperationcountshowsonlythenberofoperationsrequiredtofactorthenodesoftheeliminationtree(itdoesnotincludetheoperationsinedinextend-add).Someoftheseproblemscouldnotberunon32processorsduetomemoryconstraints(inourT3D,eachprocessorhadonly2Mwordsofmemory).Figure4graphicallyrepresentsthedatashowninTable2.Figure4(a)showstheoerallperformanceobtainedversusthenberofprocessors,andissimilarinnaturetoaspeedupcurve.Figure4(b)showstheperprocessorperformanceversusthenberofprocessors,andre ectsreductionineciencyasSincealltheseproblemsrunoutofmemoryononeprocessor,thestandardspeedupandeciencycouldnotbecomputedexperimen berofProcessors Problem n jAj jLj OperationCoun 32 64 128 256 MAROS-R7 3136 330472 1345241 720M 0.83 1.41 2.18 3.08 BCSSTK30 28924 1007284 5796797 2400M 1.48 2.45 3.59 BCSSTK31 35588 572914 6415883 3100M 1.47 2.42 3.87 BCSSTK32 44609 985046 8582414 4200M 1.51 2.63 4.12 BCSSTK33 8738 291583 2295377 1000M 0.78 1.23 1.92 2.86 COPTER2 55476 352238 12681357 9200M 1.92 3.17 5.51 CUBE35 42875 124950 11427033 10300M 2.23 3.75 6.06 able2:TheperformanceofsparsedirectfactorizationonCrayT3D.Foreachproblemthetableconthenberofequationsofthematrix,theoriginalnberofnonzerosin,thenonzerosintheCholeskyfactorberofoperationsrequiredtofactorthenodes,andtheperformanceingiga opsforberofprocessors. 3264128256 3264128256 (a)(b)Figure4:PlotoftheperformanceoftheparallelsparsemtalalgorithmforvariousproblemsonCraT3D.(a)TotalGiga opsobtained;(b)Mega opsperprocessor.Thehighestperformanceof6GFlopswasobtainedforCUBE35,whichisaregularthree-dimensionalproblem.Nearlyashighperformance(5.51GFlops)wasalsoobtainedforCOPTER2whichisirregular.Sincebothproblemshaesimilaroperationcount,thisshowsthatouralgorithmperformsequallywellinfactoringmatricesarisinginirregularproblems.Focusingourattentiontotheotherproblemsshowninable2,weseethatevenonsmallerproblems,ouralgorithmperformsquitewell.ForBCSSTK33,itwabletoace2.86GFlopson256processors,whileforBCSSTK30,itaced3.59GFlops. ofurtherillustratehoariouscomponentsofouralgorithmwork,wehaeincludedabreakdownofthevariousphasesforBCSSTK31andCUBE35inTable3.Thistableshowstheaeragetimespenalltheprocessorsinthelocalcomputationandinthedistributedcomputation.Furthermore,webreakwnthetimetakenbydistributedcomputationintotomajorphases,(a)denseCholeskyfactorization,(b)extend-addoerhead.Thelatterincludesthecostofperformingtheextend-addoperation,splittingtheks,transferringthestacks,andidlingduetoloadimbalancesinthesubforest-to-subcubepartitioning.Notethattheguresinthistableareaeragesoerallprocessors,andtheyshouldbeusedonlyasanximateindicationofthetimerequiredforeachphase.berofinterestingobservationscanbemadefromthistable.First,asthenberofprocessorsincreases,thetimespentprocessingthelocaltreeineachprocessordecreasessubstantiallybecausethesubforestassignedtoeachprocessorbecomessmaller.Thistrendismorepronouncedforthree-dimensionalproblems,becausetheytendtohaefairlyshallowtrees.Thecostofthedistributedextend-addphasedecreasesalmostlinearlyasthenberofprocessorsincreases.ThisisconsistentwiththeanalysispreseninAppendixD,sincetheoerheadofdistributedextend-addis).Sincethegureforthetimespentduringtheextend-addstepsalsoincludestheidlingduetoloadimbalance,thealmostlineardecreasealsoshowsthattheloadimbalanceisquitesmall.ThetimespentindistributeddenseCholeskyfactorizationdecreasesasthenberofprocessorsin-creases.Thisreductionisnotlinearwithrespecttothenberofprocessorsfortoreasons:(a)theratioofcommontocomputationduringthedenseCholeskyfactorizationstepsincreases,and(b)foraxedsizeproblemloadimbalancesduetotheblockcyclicmappingbecomesworseasorreasonsdiscussedinSection5.1,wedistributedthefrontalmatricesinablock-cyclicfashion.TogetgoodperformanceonCrayT3DoutoflevelthreeBLASroutines,weusedablocksizeofsixteen(blocksizesoflessthansixteenresultindegradationoflevel3BLASperformanceonCrayT3D)Hoer,suchalargeblocksizeresultsinasignicantloadimbalancewithinthedensefactorizationphase.Thisloadimbecomesworseasthenberofprocessorsincreases.er,asthesizeoftheproblemincreases,boththecommunicationoerheadduringdenseCholeskyandtheloadimbalanceduetotheblockcyclicmappingbecomeslesssignicant.Thereasonisthatlargerproblemsusuallyhaelargerfrontalmatricesatthetoplevelsoftheeliminationtree,soevenlargeprocessorgridscanbeeectivelyutilizedtofactorthem.ThisisillustratedbycomparinghowthevariousodecreaseforBCSSTK31andCUBE35.Forexample,forBCSSTK31,thefactorizationon128processorsisonly48%fastercomparedto64processors,whileforCUBE35,thefactorizationon128processorsis66%fastercomparedto64processors. BCSSTK31 CUBE35 DistributedComputation DistributedComputation p LocalComputation Factorization Extend-Add LocalComputation Factorization Extend-Add 64 0.17 1.34 0.58 0.15 3.74 0.71 128 0.06 0.90 0.32 0.06 2.25 0.43 256 0.02 0.61 0.18 0.01 1.44 0.24 able3:Abreak-downofthevariousphasesofthesparsemtalalgorithmforBCSSTK31andCUBE35.Eacberrepresentstimeinseconds.oseetheeectofthechoiceofintheoerallperformanceofthesparsefactorizationalgorithmwfactoredBCSSTK31on128processorsusing4and0001.Usingthesevaluesforeobtainedaperformanceof1.18GFlopswhen4,and1.37GFlopswhen0001.Ineithercase,theperformanceisworsethanthe2.42GFlopsobtainedfor05.When4,themappingoftheeliminationtreetotheprocessorsresemblesthatofthesubtree-to-subcubeallocation.Thus,theperformancedegradationisduetotheeliminationtreeloadimbalance.When0001,theeliminationtreemappingassignsalargeberofnodestoalltheprocessors,leadingtopoorperformanceduringthedenseCholeskyfactorization. ExperimentalresultsclearlyshowthatournewschemeiscapableofusingalargenberofprocessorOnasingleprocessorofastateoftheartvectorsupercomputersuchasCrayC90,sparseCholeskyfactorizationcanbedoneattherateofroughly500MFlopsforthelargerproblemsstudiedinSection6.Evena32-processorCrayT3DclearlyoutperformsasinglenodeC-90fortheseproblems.Ouralgorithmaspresented(andimplemented)worksforCholeskyfactorizationofsymmetricpositivdenitematrices.Withlittlemodications,itisalsoapplicabletofactorizationofothersparsematrices,aslongasnopivotingisrequired(,sparsematricesarisingoutofstructuralengineeringproblems).Withhighlyparallelformailable,thefactorizationstepisnolongerthemosttimeconsumingstepinthesolutionofsparsesystemsofequations.Anotherstepthatisquitetimeconsuming,andhasnotbeenparallelizedeectivelyisthatofordering.Inourcurrentresearceareinestigatingorderingalgorithmsthatcanbeimplementedfastonparallelcomputers[16,34].ouldliketothanktheMinnesotaSupercomputingCenterforprovidingaccesstoa64-processorCraT3D,andCrayResearchInc.forprovidingaccesstoa256-processorCrayT3D.FinallywewishtothankDr.AlexPothenforhisguidancewithspectralnesteddissectionordering.[1]CleveAshcraft,S.C.Eisenstat,J.W.-H.Liu,andA.H.Sherman.arisonofthrolumnbddistributearsefactorizationschemeshnicalReportYALEU/DCS/RR-810,YaleUniv,NewHaen,CT,1990.AlsoappearsindingsoftheFifthSIAMConfereonParallelPressingforScienticComputing,1991.[2]I.S.DuandJ.K.Reid.ThemultifrontalsolutionofindenitesparsesymmetriclineareansactionsonMathematicalSoftwar,(9):302{325,1983.[3]KalluriEswar,PySadaappan,andV.VisvdalSparseCholeskyfactorizationond-memorymultipr.InalConfereonParallelPr,pages18{22(vol.3),[4]K.A.Gallivan,R.J.Plemmons,andA.H.Sameh.allelalgorithmsfordenselinearalgebrSIAMR,32(1):54{135,March1990.AlsoappearsinK.A.Gallivanetal.allelAlgorithmsforMatrix.SIAM,Philadelphia,PA,1990.[5]D.M.GaonicMailDistributionofLinearPrammingTestPralPrcietyCOALNewsletter,December1985.[6]G.A.GeistandE.G.-Y.Ng.askschedulingforpallelsparseCholeskyfactorizationlJournalofParallelPr,18(4):291{314,1989.[7]G.A.GeistandC.H.Romine.LUfactorizationalgorithmsondistributed-memorymultipressorarSIAMJournalonScienticandStatisticalComputing,9(4):639{649,1988.AlsoaailableasThnicalReportORNL/TM-10383,OakRidgeNationalLaboratory,OakRidge,TN,1987.[8]A.George,M.T.Heath,J.W.-H.Liu,andE.G.-Y.Ng.arseCholeskyFactorizationonaloalmemorySIAMJournalonScienticandStatisticalComputing,9:327{340,1988.[9]A.GeorgeandJ.W.-H.Liu.ComputerSolutionofLgeSparsePositiveDeniteSystems.PrenoodClis,NJ,1981.[10]A.GeorgeandJ.W.-H.Liu.TheevolutionoftheminimumdeeorderingalgorithmSIAMR,31(1):1{19,h1989.[11]A.George,J.W.-H.Liu,andE.G.-Y.Ng.ationResultsforParallelSparseCholeskyFonaHypallelComputing,10(3):287{298,May1989.[12]JohnR.GilbertandRobertSchreiber.HighlyParallelSparseCholeskyFSIAMJournalonScienticandStatisticalComputing,13:1151{1172,1992. [13]AnshulGuptaandVipinKumar.AscalablepallelalgorithmforsparsematrixfactorizationhnicalRe-port94-19,DepartmentofComputerScience,UnivyofMinnesota,Minneapolis,MN,1994.AshorterersionappearsinSupercomputing'94.TRaailableinatanonymousFTPsite[14]M.T.Heath,E.Ng,andB.W.PallelAlgorithmsforSparseLinearSystemsSIAMR,33(3):420{460,1991.[15]M.T.Heath,E.G.-Y.Ng,andBarryW.PallelAlgorithmsforSparseLinearSystemsSIAMR33:420{460,1991.AlsoappearsinK.A.Gallivanetal.allelAlgorithmsforMatrixComputations.SIAM,Philadelphia,PA,1990.[16]M.T.HeathandP.RaghaACartesiannesteddissectionalgorithmhnicalReportUIUCDCS-R-92-1772,tofComputerScience,UnivyofIllinois,Urbana,IL61801,October1992.toappearinSIMAX.[17]M.T.HeathandP.RaghadsolutionofsparselinearsystemshnicalReport93-1793,Depart-tofComputerScience,UnivyofIllinois,Urbana,IL,1993.[18]LaurieHulbertandEarlZmijewski.LimitingcationinpallelsparseCholeskyfactorizationJournalonScienticandStatisticalComputing,12(5):1184{1197,September1991.[19]GeorgeKarypis,AnshulGupta,andVipinKumar.deringandloadbalancingforpallelfactorizationofarsematrichnicalReport(inpreparation),DepartmentofComputerScience,UnivyofMinnesota,Minneapolis,MN,1994.[20]GeorgeKarypis,AnshulGupta,andVipinKumar.AParallelFormulationofInteriorPointA.Inomputing94,1994.AlsoaailableasThnicalReportTR94-20,DepartmentofComputerScience,yofMinnesota,Minneapolis,MN.[21]GeorgeKarypisandVipinKumar.AHighPerformanceSparseCholeskyFactorizationAlgorithmForScaleParallelComputershnicalReport94-41,DepartmentofComputerScience,UnivyofMinnesota,Minneapolis,MN,1994.[22]VipinKumar,AnanthGrama,AnshulGupta,andGeorgeKarypis.ntoParallelComputing:DesignandAnalysisofA.Benjamin/CummingsPublishingCompan,RedwoodCit,CA,1994.[23]JosephW.H.Liu.TheMultifrontalMethodforSparseMatrixSolution:TheoryandPrSIAMR34(1):82{109,1992.[24]RobertF.Lucas.Solvingplanarsystemsofequationsondistributememorymultipr.PhDthesis,tofElectricalEngineering,StanfordUnivaloAlto,CA,1987.AlsoseeIEEETonComputerAdDesign,6:981{991,1987.[25]RobertF.Lucas,TomBlank,andJeromeJ.Tiemann.allelsolutionmethodforlargesparsesystemsofIEEETansactionsonComputerAdDesign,CAD-6(6):981{991,Nober1987.[26]MoMuandJohnR.Rice.Agrid-bdsubtreassignmentstrgyforsolvingpartialdierquationsonhypSIAMJournalonScienticandStatisticalComputing,13(3):826{839,May1992.[27]DianneP.O'LearyandG.W.StewAssignmentandSchedulinginParallelMatrixFaanditsApplic,77:275{299,1986.[28]ChristosH.PapadimitriouandKennethSteiglitz.CombinatorialOptimization,AlgorithmsandComplexityticeHall,1982.[29]A.PothenandC-J.FComputingtheblocktriangularformofasparsematrixCMTansactionsonalSoftwar,1990.[30]AlexPothen,HorstD.Simon,andKang-PuLiou.PartitioningSparseMatricesWithEigenvectorsofGrSIAMJ.onMatrixAnalysisandApplic,11(3):430{452,1990.[31]AlexPothenandChunguangSun.dmultifrontalfactorizationusingcliquetr.IndingsoftheFifthSIAMConfereonParallelPressingforScienticComputing,pages34{40,1991.[32]RolandPozoandSharonL.Smith.eevaluationofthepallelmultifrontalmethodinadistributememoryenvir.IndingsoftheSixthSIAMConfereonParallelPressingforScienticCom-,pages453{456,1993.[33]P.RaghadsparseGaussianeliminationandorthogonalfactorizationhnicalReport93-1818,tofComputerScience,UnivyofIllinois,Urbana,IL,1993.[34]P.RaghaLineandplanesephnicalReportUIUCDCS-R-93-1794,DepartmentofComputerScience,UnivyofIllinois,Urbana,IL61801,February1993. [35]EdwardRothberg.eofPanelandBlockApprachestoSparseCholeskyFactorizationontheiPSC/860andParagonMultic.Indingsofthe1994ScalableHighPerformanceComputing,May1994.[36]EdwardRothbergandAnoopGupta.necientblodapprachtopallelsparseCholeskyfactoriza-.Ing'93Pr,1993.[37]P.SadaappanandSaileshK.Rao.ationRductionforDistributedSparseMatrixFactorizationonaessorsMesh.Inomputing'89Pr,pages371{379,1989.[38]RobertSchreiber.alabilityofsparsedirctsolvershnicalReportRIACSTR92.13,NASAAmesResearcter,MoetField,CA,May1992.AlsoappearsinA.George,JohnR.Gilbert,andJ.W.-H.Liu,editors,arseMatrixComputations:GraphTheoryIssuesandA(AnIMAWorkshopVolume).Springer-erlag,NewYork,NY,1993.[39]ChunguangSun.EcientpallelsolutionsoflargesparseSPDsystemsondistributememorymultiprhnicalreport,DepartmentofComputerScience,CornellUniv,Ithaca,NY,1993.[40]SeshVugopalandVijayK.Naik.ctsofpartitioningandschedulingsparsematrixfactorizationoncationandloadb.Ing'91Pr,pages866{875,1991.[41]M.YComputingtheminimumll-inisNP-cSIAMJ.AaicDiscreteMetho,2:77{79,AppendixAtalMethodbeansymmetricpositivedenitematrixandbeitsCholeskyfactor.Letbeitseliminationtreeanddenedenei]torepresentthesetofdescendantsofthenodeintheeliminationtree.Considerthethcolumnof.Let;:::;ibetherowsubscriptsofthenonzerosin,columno-diagonalnonzeros).eupdatematrixatcolumnisdenedasasj]fjg0BBB@lj;kli1;k...lir;k1CCCA(lj;k;li1;k;:::;lNotethattainsouterproductcontributionsfromthosepreviouslyeliminatedcolumnsthatarede-tsofintheeliminationtree.Theontalmatrixisdenedtobeus,therstrow/columnofisformedfromandthesubtreeupdatematrixatcolumn.Haformedthefrontalmatrix,thealgorithmproceedstoperformonestepofeliminationonthatgivthenonzeroentriesofthefactorof.Inparticular,thiseliminationcanbewritteninmatrixnotationasarethenonzeroelementsoftheCholeskyfactorofcolumn.Thematrixiscalledforcolumnandisformedaspartoftheeliminationstep. Inpractice,isnevercomputedusingEquation1,butisconstructedfromtheupdatematricesasws.Let;:::;cbethechildrenofintheeliminationtree,thenl$iscalledtheextend-addop,andisageneralizedmatrixaddition.Theextend-addoperationisillustratedinthefollowingexample.Considerthefollowingtoupdatematricesofistheindexsetof,therstrow/columnofcorrespondstothesecondrow/columnof,andthesecondrow/columnofcorrespondstothefthrow/columnof),andistheindexsetof.ThenNotethatthesubmatrix$yhaefewerrows/columnsthan,butifitisproperlyextendedbytheindexsetof,itbecomesthesubtreeupdatematrixTheprocessofformingfromthenonzerostructureelementsofcolumnandtheupdatesmatricesiscalledontalmatrixassemblyop.Thus,inthemtalmethod,theeliminationofeachcolumnestheassemblyofafrontalmatrixandonestepofelimination.Thesupernodalmtalalgorithmproceedssimilarlytothemtalalgorithm,butwhenasupernodeoftheeliminationtreegetseliminated,thiscorrespondstomultipleeliminationstepsonthesametalmatrix.Thenberofeliminationstepsisequaltothenberofnodesinthesupernode.AppendixBDataDistributionFigure5(a)showsa44gridofprocessorsembeddedinthehypercube.Thisembeddingusestheshmapping.Inparticulargridpositionwithcoordinate(y;x)ismappedontoprocessorwhoseaddressismadeyinvingthebitsinthebinaryrepresentationoforinstance,thegridpositionwithbinarycoordinates(ab;cd)ismappedontoprocessorwhoseaddressinbinaryis,forexamplegrid(1011)ismappedonto1110.Notethatwhenwesplitthis44processorgridintot2subgrids,thesesubgridscorrespondtodistinctsubcubesoftheoriginalhypercube.Thispropertyismaintainedinsubsequentsplits,forinstanceeac2gridofa42gridisagainasubcube.Thisisimportant,becausethealgorithmusessubtree-to-subcubepartitioningscheme,andbyusingshuingmapping,eachsubcubeissimplyhalfoftheprocessorgrid.Considernext,theeliminationtreeshowninFigure5(b).Inthisgureonlythetoptolevelsareshoandateachnode,thenonzeroelementsofforthisparticularnodeisalsoshown.Usingthesubtree-to-subcubepartitioningscheme,theeliminationtreeispartitionedamongthe16-processorsasfollows.Nodeisassignedtoallthe16processors,nodeisassignedtohalftheprocessors,whilenodeisassignedtotheotherhalfoftheprocessors.InFigure5(c),weshowhowthefrontalmatricescorrespondingtonodes,andwillbedistributedamongtheprocessors.Considernode.Thefrontalmatrixisa77triangularmatrix,thatcorresponds.Thedistributionoftherowsandcolumnsofthisfrontalmatrixonthe42processorgridisshoinpart(c).Roismappedonrooftheprocessorgrid,whilecolumnismappedonoftheprocessorgrid.Inthispaperisusedtodenotebitwiseexclusiv {4, 5, 7, 8, 9, 10, 12, 13, 14, 15, 16, 19} {2, 5, 7, 8, 12, 14, 15}{3, 4, 7, 10, 13, 14, 19} 10003, 7, 13, 191100 10011011 110110, 143, 7, 19Factoring node C 400004, 8, 1610, 145, 9, 137, 15, 1900011011 1001 Figure5:Mappingfrontalmatricestotheprocessorgrid.Analternatewyofdescribingthismappingistoconsiderthecoordinatesoftheprocessorsineacprocessorgrid.hprocessorinthe42gridhasan(y;x)coordinatesuchthat04,and2.Notethatthiscoordinateislocalwithrespecttothecurrentgrid,andisnotthesamewiththegridcoordinatesoftheprocessorintheentire44grid.Fortherestofthepaper,whenwediscussgridcoordinateswewillalwysassumethattheyarelocal,unlessspeciedotherwise.Rowillbemappedontheprocessors()suchthatthetoleastsignicantbitsofisequalto.Similarly,columnwillbemappedontheprocessors(),suchthattheoneleastsignicantbitofisequaltoIngeneral,ina2processorgrid,ismappedontotheprocessorwithgridcoordinates(Orlookingitfromtheprocessor'spointofview,processor(y;x)willgetallelemenhthattheleastsignicantbitsofareequalto,andtheleastsignicantbitsofareequalto,and,whereisusedtodenotebitwiselogicaland).Thefrontalmatrixfornodeismappedontheother42gridinasimilarfashion.Notethattheroofbotharemappedonprocessorsalongthesamerowofthe44grid.Thisisbecause,bothgridshaethesamenberofrows.Also,becausethenalgridalsohas4rows,therowsofthataresimilartoeitherrowsof,aremappedonthesamerowofprocessors.Forexamplerow7ofboth,andismappedonthefourthrowofthethreegridsinquestion.Thisisimportant,becausewhentheprocessorsneedtoperformtheextend-addoperation,nodatamotbeteenroofthegridisrequired.er,dataneedstobecommunicatedalongthecolumnsofthegrid.Thisisbecause,thegridsthatarestoringtheupdatematriceshaefewercolumnsthanthegridofprocessorsthatisgoingtofactor Dataneedstobecommunicatedsothatthecolumnsoftheupdatematriceshesthatofthefrontalmatrix.Eachprocessorcaneasilydeterminewhichcolumnsoftheupdatematrixitalreadystoresneedstokeepandwhichneedstosenda.Itkeepsthecolumnswhose2leastsignicantbitsmatccoordinateinthe44grid,anditsendstheothera.Letbearejectedelement.Thiselemenneedstobesendtoprocessoralongthesamerowintheprocessorgrid,butwhosecoordinateiser,sinceelemenresidesinthisprocessorduringthefactorizationinolvingthe42grid,thenustbethecoordinateofthisprocessor.Therefore,therejectedcolumnsneedtobesendtotheprocessoralongthesamerowwhosegridcoordinatediersinthemostsignicantbit.Thus,duringthisextend-addoperation,processorsthatareneighborsinthehypercubeneedtoexchangedata.AppendixCNewParallelMultifrontalAlgorithmporder[1..k]/*Thenodesoftheeliminationtreeateachprocessornberedinalevel-postorder*/lvl=logfor(j=1;jk;j++)i=porder[j]if(level[i]!=lvl) ks(lvl,levlev)6.lvl=levlev7.gLetFibethefrontalmatrixfornode;:::;cbethechildrenofintheeliminationtreel$)/*Densefactorizationusing2lvlprocessors*/Theparallelmtalalgorithmusingthesubforest-to-subcubepartitioningscheme.Thefunction ,performslevelli]parallelextend-addoperations.Ineachoftheseextend-addseachpro-cessorsplitsitslocalstackandsendsdatatothecorrespondingprocessorsoftheothergroupofprocessors.Notethatwhen=log,thefunctionperformsthefactorizationoflocallyAppendixDInthissectionweanalyzetheperformanceoftheparallelmtalalgorithmdescribedinSection4.InournewparallelmtalalgorithmandthatofGuptaandKumar[13],therearetypesofoduetotheparallelization:(a)loadimbalancesduetotheworkpartitioning;(b)communicationoduetotheparallelextend-addoperationandduetothefactorizationofthefrontalmatrices.AsourexperimentsshowinSection6,thesubforest-to-subcubemappingusedinournewschemees-tiallyeliminatestheloadimbalance.Incontrast,forthesubtree-to-subcubeschemeusedin[13],theerheadduetoloadimbalanceisquitehigh.wthequestioniswhetherthesubforest-to-subcubemappingusedinournewschemeresultsinhigherunicationoerheadduringtheextend-addandthedensefactorizationphase.Thisanalysisisdicultduetotheheuristicnatureofournewmappingscheme.Teeptheanalysissimple,wepresentanalysisforregularto-dimensionalgrids,inwhichthenberofsubtreesmappedontoeachsubcubeisfour.Theanalysisalsoholdsforanysmallconstanberofsubtrees.Ourexperimentshaeshownthatthenberofsubtreesmappedontoeachsubcubeisindeedsmall.unicationOvConsidera np regularnitedierencegrid,anda-processorhypercube-connectedparallelcomputer.osimplifytheanalysis,weassumethatthegridhasbeenorderedusinganesteddissectionalgorithm,thatselectscross-shapedseparators[11].Forannodesquaregrid,thisschemeselectsaseparatorofsize 2p n12p thatpartitionsthegridintofoursubgridseachofsize( n1)=2(p n1)=2p n=2p hofthesesubgridsisrecursivelydissectedinasimilarfashion.Theresultingsupernodaleliminationtree,isaquadtree.Iftherootoftheeliminationtreeisatlevelzero,thenanodeatlevcorrespondstoagridofsize n=2lp ,andtheseparatorofsuchagridisofsize2 .Also,itisproenin[11,13],thatthesizeoftheupdatematrixofanodeatlevisboundedbkn=,where=341eassumethatthesubforest-to-subcubepartitioningschemepartitionstheeliminationtreeinthefol-wingw.Itassignsallthenodesinthersttolevels(thezerothandrstlevel)toalltheprocessors.Itthen,splitsthenodesatthesecondlevelintofourequalpartsandassignseachaquarteroftheprocessors.Thechildrenofthenodesofeachofthesefourgroupsaresplitintofourequalpartsandeachisassignedtoaquarteroftheprocessorsofeachquarter.Thisprocessesconues,untilthenodesatlevellogbeenassignedtoindividualprocessors.Figure6showsthistypeofsubforest-to-subcubepartitioningscThisschemeassignstoeachsubcubeofprocessors,fournodesofthesameleveloftheeliminationtree.Inparticular,for0,asubcubeofsize p=2ip isassignedfournodesoflev+1ofthetree.Thhsubcubeisassignedaforestconsistingoffourdierenttrees. Figure6:Aquadtreeandthesubforest-to-subcubeallocationscheme.(a)Showsthetopfourlevelsofaquadtree.(b)Showsthesubforestassignedtoaquarteroftheprocessors.(c)Showsthesubforestassignedtoaquarterofaprocessorsofthequarterofpart(b).Considerasubcubeofsize p=2ip .Afterithasnishedfactoringthenodesatlev+1ofthe eliminationtreeassignedtoit,itneedstoperformanextend-addwithitscorrespondingsubcube,sothenewformedsubcubeofsize p=2i1p ,cangoaheadandfactorthenodesofthethleveloftheeliminationtree.Duringthisextend-addoperation,eachsubcubeneedstosplittherootsofeachoneofthesubtreesbeingassignedtoit.Since,eachsubcubeisassignedfoursubtrees,eachsubcubeneedstosplitandsendelementsfromfourupdatematrices.Sinceeachnodeatlev+1hasanupdatematrixofsize2kn=distributedo p=2ip processors,eachprocessorneedstoexchangewiththecorrespondingprocessoroftheothersubcubekn=ts.Since,eachprocessorhasdatafromfourupdatematrices,thetotaltofdatabeingexchangediskn=p.Theareatotaloflogextend-addphases;thus,thetotalnberofdatathatneedtobeexchangedis).Notethatthiscommunicationoerheadisidenticaltothatrequiredbythesubtree-to-subcubepartitioningscheme[13orthedenseCholeskyfactorizationweusethepipelineimplementationofthealgorithmonatdimensionalprocessorgridusingcerboardpartitioning.Itcanbeshown[27,22]thatthecommerheadtoperformfactorizationstepsofanmatrix,inapipelinedimplemennona qp meshofprocessors,is).Since,eachnodeoflev+1oftheeliminationtreeisassignedtoagridof p=2ip ,andthefrontalmatrixofeachnodeisboundedb kn= kn=,andweperform factorizationsteps,thecommunicationoerheadis 2i+1p 2 i+12i p p=2 ip Since,eachsuchgridofprocessorhasfoursuchnodes,thecommunicationoerheadofeachlevelis2kn= us,thecommunicationoerheadoerthelogelsis p plogpXi=01 2i=On p Therefore,thecommunicationoerheadsummedoeralltheprocessorduetotheparallelextend-addandthedenseCholeskyfactorizationis ),whichisofthesameorderastheschemepresentedinin].Sincetheoerallcommunicationoerheadofournewsubforest-to-subcubemappingschemeisofthesameorderasthatforthesubtree-to-subcube,theisoeciencyfunctionforbothschemesisthesame.Theanalysisjustpresentedcanbeextendedsimilarlytothree-dimensionalgridproblemsandtoarcotherthanhypercube.Inparticular,theanalysisappliesdirectlytoarchitecturessuchasCM-5,andCra