van de Geijn Department of Computer Science Institute for Computational Engineering and Sciences The University of Texas at Austin Austin TX 78712 rvdgcsutexasedu March 11 2011 1 De64257nition and Existence The Cholesky factorization is only de6 ID: 22866
Download Pdf The PPT/PDF document "Notes on Cholesky Factorization Robert A" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
forj=1:nj;j:=p j;jfori=j+1:ni;j:=i;j=j;jendforfork=j+1:nfori=k:ni;k:=i;ki;jk;jendforendforendfor forj=1:nj;j:=p j;j9=;j+1:n;j:=j+1:n;j=j;j9=;j+1:n;j+1:n:=j+1:n;j+1:ntril(j+1:n;jTj+1:n;j)endforFigure1:FormulationsoftheCholeskyfactorizationthatexposeindicesusingMatlab-likenotation. partthatisthenoverwrittenwiththeresult.Inthisdiscussion,wewillassumethatthelowertriangularpartofAisstoredandoverwritten.2ApplicationTheCholeskyfactorizationisusedtosolvethelinearsystemAx=ywhenAisSPD:SubstitutingthefactorsintotheequationyieldsLLTx=y.Lettingz=LTx,Ax=L(LTx)| {z }z=Lz=y:Thus,zcanbecomputedbysolvingthetriangularsystemofequationsLz=yandsub-sequentlythedesiredsolutionxcanbecomputedbysolvingthetriangularlinearsystemLTx=z.3AnAlgorithm(Variant3)ThemostcommonalgorithmforcomputingA:=Chol(A)canbederivedasfollows:Con-siderA=LLT.PartitionA= 11 ? a21 A22!andL= 11 0 l21 L22!: (1) 2 Lemma8.LetA2RmmbeSPDandl21=a21=p 11.ThenA22l21lT21isSPD.Proof:SinceAissymmetricsoareA22andA22l21lT21.Letx16=0beanarbitraryvectoroflengthn1.Denex= 0 x1!where0=aT21x1=11.Then,sincex6=0,0xTAx= 0 x1!T 11 aT21 a21 A22! 0 x1!= 0 x1!T 110+aT21x1 a210+A22x1!=1120+0aT21x1+xT1a210+xT1A22x1=11aT21x1 11xT1a21 11xT1a21 11aT21x1xT1a21aT21x1 11+xT1A22x1=xT1(A22a21aT21 11)x1=xT1(A22l21lT21)x1:WeconcludethatA22l21lT21isSPD.Proof:ofCholeskyFactorizationTheoremProofbyinduction. Basecase:n=1.Clearlytheresultistruefora11matrixA=11:Inthiscase,thefactthatAisSPDmeansthat11isrealandpositiveandaCholeskyfactoristhengivenby11=p 11,withuniquenessifweinsistthat11ispositive. Inductivestep:AssumetheresultistrueforSPDmatrixA2R(n1)(n1).WewillshowthatitholdsforA2Rnn.LetA2RnnbeSPD.PartitionAandLasin(4)andlet11=p 11(whichiswell-denedbyLemma4),l21=a21=11,andL22=Chol(A22l21lT21)(whichexistsasaconsequenceoftheInductiveHypothesisandLemma4).ThenListhedesiredCholeskyfactorofA. Bytheprincipleofmathematicalinduction,thetheoremholds.5BlockedAlgorithm(Variant3)Inordertoattainhighperformance,thecomputationiscastintermsofmatrix-matrixmultiplicationbyso-calledblockedalgorithms.FortheCholeskyfactorizationablockedversionofthealgorithmcanbederivedbypartitioningA! A11 ? A21 A22!andL! L11 0 L21 L22!; 4 donedonedonepartiallyupdatedBeginningofiteration ATLABL?ABR## Repartition A00??aT1011?A20a21A22## UPD.UPD.UPD.Update p 11a21 11A22a21aT21## @@RdonedonedonepartiallyupdatedEndofiteration ATLABL?ABRFigure3:Left:ProgressionofpicturesthatexplainCholeskyfactorizationalgorithm.Right:Samepictures,annotatedwithlabelsandupdates. Beginningofiteration: Atsomestageofthealgorithm(Topoftheloop),thecomputationhasmovedthroughthematrixtothepointindicatedbythethicklines.Noticethatwehavenishedwiththepartsofthematrixthatareinthetop-left,top-right(whichisnottobetouched),andbottom-leftquadrants.Thebottom-rightquadranthasbeenupdatedtothepointwhereweonlyneedtoperformaCholeskyfactorizationofit. Repartition: Wenowrepartitionthebottom-rightsubmatrixtoexpose11,a21,andA22. 6 Remark11.ClearlyFig.4doesnotpresentthealgorithmasconciselyasthealgorithmsgiveninFigs.1and2.However,itdoescapturetoalargedegreetheverbaldescriptionofthealgorithmmentionedaboveandtherefore,inouropinion,reducesboththeeortrequiredtointerpretthealgorithmandtheneedforadditionalexplanations.ThenotationalsomirrorsthatusedfortheproofinSection4.Remark12.ThenotationinFigs.3and4allowsthecontentsofmatrixAatthebeginningoftheiterationtobeformallystated:A= ATL ? ABL ABR!= LTL ? LBL ^ABRtril(LBLLTBL)!;whereLTL=Chol(^ATL),LBL=^ABLLTTL,and^ATL;^ABLand^ABRdenotetheoriginalcontentsofthequadrantsATL;ABLandABR,respectively.Exercise13.ImplementtheCholeskyfactorizationwithM-script.7CostThecostoftheCholeskyfactorizationofA2Rmmcanbeanalyzedasfollows:InFig.4(left)duringthekthiteration(startingkatzero)A00iskk.Thus,theoperationsinthatiterationcost 11:=p 11:negligiblewhenkislarge. a21:=a21=11:approximately(mk1) ops. A22:=A22tril(a21aT21):approximately(mk1)2 ops.(Arank-1updateofallofA22wouldhavecost2(mk1)2 ops.ApproximatelyhalftheentriesofA22areupdated.)Thus,thetotalcostin opsisgivenbyCChol(m)m1Xk=0(mk1)2| {z }(DuetoupdateofA22)+m1Xk=0(mk1)| {z }(Duetoupdateofa21)=m1Xj=0j2+m1Xj=0j1 3m3+1 2m21 3m3whichallowsustostatethat(obvious)mostcomputationisintheupdateofA22. 8 Algorithm:A:=Chol unb(A) PartitionA! ATL ? ABL ABR!whereATLis00 whilem(ATL)m(A)do Repartition ATL ? ABL ABR!!0B@A00 ? ? aT10 11 ? A20 a21 A221CAwhere11is11 Variant1 (BorderedAlgorithm)aT10:=aT10tril(A00)T11:=11aT10a1011:=p 11 Variant2 (Left-lookingAlgorithm)11:=11aT10a1011:=p 11a21:=a21A20a10a21:=a21=11 Variant3 (Right-lookingAlgorithm)11:=p 11a21:=a21=11A22:=A22tril(a21aT21) Continuewith ATL ? ABL ABR! 0B@A00 ? ? aT10 11 ? A20 a21 A221CA endwhile Algorithm:A:=Chol blk(A) PartitionA! ATL ? ABL ABR!whereATLis00 whilem(ATL)m(A)do DetermineblocksizebRepartition ATL ? ABL ABR!!0B@A00 ? ? A10 A11 ? A20 A21 A221CAwhereA11isbb Variant1 (BorderedAlgorithm)A10:=A10tril(A00)TA11:=A11A10AT10A11:=Chol(A11) Variant2 (Left-lookingAlgorithm)A11:=A11A10AT10A11:=Chol(A11)A21:=A21A20AT10A21:=A21tril(A11)T Variant3 (Right-lookingAlgorithm)A11:=Chol(A11)A21:=A21tril(A11)TA22:=A22tril(A21AT21) Continuewith ATL ? ABL ABR! 0B@A00 ? ? A10 A11 ? A20 A21 A221CA endwhile Figure5:MultiplevariantsforcomputingtheCholeskyfactorization. 10 fromwhichweconcludethatA00=L00LT00 ? ? aT10=lT10LT00 11=lT10l10+211 ? A20=L20LT00 a21=L20l10+11l21 A22=L20LT20+l21lT21+L22LT22:Now,letusassumethatinpreviousiterationsoftheloopthecomputationhasproceededsothatAcontains0B@L00 0 0 lT10 11 0 L20 a21 A221CA:ThepurposeofthecurrentiterationistomakeitsothatAcontains0B@L00 0 0 lT10 11 0 L20 l21 A221CA:Thefollowingalgorithmaccomplishesthis: 1. PartitionA!0B@A00 ? ? aT10 11 ? A20 a21 A221CAandassumethatduetocomputationfrompreviousiterationsitcontains0B@L00 0 0 lT10 11 0 L20 a21 A221CA: 2. Overwrite11:=11=p 11lT10l10. 3. Overwritea21:=a21L20l10. 4. Overwritea21:=l21=a21=11.ThisjustiestheunblockedVariant2inFigure5.Ablockedalgorithmcanbesimilarlyderived.PartitionA=0B@A00 ? ? A10 A11 ? A20 A21 A221CAandL=0B@L00 0 0 L10 L11 0 L20 L21 L221CA (4) 12 3. OverwriteA21:=A21=A21L20LT10. 4. OverwriteA21:=L21=A21LT11.ThisjustiestheblockedVariant2inFigure5.9CodingthealgorithmsAspartoftheFLAMEprojectatTheUniversityofTexasatAustin,UniversidadJaumeI,Spain,andRWTHAachenUniversity,Germany,wehavedevelopedAPIsforcodingthesealgorithmsthatmakeitsothatthecodecloselyresemblesthealgorithms.Thedemonstratehowthesecodesarewritten,wecreatedavideotitledHigh-PerformanceImplementationofCholeskyFactorizationthatcanbefoundat http://www.cs.utexas.edu/users/flame/Movies.html#Chol.Exercise14.ImplementallthealgorithmsinFigure5inM-script(thescriptinglanguageofMatlabandOctave).Exercise15.ImplementallthealgorithmsinFigure5usingtheFLAME/CAPImentionedinthevideo.10PerformanceInFigure6weshowtheperformanceattainedonamulticorearchitecture.Allexperimentswereperformedusingdouble-precision oating-pointarithmeticonaDell/PowerEdge-1435poweredbyaIntelXeonX5355(2.666GHz)processor.Thisparticulararchitecturehasfourcores,eachofwhichhasaclockrateofabout2.7GHzandeachcorecanperformfour oatingpointoperationsperclockcycle.Thus,thepeakofthismachineis4cores4FLOPS cycle2:7109cycles core43109FLOPS=43GFLOPS:Whatthegraphshowsistheproblemsizeforwhichperformancewasmeasuredalongthex-axisandtherateatwhichthealgorithmcomputedalongthey-axis.Theratewascomputedby,foragiventime,measuringthetimerequiredtocompletethecomputation,t(n)(inseconds),andthendividingthenumberof oatingpointoperationsbythistime.Thisgivestherate,in oatingpointoperationspersecond(FLOPS),atwhichthecomputationexecuted.Sincethisisanumbermeasuredinbillions,wethendividethisnumberbyabilliontogettherateinGFLOPS(GigaFLOPS,orbillionsof oatingpointoperationspersecond):rateinGFLOPS=n3 t(n)109: 14 algorithmsrelatedtotheinversionofasymmetricpositivedenitematrix.ACMTransactionsonMathematicalSoftware,35(1). 2. Twopapersthatshowhowmatrix-matrixoperationscanachievenear-peakperfor-mance: KazushigeGotoandRobertA.vandeGeijn.Anatomyofhigh-performancematrixmultiplication.ACMTrans.Math.Soft.,34(3:Article12,25pages),May2008.and KazushigeGotoandRobertvandeGeijn.High-performanceimplementa-tionofthelevel-3BLAS.ACMTrans.Math.Softw.,35(1):1{14,2008. 3. Abookthatshowshowtosystematicallyderiveallloop-basedalgorithmsforabroadrangeoflinearalgebraoperations,includingCholeskyfactorization: RobertA.vandeGeijnandEnriqueS.Quintana-Ort.TheScienceofPro-grammingMatrixComputations.http://www.lulu.com/content/1911788,2008. 4. TheUser'sGuideforthelibflamelibrary,amodernreplacementfortheLAPACKlibrarythatiscodedusingtheAPIsmentionedinthemovie: FieldG.VanZee.libflame:TheCompleteReference.www.lulu.com,2009. 5. AnoverviewoftheFLAMEproject: FieldG.VanZee,ErnieChan,RobertvandeGeijn,EnriqueS.Quintana-Ort,andGregorioQuintana-Ort.Introducing:Thelib amelibraryfordensematrixcomputations.IEEEComputationinScience&Engineering,11(6):56{62,2009. 16