Bach francisbachminesorg Centre de Morphologie Math57524ematique Ecole des Mines de Paris 35 rue SaintHonor57524e 77300 Fontainebleau France Michael I Jordan jordancsberkeleyedu Computer Science Division and Department of Statistics University of Ca ID: 23603
Download Pdf The PPT/PDF document "Predictive lowrank decomposition for ker..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Predictivelow-rankdecompositionforkernelmethodsFrancisR.Bachfrancis.bach@mines.orgCentredeMorphologieMathematique,EcoledesMinesdeParis35rueSaint-Honore,77300Fontainebleau,FranceMichaelI.Jordanjordan@cs.berkeley.eduComputerScienceDivisionandDepartmentofStatisticsUniversityofCalifornia,Berkeley,CA94720,USAAbstractLow-rankmatrixdecompositionsareessen-tialtoolsintheapplicationofkernelmeth-odstolarge-scalelearningproblems.Thesedecompositionshavegenerallybeentreatedasblackboxes|thedecompositionofthekernelmatrixthattheydeliverisindepen-dentofthespeciclearningtaskathand|andthisisapotentiallysignicantsourceofineciency.Inthispaper,wepresentanalgorithmthatcanexploitsideinforma-tion(e.g.,classicationlabels,regressionre-sponses)inthecomputationoflow-rankde-compositionsforkernelmatrices.Oural-gorithmhasthesamefavorablescalingasstate-of-the-artmethodssuchasincompleteCholeskydecomposition|itislinearinthenumberofdatapointsandquadraticintherankoftheapproximation.Wepresentsimulationresultsthatshowthatouralgo-rithmyieldsdecompositionsofsignicantlysmallerrankthanthosefoundbyincompleteCholeskydecomposition.1.IntroductionKernelmethodsprovideaunifyingframeworkforthedesignandanalysisofmachinelearningalgo-rithms(ScholkopfandSmola,2001,Shawe-TaylorandCristianini,2004).Akeystepinanykernelmethodisthereductionofthedatatoakernelmatrix,alsoknownasaGrammatrix.Giventhekernelmatrix,genericmatrix-basedalgorithmsareavailableforsolv-inglearningproblemssuchasclassication,prediction,anomalydetection,clusteringanddimensionalityre-AppearinginProceedingsofthe22ndInternationalConfer-enceonMachineLearning,Bonn,Germany,2005.Copy-right2005bytheauthor(s)/owner(s).duction.Therearetwoprincipaladvantagestothisdivisionoflabor:(1)anyreductionthatyieldsapos-itivesemidenitekernelmatrixisallowed,afactthatopensthedoortospecializedtransformationsthatex-ploitdomain-specicknowledge;and(2)expressedintermsofthekernelmatrix,learningproblemsoftentaketheformofconvexoptimizationproblems,andpowerfulalgorithmictechniquesfromtheconvexop-timizationliteraturecanbebroughttobearintheirsolution.Anapparentdrawbackofkernelmethodsisthenaivecomputationalcomplexityassociatedwithmanipulat-ingkernelmatrices.Givenasetofndatapoints,thekernelmatrixKisofsizenn.Thissuggestsacompu-tationalcomplexityofatleastO(n2);infactmostker-nelmethodshaveattheircoreoperationssuchasma-trixinversionoreigenvaluedecompositionwhichscaleasO(n3).Moreover,somekernelalgorithmsmakeuseofsophisticatedtoolssuchassemideniteprogram-mingandhaveevenhigher-orderpolynomialcomplex-ities(Lanckrietetal.,2004).Thesegenericworst-casecomplexitiescanoftenbeskirted,andthisfactisoneofthemajorreasonsforthepracticalsuccessofkernelmethods.Theunder-lyingissueisthatthekernelmatricesoftenhaveaspectrumthatdecaysrapidlyandarethusofsmallnu-mericalrank(WilliamsandSeeger,2000).StandardalgorithmsfromnumericallinearalgebracanthusbeexploitedtocomputeanapproximationoftheformLGG,whereGRnm,andwheretherankmisgenerallysignicantlysmallerthann.Moreover,itisoftenpossibletoreformulatekernel-basedlearningalgorithmstomakeuseofGinsteadofK.There-sultingcomputationalcomplexitygenerallyscalesasO(m3m2n).Thislinearscalinginnmakeskernel-basedmethodsviableforlarge-scaleproblems.Toachievethisdesirableresult,itisofcourseneces-sarythattheunderlyingnumericallinearalgebrarou- tinesscalelinearlyinn,adesideratumthatinteraliarulesoutroutinesthatinspectalloftheentriesofK.AlgorithmsthatmeetthisdesideratumincludetheNystromapproximation(WilliamsandSeeger,2000),sparsegreedyapproximations(SmolaandScholkopf,2000)andincompleteCholeskydecomposition(FineandScheinberg,2001,BachandJordan,2002).Oneunappealingaspectofthecurrentstate-of-the-artisthatthedecompositionofthekernelmatrixisper-formedindependentlyofthelearningtask.Thus,intheclassicationsetting,thedecompositionofKisperformedindependentlyofthelabels,andinthere-gressionsettingthedecompositionisperformedinde-pendentlyoftheresponsevariables.Itseemsunlikelythatasingledecompositionwouldbeappropriateforallpossiblelearningtasks,andunlikelythatadecom-positionthatisindependentofthepredictionsshouldbeoptimizedfortheparticulartaskathand.Similarissuesariseinotherareasofmachinelearning;forex-ample,inclassicationproblems,whileprincipalcom-ponentanalysiscanbeusedtoreducedimensionalityinalabel-independentmanner,methodssuchaslin-eardiscriminantanalysisthattakethelabelsintoac-countaregenerallyviewedaspreferable(Hastieetal.,2001).Thepointofviewofthecurrentpaperisthatthattherearelikelytobeadvantagestobeing\dis-criminative"notonlywithrespecttotheparametersofamodel,butwithrespecttotheunderlyingmatrixalgorithmsaswell.Thusweposethefollowingtwoquestions:1.Canweexploitsideinformation(labels,desiredresponses,etc.)inthecomputationoflow-rankdecompositionsofkernelmatrices?2.Canwecomputethesedecompositionswithacomputationalcomplexitythatislinearinn?Thecurrentpaperanswersbothofthesequestionsinthearmative.Althoughsomenewideasareneeded,theendresultisanalgorithmcloselyrelatedtoincom-pleteCholeskydecompositionwhosecomplexityisaconstantfactortimesthecomplexityofstandardin-completeCholeskydecomposition.Aswewillshowempirically,thenewalgorithmyieldsdecompositionsofsignicantlysmallerrankthanthoseofthestandardapproach.Thepaperisorganizedasfollows.InSection2,were-viewclassicalincompleteCholeskydecompositionwithpivoting.InSection3,wepresentournewpredictivelow-rankdecompositionframework,andinSection4wepresentthedetailsofthecomputationsperformedateachiteration,aswellastheexactcostreductionofsuchsteps.InSection5,weshowhowthecostreduc-tioncanbeecientlyapproximatedviaalook-aheadmethod.EmpiricalresultsarepresentedinSection6andwepresentourconclusionsinSection7.Weusethefollowingnotations:forarectangularmatrixM,kMkFdenotestheFrobeniusnorm,de-nedaskMkF=(trMM)1=2;kMk1denotesthesumofthesingularvaluesofM,whichisequaltothesumoftheeigenvaluesofMwhenMissquareandsymmetric,andinturnequaltotrMwhenthematrixisinadditionpositivesemidenite.Wealsoletkxk2denotethe2-normofavectorx,equaltokxk2=(xx)1=2kxkF.Giventwosequencesofdis-tinctindicesIandJ,M(I;J)denotesthesubmatrixofMcomposedofrowsindexedbyI,andcolumnsin-dexedbyJ.NotethatthesequencesIandJarenotnecessarilyincreasingsequences.ThenotationM(:;J)denotesthesubmatrixofthecolumnsofMindexedbytheelementsofJ,andsimilarlyforM(I;:).Also,werefertothesequenceofintegersfromptoqasp:q.Fi-nally,wedenotetheconcatenationoftwosequencesIandJas[IJ].WeletIdqRqqdenotetheidentitymatrixand1qRqdenotethevectorofallones.2.IncompleteCholeskydecompositionInthissection,wereviewincompleteCholeskydecom-positionwithpivoting,asusedbyFineandScheinberg(2001)andBachandJordan(2002).2.1.DecompositionalgorithmIncompleteCholeskydecompositionisaniterativeal-gorithmthatyieldsanapproximationLGG,whereGRnm,andwheretherankmisgenerallysignicantlysmallerthann.1Thealgorithmdependsonasequenceofpivots.Assumingtemporarilythatthepivotsi1;i2;:::areknown,andinitializingadiag-onalmatrixDtothediagonalofK,thek-thiterationofthealgorithmisasfollows:G(ik;k)=D(ik)1=2G(Jk;k)=1G(ik;k)K(Jk;ik)Pk 1j=1G(Jk;j)G(ik;j)D(j)=D(j)G(j;k)2;j=2fi1;:::;ikg;whereIk=(i1;:::;ik)andJkdenotesthesortedcom-plementofIk.Thecomplexityofthek-thiterationisO(kn),andthusthetotalcomplexityaftermstepsisO(m2n).Afterthek-thiteration,G(Ik;1:k)isalowertriangularmatrixandtheapproximationofKisLkGkGk,whereGkisthematrixcomposedoftherstkcolumnsofG,i.e.,GkG(:;1:k).WeletdenoteDkthediagonalmatrixDafterthek-th1Inthispaper,thematricesGwillalwayshavefullrank,i.e,therankwillalwaysbethenumberofcolumns. iteration.2.2.PivotselectionandearlystoppingThealgorithmoperatesbygreedilychoosingthecol-umnsuchthattheapproximationofKobtainedbyaddingthatcolumnisbest.Inordertoselectthenextpivotik,wethushavetorankthegainsinap-proximationerrorforallremainingcolumns.SinceallapproximatingmatricesGkGkaresuchthatLkGkGkK,the1-normkKLkk1,whichisde-nedasthesumofthesingularvalues,isequaltokKLkk1=tr(KLk).TocomputetheexactgainofapproximationafteraddingcolumnikisanO(kn)operation.Ifthisweretobedoneforallremainingcolumnsateachiteration,wewouldobtainaprohibitivetotalcom-plexityofO(m2n2).Thealgorithmavoidsthiscostbyusingalowerboundonthegaininapproxima-tion.NoteinparticularthatateverystepwehavetrLkPkq=1kG(:;q)k22;thusthegainofaddingthek-thcolumniskG(:;k)k22,whichislowerboundedbyG(ik;k)2.Evenbeforethek-thiterationhasbegun,weknowthenalvalueofG(ik;k)2ifikwerechosen,sincethisisexactlyDk 1(ik).WethuschosethepivotikthatmaximizesthelowerboundDk 1(ik)amongtheremainingindices.Thisstrategyalsoprovidesaprincipledearlystoppingcri-terion:ifnopivotislargerthanagivenprecision",thealgorithmstops.2.3.LowrankapproximationandpartitionedmatricesIncompleteCholeskydecompositionyieldsadecompo-sitioninwhichthecolumnspaceofLisspannedbyasubsetofthecolumnsofK.Asthefollowingproposi-tionshows,underadditionalconstraintsthesubsetofcolumnsactuallydeterminestheapproximation:Proposition1LetKbeannnsymmetricpositivesemidenitematrix.LetIbeasequenceofdistinctelementsoff1;:::;ngandJitsorderedcomplementinf1;:::;ng.ThereisanuniquematrixLofsizensuchthat:(i)Lissymmetric,(ii)thecolumnspaceofLisspannedbybyK(:;I),(iii)L(:;I)=K(:;I).ThismatrixLissuchthatL([IJ];[IJ])=K(I;I)K(J;I)K(J;I)K(J;I)K(I;I)yK(J;I):Inaddition,thematricesLandKLarepositivesemidenite.ProofIfLsatisesthethreeconditions,then(i)and(iii)impliesthatL(I;J)=L(J;I)K(J;I)K(I;J).SincethecolumnspaceofLisspannedbyK(:;I),wemusthaveL(:;J)=K(:;I)E,whereEisajIjjJjmatrix.ByprojectingontothecolumnsinI,wegetK(I;J)=K(I;I)E,whichimpliesthatL(J;J)=K(J;I)K(I;I)yK(I;J),whereMydenotesthepseudo-inverseofM(GolubandLoan,1996).Notethattheapproximationerrorfortheblock(J;J)isequaltotheSchurcomplementK(J;J)K(J;I)K(I;I)yK(I;J)ofK(I;I).TheincompleteCholeskydecompositionwithpivotingbuildsasetIfi1;:::;imgiterativelyandapproximatesthema-trixKbyLgiveninthepreviouspropositionforthegivenI.ToobtainL,asquarerootofK(I;I)hastobecomputedthatiseasytoinvert.TheCholeskyde-compositionprovidessuchasquarerootwhichisbuiltecientlyasIgrows.3.Predictivelow-rankdecompositionWenowassumethatthennkernelmatrixKisas-sociatedwithsideinformationoftheformYRnd.Supervisedlearningproblemsprovidestandardexam-plesofproblemsinwhichsuchsideinformationispresent.Forexample,inthe(multi-way)classica-tionsetting,disthenumberofclassesandeachrowyofYhasdelementssuchthatyiisequaltooneifthecorrespondingdatapointbelongstoclassi,andzerootherwise.Inthe(multiple)regressionsetting,disthenumberofresponsevariables.Inallofthesecases,ourobjectiveistondanapproximationofKwhich(1)leadstogoodpredictiveperformanceand(2)hassmallrank.3.1.PredictionwithkernelsInthissection,wereviewtheclassicaltheoryofre-producingkernelHilbertspaces(RKHS)whichisnec-essarytojustifytheerrortermthatweusetochar-acterizehowwellanapproximationofKisabletopredictY.Letxi2XbeaninputdatapointandletyiRdnotetheassociatedlabelorresponsevariable,fori1;:::;n.LetHbeanRKHSonX,withkernelT(:;:).Givenalossfunction:XRdRdR+,theem-piricalriskisdenedasR()=Pni=1(xi;yi;f(xi))forfunctionsinHd.Byasimplemultivariateexten-sionoftherepresentertheorem(ScholkopfandSmola,2001,Shawe-TaylorandCristianini,2004),minimiz-ingtheempiricalrisksubjecttoaconstraintontheRKHSnormofleadstoasolutionoftheform(x)=Pni=1ik(x;xi),whereiRd. Inthispaper,webuildourkernelapproximationsbyconsideringthequadraticloss(x;y;f)=12kxk2F.Theempiricalriskisthenequalto12Pni=1kyi(K)ik2F12kYKk2F,whereRnd.WhenKisapproximatedbyGG,forGannmmatrix,theoptimalriskisequalto:min2Rnd12kYKk2F=min2Rmd12kYGk2F:(1)3.2.GlobalobjectivefunctionTheglobalcriterionthatweconsiderisalinearcom-binationoftheapproximationerrorofKandthelossasdenedinEq.(1),i.e:J(G)=kKGGk1min2RmdkYGk2F:Forconvenienceweusethefollowingnormalizedvaluesofand(whichcorrespondtovaluesofthecorre-spondingtermsintheobjectiveforG=0):1trKandtrYY:TheparameterthuscalibratesthetradeobetweenapproximationofKandpredictionofY.Thematrixcanbeminimizedouttoobtainthefol-lowingcriterion:J(G)=kKGGk1tr YYYG(GG) 1GY:Finally,ifweincorporatetheconstraintKGG,weobtainthenalformofthecriterion:J(G)=tr(KGG)tr(YYYG(GG) 1GY):(2)4.Choleskywithsideinformation(CSI)OuralgorithmbuildsonincompleteCholeskydecom-position,restrictingthematricesGthatitconsiderstothosewhichareobtainedasincompleteCholeskyfactorsofK.Inordertoselectthepivot,weneedtocomputethegaininthecostfunctionJinEq.(2)foreachpivotateachiteration.LetusdenotethetwotermsinthecostfunctionJ(G)asJK(G)andJY(G).Thersttermhasbeenstud-iedinSection2,wherewefoundthatJK(G)=trKPmk=1kG(:;k)k22:(3)Inordertocomputethesecondterm,JY(G)=tr YYYG(GG) 1GY;(4)eciently,weneedanecientwayofcomputingthematrixG(GG) 1Gwhichisamenabletocheapup-datingasGincreases.ThiscanbeachievedbyQRdecomposition.4.1.QRdecompositionGivenarectangularnmmatrixM,suchthatnm,theQRdecompositionofMisoftheformMQR,whereQisanm0matrixwithorthonormalcolumns,i.e.,QQIdm0,m0m,andRisanm0muppertriangularmatrix.ThematrixQprovidesanorthonormalbasisofthecolumnspaceofM;ifMhasfullrankm,thenm0m,whileifnot,m0isequaltotherankofm.TheQRdecompositioncanbeseenastheGram-SchmidtorthonormalizationofthecolumnvectorsofM(GolubandLoan,1996);moreover,thematrixRistheCholeskyfactorofthemmmatrixMM.AsimpleiterativealgorithmtocomputetheQRde-compositionofMfollowstheGram-Schmidtorthonor-malizationprocedure.TherstcolumnofQandRaredenedasQ(:;1)=M(:;1)=kM(:;1)k2andR(1;1)=kM(:;1)k2.Thek-thiteration,km,isthefollowing:R(j;k)=Q(:;j)M(:;k);j=1;:::;k1R(k;k)=kM(:;k)Pk 1i=1R(i;k)Q(:;i)k2Q(:;k)=1R(k;k)M(:;ik)Pk 1i=1R(i;k)Q(:;i):ThealgorithmstopswheneverkreachesmorR(k;k)vanishes.ThecomplexityofeachiterationisequaltoO(km)andthusthetotalcomplexityuptothem-thstepisO(m2n).4.2.ParallelCholeskyandQRdecompositionsWhilebuildingtheCholeskydecompositionGitera-tivelyasdescribedinSection2.1,weupdateitsQRdecompositionateachstep.ThecomplexityofeachiterationisO(kn)andthus,ifthealgorithmstopsaf-termsteps,thetotalcomplexityisO(m2n).Westillneedtodescribethepivotselectionstrategy;asfortheCholeskydecompositionweuseagreedystrategy,i.e.,wechosethepivotthatmostreducesthecost.Inthefollowingsections,weshowhowthischoicecanbeperformedeciently.4.3.CostreductionWeusethefollowingnotation:RkR(1:k;1:k),GkG(:;1:k),QkQ(:;1:k),kG(:;k)andqkQ(:;k).Afterthek-thiterationthecostfunctionisequaltoJk trKmXq=1kqk2! trYYtrYQkQkY;andthecostreductionisthusequaltoJk 1JkJKJY;(5) whereJKkkk22(6)JYkYqkk22kY(IdnQk 1Qk 1)kk2Fk(IdnQk 1Qk 1)kk2F:(7)FollowingSection2.1,wecanexpresskintermsofthepivotikandtheapproximationafterthe(k1)-thiteration,i.e.,k(KLk 1)(:;ik)(KLk 1)(ik;ik)1=2(KLk 1)(:;ik)Dk 1(ik)1=2(8)Computingthisreductionbeforethek-thiterationforalln+1kavailablepivotsisaprohibitiveO(kn2)operation.AsinthecaseofCholeskydecomposition,alowerboundonthereductioncanbecomputedtoavoidthiscostlyoperation.However,wehavedevel-opedadierentstrategy,onebasedonalook-aheadalgorithmthatgivescheapadditionalinformationonthekernelmatrix.Thisstrategyispresentedinthenextsection.5.Look-aheaddecompositionsAteverystepofthealgorithm,wenotonlyperformonestepofCholeskyandQR,butwealsoperformseveral\look-aheadsteps"togathermoreinformationaboutthekernelmatrixK.Throughouttheprocedurewemaintainthefollowinginformation:(1)decompo-sitionmatricesGk 1Rn(k 1),Qk 1Rn(k 1),Dk 1Rn,andRk 1R(k 1)(k 1),obtainedfromthesequenceofindicesIk 1=(i1;:::;ik 1);(2)additionaldecompositionmatricesobtainedbyadditionalrunsofCholeskyandQRdecomposition:Gadvk 1Rn(k 1+),Qadvk 1Rn(k 1+),Radvk 1R(k 1+)(k 1+),Dadvk 1Rn.Therstk1columnsofGadvk 1andQadvk 1arethematricesGk 1andQk 1,andtheadditionalcolumnsthatareaddedarein-dexedbyHk=(hk1;:::;hk).Wenowdescribehowthisinformationisupdated,andhowitisusedtoap-proximatethecostreduction.Ahigh-leveldescriptionoftheoverallalgorithmisgiveninFigure1.5.1.ApproximationofthecostreductionAfterthe(k1)-thiteration,wehavethefollow-ingapproximations:Lk 1Gk 1Gk 1andLadvk 1Gadvk 1(Gadvk 1).Inordertoapproximatethecostreduc-tiondenedbyEqs.(5),(6),(7)and(8),wereplaceallcurrentlyunknownportionsofthekernelmatrix(i.e.,thecolumnswhoseindicesarenotinIk 11Hk)bythecorrespondingelementsofLadvk 1.ThisisequivalenttoreplacingkinEq.(8)by^k(Ladvk 1Lk 1)(:;ik)(K(ik;ik)Lk 1(ik;ik))1=2:InordertoapproximateJK,wealsomakesurethatk(ik)isnotapproximatedsothatourerrortermre-ducestothelowerboundoftheincompleteCholeskydecompositionwhen=0(i.e.,nolook-aheadper-formed);thisisobtainedthroughthecorrectiveterminthefollowingequations.Weobtain:JK(ik)=Ak 1(ik)+k 1(ik)Dk 1(ik)(9)JY(ik)=Ck 1(ik)Bk 1(ik)(10)withk 1(ik)=Dk 1(i)2(Dk 1(i)Dadvk 1(i))2Ak 1(ik)=k(Ladvk 1Lk 1)(:;ik)k22Bk 1(ik)=k(IdnQk 1Qk 1)(Ladvk 1Lk 1)(:;ik))k22Ck 1(ik)=kY(IdnQk 1Qk 1)(Ladvk 1Lk 1)(:;ik))k2F:Notethatwhentheindexikbelongstothesetofindicesthatwereconsideredinadvance,thentheap-proximationisexact.Anaivecomputationoftheapproximationwouldleadtoaprohibitivequadraticcomplexityinn.Wenowpresentawayofupdatingthequantitiesdenedaboveaswellasawayofupdatingthelook-aheadCholeskyandQRstepsatacostofO(ndn)periteration.5.2.EcientimplementationUpdatingthelookaheaddecompositionsAfterthepivotikhasbeenchosen,ifitwasnotalreadyin-cludedinthesetofindicesalreadytreatedinadvance,weperformtheadditionalstepofCholeskyandQRde-compositionswiththatpivot.Ifitwasalreadychosen,weselectanewpivotusingtheusualCholeskylowerbounddenedinSection2.LetGbefk,QbefkandRbefkbethosedecompositionswithkcolumns.Inbothcases,weobtainaCholeskydecompositionswhosekthpivotisnotikingeneral,sinceikmaynotbeamongtherstlook-aheadpivotsfromthepreviousiteration.Ingeneral,ikislessthanindicesawayfromthek-thposition.InordertocomputeGadvk,QadvkandRadvk,weneedtoupdatetheCholeskyandQRde-compositionstoadvancepivotiktothek-thposition.InAppendixA,weshowhowthiscanbedonewithworst-casetimecomplexityO(n),whichisfasterthannaivelyredoingstepsofCholeskydecompositioninO(kn).UpdatingapproximationcostsInordertoderiveupdateequationsforAk(i),Bk(i)andCk(i),thecru- Input:kernelmatrixKRnn,targetmatrixYRnd,maximumrankm,tolerance",tradeoparameter[0;1],numberoflook-aheadstepsN.Algorithm:1.Performlook-aheadstepsofCholesky(Sec-tion2.1)andQRdecomposition(Section4.1),selectingpivotsaccordingtoSection2.2.2.Initialization:=2",k=1.3.While"andkm,a.Computeestimatedgainsfortheremainingpivots(Section5.1),andselectbestpivot,b.Ifnewpivotnotinthesetoflook-aheadpivots,performaCholeskyandaQRstep,otherwiseperformthestepswithapivotselectedaccord-ingtoSection2.2,c.PermuteindicesintheCholeskyandQRdecompositiontoputnewpivotinpositionk,usingthemethodinAppendixA,d.Computeexactgain;letkk+1.Output:GanditsQRdecomposition.Figure1.High-leveldescriptionoftheCSIalgorithm.cialpointistonoticethatLadvk(:;i)=Lbefk(:;i)Ladvk 1(:;i)+Gbefk(i;k)Gbefk(:;k):ThismakesmostofthetermsintheexpansionofAk(i),Bk(i)andCk(i)identicaltotermsinAk 1(i),Bk 1(i)andCk 1(i).Thetotalcomplexityofupdat-ingAk(i),Bk(i)andCk(i),foralli,isthenO(dnn).5.3.ComputationalcomplexityThetotalcomplexityoftheCSIalgorithmaftermstepsisthesumof(a)mstepsofCholeskyandQRdecomposition,i.e.,O((m)2n),(b)updatingthelook-aheaddecompositionsbypermutingindicesaspresentedinAppendixA,i.e.,O(mn),and(c)up-datingtheapproximationcosts,i.e.,O(mdnmn).ThetotalcomplexityisthusO((m)2nmdn).Intheusualcaseinwhichdmaxfm;g,thisyieldsatotalcomplexityequaltoO((m)2n),whichisthesamecomplexityascomputingmstepsofCholeskyandQRdecomposition.Forlargekernelmatrices,theCholeskyandQRdecompositionsremainthemostcostlycomputations,andthustheCSIalgorithmisonlyafewtimesslowerthanthestandardincompleteCholeskydecomposition.WeseethattheCSIalgorithmhasthesamefavor-ablelinearcomplexityinthenumberofdatapointsnasstandardCholeskydecomposition.Inparticular,wedonotneedtoexamineeveryentryofthekernelmatrixinordertocomputetheCSIapproximation.Thisisparticularlyimportantwhenthekernelisitselfcostlytocompute,asinthecaseofstringkernelsorgraphkernels(Shawe-TaylorandCristianini,2004).5.4.IncludinganinterceptItisstraightforwardtoincludeaninterceptintheCSIalgorithm.ThisisdonebyreplacingYwithnY,wheren=(Idn1n1nn)isthecenteringprojectionmatrix.TheCholeskydecompositionisnotchanged,whiletheQRdecompositionisnowperformedonnGinsteadofG.Therestofthealgorithmisnotchanged.6.ExperimentsWehaveconductedacomparisonofCSIandincom-pleteCholeskydecompositionfor37UCIdatasets,in-cludingbothregressionand(multi-way)classicationproblems.Thekernelmethodthatweusedintheseexperimentsistheleast-squaresSVM(SuykensandVandewalle,1999).Thegoalofthecomparisonwastoinvestigatetowhatextentwecanachievealower-rankdecompositionwiththeCSIalgorithmascomparedtoincompleteCholesky,atequivalentlevelsofpredictiveperformance.26.1.Least-squaresSVMsTheleast-squaresSVM(LS-SVM)algorithmisbasedontheminimizationofthefollowingcostfunction:12nkYK1nbk2F2trK;whereRndandbR1d.Thisisaclassicalpe-nalizedleast-squaresproblem,whoseestimatingequa-tionsareobtainedbysettingthederivativestozero:b1n(1nK+1nY);(KnKnK)KnY;wheren=(Idn1n1n1n).6.2.Least-squaresSVMwithincompleteCholeskydecompositionWenowapproximateKbyanincompleteCholeskyfactorizationobtainedfromcolumnsinI,i.e.,LGG.ExpressedintermsofG,theestimatingequa-2AMatlab/Cimplementationcanbedownloadedfrom tionsfortheLS-SVMbecome:G(GnGnIdm)GGGnY(11)b1n(1nGG+1nY):(12)ThesolutionsofEq.(11)arethevectorsoftheform:G(GG) 1(GnGnIdm) 1GnYv;wherevisanyvectororthogonaltothecolumnspaceofG.Thusisnotuniquelydened;however,thequantityKisuniquelydened,andequaltoKG(GnGIdm) 1GnY.Wealsohaveb1n(1nG(GnGIdm) 1GnY+1nY)andthepredictedtrainingresponsesinRdareKb1nnG(GnGIdm) 1GnY.Inordertocomputetheresponsesforpreviouslyun-seendatazj,forj=1;:::;ntest,weconsidertherectangulartestingkernelmatrixinRntestn,denedas(Ktest)jiT(xi;zj).Weusetheapproxima-tionofKtestbasedontherowsofKtestforwhichthecorrespondingrowsofKwerealreadyselectedintheCholeskydecompositionofK.IfweletInotethoserows,thetestingresponsesarethenequaltoKtest(:;I)G(I;:) G,whichisuniquelydened(whileisnot).Thisalsohastheeectofnotrequir-ingthecomputationoftheentiretestingkernelmatrixKtest|asubstantialgainforlargedatasets.Inordertocomputethetrainingerrorandtestingerrors,wethresholdtheresponsesappropriately(bytakingthesignforbinaryclassication,ortheclosestbasisvectorformulti-classclassication,whereeachclassismappedtoabasisvector).6.3.Experimentalresults-UCIdatasetsWetransformedalldiscretevariablestomultivariaterealrandomvariablesbymappingthemtotheba-sisvectors;wealsoscaledeachvariabletounitvari-ance.Weperformed10random\75/25"splitsofthedata.WeusedaGaussian-RBFkernel,T(x;y)=exp(\rkxyk22),withtheparameters\randcho-sensoastominimizeerroronthetrainingsplit.Theminimizationwasperformedbygridsearch.WetrainedandtestedseveralLS-SVMswithdecom-positionsofincreasingrank,comparingincompleteCholeskydecompositiontotheCSImethodpresentedinthispaper.ThehyperparametersfortheCSIalgo-rithmweresetto=0:99and=40.Thevalueofwaschosentobelargeenoughsothatinmostcasesthenalrankwasthesameasiftheentirekernelmatrixwasused,andsmallenoughsothatthecomplexityofthelookaheadwassmallcomparedtotherestoftheCholeskydecomposition.Forbothalgorithms,thestoppingcriterion(themin-imalgainateachiteration)wassetto10 4.Weim-posednoupperboundontheranksofthedecomposi-tions.Wereporttheminimalrankforwhichthecross-validationerroriswithinastandarddeviationoftheaveragetestingerrorobtainedwhennolow-rankde-compositionisused.AsshowninFigure2,theCSIalgorithmgenerallyyieldsadecompositionofsignif-icantlysmallerrankthanincompleteCholeskyde-composition;indeed,thedierenceinminimalranksachievedbythealgorithmscanbedramatic.7.ConclusionsAmajorthemeofmachinelearningresearchisthead-vantagesthataccrueto\discriminative"methods|methodsthatadjustalloftheparametersofamodeltominimizeatask-speciclossfunction.Inthispa-perwehaveextendedthispointofviewtothematrixalgorithmsthatunderliekernel-basedlearningmeth-ods.WiththeincompleteCholeskydecompositionasastartingpoint,wehavedevelopedanewlow-rankdecompositionalgorithmforpositivesemidenitema-tricesthatcanexploitsideinformation(e.g.,classica-tionlabels).Wehaveshownthatthisalgorithmyieldsdecompositionsofsignicantlylowerrankthanthoseobtainedwithcurrentmethods(whichignorethesideinformation).Giventhatthecomputationalrequire-mentsofthenewalgorithmarecomparabletothoseofstandardincompleteCholeskydecomposition,wefeelthatthenewalgorithmcanandshouldreplaceincom-pleteCholeskyinavarietyofapplications.Thereareseveralnaturalextensionsoftheresearchre-portedherethatareworthpursuing,mostnotablytheextensionoftheseresultstosituationsinwhichtwoormorerelatedkernelmatriceshavetobeapproximatedconjointly,suchasinkernelcanonicalcorrelationanal-ysis(BachandJordan,2002)ormultiplekernellearn-ing(Lanckrietetal.,2004).AppendixA.EcientpivotpermutationInthisappendixwedescribeanecientalgorithmtoadvancethepivotwithindexqtopositionpqinanincompleteCholeskyandQRdecomposition.Thiscanbeachievedbyqptranspositionsbetweensuccessivepivots.Permutingtwosuccessivepivotsp;p+1canbedoneinO(n)asfollows(weletdenotePp:p+1):1.Permuterowspandp+1ofQandG2.PerformQRdecompositionG(P;P)-278;.237;Q1R13.R(:;P)R(:;P)Q1,G(:;P)G(:;P)Q1 datasetnfncnpCholCSIringnorm2021000143kin-32fh-c3222000256pumadyn-32nm32{40009323pumadyn-32fh32{4000308kin-32fh32{40003410cmc1231473103bank-32fh32{400022172page-blocks825473451155spambase49240009031isolet6178179825489twonorm202400083dermatology3423583214comp-activ21{400015973abalone10{40002713yeast7367384titanic82220142kin-32nm-c322400012268pendigits164448511163adult32400074ionosphere3323517645liver62345159pi-diabetes82768106segmentation15366053waveform213200085splice24033175487305census-16h16{10004228kin-32nm32{2000307211add1010{2000280204mushroom116240006044bank-32-nm32{4000413328kin-32nm32{4000586479vehicle1824163127breast9268322thyroid74100011satellite363200022vowel1043607073optdigits58620006872boston12{5064861Figure2.SimulationresultsonUCIdatasets,wherenisthenumberoffeatures,ncthenumberofclasses(`{'forre-gressionproblems),andnpthenumberofdatapoints.ForbothclassicalincompleteCholeskydecomposition(Chol)andCholeskydecompositionwithsideinformation(CSI),wereporttheminimalrankforwhichthepredictionper-formancewithadecompositionofthatrankiswithinonestandarddeviationoftheperformancewithafull-rankker-nelmatrix.Datasetsaresortedbythevaluesoftheratiosbetweenthelasttwocolumns.4.PerformQRdecompositionR(P;P)=Q2R25.R(P;:)Q2R(P;:),Q(:;P)Q(:;P)Q2.ThetotalcomplexityofpermutingindicespandqisthusO(jpqjn).NoteallcolumnsofGandQbetweenpandqarechangedbutthattheupdatesinvolvejpqjshuesbetweensuccessivecolumnsofQandG.AcknowledgementsWewishtoacknowledgesupportfromagrantfromIn-telCorporation,andagraduatefellowshiptoFrancisBachfromMicrosoftResearch.Wealsowishtoac-knowledgeGrant0412995fromtheNationalScienceFoundation.ReferencesF.R.BachandM.I.Jordan.Kernelindependentcomponentanalysis.J.Mach.Learn.Res.,3:1{48,2002.S.FineandK.Scheinberg.EcientSVMtrainingus-inglow-rankkernelrepresentations.J.Mach.Learn.Res.,2:243{264,2001.G.H.GolubandC.F.VanLoan.MatrixComputa-tions.J.HopkinsUniv.Press,1996.T.Hastie,R.Tibshirani,andJ.Friedman.TheEle-mentsofStatisticalLearning.Springer-Verlag,2001.G.R.G.Lanckriet,N.Cristianini,L.ElGhaoui,P.Bartlett,andM.I.Jordan.Learningthekernelmatrixwithsemideniteprogramming.J.Mach.Learn.Res.,5:27{72,2004.B.ScholkopfandA.J.Smola.LearningwithKernels.MITPress,2001.J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniv.Press,2004.A.J.SmolaandB.Scholkopf.Sparsegreedymatrixapproximationformachinelearning.InProc.ICML,2000.J.A.K.SuykensandJ.Vandewalle.Leastsquaressup-portvectormachineclassiers.NeuralProc.Let.,9(3):293{300,1999.C.K.I.WilliamsandM.Seeger.Eectoftheinputdensitydistributiononkernel-basedclassiers.InProc.ICML,2000.