/
Predictive lowrank decomposition for kernel methods Francis R Predictive lowrank decomposition for kernel methods Francis R

Predictive lowrank decomposition for kernel methods Francis R - PDF document

pamella-moone
pamella-moone . @pamella-moone
Follow
558 views
Uploaded On 2014-12-14

Predictive lowrank decomposition for kernel methods Francis R - PPT Presentation

Bach francisbachminesorg Centre de Morphologie Math57524ematique Ecole des Mines de Paris 35 rue SaintHonor57524e 77300 Fontainebleau France Michael I Jordan jordancsberkeleyedu Computer Science Division and Department of Statistics University of Ca ID: 23603

Bach francisbachminesorg Centre

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Predictive lowrank decomposition for ker..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Predictivelow-rankdecompositionforkernelmethodsFrancisR.Bachfrancis.bach@mines.orgCentredeMorphologieMathematique,EcoledesMinesdeParis35rueSaint-Honore,77300Fontainebleau,FranceMichaelI.Jordanjordan@cs.berkeley.eduComputerScienceDivisionandDepartmentofStatisticsUniversityofCalifornia,Berkeley,CA94720,USAAbstractLow-rankmatrixdecompositionsareessen-tialtoolsintheapplicationofkernelmeth-odstolarge-scalelearningproblems.Thesedecompositionshavegenerallybeentreatedasblackboxes|thedecompositionofthekernelmatrixthattheydeliverisindepen-dentofthespeci clearningtaskathand|andthisisapotentiallysigni cantsourceofineciency.Inthispaper,wepresentanalgorithmthatcanexploitsideinforma-tion(e.g.,classi cationlabels,regressionre-sponses)inthecomputationoflow-rankde-compositionsforkernelmatrices.Oural-gorithmhasthesamefavorablescalingasstate-of-the-artmethodssuchasincompleteCholeskydecomposition|itislinearinthenumberofdatapointsandquadraticintherankoftheapproximation.Wepresentsimulationresultsthatshowthatouralgo-rithmyieldsdecompositionsofsigni cantlysmallerrankthanthosefoundbyincompleteCholeskydecomposition.1.IntroductionKernelmethodsprovideaunifyingframeworkforthedesignandanalysisofmachinelearningalgo-rithms(ScholkopfandSmola,2001,Shawe-TaylorandCristianini,2004).Akeystepinanykernelmethodisthereductionofthedatatoakernelmatrix,alsoknownasaGrammatrix.Giventhekernelmatrix,genericmatrix-basedalgorithmsareavailableforsolv-inglearningproblemssuchasclassi cation,prediction,anomalydetection,clusteringanddimensionalityre-AppearinginProceedingsofthe22ndInternationalConfer-enceonMachineLearning,Bonn,Germany,2005.Copy-right2005bytheauthor(s)/owner(s).duction.Therearetwoprincipaladvantagestothisdivisionoflabor:(1)anyreductionthatyieldsapos-itivesemide nitekernelmatrixisallowed,afactthatopensthedoortospecializedtransformationsthatex-ploitdomain-speci cknowledge;and(2)expressedintermsofthekernelmatrix,learningproblemsoftentaketheformofconvexoptimizationproblems,andpowerfulalgorithmictechniquesfromtheconvexop-timizationliteraturecanbebroughttobearintheirsolution.Anapparentdrawbackofkernelmethodsisthenaivecomputationalcomplexityassociatedwithmanipulat-ingkernelmatrices.Givenasetofndatapoints,thekernelmatrixKisofsizenn.Thissuggestsacompu-tationalcomplexityofatleastO(n2);infactmostker-nelmethodshaveattheircoreoperationssuchasma-trixinversionoreigenvaluedecompositionwhichscaleasO(n3).Moreover,somekernelalgorithmsmakeuseofsophisticatedtoolssuchassemide niteprogram-mingandhaveevenhigher-orderpolynomialcomplex-ities(Lanckrietetal.,2004).Thesegenericworst-casecomplexitiescanoftenbeskirted,andthisfactisoneofthemajorreasonsforthepracticalsuccessofkernelmethods.Theunder-lyingissueisthatthekernelmatricesoftenhaveaspectrumthatdecaysrapidlyandarethusofsmallnu-mericalrank(WilliamsandSeeger,2000).StandardalgorithmsfromnumericallinearalgebracanthusbeexploitedtocomputeanapproximationoftheformLGG�,whereGRnm,andwheretherankmisgenerallysigni cantlysmallerthann.Moreover,itisoftenpossibletoreformulatekernel-basedlearningalgorithmstomakeuseofGinsteadofK.There-sultingcomputationalcomplexitygenerallyscalesasO(m3m2n).Thislinearscalinginnmakeskernel-basedmethodsviableforlarge-scaleproblems.Toachievethisdesirableresult,itisofcourseneces-sarythattheunderlyingnumericallinearalgebrarou- tinesscalelinearlyinn,adesideratumthatinteraliarulesoutroutinesthatinspectalloftheentriesofK.AlgorithmsthatmeetthisdesideratumincludetheNystromapproximation(WilliamsandSeeger,2000),sparsegreedyapproximations(SmolaandScholkopf,2000)andincompleteCholeskydecomposition(FineandScheinberg,2001,BachandJordan,2002).Oneunappealingaspectofthecurrentstate-of-the-artisthatthedecompositionofthekernelmatrixisper-formedindependentlyofthelearningtask.Thus,intheclassi cationsetting,thedecompositionofKisperformedindependentlyofthelabels,andinthere-gressionsettingthedecompositionisperformedinde-pendentlyoftheresponsevariables.Itseemsunlikelythatasingledecompositionwouldbeappropriateforallpossiblelearningtasks,andunlikelythatadecom-positionthatisindependentofthepredictionsshouldbeoptimizedfortheparticulartaskathand.Similarissuesariseinotherareasofmachinelearning;forex-ample,inclassi cationproblems,whileprincipalcom-ponentanalysiscanbeusedtoreducedimensionalityinalabel-independentmanner,methodssuchaslin-eardiscriminantanalysisthattakethelabelsintoac-countaregenerallyviewedaspreferable(Hastieetal.,2001).Thepointofviewofthecurrentpaperisthatthattherearelikelytobeadvantagestobeing\dis-criminative"notonlywithrespecttotheparametersofamodel,butwithrespecttotheunderlyingmatrixalgorithmsaswell.Thusweposethefollowingtwoquestions:1.Canweexploitsideinformation(labels,desiredresponses,etc.)inthecomputationoflow-rankdecompositionsofkernelmatrices?2.Canwecomputethesedecompositionswithacomputationalcomplexitythatislinearinn?Thecurrentpaperanswersbothofthesequestionsinthearmative.Althoughsomenewideasareneeded,theendresultisanalgorithmcloselyrelatedtoincom-pleteCholeskydecompositionwhosecomplexityisaconstantfactortimesthecomplexityofstandardin-completeCholeskydecomposition.Aswewillshowempirically,thenewalgorithmyieldsdecompositionsofsigni cantlysmallerrankthanthoseofthestandardapproach.Thepaperisorganizedasfollows.InSection2,were-viewclassicalincompleteCholeskydecompositionwithpivoting.InSection3,wepresentournewpredictivelow-rankdecompositionframework,andinSection4wepresentthedetailsofthecomputationsperformedateachiteration,aswellastheexactcostreductionofsuchsteps.InSection5,weshowhowthecostreduc-tioncanbeecientlyapproximatedviaalook-aheadmethod.EmpiricalresultsarepresentedinSection6andwepresentourconclusionsinSection7.Weusethefollowingnotations:forarectangularmatrixM,kMkFdenotestheFrobeniusnorm,de- nedaskMkF=(trMM�)1=2;kMk1denotesthesumofthesingularvaluesofM,whichisequaltothesumoftheeigenvaluesofMwhenMissquareandsymmetric,andinturnequaltotrMwhenthematrixisinadditionpositivesemide nite.Wealsoletkxk2denotethe2-normofavectorx,equaltokxk2=(x�x)1=2kxkF.Giventwosequencesofdis-tinctindicesIandJ,M(I;J)denotesthesubmatrixofMcomposedofrowsindexedbyI,andcolumnsin-dexedbyJ.NotethatthesequencesIandJarenotnecessarilyincreasingsequences.ThenotationM(:;J)denotesthesubmatrixofthecolumnsofMindexedbytheelementsofJ,andsimilarlyforM(I;:).Also,werefertothesequenceofintegersfromptoqasp:q.Fi-nally,wedenotetheconcatenationoftwosequencesIandJas[IJ].WeletIdqRqqdenotetheidentitymatrixand1qRqdenotethevectorofallones.2.IncompleteCholeskydecompositionInthissection,wereviewincompleteCholeskydecom-positionwithpivoting,asusedbyFineandScheinberg(2001)andBachandJordan(2002).2.1.DecompositionalgorithmIncompleteCholeskydecompositionisaniterativeal-gorithmthatyieldsanapproximationLGG�,whereGRnm,andwheretherankmisgenerallysigni cantlysmallerthann.1Thealgorithmdependsonasequenceofpivots.Assumingtemporarilythatthepivotsi1;i2;:::areknown,andinitializingadiag-onalmatrixDtothediagonalofK,thek-thiterationofthealgorithmisasfollows:G(ik;k)=D(ik)1=2G(Jk;k)=1G(ik;k)K(Jk;ik)Pk1j=1G(Jk;j)G(ik;j)D(j)=D(j)G(j;k)2;j=2fi1;:::;ikg;whereIk=(i1;:::;ik)andJkdenotesthesortedcom-plementofIk.Thecomplexityofthek-thiterationisO(kn),andthusthetotalcomplexityaftermstepsisO(m2n).Afterthek-thiteration,G(Ik;1:k)isalowertriangularmatrixandtheapproximationofKisLkGkG�k,whereGkisthematrixcomposedofthe rstkcolumnsofG,i.e.,GkG(:;1:k).WeletdenoteDkthediagonalmatrixDafterthek-th1Inthispaper,thematricesGwillalwayshavefullrank,i.e,therankwillalwaysbethenumberofcolumns. iteration.2.2.PivotselectionandearlystoppingThealgorithmoperatesbygreedilychoosingthecol-umnsuchthattheapproximationofKobtainedbyaddingthatcolumnisbest.Inordertoselectthenextpivotik,wethushavetorankthegainsinap-proximationerrorforallremainingcolumns.SinceallapproximatingmatricesGkG�karesuchthatLkGkG�kK,the1-normkKLkk1,whichisde- nedasthesumofthesingularvalues,isequaltokKLkk1=tr(KLk).TocomputetheexactgainofapproximationafteraddingcolumnikisanO(kn)operation.Ifthisweretobedoneforallremainingcolumnsateachiteration,wewouldobtainaprohibitivetotalcom-plexityofO(m2n2).Thealgorithmavoidsthiscostbyusingalowerboundonthegaininapproxima-tion.NoteinparticularthatateverystepwehavetrLkPkq=1kG(:;q)k22;thusthegainofaddingthek-thcolumniskG(:;k)k22,whichislowerboundedbyG(ik;k)2.Evenbeforethek-thiterationhasbegun,weknowthe nalvalueofG(ik;k)2ifikwerechosen,sincethisisexactlyDk1(ik).WethuschosethepivotikthatmaximizesthelowerboundDk1(ik)amongtheremainingindices.Thisstrategyalsoprovidesaprincipledearlystoppingcri-terion:ifnopivotislargerthanagivenprecision",thealgorithmstops.2.3.LowrankapproximationandpartitionedmatricesIncompleteCholeskydecompositionyieldsadecompo-sitioninwhichthecolumnspaceofLisspannedbyasubsetofthecolumnsofK.Asthefollowingproposi-tionshows,underadditionalconstraintsthesubsetofcolumnsactuallydeterminestheapproximation:Proposition1LetKbeannnsymmetricpositivesemide nitematrix.LetIbeasequenceofdistinctelementsoff1;:::;ngandJitsorderedcomplementinf1;:::;ng.ThereisanuniquematrixLofsizensuchthat:(i)Lissymmetric,(ii)thecolumnspaceofLisspannedbybyK(:;I),(iii)L(:;I)=K(:;I).ThismatrixLissuchthatL([IJ];[IJ])=K(I;I)K(J;I)�K(J;I)K(J;I)K(I;I)yK(J;I)�:Inaddition,thematricesLandKLarepositivesemide nite.ProofIfLsatis esthethreeconditions,then(i)and(iii)impliesthatL(I;J)=L(J;I)�K(J;I)�K(I;J).SincethecolumnspaceofLisspannedbyK(:;I),wemusthaveL(:;J)=K(:;I)E,whereEisajIjjJjmatrix.ByprojectingontothecolumnsinI,wegetK(I;J)=K(I;I)E,whichimpliesthatL(J;J)=K(J;I)K(I;I)yK(I;J),whereMydenotesthepseudo-inverseofM(GolubandLoan,1996).Notethattheapproximationerrorfortheblock(J;J)isequaltotheSchurcomplementK(J;J)K(J;I)K(I;I)yK(I;J)ofK(I;I).TheincompleteCholeskydecompositionwithpivotingbuildsasetIfi1;:::;imgiterativelyandapproximatesthema-trixKbyLgiveninthepreviouspropositionforthegivenI.ToobtainL,asquarerootofK(I;I)hastobecomputedthatiseasytoinvert.TheCholeskyde-compositionprovidessuchasquarerootwhichisbuiltecientlyasIgrows.3.Predictivelow-rankdecompositionWenowassumethatthennkernelmatrixKisas-sociatedwithsideinformationoftheformYRnd.Supervisedlearningproblemsprovidestandardexam-plesofproblemsinwhichsuchsideinformationispresent.Forexample,inthe(multi-way)classi ca-tionsetting,disthenumberofclassesandeachrowyofYhasdelementssuchthatyiisequaltooneifthecorrespondingdatapointbelongstoclassi,andzerootherwise.Inthe(multiple)regressionsetting,disthenumberofresponsevariables.Inallofthesecases,ourobjectiveisto ndanapproximationofKwhich(1)leadstogoodpredictiveperformanceand(2)hassmallrank.3.1.PredictionwithkernelsInthissection,wereviewtheclassicaltheoryofre-producingkernelHilbertspaces(RKHS)whichisnec-essarytojustifytheerrortermthatweusetochar-acterizehowwellanapproximationofKisabletopredictY.Letxi2XbeaninputdatapointandletyiRdnotetheassociatedlabelorresponsevariable,fori1;:::;n.LetHbeanRKHSonX,withkernelT(:;:).Givenalossfunction:XRdRdR+,theem-piricalriskisde nedasR()=Pni=1(xi;yi;f(xi))forfunctionsinHd.Byasimplemultivariateexten-sionoftherepresentertheorem(ScholkopfandSmola,2001,Shawe-TaylorandCristianini,2004),minimiz-ingtheempiricalrisksubjecttoaconstraintontheRKHSnormofleadstoasolutionoftheform(x)=Pni=1 ik(x;xi),where iRd. Inthispaper,webuildourkernelapproximationsbyconsideringthequadraticloss(x;y;f)=12kxk2F.Theempiricalriskisthenequalto12Pni=1kyi(K )ik2F12kYK k2F,where Rnd.WhenKisapproximatedbyGG�,forGannmmatrix,theoptimalriskisequalto:min 2Rnd12kYK k2F=min 2Rmd12kYG k2F:(1)3.2.GlobalobjectivefunctionTheglobalcriterionthatweconsiderisalinearcom-binationoftheapproximationerrorofKandthelossasde nedinEq.(1),i.e:J(G)=kKGG�k1min 2RmdkYG k2F:Forconvenienceweusethefollowingnormalizedvaluesofand(whichcorrespondtovaluesofthecorre-spondingtermsintheobjectiveforG=0):1trKandtrY�Y:Theparameterthuscalibratesthetradeo betweenapproximationofKandpredictionofY.Thematrix canbeminimizedouttoobtainthefol-lowingcriterion:J(G)=kKGG�k1trY�YY�G(G�G)1G�Y:Finally,ifweincorporatetheconstraintKGG�,weobtainthe nalformofthecriterion:J(G)=tr(KGG�)tr(Y�YY�G(G�G)1G�Y):(2)4.Choleskywithsideinformation(CSI)OuralgorithmbuildsonincompleteCholeskydecom-position,restrictingthematricesGthatitconsiderstothosewhichareobtainedasincompleteCholeskyfactorsofK.Inordertoselectthepivot,weneedtocomputethegaininthecostfunctionJinEq.(2)foreachpivotateachiteration.LetusdenotethetwotermsinthecostfunctionJ(G)asJK(G)andJY(G).The rsttermhasbeenstud-iedinSection2,wherewefoundthatJK(G)=trKPmk=1kG(:;k)k22:(3)Inordertocomputethesecondterm,JY(G)=trY�YY�G(G�G)1G�Y;(4)eciently,weneedanecientwayofcomputingthematrixG(G�G)1G�whichisamenabletocheapup-datingasGincreases.ThiscanbeachievedbyQRdecomposition.4.1.QRdecompositionGivenarectangularnmmatrixM,suchthatnm,theQRdecompositionofMisoftheformMQR,whereQisanm0matrixwithorthonormalcolumns,i.e.,Q�QIdm0,m0m,andRisanm0muppertriangularmatrix.ThematrixQprovidesanorthonormalbasisofthecolumnspaceofM;ifMhasfullrankm,thenm0m,whileifnot,m0isequaltotherankofm.TheQRdecompositioncanbeseenastheGram-SchmidtorthonormalizationofthecolumnvectorsofM(GolubandLoan,1996);moreover,thematrixR�istheCholeskyfactorofthemmmatrixM�M.AsimpleiterativealgorithmtocomputetheQRde-compositionofMfollowstheGram-Schmidtorthonor-malizationprocedure.The rstcolumnofQandRarede nedasQ(:;1)=M(:;1)=kM(:;1)k2andR(1;1)=kM(:;1)k2.Thek-thiteration,km,isthefollowing:R(j;k)=Q(:;j)�M(:;k);j=1;:::;k1R(k;k)=kM(:;k)Pk1i=1R(i;k)Q(:;i)k2Q(:;k)=1R(k;k)M(:;ik)Pk1i=1R(i;k)Q(:;i):ThealgorithmstopswheneverkreachesmorR(k;k)vanishes.ThecomplexityofeachiterationisequaltoO(km)andthusthetotalcomplexityuptothem-thstepisO(m2n).4.2.ParallelCholeskyandQRdecompositionsWhilebuildingtheCholeskydecompositionGitera-tivelyasdescribedinSection2.1,weupdateitsQRdecompositionateachstep.ThecomplexityofeachiterationisO(kn)andthus,ifthealgorithmstopsaf-termsteps,thetotalcomplexityisO(m2n).Westillneedtodescribethepivotselectionstrategy;asfortheCholeskydecompositionweuseagreedystrategy,i.e.,wechosethepivotthatmostreducesthecost.Inthefollowingsections,weshowhowthischoicecanbeperformedeciently.4.3.CostreductionWeusethefollowingnotation:RkR(1:k;1:k),GkG(:;1:k),QkQ(:;1:k),kG(:;k)andqkQ(:;k).Afterthek-thiterationthecostfunctionisequaltoJk trKmXq=1kqk2!trY�YtrY�QkQ�kY;andthecostreductionisthusequaltoJk1JkJKJY;(5) whereJKkkk22(6)JYkY�qkk22kY�(IdnQk1Q�k1)kk2Fk(IdnQk1Q�k1)kk2F:(7)FollowingSection2.1,wecanexpresskintermsofthepivotikandtheapproximationafterthe(k1)-thiteration,i.e.,k(KLk1)(:;ik)(KLk1)(ik;ik)1=2(KLk1)(:;ik)Dk1(ik)1=2(8)Computingthisreductionbeforethek-thiterationforalln+1kavailablepivotsisaprohibitiveO(kn2)operation.AsinthecaseofCholeskydecomposition,alowerboundonthereductioncanbecomputedtoavoidthiscostlyoperation.However,wehavedevel-opedadi erentstrategy,onebasedonalook-aheadalgorithmthatgivescheapadditionalinformationonthekernelmatrix.Thisstrategyispresentedinthenextsection.5.Look-aheaddecompositionsAteverystepofthealgorithm,wenotonlyperformonestepofCholeskyandQR,butwealsoperformseveral\look-aheadsteps"togathermoreinformationaboutthekernelmatrixK.Throughouttheprocedurewemaintainthefollowinginformation:(1)decompo-sitionmatricesGk1Rn(k1),Qk1Rn(k1),Dk1Rn,andRk1R(k1)(k1),obtainedfromthesequenceofindicesIk1=(i1;:::;ik1);(2)additionaldecompositionmatricesobtainedbyadditionalrunsofCholeskyandQRdecomposition:Gadvk1Rn(k1+),Qadvk1Rn(k1+),Radvk1R(k1+)(k1+),Dadvk1Rn.The rstk1columnsofGadvk1andQadvk1arethematricesGk1andQk1,andtheadditionalcolumnsthatareaddedarein-dexedbyHk=(hk1;:::;hk).Wenowdescribehowthisinformationisupdated,andhowitisusedtoap-proximatethecostreduction.Ahigh-leveldescriptionoftheoverallalgorithmisgiveninFigure1.5.1.ApproximationofthecostreductionAfterthe(k1)-thiteration,wehavethefollow-ingapproximations:Lk1Gk1G�k1andLadvk1Gadvk1(Gadvk1)�.Inordertoapproximatethecostreduc-tionde nedbyEqs.(5),(6),(7)and(8),wereplaceallcurrentlyunknownportionsofthekernelmatrix(i.e.,thecolumnswhoseindicesarenotinIk11Hk)bythecorrespondingelementsofLadvk1.ThisisequivalenttoreplacingkinEq.(8)by^k(Ladvk1Lk1)(:;ik)(K(ik;ik)Lk1(ik;ik))1=2:InordertoapproximateJK,wealsomakesurethatk(ik)isnotapproximatedsothatourerrortermre-ducestothelowerboundoftheincompleteCholeskydecompositionwhen=0(i.e.,nolook-aheadper-formed);thisisobtainedthroughthecorrectiveterminthefollowingequations.Weobtain:JK(ik)=Ak1(ik)+k1(ik)Dk1(ik)(9)JY(ik)=Ck1(ik)Bk1(ik)(10)withk1(ik)=Dk1(i)2(Dk1(i)Dadvk1(i))2Ak1(ik)=k(Ladvk1Lk1)(:;ik)k22Bk1(ik)=k(IdnQk1Q�k1)(Ladvk1Lk1)(:;ik))k22Ck1(ik)=kY�(IdnQk1Q�k1)(Ladvk1Lk1)(:;ik))k2F:Notethatwhentheindexikbelongstothesetofindicesthatwereconsideredinadvance,thentheap-proximationisexact.Anaivecomputationoftheapproximationwouldleadtoaprohibitivequadraticcomplexityinn.Wenowpresentawayofupdatingthequantitiesde nedaboveaswellasawayofupdatingthelook-aheadCholeskyandQRstepsatacostofO(ndn)periteration.5.2.EcientimplementationUpdatingthelookaheaddecompositionsAfterthepivotikhasbeenchosen,ifitwasnotalreadyin-cludedinthesetofindicesalreadytreatedinadvance,weperformtheadditionalstepofCholeskyandQRde-compositionswiththatpivot.Ifitwasalreadychosen,weselectanewpivotusingtheusualCholeskylowerboundde nedinSection2.LetGbefk,QbefkandRbefkbethosedecompositionswithkcolumns.Inbothcases,weobtainaCholeskydecompositionswhosekthpivotisnotikingeneral,sinceikmaynotbeamongthe rstlook-aheadpivotsfromthepreviousiteration.Ingeneral,ikislessthanindicesawayfromthek-thposition.InordertocomputeGadvk,QadvkandRadvk,weneedtoupdatetheCholeskyandQRde-compositionstoadvancepivotiktothek-thposition.InAppendixA,weshowhowthiscanbedonewithworst-casetimecomplexityO(n),whichisfasterthannaivelyredoingstepsofCholeskydecompositioninO(kn).UpdatingapproximationcostsInordertoderiveupdateequationsforAk(i),Bk(i)andCk(i),thecru- Input:kernelmatrixKRnn,targetmatrixYRnd,maximumrankm,tolerance",tradeo parameter[0;1],numberoflook-aheadstepsN.Algorithm:1.Performlook-aheadstepsofCholesky(Sec-tion2.1)andQRdecomposition(Section4.1),selectingpivotsaccordingtoSection2.2.2.Initialization:=2",k=1.3.While�"andkm,a.Computeestimatedgainsfortheremainingpivots(Section5.1),andselectbestpivot,b.Ifnewpivotnotinthesetoflook-aheadpivots,performaCholeskyandaQRstep,otherwiseperformthestepswithapivotselectedaccord-ingtoSection2.2,c.PermuteindicesintheCholeskyandQRdecompositiontoputnewpivotinpositionk,usingthemethodinAppendixA,d.Computeexactgain;letkk+1.Output:GanditsQRdecomposition.Figure1.High-leveldescriptionoftheCSIalgorithm.cialpointistonoticethatLadvk(:;i)=Lbefk(:;i)Ladvk1(:;i)+Gbefk(i;k)Gbefk(:;k):ThismakesmostofthetermsintheexpansionofAk(i),Bk(i)andCk(i)identicaltotermsinAk1(i),Bk1(i)andCk1(i).Thetotalcomplexityofupdat-ingAk(i),Bk(i)andCk(i),foralli,isthenO(dnn).5.3.ComputationalcomplexityThetotalcomplexityoftheCSIalgorithmaftermstepsisthesumof(a)mstepsofCholeskyandQRdecomposition,i.e.,O((m)2n),(b)updatingthelook-aheaddecompositionsbypermutingindicesaspresentedinAppendixA,i.e.,O(mn),and(c)up-datingtheapproximationcosts,i.e.,O(mdnmn).ThetotalcomplexityisthusO((m)2nmdn).Intheusualcaseinwhichdmaxfm;g,thisyieldsatotalcomplexityequaltoO((m)2n),whichisthesamecomplexityascomputingmstepsofCholeskyandQRdecomposition.Forlargekernelmatrices,theCholeskyandQRdecompositionsremainthemostcostlycomputations,andthustheCSIalgorithmisonlyafewtimesslowerthanthestandardincompleteCholeskydecomposition.WeseethattheCSIalgorithmhasthesamefavor-ablelinearcomplexityinthenumberofdatapointsnasstandardCholeskydecomposition.Inparticular,wedonotneedtoexamineeveryentryofthekernelmatrixinordertocomputetheCSIapproximation.Thisisparticularlyimportantwhenthekernelisitselfcostlytocompute,asinthecaseofstringkernelsorgraphkernels(Shawe-TaylorandCristianini,2004).5.4.IncludinganinterceptItisstraightforwardtoincludeaninterceptintheCSIalgorithm.ThisisdonebyreplacingYwithnY,wheren=(Idn1n1�nn)isthecenteringprojectionmatrix.TheCholeskydecompositionisnotchanged,whiletheQRdecompositionisnowperformedonnGinsteadofG.Therestofthealgorithmisnotchanged.6.ExperimentsWehaveconductedacomparisonofCSIandincom-pleteCholeskydecompositionfor37UCIdatasets,in-cludingbothregressionand(multi-way)classi cationproblems.Thekernelmethodthatweusedintheseexperimentsistheleast-squaresSVM(SuykensandVandewalle,1999).Thegoalofthecomparisonwastoinvestigatetowhatextentwecanachievealower-rankdecompositionwiththeCSIalgorithmascomparedtoincompleteCholesky,atequivalentlevelsofpredictiveperformance.26.1.Least-squaresSVMsTheleast-squaresSVM(LS-SVM)algorithmisbasedontheminimizationofthefollowingcostfunction:12nkYK 1nbk2F2tr �K ;where RndandbR1d.Thisisaclassicalpe-nalizedleast-squaresproblem,whoseestimatingequa-tionsareobtainedbysettingthederivativestozero:b1n(1�nK +1�nY);(KnKnK) KnY;wheren=(Idn1n1n1�n).6.2.Least-squaresSVMwithincompleteCholeskydecompositionWenowapproximateKbyanincompleteCholeskyfactorizationobtainedfromcolumnsinI,i.e.,LGG�.ExpressedintermsofG,theestimatingequa-2AMatlab/Cimplementationcanbedownloadedfrom tionsfortheLS-SVMbecome:G(G�nGnIdm)G� GG�nY(11)b1n(1�nGG� +1�nY):(12)ThesolutionsofEq.(11)arethevectorsoftheform: G(G�G)1(G�nGnIdm)1G�nYv;wherevisanyvectororthogonaltothecolumnspaceofG.Thus isnotuniquelyde ned;however,thequantityK isuniquelyde ned,andequaltoK G(G�nGIdm)1G�nY.Wealsohaveb1n(1�nG(G�nGIdm)1G�nY+1�nY)andthepredictedtrainingresponsesinRdareK b1nnG(G�nGIdm)1G�nY.Inordertocomputetheresponsesforpreviouslyun-seendatazj,forj=1;:::;ntest,weconsidertherectangulartestingkernelmatrixinRntestn,de nedas(Ktest)jiT(xi;zj).Weusetheapproxima-tionofKtestbasedontherowsofKtestforwhichthecorrespondingrowsofKwerealreadyselectedintheCholeskydecompositionofK.IfweletInotethoserows,thetestingresponsesarethenequaltoKtest(:;I)G(I;:)�G� ,whichisuniquelyde ned(while isnot).Thisalsohasthee ectofnotrequir-ingthecomputationoftheentiretestingkernelmatrixKtest|asubstantialgainforlargedatasets.Inordertocomputethetrainingerrorandtestingerrors,wethresholdtheresponsesappropriately(bytakingthesignforbinaryclassi cation,ortheclosestbasisvectorformulti-classclassi cation,whereeachclassismappedtoabasisvector).6.3.Experimentalresults-UCIdatasetsWetransformedalldiscretevariablestomultivariaterealrandomvariablesbymappingthemtotheba-sisvectors;wealsoscaledeachvariabletounitvari-ance.Weperformed10random\75/25"splitsofthedata.WeusedaGaussian-RBFkernel,T(x;y)=exp(\rkxyk22),withtheparameters\randcho-sensoastominimizeerroronthetrainingsplit.Theminimizationwasperformedbygridsearch.WetrainedandtestedseveralLS-SVMswithdecom-positionsofincreasingrank,comparingincompleteCholeskydecompositiontotheCSImethodpresentedinthispaper.ThehyperparametersfortheCSIalgo-rithmweresetto=0:99and=40.Thevalueofwaschosentobelargeenoughsothatinmostcasesthe nalrankwasthesameasiftheentirekernelmatrixwasused,andsmallenoughsothatthecomplexityofthelookaheadwassmallcomparedtotherestoftheCholeskydecomposition.Forbothalgorithms,thestoppingcriterion(themin-imalgainateachiteration)wassetto104.Weim-posednoupperboundontheranksofthedecomposi-tions.Wereporttheminimalrankforwhichthecross-validationerroriswithinastandarddeviationoftheaveragetestingerrorobtainedwhennolow-rankde-compositionisused.AsshowninFigure2,theCSIalgorithmgenerallyyieldsadecompositionofsignif-icantlysmallerrankthanincompleteCholeskyde-composition;indeed,thedi erenceinminimalranksachievedbythealgorithmscanbedramatic.7.ConclusionsAmajorthemeofmachinelearningresearchisthead-vantagesthataccrueto\discriminative"methods|methodsthatadjustalloftheparametersofamodeltominimizeatask-speci clossfunction.Inthispa-perwehaveextendedthispointofviewtothematrixalgorithmsthatunderliekernel-basedlearningmeth-ods.WiththeincompleteCholeskydecompositionasastartingpoint,wehavedevelopedanewlow-rankdecompositionalgorithmforpositivesemide nitema-tricesthatcanexploitsideinformation(e.g.,classi ca-tionlabels).Wehaveshownthatthisalgorithmyieldsdecompositionsofsigni cantlylowerrankthanthoseobtainedwithcurrentmethods(whichignorethesideinformation).Giventhatthecomputationalrequire-mentsofthenewalgorithmarecomparabletothoseofstandardincompleteCholeskydecomposition,wefeelthatthenewalgorithmcanandshouldreplaceincom-pleteCholeskyinavarietyofapplications.Thereareseveralnaturalextensionsoftheresearchre-portedherethatareworthpursuing,mostnotablytheextensionoftheseresultstosituationsinwhichtwoormorerelatedkernelmatriceshavetobeapproximatedconjointly,suchasinkernelcanonicalcorrelationanal-ysis(BachandJordan,2002)ormultiplekernellearn-ing(Lanckrietetal.,2004).AppendixA.EcientpivotpermutationInthisappendixwedescribeanecientalgorithmtoadvancethepivotwithindexqtopositionpqinanincompleteCholeskyandQRdecomposition.Thiscanbeachievedbyqptranspositionsbetweensuccessivepivots.Permutingtwosuccessivepivotsp;p+1canbedoneinO(n)asfollows(weletdenotePp:p+1):1.Permuterowspandp+1ofQandG2.PerformQRdecompositionG(P;P)&#x-278;&#x.237;Q1R13.R(:;P)R(:;P)Q1,G(:;P)G(:;P)Q1 datasetnfncnpCholCSIringnorm2021000143kin-32fh-c3222000256pumadyn-32nm32{40009323pumadyn-32fh32{4000308kin-32fh32{40003410cmc1231473103bank-32fh32{400022172page-blocks825473451155spambase49240009031isolet6178179825489twonorm202400083dermatology3423583214comp-activ21{400015973abalone10{40002713yeast7367384titanic82220142kin-32nm-c322400012268pendigits164448511163adult32400074ionosphere3323517645liver62345159pi-diabetes82768106segmentation15366053waveform213200085splice24033175487305census-16h16{10004228kin-32nm32{2000307211add1010{2000280204mushroom116240006044bank-32-nm32{4000413328kin-32nm32{4000586479vehicle1824163127breast9268322thyroid74100011satellite363200022vowel1043607073optdigits58620006872boston12{5064861Figure2.SimulationresultsonUCIdatasets,wherenisthenumberoffeatures,ncthenumberofclasses(`{'forre-gressionproblems),andnpthenumberofdatapoints.ForbothclassicalincompleteCholeskydecomposition(Chol)andCholeskydecompositionwithsideinformation(CSI),wereporttheminimalrankforwhichthepredictionper-formancewithadecompositionofthatrankiswithinonestandarddeviationoftheperformancewithafull-rankker-nelmatrix.Datasetsaresortedbythevaluesoftheratiosbetweenthelasttwocolumns.4.PerformQRdecompositionR(P;P)=Q2R25.R(P;:)Q�2R(P;:),Q(:;P)Q(:;P)Q2.ThetotalcomplexityofpermutingindicespandqisthusO(jpqjn).NoteallcolumnsofGandQbetweenpandqarechangedbutthattheupdatesinvolvejpqjshuesbetweensuccessivecolumnsofQandG.AcknowledgementsWewishtoacknowledgesupportfromagrantfromIn-telCorporation,andagraduatefellowshiptoFrancisBachfromMicrosoftResearch.Wealsowishtoac-knowledgeGrant0412995fromtheNationalScienceFoundation.ReferencesF.R.BachandM.I.Jordan.Kernelindependentcomponentanalysis.J.Mach.Learn.Res.,3:1{48,2002.S.FineandK.Scheinberg.EcientSVMtrainingus-inglow-rankkernelrepresentations.J.Mach.Learn.Res.,2:243{264,2001.G.H.GolubandC.F.VanLoan.MatrixComputa-tions.J.HopkinsUniv.Press,1996.T.Hastie,R.Tibshirani,andJ.Friedman.TheEle-mentsofStatisticalLearning.Springer-Verlag,2001.G.R.G.Lanckriet,N.Cristianini,L.ElGhaoui,P.Bartlett,andM.I.Jordan.Learningthekernelmatrixwithsemide niteprogramming.J.Mach.Learn.Res.,5:27{72,2004.B.ScholkopfandA.J.Smola.LearningwithKernels.MITPress,2001.J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniv.Press,2004.A.J.SmolaandB.Scholkopf.Sparsegreedymatrixapproximationformachinelearning.InProc.ICML,2000.J.A.K.SuykensandJ.Vandewalle.Leastsquaressup-portvectormachineclassi ers.NeuralProc.Let.,9(3):293{300,1999.C.K.I.WilliamsandM.Seeger.E ectoftheinputdensitydistributiononkernel-basedclassi ers.InProc.ICML,2000.