/
Low rank kernel learning with bregman matrix divergences Low rank kernel learning with bregman matrix divergences

Low rank kernel learning with bregman matrix divergences - PDF document

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
427 views
Uploaded On 2017-04-27

Low rank kernel learning with bregman matrix divergences - PPT Presentation

KULISSUSTIKANDDHILLONovertheconeofpositivesemidenitematricesandouralgorithmsleadtoautomaticenforcementofpositivesemidenitenessThispaperfocusesonkernellearningusingtwodivergencemeasuresbetweenPSDm ID: 340008

KULIS SUSTIKANDDHILLONovertheconeofpositivesemidenitematrices andouralgorithmsleadtoautomaticenforcementofpositivesemideniteness.ThispaperfocusesonkernellearningusingtwodivergencemeasuresbetweenPSDm

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Low rank kernel learning with bregman ma..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch10(2009)341-376Submitted8/07;Revised10/08;Published2/09Low-RankKernelLearningwithBregmanMatrixDivergencesBrianKulisKULIS@CS.UTEXAS.EDUM´aty´asA.SustikSUSTIK@CS.UTEXAS.EDUInderjitS.DhillonINDERJIT@CS.UTEXAS.EDUDepartmentofComputerSciencesUniversityofTexasatAustinAustin,TX78712,USAEditor:MichaelI.JordanAbstractInthispaper,westudylow-rankmatrixnearnessproblems,withafocusonlearninglow-rankpositivesemidenite(kernel)matricesformachinelearningapplications.Weproposeefcientalgorithmsthatscalelinearlyinthenumberofdatapointsandquadraticallyintherankoftheinputmatrix.Existingalgorithmsforlearningkernelmatricesoftenscalepoorly,withrunningtimesthatarecubicinthenumberofdatapoints.WeemployBregmanmatrixdivergencesasthemeasuresofnearness—thesedivergencesarenaturalforlearninglow-rankkernelssincetheypreserverankaswellaspositivesemideniteness.Specialcasesofourframeworkyieldfasteralgorithmsforvariousexistinglearningproblems,andexperimentalresultsdemonstratethatouralgorithmscaneffectivelylearnbothlow-rankandfull-rankkernelmatrices.Keywords:kernelmethods,Bregmandivergences,convexoptimization,kernellearning,matrixnearness1.IntroductionUnderlyingmanymachinelearningalgorithmsisameasureofdistance,ordivergence,betweendataobjects.Anumberoffactorsaffectthechoiceoftheparticulardivergencemeasureused:thedatamaybeintrinsicallysuitedforaspecicdivergencemeasure,analgorithmmaybefasteroreasiertoimplementforsomemeasure,oranalysismaybesimplerforaparticularmeasure.Forexample,theKL-divergenceispopularwhenthedataisrepresentedasdiscreteprobabilityvectors(i.e.,non-negativevectorswhosesumis1),andthe`1-normdistanceisoftenusedwhensparsityisdesired.Whenmeasuringthedivergencebetweenmatrices,similarissuesmustbeconsidered.Typically,matrixnormssuchastheFrobeniusnormareused,butthesemeasuresarenotappropriateforallproblems.AnalogoustotheKL-divergenceforvectors,whenthematricesunderconsiderationarepositivesemidenite(i.e.,theyhavenon-negativeeigenvalues),thenonemaywanttochooseadivergencemeasurethatiswell-suitedtosuchmatrices;positivesemidenite(PSD)matricesarisefrequentlyinmachinelearning,inparticularwithkernelmethods.ExistinglearningtechniquesinvolvingPSDmatricesoftenemploymatrixnorms,buttheiruserequiresanadditionalconstraintthatthematrixstaypositivesemidenite,whichultimatelyleadstoalgorithmsinvolvingexpensiveeigenvectorcomputation.Kernelalignment,ameasureofsimilaritybetweenPSDmatricesusedinsomekernellearningalgorithms(Cristianinietal.,2002),alsorequiresexplicitenforcementofpositivedeniteness.Ontheotherhand,thetwomaindivergencesusedinthispaperaredenedonlyc\r2009BrianKulis,M´aty´asA.SustikandInderjitS.Dhillon. KULIS,SUSTIKANDDHILLONovertheconeofpositivesemidenitematrices,andouralgorithmsleadtoautomaticenforcementofpositivesemideniteness.ThispaperfocusesonkernellearningusingtwodivergencemeasuresbetweenPSDmatrices—theLogDetdivergenceandthevonNeumanndivergence.OurkernellearninggoalistondaPSDmatrixwhichisascloseaspossible(undertheLogDetorvonNeumanndivergence)tosomeinputPSDmatrix,butwhichadditionallysatiseslinearequalityorinequalityconstraints.Wearguethatthesetwodivergencesarenaturalforproblemsinvolvingpositivedenitematrices.First,theyhaveseveralpropertieswhichmakeoptimizationcomputationallyefcientandusefulforavarietyoflearningtasks.Forexample,theLogDetdivergencehasascale-invariancepropertywhichmakesitparticularlywell-suitedtomachinelearningapplications.Second,thesedivergencesarisenaturallyinseveralproblems;forexample,theLogDetdivergencearisesinproblemsinvolvingthedifferentialrelativeentropybetweenGaussiandistributionswhilethevonNeumanndivergencearisesinquantumcomputingandproblemssuchasonlinePCA.Oneofthekeypropertiesthatwedemonstrateandexploitisthat,forlow-rankmatrices,thedivergencesenjoyarange-spacepreservingproperty.Thatis,theLogDetdivergencebetweentwomatricesisniteifandonlyiftheirrangespacesareidentical,andasimilarpropertyholdsforthevonNeumanndivergence.Thispropertyleadstosimpleprojection-basedalgorithmsforlearningPSD(orkernel)matricesthatpreservetherankoftheinputmatrix—iftheinputmatrixislow-rankthentheresultinglearnedmatrixwillalsobelow-rank.Thesealgorithmsareefcient;theyscalelinearlywiththenumberofdatapointsnandquadraticallywiththerankoftheinputmatrix.TheefciencyofthealgorithmsarisesfromnewresultsdevelopedinthispaperwhichdemonstratethatBregmanprojectionsforthesematrixdivergencescanbecomputedintimethatscalesquadraticallywiththerankoftheinputmatrix.Ouralgorithmsstandincontrasttopreviousworkonlearningkernelmatrices,whichscaleasO(n3)orworse,relyingonsemideniteprogrammingorrepeatedeigenvectorcomputation.Weemphasizethatourmethodspreserverank,sotheycanlearnlow-rankkernelmatriceswhentheinputkernelmatrixhaslowrank;howeverourmethodsdonotdecreaseranksotheyarenotapplicabletothenon-convexproblemofndingalow-ranksolutiongivenafull(orhigher)rankinputkernelmatrix.OnespecialcaseofourmethodistheDeniteBoostoptimizationproblemfromTsudaetal.(2005);ouranalysisshowshowtoimprovetherunningtimeoftheiralgorithmbyafactorofn,fromO(n3)timeperprojectiontoO(n2).Thisprojectionisalsousedinonline-PCA(WarmuthandKuzmin,2006),andweobtainafactorofnspeedupforthatproblemaswell.Intermsofexperimentalresults,adirectapplicationofourtechniquesisinlearninglow-rankkernelmatricesinthesettingwherewehavebackgroundinformationforasubsetofthedata;wediscussandexperimentwithlearninglow-rankkernelmatricesforclassicationandclusteringinthissetting,demonstratingthatwecanscaletolargedatasets.WealsodiscusstheuseofourdivergencesforlearningMahalanobisdistancefunctions,whichallowsustomovebeyondthetransductivesettingandgeneralizetonewpoints.Inthisvein,weempiricallycompareourmethodstoexistingmetriclearningalgorithms;weareparticularlyinterestedinlarge-scaleapplications,anddiscusssomerecentapplicationsofthealgorithmsdevelopedinthispapertocomputervisiontasks.2.BackgroundandRelatedWorkInthissection,webrieyreviewrelevantbackgroundmaterialandrelatedwork.342 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCES2.1KernelMethodsGivenasetoftrainingpointsa1;:::;an,acommonstepinkernelalgorithmsistotransformthedatausinganonlinearfunctiony.Thismapping,typically,representsatransformationofthedatatoahigher-dimensionalfeaturespace.Akernelfunctionkgivestheinnerproductbetweentwovectorsinthefeaturespace:k(ai;aj)=y(ai)y(aj):Itisoftenpossibletocomputethisinnerproductwithoutexplicitlycomputingtheexpensivemap-pingoftheinputpointstothehigher-dimensionalfeaturespace.Generally,givennpointsai,weformannnmatrixK,calledthekernelmatrix,whose(i;j)entrycorrespondstok(ai;aj).Inkernel-basedalgorithms,theonlyinformationneededabouttheinputdatapointsistheinnerprod-ucts;hence,thekernelmatrixprovidesallrelevantinformationforlearninginthefeaturespace.Akernelmatrixformedfromanysetofinputdatapointsisalwayspositivesemidenite.SeeShawe-TaylorandCristianini(2004)formoredetails.2.2Low-RankKernelRepresentationsandKernelLearningDespitethepopularityofkernelmethodsinmachinelearning,manykernel-basedalgorithmsscalepoorly;low-rankkernelrepresentationsaddressthisissue.GivenannnkernelmatrixK,ifthematrixisoflowrank,sayrn,wecanrepresentthekernelmatrixintermsofafactorizationK=GGT,withGannrmatrix.InadditiontoeasingtheburdenofmemoryoverheadfromO(n2)storagetoO(nr),thislow-rankdecompositioncanleadtoimprovedefciency.Forexample,FineandScheinberg(2001)showthatSVMtrainingreducesfromO(n3)toO(nr2)whenusingalow-rankdecomposition.Empirically,thealgorithminFineandScheinberg(2001)outperformsotherSVMtrainingalgorithmsintermsoftrainingtimebyseveralfactors.Inclustering,thekernelk-meansalgorithm(Dhillonetal.,2004)hasarunningtimeofO(n2)periteration,whichcanbeimprovedtoO(nrc)timeperiterationwithalow-rankkernelrepresentation,wherecisthenumberofdesiredclusters.Low-rankkernelrep-resentationsareoftenobtainedusingincompleteCholeskydecompositions(FineandScheinberg,2001).Recently,workhasbeendoneonusinglabeleddatatoimprovethequalityoflow-rankdecompositions(BachandJordan,2005).Low-rankdecompositionshavealsobeenemployedforsolvinganumberofothermachinelearningproblems.Forexample,inKulisetal.(2007b),low-rankdecompositionswereemployedforclusteringandembeddingproblems.Incontrasttoworkinthispaper,theirfocuswasonusinglow-rankdecompositionstodevelopalgorithmsforsuchproblemsask-meansandmaximumvari-anceunfolding.Otherexamplesofusinglow-rankdecompositionstospeedupmachinelearningalgorithmsincludeWeinbergeretal.(2006)andTorresaniandLee(2006).Inthispaper,ourfocusisonusingdistanceandsimilarityconstraintstolearnalow-rankkernelmatrix.Inrelatedwork,Lanckrietetal.(2004)havestudiedtransductivelearningofthekernelmatrixandmultiplekernellearningusingsemideniteprogramming.InKwokandTsang(2003),aformulationbasedonidealizedkernelsispresentedtolearnakernelmatrixwhensomelabelsaregiven.Anotherrecentpaper(Weinbergeretal.,2004)considerslearningakernelmatrixfornonlineardimensionalityreduction;likemuchoftheresearchonlearningakernelmatrix,theyusesemideniteprogrammingandtherunningtimeisatleastcubicinthenumberofdatapoints.OurworkisclosesttothatofTsudaetal.(2005),wholearna(full-rank)kernelmatrixusingvonNeumanndivergenceunderlinearconstraints.However,ourframeworkismoregeneralandour343 KULIS,SUSTIKANDDHILLONemphasisisonlow-rankkernellearning.OuralgorithmsaremoreefcientthanthoseofTsudaetal.;weuseexactinsteadofapproximateprojectionstospeedupconvergence,andweconsideralgorithmsfortheLogDetdivergenceinadditiontothevonNeumanndivergence.ForthecaseofthevonNeumanndivergence,ouralgorithmalsocorrectsamistakeinTsudaetal.(2005);seeAppendixBfordetails.AnearlierversionofourworkappearedinKulisetal.(2006).Inthispaper,wesubstantiallyexpandontheanalysisofBregmanmatrixdivergences,givingaformaltreatmentoflow-rankBreg-manmatrixdivergencesandprovingseveralnewproperties.WealsopresentfurtheralgorithmicanalysisfortheLogDetandvonNeumannalgorithms,provideconnectionstosemidenitepro-grammingandmetriclearning,andpresentadditionalexperiments,includingonesonmuchlargerdatasets.3.OptimizationFrameworkWebeginwithanoverviewoftheoptimizationframeworkappliedtotheproblemstudiedinthispaper.WeintroduceBregmanmatrixdivergences—thematrixdivergencemeasuresconsideredinthispaper—anddiscusstheirproperties.ThenweoverviewBregman'smethod,theoptimizationalgorithmthatisusedtolearnlow-rankkernelmatrices.3.1BregmanMatrixDivergencesTomeasurethenearnessbetweentwomatrices,wewilluseBregmanmatrixdivergences,whicharegeneralizationsofBregmanvectordivergences.Letjbeareal-valuedstrictlyconvexfunctiondenedoveraconvexsetS=dom(j)RmsuchthatjisdifferentiableontherelativeinteriorofS.TheBregmanvectordivergence(Bregman,1967)withrespecttojisdenedasDj(x;y)=j(x)j(y)(xy)TÑj(y):Forexample,ifj(x)=xTx,thentheresultingBregmandivergenceisDj(x;y)=kxyk2.An-otherexampleisj(x)=åi(xilogxixi),wheretheresultingBregmandivergenceisthe(unnor-malized)relativeentropyDj(x;y)=KL(x;y)=åi(xilogxiyixi+yi).Bregmandivergencesgen-eralizemanypropertiesofsquaredlossandrelativeentropy.SeeCensorandZenios(1997)formoredetails.Wecannaturallyextendthisdenitiontoreal,symmetricnnmatrices,denotedbySn.Givenastrictlyconvex,differentiablefunctionf:Sn!R,theBregmanmatrixdivergenceisdenedtobeDf(X;Y)=f(X)f(Y)tr((Ñf(Y))T(XY));wheretr(A)denotesthetraceofmatrixA.Examplesincludef(X)=kXk2,whichleadstothewell-knownsquaredFrobeniusnormkXYk2F.Inthispaper,wewillextensivelystudytwolesswell-knowndivergences.Letfbetheentropyoftheeigenvaluesofapositivedenitematrix.Specically,ifXhaseigenvaluesl1;:::;ln,letf(X)=åi(liloglili),whichmaybeexpressedasf(X)=tr(XlogXX),wherelogXisthematrixlogarithm.1TheresultingBregmandivergenceisDvN(X;Y)=tr(XlogXXlogYX+Y);(1)1.IfX=VLVTistheeigendecompositionofthepositivedenitematrixX,thematrixlogarithmcanbewrittenasVlogLVT,wherelogListhediagonalmatrixwhoseentriescontainthelogarithmoftheeigenvalues.Thematrixexponentialcanbedenedanalogously.344 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESandwecallitthevonNeumanndivergence.Thisdivergenceisalsocalledquantumrelativeentropy,andisusedinquantuminformationtheory(NielsenandChuang,2000).AnotherimportantmatrixdivergencearisesbytakingtheBurgentropyoftheeigenvalues,thatis,f(X)=åilogli,orequiv-alentlyasf(X)=logdetX.TheresultingBregmandivergenceoverpositivedenitematricesisD`d(X;Y)=tr(XY1)logdet(XY1)n;(2)andiscommonlycalledtheLogDetdivergence(wecalledittheBurgmatrixdivergenceinKulisetal.(2006)).Fornow,weassumethatXispositivedeniteforbothdivergences;laterwewilldiscussextensionswhenXispositivesemidenite,thatis,low-rank.3.2PropertiesItisimportanttojustifytheuseofthedivergencesintroducedaboveforkernellearning,sowenowdiscusssomeimportantpropertiesofthedivergences.Themostobviouscomputationalbenetofusingthedivergencesforkernellearningarisesfromthefactthattheyaredenedonlyoverpositivedenitematrices.Becauseofthis,ouralgorithmswillnotneedtoexplicitlyconstrainourlearnedmatricestobepositivedenite,whichinturnleadstoefcientalgorithms.Beyondautomaticenforcementofpositivedeniteness,wewillseelaterthatthedivergenceshavearange-spacepreservingproperty,whichleadstofurthercomputationalbenets.AppendixAcoverssomeimportantpropertiesofthedivergences.Examplesincludethescale-invarianceoftheLogDetdivergence(D`d(X;Y)=D`d(aX;aY))and,moregenerally,transformation-invariance(D`d(X;Y)=D`d(MTXM;MTYM)foranysquare,non-singularM),whichareusefulpropertiesforlearningalgorithms.Forinstance,ifwehaveappliedalinearkerneloverasetofdatavectors,scale-invarianceimpliesthatwecanscaleallthefeaturesofourdatavectors,andourlearnedkernelwillsimplybescaledbythesamefactor.Similarly,ifwescaleeachofthefeaturesinourdatavectorsdifferently(forexample,ifsomefeaturesaremeasuredinfeetandothersinmetersasopposedtoallfeaturesmeasuredinfeet),thenthelearnedkernelwillbescaledbyanappropriatediagonaltransformation.NotethatthisnaturalpropertydoesnotholdforlossfunctionssuchastheFrobeniusnormdistance.Anothercriticalpropertyconcernsgeneralizationtonewpoints.Typicallykernellearningalgo-rithmsworkinthetransductivesetting,meaningthatallofthedataisgivenup-front,withsomeofitlabeledandsomeunlabeled,andonelearnsakernelmatrixoverallthedatapoints.Ifsomenewdatapointisgiven,thereisnowaytocompareittoexistingdatapoints,soanewkernelmatrixmustbelearned.However,ashasrecentlybeenshowninDavisetal.(2007),wecanmovebeyondthetransductivesettingwhenusingtheLogDetdivergence,andcanevaluatethekernelfunctionovernewdatapoints.AsimilarresultcanbeshownforthevonNeumanndivergence.WeoverviewthisrecentworkinSection6.1,thoughthemainfocusinthispaperremainsinthetransductivesetting.Furthermore,boththeLogDetdivergenceandthevonNeumanndivergencehaveprecedenceinanumberofareas.ThevonNeumanndivergenceisusedinquantuminformationtheory(NielsenandChuang,2000),andhasbeenemployedinmachinelearningforonlineprincipalcomponentanalysis(WarmuthandKuzmin,2006).TheLogDetdivergenceiscalledStein'slossinthestatisticsliterature,whereithasbeenusedasameasureofdistancebetweencovariancematrices(JamesandStein,1961).Ithasalsobeenemployedintheoptimizationcommunity;theupdatesfortheBFGSandDFPalgorithms(Fletcher,1991),bothquasi-Newtonalgorithms,canbeviewedasLogDet345 KULIS,SUSTIKANDDHILLONoptimizationprograms.Inparticular,theupdatetotheapproximationoftheHessiangiveninthesealgorithmsistheresultofaLogDetminimizationproblemwithlinearconstraints(givenbythesecantequation).Forallthreedivergencesintroducedabove,thegeneratingconvexfunctionoftheBregmanma-trixdivergencecanbeviewedasacompositionf(X)=(jl)(X),wherel(X)isthefunctionthatliststheeigenvaluesinalgebraicallydecreasingorder,andjisastrictlyconvexfunctiondenedovervectors(DhillonandTropp,2007).Ingeneral,everysuchjdenesaBregmanmatrixdi-vergenceoverreal,symmetricmatricesviatheeigenvaluemapping.Forexample,ifj(x)=xTx,thentheresultingcomposition(jl)(X)isthesquaredFrobeniusnorm.WecallsuchdivergencesspectralBregmanmatrixdivergences.ConsideraspectralBregmanmatrixdivergenceDf(X;Y),wheref=jl.WenowshowanalternateexpressionforDf(X;Y)basedontheeigenvaluesandeigenvectorsofXandY,whichwillprovetobeusefulwhenmotivatingextensionstothelow-rankcase.Lemma1LettheeigendecompositionsofXandYbeVLVTandUQUTrespectively,andassumethatjisseparable,thatis,f(X)=(jl)(X)=åiji(li).ThenDf(X;Y)=åi;j(vTiuj)2(ji(li)jj(qj)(liqj)Ñjj(qj)):ProofWehaveDf(X;Y)=f(X)f(Y)tr((XY)T(Ñf(Y)))=åiji(li)åjjj(qj)tr((VLVTUQUT)T(Ñf(Y)))=åi;j(vTiuj)2ji(li)åi;j(vTiuj)2jj(qj)tr((VLVTUQUT)T(Ñf(Y))):Thesecondlineaboveusestheseparabilityofj,whilethethirdlineusesthefactthatåi(vTiuj)2=åj(vTiuj)2=1.WecanexpressÑf(Y)as:Ñf(Y)=U0Ñj1(q1)0:::0Ñj2(q2):::.....1UT;andsowehavetr(UQUTÑf(Y))=åjqjÑjj(qj)=åi;j(vTiuj)2qjÑjj(qj).Finally,thetermtr(VLVTÑf(Y))canbeexpandedasåi;j(vTiuj)2liÑjj(qj).Puttingthisalltogether,wehave:Df(X;Y)=åi;j(vTiuj)2(ji(li)jj(qj)Ñjj(qj)(liqj)):Notethateachofthedivergencesdiscussedearlier—thesquaredFrobeniusdivergence,theLogDetdivergence,andthevonNeumanndivergence—arisefromseparableconvexfunctions.Fur-thermore,inthesethreecases,thefunctionsjidonotdependoni(sowedenotej=j1=:::=jn)andthecorollarybelowfollows:346 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESCorollary2GivenX=VLVTandY=UQUT,thesquaredFrobenius,vonNeumannandLogDetdivergencessatisfy:Df(X;Y)=åi;j(vTiuj)2Dj(li;qj):(3)Thisformulahighlightstheconnectionbetweenaseparablespectralmatrixdivergenceandthecorrespondingvectordivergence.Italsoillustrateshowthedivergencerelatestothegeometricalpropertiesoftheargumentmatricesasrepresentedbytheeigenvectors.3.3ProblemDescriptionWenowgiveaformalstatementoftheproblemthatwestudyinthispaper.GivenaninputkernelmatrixX0,weattempttosolvethefollowingforX:minimizeDf(X;X0)subjecttotr(XAi)bi;1ic;rank(X)r;X0:(4)Anyofthelinearinequalityconstraintstr(XAi)bimaybereplacedwithequalities.Theaboveproblemisclearlynon-convexingeneral,duetotherankconstraint.However,whentherankofX0doesnotexceedr,thenthisproblemturnsouttobeconvexwhenweusethevonNeumannandLogDetmatrixdivergences.ThisisbecausethesedivergencesrestrictthesearchfortheoptimalXtothelinearsubspaceofmatricesthathavethesamerangespaceasX0.ThedetailswillemergeinSection4.AnotheradvantageofusingthevonNeumannandLogDetdivergencesisthattheal-gorithmsusedtosolvetheminimizationproblemimplicitlymaintainthepositivesemidenitenessconstraint,sonoextraworkneedstobedonetoenforcepositivesemideniteness.Thisisincon-trasttothesquaredFrobeniusdivergence,wherethepositivesemidenitenessconstrainthastobeexplicitlyenforced.Thoughitispossibletohandlegenerallinearconstraintsoftheformtr(XAi)bi,wewillfocusontwospecictypesofconstraints,whichwillbeusefulforourkernellearningapplications.Therstisadistanceconstraint.ThesquaredEuclideandistanceinfeaturespacebetweenthejthandthekthdatapointsisgivenbyXjj+Xkk2Xjk.TheconstraintXjj+Xkk2Xjkbicanberepresentedastr(XA)bi,whereAi=zizTi,andzi(j)=1,zi(k)=1,andallotherentries0(equivalently,zi=ejek).ThesecondtypeofconstrainthastheformXjkbi;whichcanbewrittenastr(XAi)biusingAi=12(ejeT+ekeTj)(wemaintainsymmetryofAi).ForthealgorithmsdevelopedinSection5,wewillassumethatthematricesAiin(4)arerankone(soAi=zizTi)andthatr=rank(X0)(webrieydiscussextensionstohigher-rankconstraintsinSection6.2.2).Inthiscase,wecanwritetheoptimizationproblemas:minimizeDf(X;X0)subjecttozTiXzibi;1ic;rank(X)r;X0:(5)347 KULIS,SUSTIKANDDHILLONFurthermore,weassumethatthereexistsafeasiblesolutiontotheaboveproblem;wediscussanextensiontotheinfeasiblecaseinvolvingslackvariablesinSection6.2.1.Notethatinthiscase,weassumebi0,asotherwisetherecannotbeafeasiblesolution.Thekeyapplicationoftheaboveoptimizationproblemisinlearningakernelmatrixinthesettingwherewehavesideinformationaboutsomeofthedatapoints(e.g.,labelsorconstraints),andwewanttolearnakernelmatrixoverallthedatapointsinordertoperformclassication,regression,orsemi-supervisedclustering.Wewillalsodiscussotherapplicationsthroughoutthepaper.3.4BregmanProjectionsConsidertheconvexoptimizationproblem(4)presentedabove,butwithouttherankconstraint(wewillseehowtohandletherankconstraintlater).Tosolvethisproblem,weusethemethodofBregmanprojections(Bregman,1967;CensorandZenios,1997).SupposewewishtominimizeDf(X;X0)subjecttolinearequalityandinequalityconstraints.TheideabehindBregmanprojec-tionsistochooseoneconstraintperiteration,andperformaprojectionsothatthecurrentsolutionsatisesthechosenconstraint.Notethattheprojectionisnotanorthogonalprojection,butratheraBregmanprojection,whichistailoredtotheparticularfunctionthatisbeingminimized.Inthecaseofinequalityconstraints,anappropriatecorrectionisalsoenforced.Thisprocessisthenre-peatedbycyclingthroughtheconstraints(oremployingamoresophisticatedcontrolmechanism).Thismethodmayalsobeviewedasadualcoordinateascentprocedurethatoptimizesthedualwithrespecttoasingledualvariableperiteration(withallotherdualvariablesremainingxed).Un-dermildconditions,itcanbeshownthatthemethodofcyclicBregmanprojections(oracontrolmechanismthatvisitseachconstraintinnitelyoften)convergestothegloballyoptimalsolution;seeCensorandZenios(1997)andDhillonandTropp(2007)formoredetails.Nowforthedetailsofeachiteration.Foranequalityconstraintoftheformtr(XAi)=bi,theBregmanprojectionofthecurrentiterateXtmaybecomputedbysolving:minimizeXDf(X;Xt)subjecttotr(XAi)=bi:(6)Introducethedualvariableai,andformtheLagrangianL(X;ai)=Df(X;Xt)+ai(bitr(XAi)).BysettingthegradientoftheLagrangian(withrespecttoXandai)tozero,wecanobtaintheBregmanprojectionbysolvingtheresultingsystemofequationssimultaneouslyforaiandXt+1:Ñf(Xt+1)=Ñf(Xt)+aiAi(7)tr(Xt+1Ai)=bi:Foraninequalityconstraintoftheformtr(XAi)bi,letni0bethecorrespondingdualvari-able.Tomaintainnon-negativityofthedualvariable(whichisnecessaryforsatisfyingtheKKTconditions),wecansolve(7)foraiandperformthefollowingupdate:a0=min(ni;ai);ni nia0i:(8)SeeAppendixBforadiscussiononwhythedualvariablecorrectionsareneeded.Clearlytheupdateguaranteesthatni0.Finally,updateXt+1viaÑf(Xt+1)=Ñf(Xt)+a0iAi:(9)348 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESNotethat(8)maybeviewedasacorrectionstepthatfollowstheprojectiongivenby(7).BothofouralgorithmsinSection5arebasedonthismethod,whichiterativelychoosesaconstraintandperformsaBregmanprojectionuntilconvergence.Themaindifcultyliesinefcientlysolvingthenonlinearsystemofequationsgivenin(7).InthecasewhereAi=zizTi,bycalculatingthegradientfortheLogDetandvonNeumannmatrixdivergences,respectively,(7)simpliesto:Xt+1=X1taizizTi1Xt+1=exp(log(Xt)+aizizTi);(10)subjecttozTiXt+1zi=bi.In(10),expandlogdenotethematrixexponentialandmatrixlogarithm,respectively.Aswewillsee,forthevonNeumannandLogDetdivergences,theseprojectionscanbecomputedveryefciently(andarethusmoredesirablethanothermethodsthatinvolvemultipleconstraintsatatime).4.BregmanDivergencesforRank-DecientMatricesAsgivenin(1)and(2),thevonNeumannandLogDetdivergencesareundenedforlow-rankmatrices.FortheLogDetdivergence,theconvexgeneratingfunctionf(X)=logdetXisinnitewhenXissingular,thatis,itseffectivedomainisthesetofpositivedenitematrices.ForthevonNeumanndivergencethesituationissomewhatbetter,sinceonecandenef(X)=tr(XlogXX)viacontinuityforrank-decientmatrices.Thekeytousingthesedivergencesinthelow-ranksettingcomesfromrestrictingtheconvexfunctionftotherangespacesofthematrices.WemotivateourapproachusingCorollary2,beforeformalizingitinSection4.2.Subsequently,wediscussthecomputationofBregmanprojectionsforlow-rankmatricesinSection4.3.4.1MotivationIfXandYhaveeigendecompositionsX=VLVTandY=UQUT,respectively,thenwheneverXorYisoflow-rank,someeigenvaluesliofXorqjofYareequaltozero.Consequently,ifwecouldapplyCorollary2,theDj(li;qj)termsthatinvolvezeroeigenvaluesneedcarefultreatment.Morespecically,forthevonNeumanndivergencewehave:Dj(li;qj)=liloglililogqjli+qj:(11)Usingtheconventionthat0log0=0,Dj(li;qj)equals0whenbothliandqjare0,butisinnitewhenqjis0butliisnon-zero.Similarly,withtheLogDetdivergence,wehaveDj(li;qj)=liqjlogliqj1:(12)Incaseswhereli=0andqj6=0,orli6=0andqj=0,Dj(li;qj)isinnite.FornitenessofthecorrespondingmatrixdivergencewerequirethatvTiuj=0wheneverDj(li;qj)isinnitesoacancellationwilloccur(viaappropriatecontinuityarguments)andthedivergencewillbenite.Thisleadstopropertiesofrank-decientXandYthatarerequiredforthematrixdivergencetobenite.ThefollowingobservationsarediscussedmoreformallyinSec-tion4.2.349 KULIS,SUSTIKANDDHILLONObservation1ThevonNeumanndivergenceDvN(X;Y)isniteiffrange(X)range(Y).Thetermlilogqjfrom(11)is¥ifqj=0butli6=0.ByusingCorollary2anddistributingthe(vTiuj)2termintothescalardivergence,weobtain:li(vTiuj)2logqj.Thus,ifvTiuj=0whenqj=0,thenli(vTiuj)2logqj=0(using0log0=0),andthedivergenceisnite.Whenqj=0,thecorrespondingeigenvectorujisinthenullspaceofY;therefore,fornitenessofthedivergence,everyvectorujinthenullspaceofYisorthogonaltoanyvectorviintherangespaceofX.ThisimpliesthatthenullspaceofYcontainsthenullspaceofXor,equivalently,range(X)range(Y).Observation2TheLogDetdivergenceD`d(X;Y)isniteiffrange(X)=range(Y).Toshowtheobservation,weadopttheconventionsthatlog00=0and00=1in(12),whichfollowbycontinuityassumingthattherateatwhichthenumeratoranddenominatorapproachzeroisthesame.Then,distributingthe(vTiuj)2termintoDj,wehavethatviandujmustbeorthogonalwheneverli=0;qj6=0orli6=0;qj=0.Thisinturnsaysthat:a)everyvectorujinthenullspaceofYmustbeorthogonaltoeveryvectorviintherangespaceofXand,b)everyvectorviinthenullspaceofXmustbeorthogonaltoeveryvectorujintherangespaceofY.ItfollowsthattherangespacesofXandYareequal.AssumingtheeigenvaluesofXandYarelistedinnon-increasingorder,wecanwritethelow-rankequivalentof(3)simplyas:Df(X;Y)=åi;jr(vTiuj)2Dj(li;qj);wherer=rank(X).Ifwenowrevisitouroptimizationproblemformulatedin(4),whereweaimtominimizeDf(X;X0),weseethattheLogDetandvonNeumanndivergencesnaturallymaintainrankcon-straintsiftheproblemisfeasible.FortheLogDetdivergence,theequalityoftherangespacesofXandX0impliesthatwhenminimizingDf(X;X0),wemaintainrank(X)=rank(X0),assumingthatthereisafeasibleXwithaniteDf(X;X0).Similarly,forthevonNeumanndivergence,thepropertythatrange(X)range(X0)impliesrank(X)rank(X0).4.2FormalizationViaRestrictionsontheRangeSpaceTheabovesectiondemonstratedinformallythatforDf(X;Y)tobenitetherangespacesofXandYmustbeequalfortheLogDetdivergence,andtherangespaceofXmustbecontainedintherangespaceofYforthevonNeumanndivergence.Wenowformalizethegeneralizationtolow-rankmatrices.LetWbeanorthogonalnrmatrix,suchthatitscolumnsspantherangespaceofY.Wewillusethefollowingsimpleandwellknownlemmalateron:Lemma3LetYbeasymmetricnnmatrixwithrank(Y)r,andletWbeannrcolumnorthogonalmatrix(WTW=I)withrange(Y)range(W).ThenY=WWTYWWT.ProofExtendWtoafullnnorthogonalmatrixanddenoteitbyWf.Noticethatthelast(nr)rowsofWTfYandthelast(nr)columnsofYWfconsistofonlyzeros.ItfollowsthatthematrixY1=WTfYWfcontainsonlyzerosexceptfortherrsub-matrixinthetopleftcorner,whichcoin-cideswithˆY=WTYW.ObservethatY=WfY1WTf=WˆYWT=WWTYWWTtonishtheproof.350 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESWearenowreadytoextendthedomainofthevonNeumannandLogDetdivergencestolow-rankmatrices.Denition4ConsiderthepositivesemidenitennmatricesXandYthatsatisfyrange(X)range(Y)whenconsideringthevonNeumanndivergence,andrange(X)=range(Y)whenconsid-eringtheLogDetdivergence.LetWbeannrcolumnorthogonalmatrixsuchthatrange(Y)range(W)anddene:Df(X;Y)=Df;W(X;Y)=Df(WTXW;WTYW);(13)whereDfiseitherthevonNeumannortheLogDetdivergence.Therstequalityin(13)implicitlyassumesthattherighthandsideisnotdependentonthechoiceofW.Lemma5InDenition4,Df;W(X;Y)isindependentofthechoiceofW.ProofAllthenrorthogonalmatriceswiththesamerangespaceasWcanbeexpressedasaproductWQ,whereQisanrrorthogonalmatrix.IntroduceˆX=WTXWandˆY=WTYWandsubstituteWQinplaceofWin(13):Df;WQ(X;Y)=Df((WQ)TX(WQ);(WQ)TY(WQ))=Df(QTˆXQ;QTˆYQ)=Df(ˆX;ˆY)=Df;W(X;Y):TherstequalityisbyDenition4,andthethirdfollowsfromthefactthatthevonNeumannandtheLogDetdivergencesareinvariantunderanyorthogonalsimilaritytransformation2(seeProposi-tion11inAppendixA).WenowshowthatDenition4isconsistentwithCorollary2,demonstratingthatourlow-rankformulationagreeswiththeinformaldiscussiongivenearlier.Theorem6LetthepositivesemidenitematricesXandYhaveeigendecompositionsX=VLVT,Y=UQUTandletrange(X)range(Y).LettherankofYequalr.AssumingthattheeigenvaluesofXandYaresortedinnon-increasingorder,thatis,l1l2:::lrandq1q2:::qr,thenDvN(X;Y)=åi;jr(vTiuj)2(liloglililogqjli+qj):ProofDenotetheupperleftrrsubmatricesofLandQbyLrandQrrespectively,andthecorrespondingreducedeigenvectormatricesforXandYbyVrandUr.PickingWin(13)toequalUr,weget:DvN(X;Y)=DvN(UTrXUr;UTrYUr)=DvN((UTrVr)Lr(VTrUr);Qr):TheargumentsontherighthandsidearerrmatricesandQrisnotrank-decient.WecannowapplyCorollary2togettheresult.2.Infact,inthecaseoftheLogDetdivergencewehaveinvarianceunderanyinvertiblecongruencetransformation,seeProposition12inAppendixA.351 KULIS,SUSTIKANDDHILLONTheorem7LetthepositivesemidenitematricesXandYhaveeigendecompositionsX=VLVT,Y=UQUT,andletrange(X)=range(Y)andassumethattheeigenvaluesofXandYaresortedindecreasingorder.Then:D`d(X;Y)=åi;jr(vTiuj)2liqjlogliqj1:ProofSimilartotheproofofTheorem6.Notethattherangespacesmustcoincideinthiscase,becausethedeterminantofXY1shouldnotvanishfortherestrictedtransformations,whichagreeswiththerange(X)=range(Y)restriction.WenextshowthattheoptimizationproblemandBregman'sprojectionalgorithmforlow-rankma-tricescanbecastasafullrankprobleminalowerdimensionalspace,namelytherangespace.ThisequivalenceimpliesthatwedonothavetorederivetheconvergenceproofsandotherpropertiesofBregman'salgorithminthelow-ranksetting.Considertheoptimizationproblem(4)forlow-rankX0,anddenoteasuitableorthogonalmatrixasrequiredinDenition4byW.ThematrixoftheeigenvectorsofthereducedeigendecompositionofX0isasuitablechoice.Considerthefollowingmatrixmapping:M!ˆM=WTMW:ByLemma3,themappingisone-to-oneonthesetofsymmetricmatriceswithrangespacecon-tainedinrange(W).Wenowapplythemappingtoallmatricesintheoptimizationproblem(4)toobtain:minimizeDf(ˆX;ˆX0)subjecttotr(ˆXˆAi)bi;1icrank(ˆX)rˆX0:(14)Therankconstraintisautomaticallysatisedwhenrank(X0)=randtheproblemisfeasible.Clearly,ˆX0ifandonlyifX0.ByDenition4,Df(ˆX;ˆX0)=Df(X;X0).Finally,thenextlemmaveriesthattheconstraintsareequivalentaswell.Lemma8GivenacolumnorthogonalnrmatrixWsatisfyingrange(X)range(W),itfollowsthattr(ˆXˆAi)=tr(XAi),whereˆX=WTXW,ˆAi=WTAiW.ProofChooseareducedrank-reigendecompositionofXtobeVLVTsuchthatthecolumnsofVformanorthogonalbasisofrange(W).NotethatLwillbesingularwhenrank(X)islessthanr.ThereexistsanrrorthogonalQthatsatisesW=VQ,andso:tr(ˆXˆAi)=tr((WTXW)(WTAiW))=tr(QTVTVLVTVQQTVTAiVQ)=tr(VQQTLQQTVTAi)=tr(VLVTAi)=tr(XAi):Ifweassumethattheoptimizationproblem(4)hasa(rank-decient)solutionwithnitedivergencemeasure,thenthecorrespondingfull-rankoptimizationproblem(14)alsohasasolution.Con-versely,byLemma3,forasolutionˆXof(14),thereisauniquesolutionof(4),namelyX=WˆXWT(withniteDf(X;X0))thatsatisestherangespacerestriction.352 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCES4.3BregmanProjectionsforRank-decientMatricesLastly,wederivetheexplicitupdatesforBregman'salgorithminthelow-ranksetting.Fromnowonweomittheconstraintindexforsimplicity.Recall(7),whichweusedtocalculatetheprojectionupdateforBregman'salgorithmandapplyittothemappedproblem(14):Ñf(ˆXt+1)=Ñf(ˆXt)+aˆAtr(ˆXt+1ˆA)=b:IncaseofthevonNeumanndivergencethisleadstotheupdateˆXt+1=exp(log(ˆXt)+aˆA).Thedis-cussioninSection4.2andinductionontimpliesthatXt+1=WˆXt+1WT(withWasinDenition4),orexplicitly:Xt+1=Wexp(log(WTXtW)+a(WTAW))WT:IfwechooseW=VtfromthereducedeigendecompositionXt=VtLtVTt,thentheupdateiswrittenas:Xt+1=Vtexp(log(Lt)+aVTtAVt)VTt;(15)whichwewilluseinthevonNeumannkernellearningalgorithm.Notethatthelimitofexp(log(Xt+eI)+aA)aseapproacheszeroyieldsthesameformula,whichbecomesclearifweapplyabasistransformationthatputsXtindiagonalform.ThesameargumentappliestotheBregmanprojectionfortheLogDetdivergence.Inthiscasewearriveattheupdate:Xt+1=Vt((VTtXtVt)1a(VTtAVt))1VTt;(16)usingLemma3andthefactthatrange(Xt+1)=range(Xt).Therighthandsideof(16)canbecalculatedwithoutanyeigendecompositionaswewillshowinSection5.1.1.5.AlgorithmsInthissection,wederiveefcientalgorithmsforsolvingtheoptimizationproblem(5)forlow-rankmatricesusingtheLogDetandvonNeumanndivergences.5.1LogDetDivergenceWerstdevelopacyclicprojectionalgorithmtosolve(5)whenthematrixdivergenceistheLogDetdivergence.5.1.1MATRIXUPDATESConsiderminimizingD`d(X;X0),theLogDetdivergencebetweenXandX0.Asgivenin(16),theprojectionupdateruleforlow-rankX0andarank-oneconstraintmatrixA=zzTis:Xt+1=Vt((VTtXtVt)1a(VTtzzTVt))1VTt;wheretheeigendecompositionofXtisVtLtVTt.RecalltheSherman-Morrisoninverseformula(Sher-manandMorrison,1949;GolubandVanLoan,1996):(A+uvT)1=A1A1uvTA11+vTA1u:353 KULIS,SUSTIKANDDHILLONApplyingthisformulatothemiddletermoftheprojectionupdate,wearriveatasimpliedexpres-sionforXt+1:Xt+1=VtVTtXtVt+aVTtXtVtVTtzzTVtVTtXtVt1azTVtVTtXtVtVTtzVTt=VtLt+aLtVTtzzTVtLt1azTVtLtVTtzVTt=VtLtVTt+aVtLtVTtzzTVtLtVTt1azTXtz=Xt+aXtzzTXt1azTXtz:(17)SinceXt+1mustsatisfytheconstraint,i.e.,tr(Xt+1zzT)=zTXt+1z=b,wecansolvethefollowingequationfora:zTXt+aXtzzTXt1azTXtzz=b:(18)Letp=zTXtz.Notethatinthecaseofdistanceconstraints,pisthedistancebetweenthetwodatapoints.Whenp6=0,elementaryargumentsrevealthatthereisexactlyonesolutionforaprovidedthatb6=0.Theuniquesolution,inthiscase,canbeexpressedas:a=1p1b:(19)Ifweletb=a=(1ap);(20)thenourmatrixupdateisgivenbyXt+1=Xt+bXtzzTXt:(21)Thisupdateispleasantandsurprising;usuallytheprojectionparameterforBregman'salgo-rithmdoesnotadmitaclosedformsolution(seeSection5.2.2forthecaseofthevonNeumanndivergence).Inthecasewherep=0,(18)rewritestob=p=(1ap),whichhasasolutionifandonlyifb=0(whileaisarbitrary).ItfollowsthatzisinthenullspaceofXt,andby(17),Xt+1=Xt.Thefollowinglemmaconrmstheexpectationthatweremaininthepositivesemideniteconeandthattherangespaceisunchanged.Lemma9GivenapositivesemidenitematrixXt,thematrixXt+1fromtheupdatein(21)ispositivesemidenitewithrange(Xt+1)=range(Xt),assumingthat(6)isfeasible.ProofFactorthepositivesemidenitematrixXtasGGT,whereGisannrmatrixandr=rank(Xt).TheLogDetupdateproduces:Xt+1=GGT+bGGTzzTGGT=G(I+bGTzzTG)GT:Wededuceimmediatelythatrange(Xt+1)range(G)=range(Xt).354 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESInordertoprovethatXt+1ispositivesemideniteandthattherangespacedoesnotshrink,itsufcestoshowthatalleigenvaluesofbGTzzTGarestrictlylargerthan1,implyingthatI+bGTzzTGremainspositivedenite.Theonlynon-zeroeigenvalueoftherank-onebGTzzTGequalstr(bGTzzTG)=bzTGGTz=bp.Accordingto(19)and(20)wecalculatebp=bp1,andnotingthatb;p�0completestheproof.TheupdateforXt+1canalternativelybewrittenusingthepseudoinverse(GolubandVanLoan,1996):Xt+1=(X†ta˜Ai)†;where˜Ai=WWTAiWWTforanorthogonalmatrixWsatisfyingrange(W)=range(Xt).3Lemma10Let˜Ai=WWTAiWWTandWbeanorthogonalnrmatrixsuchthatrange(W)=range(Xt).Thenthefollowingupdatesareequivalent:a)Xt+1=W((WTXtW)1a(WTAiW))1WTb)Xt+1=(X†ta˜Ai)†ProofSincerange(W)=range(Xt),byLemma3wehaveˆXt=WTXtWandXt=WˆXtWT,whereˆXtisaninvertiblerrmatrix.NotethatWˆX1tWTsatisesthepropertiesrequiredfortheMoore-Penrosepseudoinverse,forexample:Xt(WˆX1tWT)Xt=WˆXtWTWˆX1tWTWˆXtWT=WˆXtWT=Xt:ThusX†t=WˆX1tWTandwenishtheproofbyrewriting:(X†ta˜Ai)†=(WˆX1tWTaWWTAiWWT)†=W(ˆX1taWTAiW)WT†=W((WTXtW)1a(WTAiW))1WT:5.1.2UPDATEFORTHEFACTOREDMATRIXAnaiveimplementationoftheupdategivenin(21)costsO(n2)operationsperiteration.However,wecanachieveamoreefcientupdateforlow-rankmatricesbyworkingonasuitablefactoredformofthematrixXtresultinginanO(r2)algorithm.BoththereducedeigendecompositionandtheCholeskyfactorizationarepossiblecandidatesforthefactorization;wepreferthelatterbecausetheresultingalgorithmdoesnothavetorelyoniterativemethods.Thepositivesemideniterank-rmatrixXtcanbefactoredasGGT,whereGisannrmatrix,andthustheupdatecanbewrittenas:Xt+1=G(I+bGTzzTG)GT:ThematrixI+b˜zi˜zTi,where˜zi=GTz,isanrrmatrix.ToupdateGforthenextiteration,wefactorthismatrixasLLT;thenournewGisupdatedtoGL.SinceI+b˜zi˜zTiisarank-one3.InKulisetal.(2006),wehadinadvertentlyusedAiinsteadof˜Ai.355 KULIS,SUSTIKANDDHILLONAlgorithm1CHOLUPDATEMULT(a;x;B):RightmultiplicationofalowertriangularrrmatrixBwiththeCholeskyfactorofI+axxTinO(r2)time.Input:a;x;B,withI+axxT0,BislowertriangularOutput:B BL,withLLT=I+axxT1:a1=a2:fori=1tordo3:t=1+aix2i4:hi=pt5:ai+1=ai=t6:t=Bii7:s=08:Bii=Biihi9:forj=i1downto1do10:s=s+txj+111:t=Bij12:Bij=(Bij+aj+1xjs)hj13:endfor14:endforperturbationoftheidentity,thisupdatecanbedoneinO(r2)timeusingastandardCholeskyrank-oneupdateroutine.Toincreasecomputationalefciency,wenotethatG=G0B,whereBistheproductofalltheLmatricesfromeveryiterationandX0=G0GT.InsteadofupdatingGexplicitlyateachiteration,wesimplyupdateBtoBL.ThematrixI+bGTzzTGisthenI+bBTGTzzTG0B.Inthecaseofdistanceconstraints,wecancomputeGTzinO(r)timeasthedifferenceoftworowsofG0.ThemultiplicationupdateBLappearstohaveO(r3)complexity,dominatingtheruntime.InthenextsectionwederiveanalgorithmthatcombinestheCholeskyrank-oneupdatewiththematrixmultiplicationintoasingleO(r2)routine.5.1.3FASTMULTIPLICATIONWITHACHOLESKYUPDATEFACTORWeefcientlycombinetheCholeskyrank-oneupdatewiththematrixmultiplicationinCHOLUPDATEMULTgivenbyAlgorithm1.Asimpleanalysisofthisalgorithmrevealsthatitrequires3r2+2roatingpointoperations(ops).ThisisopposedtotheusualO(r3)timeneededbymatrixmultiplication.Wedevotethissectiontothedevelopmentofthisfastmultiplicationalgo-rithm.RecallthealgorithmusedfortheCholeskyfactorizationofanrrmatrixA(seeDemmel,1997,page78):forj=1tordoljj=(ajjåj1k=1l2jk)1=2fori=j+1tordolij=(aijåj1k=1likljk)=ljjendforendfor356 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESWewillderiveacorrespondingalgorithminamannersimilartoDemmel(1997),whileexploit-ingthespecialstructurepresentinourproblem.LetusdenoteIr+a1xxTbyA(a1=a)andwriteitasaproductofthreeblockmatrices:A=2q1+a1x210a1x1p1+a1x21x2:rIr13100˜A222q1+a1x21a1x1p1+a1x21xTr0Ir13:Itfollowsthat˜A22+a2x211+a1x21x2:rxTr=A22=Ir1+a1x2:rxT2:r,leadingto:˜A22=Ir1+a11+a1x21x2:rxTr:Introducea2=a11+a1x21andproceedbyinduction.Weextractthefollowingalgorithmwhichcalcu-latesLsatisfyingLLT=A:forj=1tordot=1+ajx2jljj=ptaj+1=aj=tt=ajxj=ljjfori=j+1tordolij=txiendforendforTheabovealgorithmuses12r2+132ropstocalculateL,while12r3+O(r2)areneededforthegeneralalgorithm.However,wedonotnecessarilyhavetocalculateLexplicitly,sincetheparam-etersaitogetherwithximplicitlydetermineL.Noticethatthecostofcalculatinga1;a2;:::;arislinearinr.NextweshowhowtocalculateuTLforagivenvectoruwithoutexplicitlycalculatingLandarriveatanO(r)algorithmforthisvector-matrixmultiplication.TheelementsofvT=uTLareequalto:v1=u1q1+a1x21+a2q1+a1x21x1(u2x2+u3x3+:::urxr)v2=u2q1+a2x22+a3q1+a2x22x2(u3x3+:::urxr).vr1=ur1q1+ar1x2r1+arq1+ar1x2r1xr1urxrvr=urq1+arx2r:Wecanavoidtherecalculationofsomeintermediateresultsifweevaluatevrrst,followedbyvr1uptov1.Thisstrategyleadstothefollowingalgorithm:357 KULIS,SUSTIKANDDHILLONAlgorithm2Learningalow-rankkernelmatrixinLogDetdivergenceunderdistanceconstraints.Input:G0:inputkernelfactormatrix,thatis,X0=G0GT,r:desiredrank,fAigc=1:distanceconstraints,whereAi=(ei1ei2)(ei1ei2)TOutput:G:outputlow-rankkernelfactormatrix,thatis,X=GGT1:SetB=Ir,i=1,andnj=08constraintsj.2:repeat3:vT=G0(i1;:)G0(i2;:)4:w=BTv5:a=minni;1kwk21bi6:ni nia7:b=a=(1akwk2)8:CallCHOLUPDATEMULT(b;w;B)tofactorI+bwwT=LLTandupdateB BL.9:Seti mod(i+1;c).10:untilconvergenceofn11:returnG=G0Bvr=urp1+arx2rs=0;forj=r1downto1dos=s+uj+1xj+1t=q1+ajx2jvj=ujt+aj+1xjs=tendforExactly11r6opsarerequiredbytheabovealgorithm,andthereforewecanreadilymultiplyanrrmatrixbyLusing11r26rops.Evenfeweropsaresufcienttoimplementthematrixmultiplicationifweobservethatthesquarerootexpressionaboveisrepeatedlycalculatedforeachrow,sinceitdependsonlyonxjandaj.Additionally,whenmultiplyingwithalowertriangularma-trix,thepresenceofzerosallowsfurthersimplications.TakingtheseconsiderationsintoaccountwearriveatthepreviouslypresentedAlgorithm1,whichusesexactly3r2+2rops.5.1.4KERNELLEARNINGALGORITHMWearenowreadytopresenttheoverallalgorithmfordistanceinequalityconstraintsusingtheLogDetdivergence,seeAlgorithm2.Asdiscussedintheprevioussection,everyprojectioncanbedoneinO(r2)time.Thus,cyclingthroughallcconstraintsrequiresO(cr2)time,butthetotalnumberofBregmaniterationsmaybelarge.Theonlydependenceonthenumberofdatapointsnoccursinsteps3and11.Thelaststep,multiplyingG=G0B,takesO(nr2)time.Asdiscussedearlier,convergenceisguaranteedsincewehavemappedtheoriginallow-rankproblemintoafull-rankprobleminalower-dimensionalspace.Convergencecanbecheckedbyusingthedualvariablesn.Thecyclicprojectionalgorithmcanbeviewedasadualcoordinateascentalgorithm,thusconvergencecanbemeasuredasfollows:aftercyclingthroughallconstraints,wechecktoseehowmuchnhaschangedafterafullcycle.Atconvergence,thischange(asmeasuredwithavectornorm)shouldbesmall.358 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCES5.2VonNeumannDivergenceInthissectionwedevelopacyclicprojectionalgorithmtosolve(5)whenthematrixdivergenceisthevonNeumanndivergence.5.2.1MATRIXUPDATESConsiderminimizingDvN(X;X0),thevonNeumanndivergencebetweenXandX0.Recallthepro-jectionupdate(15)forconstrainti:Xt+1=Vtexp(log(Lt)+aVTtzzTVt)VTt:Iftheeigendecompositionoftheexponentlog(Lt)+aVTtzzTVtisUtQt+1UTt,thentheeigendecom-positionofXt+1isgivenbyVt+1=VtUtandLt+1=exp(Qt+1).Thisspecialeigenvalueproblem(diagonalplusrankoneupdate)canbesolvedinO(r2)time;seeGolub(1973),GuandEisenstat(1994)andDemmel(1997).ThismeansthatthematrixmultiplicationVt+1=VtUtbecomesthemostexpensivestepinthecomputation,yieldingO(nr2)complexity.Wereducethiscostbymodifyingthedecompositionslightly.LetXt=VtWtLtWTtVTtbethefactorizationofXt,whereWtisanrrorthogonalmatrix,initiallyW0=Ir,whileVtandLtaredenedasbefore.ThematrixupdatebecomesXt+1=VtWtexp(logLt+aWTtVTtzzTVtWt)WTtVTt;yieldingthefollowingformulae:Vt+1=Vt;Wt+1=WtUt;Lt+1=exp(Qt+1);wherelogLt+aWTtVTtzzTVtWt=UtQt+1UTt.Forageneralrank-oneconstrainttheproductVTtziscalculatedinO(nr)time,butfordistanceconstraintsO(r)operationsaresufcient.ThecalculationofWTtVTtzandcomputingtheeigendecompositionUtQt+1UTtwillbothtakeO(r2)time.ThematrixproductWtUtappearstocostO(r3)time,butinfactarightmultiplicationbyUTtcanbeapproxi-matedveryaccuratelyinO(r2logr)andeveninO(r2)timeusingthefastmultipolemethod—seeBarnesandHut(1986)andGreengardandRokhlin(1987).Sincewerepeattheaboveupdatecalculationsuntilconvergence,wecanavoidcalculatingthelogarithmofLtateverystepbymaintainingQt=logLtthroughoutthealgorithm.5.2.2COMPUTINGTHEPROJECTIONPARAMETERIntheprevioussection,weassumedthattheprojectionparameterahasalreadybeencalculated.IncontrasttotheLogDetdivergence,thisparameterdoesnothaveaclosedformsolution.Instead,wemustcomputeabysolvingthenonlinearsystemofequationsgivenby(7).InthevonNeumanncaseandinthepresenceofdistanceconstraints,theproblemamountstondingtheuniquerootofthefunctionf(a)=zTVtWtexp(Qt+aWTtVTtzzTVtWt)WTtVTtzb:Itcanbeshownthatf(a)ismonotoneina,seeSustikandDhillon(2008).Usingtheapproachfromtheprevioussection,f(a)canbecomputedinO(r2)time.Onenat-uralchoicetondtherootoff(a)istoapplyBrent'sgeneralrootndingmethod(Brent,1973),359 KULIS,SUSTIKANDDHILLONAlgorithm3Learningalow-rankkernelmatrixinvonNeumannmatrixdivergenceunderdistanceconstraints.Input:V0;L0:inputkernelfactors,i.e.,X0=V0L0VT0,r:rankofdesiredkernelmatrix,fAigc=1:distanceconstraints,whereAi=(ei1ei2)(ei1ei2)TOutput:V;L:outputlow-rankkernelfactors,i.e.,X=VLVT1:SetW=Ir,Q=logL0,i=1,andnj=08constraintsj.2:repeat3:vT=V0(i1;:)V0(i2;:)4:w=WTv5:CallZEROFINDER(Q;w;bi)todeterminea.6:b=min(ni;a)7:ni nib.8:ComputeeigendecompositionofQ+bwwT=U˜QUT,Q ˜Q.9:UpdateW WU.10:i=mod(i+1;c)11:untilconvergenceofn12:returnV=V0W,L=exp(Q)whichdoesnotneedthecalculationofthederivativeoff(a).Wehavebuiltanevenmoreefcientcustomrootnderthatisoptimizedforthisproblem.Werarelyneedmorethansixevaluationsperprojectiontoaccuratelycomputea.Amoredetaileddescriptionofourrootnderisbeyondthescopeofthispaper,seeSustikandDhillon(2008)fordetails.5.2.3KERNELLEARNINGALGORITHMThealgorithmfordistanceinequalityconstraintsusingthevonNeumanndivergenceisgivenasAlgorithm3.ByusingthefastmultipolemethodeveryprojectioncanbedoneinO(r2)time.NotethattheasymptoticrunningtimeofthisalgorithmisthesameastheLogDetdivergencealgorithm,althoughtherootndingstepmakesAlgorithm3slightlyslowerthanAlgorithm2.5.3LimitationsofourApproachWeconcludethissectionbybrieymentioningsomelimitationsofourapproach.First,thoughweareabletolearnlow-rankkernelmatricesinourframework,theinitialkernelmatrixmustbelow-rank.Asaresult,wecannotuseourmethodstoreducethedimensionalityofourdata.Secondly,themethodofBregmanprojectionsmayrequireiteratingmanycyclesthroughallconstraintstoreachconvergence.Althoughwehaveheavilyoptimizedeachiterationinthispaper,itmaybebenecialtosearchforanalgorithmwithfasterconvergence.6.DiscussionWenowfurtherdevelopseveralaspectsofthealgorithmspresentedintheprevioussection.Inparticular,wediscusswaystomovebeyondthetransductivesettinginourframeworkforlearningkernels,webrieyoverviewsomegeneralizationsoftheproblemformulation,wehighlighthow360 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESspecialcasesofourmethodarerelatedtoexistingtechniques,andwebrieyanalyzeconnectionsbetweenourapproachandsemideniteprogramming.6.1HandlingNewDataPointsThekernellearningalgorithmsofSection5arerestrictedtolearninginthetransductivesetting;thatis,weassumethatwehaveallthedatapointsupfront,butlabelsorotherformsofsupervisionareonlyavailableforsomeofthedatapoints.Thisapproachsufferswhennew,unseendatapointsareadded,sincethiswouldrequirere-learningtheentirekernelmatrix.Thoughwedonotconsiderthissituationindepthinthecurrentpaper,wecancircumventit.WhenlearningakernelmatrixminimizingeithertheLogDetdivergenceorthevonNeumanndi-vergence,therangespacerestrictionimpliesthatthelearnedkernelKhastheformK=G0BBTGT,whereBisrrandtheinputkernelisK0=G0GT0.WecanviewthematrixBasalineartransfor-mationappliedtoourinputdatavectorsG0andthereforewecanapplythislineartransformationtonewpoints.Inparticular,recentwork(Davisetal.,2007)hasshownthatthelearningalgorithmsconsideredinthispapercanequivalentlybeviewedaslearningaMahalanobisdistancefunctiongivenconstraintsonthedata,whichissimplytheEuclideandistanceafterapplyingalineartrans-formationovertheinputdata.SinceouralgorithmscanbeviewedasMahalanobisdistancelearningtechniques,itisnaturaltocompareagainstotherexistingmetriclearningalgorithms.Suchmethodsincludemetriclearningbycollapsingclasses(MCML)(GlobersonandRoweis,2005),large-marginnearestneighbormetriclearning(LMNN)(Weinbergeretal.,2005),andmanyothers.Intheexperimentalresultssection,weprovidesomeresultscomparingouralgorithmswiththeseexistingmethods.6.2GeneralizationsThereareseveralsimplewaystoextendthealgorithmsdevelopedinSection5.Inthissection,weintroduceslackvariablesanddiscussmoregeneralconstraintsthanthedistanceconstraintsdiscussedearlier.6.2.1SLACKVARIABLESInmanycases,especiallyifthenumberofconstraintsislarge,nofeasiblesolutionwillexisttotheBregmandivergenceoptimizationproblemgivenin(4).Whennofeasiblesolutionexists,acommonapproachistoincorporateslackvariables,whichallowsconstraintstobeviolatedbutpenalizessuchviolations.Therearemanywaystointroduceslackvariablesintotheoptimizationproblem(4).Weaddanewvectorvariablebwithcoordinatesrepresentingperturbedrighthandsidesofthelinearcon-straints,andusethecorrespondingvectordivergencetomeasurethedeviationfromtheoriginalconstraintsdescribedbytheinputvectorb0.Theresultingoptimizationproblemisasfollows:minimizeX;bDf(X;X0)+gDj(b;b0)subjecttotr(XAi)eTb;1icX0:(22)NotethatweuseDj(b;b0)asthepenaltyforconstraintsasitiscomputationallysimple;otherchoicesofconstraintpenaltyarepossible,aslongastheresultingobjectivefunctionisconvex.361 KULIS,SUSTIKANDDHILLONTheg�0parametergovernsthetradeoffbetweensatisfyingtheconstraintsandminimizingthedivergencebetweenXandX0.Notethatwehavealsoremovedtheimplicitrankconstraintforsimplicity.WeformtheLagrangiantosolvetheaboveproblem;recallthesimilarsystemofequations(7)inSection3.4.DeneL(Xt+1;bt+1;ai)=Df(Xt+1;Xt)+gDj(bt+1;bt)+ai(eTbt+1tr(Xt+1Ai)),andsetthegradientstozerowithrespecttoXt+1,bt+1andaitogetthefollowingupdateequations:Ñf(Xt+1)=Ñf(Xt)+aiAi;Ñj(bt+1)=Ñj(bt)aigei;tr(Xt+1Ai)=eTbt+1:(23)Inparticular,fortheLogDetdivergencewearriveatthefollowingupdates:X1t+1=X1taiAi;eTbt+1=geTbtg+aieTbt;tr(Xt+1Ai)eTbt+1=0:AssumingAi=zizTi,wecanstillcomputeaiinclosedformasai=gg+11p1bi;wherep=zTiXtzi;bi=eTbt:Theithelementofbtisupdatedtogbi=(g+aibi)andthematrixupdateiscalculatedasinthenon-slackcase.IncaseofthevonNeumanndivergencetheupdaterulesturnouttobe:logXt+1=logXt+aiAi;eTbt+1=eTbteaig;tr(Xt+1Ai)eTbt+1=0:Theprojectionparameteraiisnotavailableinclosedform,insteadwecalculateitastherootofthenon-linearequation:logtr(exp(logXt+aiAi)Ai)+aiglogeTbt=0:ThescaleinvarianceoftheLogDetdivergenceimpliesthatD`d(b;b0)=D`d(b=b0;1)wherethedivisioniselementwiseandthereforeweimplicitlymeasure“relative”error,thatisthemagnitudeofthecoordinatesofvectorb0arenaturallytakenintoaccount.ForthevonNeumanndivergencescaleinvariancedoesnothold,butonemayalternativelyuseDvN(b=b0;1)asthepenaltyfunction.6.2.2OTHERCONSTRAINTSInearliersections,wefocusedonthecaseofdistanceconstraints.Wenowbrieydiscussgeneral-izingtootherconstraints.Whentheconstraintsaresimilarityconstraints(i.e.,Kjlbi),theupdatescanbeeasilymodi-ed.Asdiscussedearlier,wemustretainsymmetryoftheAiconstraintmatrices,soforsimilarityconstraints,werequireAi=12(ejeT+eleTj).Noticethattheconstraintmatrixnowhasranktwo.Thiscomplicatesthealgorithmsslightly;forexample,withtheLogDetdivergence,theSherman-Morrisonformulamustbeappliedtwice,leadingtoamorecomplicatedupdaterule(albeitonethatstillhasaclosed-formsolutionandcanbecomputedinO(r2)time).362 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESOtherconstraintsarepossibleaswell;forexample,onecouldincorporatedistanceconstraintssuchaskajakk2kalamk2,orfurthersimilarityconstraintssuchasKjkKlm.Thesearesometimesreferredtoasrelativecomparisonconstraints,andareusefulinrankingalgorithms(SchultzandJoachims,2003);forthesealso,thecostperprojectionremainsO(r2).Finally,ar-bitrarylinearconstraintscanalsobeapplied,thecostperprojectionupdatewillthenbeO(nr).6.3SpecialCasesIfweusethevonNeumanndivergence,letr=n,andsetbi=0forallconstraints,weexactlyobtaintheDeniteBoostoptimizationproblemfromTsudaetal.(2005).Inthiscase,ouralgorithmcomputestheprojectionupdateinO(n2)time.Incontrast,thealgorithmfromTsudaetal.(2005)computestheprojectioninamoreexpensivemannerinO(n3)time.Anotherdifferencewithourapproachisthatwecomputetheexactprojection,whereasTsudaetal.(2005)computeanapprox-imateprojection.Thoughcomputinganapproximateprojectionmayleadtoafasterper-iterationcost,ittakesmanymoreiterationstoconvergetotheoptimalsolution.Weillustratethisfurtherintheexperimentalresultssection.Theonline-PCAproblemdiscussedinWarmuthandKuzmin(2006)employsasimilarupdatebasedonthevonNeumanndivergence.AswiththeDeniteBoostalgorithm,theprojectionisnotanexactBregmanprojection;however,theprojectioncanbecomputedinthesamewayasinourvonNeumannkernellearningprojection.Asaresult,thecostofaniterationofonline-PCAcanbeimprovedfromO(n3)toO(n2)withourapproach.Anotherspecialcaseisthenearestcorrelationmatrixproblem(Higham,2002)thatarisesinnancialapplications.Acorrelationmatrixisapositivesemidenitematrixwithunitdiagonal.Forthiscase,wesettheconstraintstobeKii=1foralli.Ouralgorithmsfromthispapergivenewdivergencemeasuresandmethodsforndinglow-rankcorrelationmatrices.Previousalgorithmsscalecubicallyinn,whereasourmethodscaleslinearlywithnandquadraticallywithrforlow-rankcorrelationmatrices.OurformulationcanalsobeemployedtosolveaproblemsimilartothatofWeinbergeretal.(2004).Theenforcedconstraintsonthekernelmatrix(centeringandisometry)arelinear,andthuscanbeencodedintoourframework.TheonlydifferenceisthatWeinbergeretal.maximizethetraceofK,whereasweminimizeamatrixdivergence.Comparisonsbetweentheseapproachesisapotentialareaoffutureresearch.6.4ConnectionstoSemideniteProgrammingInthissection,wepresentaconnectionbetweenminimizingtheLogDetdivergenceandthesolutiontoasemideniteprogrammingproblem(SDP).Asanexample,weconsiderthemin-balanced-cutproblem.Supposewearegivenann-vertexgraph,whoseadjacencymatrixisA.LetL=diag(Ae)AbetheLaplacianofA,whereeisthevectorofallones.Thesemideniterelaxationtotheminimumbalancedcutproblem(Lang,2005)resultsinthefollowingSDP:363 KULIS,SUSTIKANDDHILLONminXtr(LX)subjecttodiag(X)=etr(XeeT)=0X0:LetL†denotethepseudoinverseoftheLaplacian,andletVbeanorthonormalbasisfortherangespaceofL.LetrdenotetherankofL;itiswellknownthatnrequalsthenumberofconnectedcomponentsofthegraph.ConsidertheLogDetdivergencebetweenXandeL†:D`d(X;eL†)=D`d(VTXV;VT(eL†)V)=tr(VTXV(VT(eL†)V)1)logdet(VTXV(VT(eL†)V)1)r=trVTXV1eVTLVlogdetVTXV1eVTLVr=1etr(LX)logdet1eVTXVVTLVr=1etr(LX)logdet(VTXVVTLV)rlog(er):ThefourthlineusesLemma8toreplacetr(VTXV(1eVTLV))with1etr(LX).Ifweaimtominimizethisdivergence,wecandropthelasttwoterms,astheyareconstants.Inthiscase,wehave:argminXD`d(X;eL†)=argminX1etr(LX)logdet(VTXVVTLV)=argminXtr(LX)elogdet(VTXVVTLV):Asebecomessmall,thetr(LX)termdominatestheobjectivefunction.NowconsidertheconstraintsfromtheminbalancedcutSDP.Thediag(X)=econstraintisexactlytheconstraintfromthenearestcorrelationmatrixproblem(thatis,Xii=1foralli).TheX0constraintisimplicitlysatisedwhenLogDetisused,leavingthebalancingconstrainteTXe=0.Recallthatnull(X)=null(L†),andfromstandardspectralgraphtheory,thatLe=0.ThereforeL†e=0,whichfurtherimplieseTL†e=0)eTXe=0bythefactthattheLogDetdivergencepreservestherangespaceofL†.Thus,thenullspacerestric-tioninLogDetdivergencenaturallyyieldstheconstrainteTXe=0andsotheminbalancedcutSDPproblemonAisequivalent(forsufcientlysmalle)tondingthenearestcorrelationmatrixtoeL†undertheLogDetdivergence.Manyothersemideniteprogrammingproblemscanbesolvedinasimilarmannertotheonegivenabovefortheminbalancedcutproblem.Itisbeyondthescopeofthispapertodeterminethepracticalityofsuchanoptimizationmethod;ouraimhereissimplytodemonstrateanintriguingrelationshipbetweensemideniteprogrammingandminimizingtheLogDetdivergence.Forfurtherinformationonthisrelationship,seeKulisetal.(2007a).364 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCES6.5ChangingtheRangeSpaceThekeypropertyoftheLogDetandvonNeumanndivergenceswhichallowthealgorithmstolearnlow-rankkernelmatricesistheirrangespacepreservationproperty,discussedinSection4.Whilethispropertyleadstoefcientalgorithmsforlearninglow-rankkernelmatrices,italsoforcesthelearnedmatricestomaintaintherangespaceoftheinputmatrix,whichisalimitationofourmethod.Allowingtherangespacetochangeispotentiallyaveryusefultool,butisstillanopenresearchquestion.OnepossibilityworthexploringistoaugmenttheinputmatrixX0withamatrixcapturingthecomplementaryspace,thatis,theinputwouldbeX0+eN,whereNcapturesthenullspaceofX0.EveniftheredoesnotexistasolutiontotheoptimizationproblemofminimizingDf(X;X0),thereisalwaysasolutiontominimizingDf(X;X0+eN).Detailsofthisapproacharetobepursuedasfuturework.Arelatedapproachtocircumventthisproblemistoapplyanappropriatekernelfunctionovertheinputdata,theresultofwhichisthatthealgorithmlearnsanon-lineartransformationofthedataovertheinputspace.Recently,Davisetal.(2007)discussedhowonecangeneralizetonewdatapointswiththeLogDetdivergenceevenafterapplyingsuchakernelfunction,makingthisapproachpracticalinmanysituations.7.ExperimentsInthissection,wepresentanumberofresultsusingouralgorithms,andprovidereferencestorecentworkwhichhasappliedouralgorithmstolarge-scalelearningtasks.Werstbeginwiththebasicalgorithm,andpresentresultsontransductivelearning—thescenariowherealldataisprovidedupfrontbutlabelsareavailableforonlyasubsetofthedata—andclustering.Wethendiscusssomeresultscomparingourmethodstoexistingmetriclearningalgorithms.WerunthekernellearningalgorithmsinMATLAB,withsomeroutineswritteninCandcom-piledwiththeMEXcompiler.Anefcientimplementationofthespecialmatrixmultiplicationappearinginstep9ofAlgorithm3couldachievefurtherrun-timeimprovements.Unfortunately,anefcientandaccurateimplementation—basedonthefastmultipolemethod—isnotreadilyavail-able.TransductiveLearningandClusteringResultsToshowtheeffectivenessofouralgorithms,wepresentresultsfromclusteringaswellasclassi-cation.Weconsiderseveraldatasetsfromreal-lifeapplications:1.Digits:asubsetofthePendigitsdatafromtheUCIrepositorythatcontainshandwrittensamplesofthedigits3,8,and9.Therawdataforeachdigitis16-dimensional;thissubsetcontains317digitsandisastandarddatasetforsemi-supervisedclustering(e.g.,see,Bilenkoetal.,2004).2.GyrB:aproteindataset:a5252kernelmatrixamongbacteriaproteins,containingthreebacteriaspecies.ThismatrixisidenticaltotheoneusedtotesttheDeniteBoostalgorithminTsudaetal.(2005).3.Spambase:adatasetfromtheUCIrepositorycontaining4601emailmessages,classiedasspamornotspam—1813oftheemailsarespam(39.4percent).Thisdatahas57attributes.365 KULIS,SUSTIKANDDHILLON4.Nursery:datasetdevelopedtorankapplicationsfornurseryschools.Thissetcontains12960instanceswith8attributes,and5classlabels.Forclassication,wecomputeaccuracyusingak-nearestneighborclassier(k=5)thatcom-putesdistanceinthefeaturespace,witha50/50training/testsplitandtwo-foldcrossvalidation.Ourresultsareaveragedover20runs.Forclustering,weusethekernelk-meansalgorithmandcomputeaccuracyusingtheNormalizedMutualInformation(NMI)measure,astandardtechniquefordeterminingqualityofclusters,whichmeasurestheamountofstatisticalinformationsharedbytherandomvariablesrepresentingtheclusterandclassdistributions(Strehletal.,2000).IfCistherandomvariabledenotingtheclusterassignmentsofthepoints,andKistherandomvariabledenotingtheunderlyingclasslabelsonthepointsthentheNMImeasureisdenedas:NMI=I(C;K)(H(C)+H(K))=2whereI(X;Y)=H(X)H(XjY)isthemutualinformationbetweentherandomvariablesXandY,H(X)istheShannonentropyofX,andH(XjY)istheconditionalentropyofXgivenY(CoverandThomas,1991).ThenormalizationbytheaverageentropyofCandKmakesthevalueofNMIstaybetween0and1.Weconsidertwodifferentexperiments,eachwiththeirownconstraintselectionprocedure.Oneexperimentistolearnakernelmatrixonlyusingconstraints.Inthiscase,wegenerateconstraintsfromsometargetkernelmatrix,andjudgeperformanceonthelearnedkernelmatrixasweprovideanincreasingnumberofconstraints.ThissetupwasemployedfortheGyrBdatasetallowingustocomparetothemethodsandresultsofTsudaetal.(2005).Constraintsweregeneratedrandomlyasfollows:twodatapointsarechosenatrandomandadistanceconstraintisconstructedasfollows.Ifthedatapointsareinthesameclass,theconstraintisoftheformd(i;j)b,wherebisthecorrespondingdistancebetweenpointsiandjfromtheoriginalkernelmatrix,andifthedatapointsarefromdifferentclasses,thentheconstraintisoftheformd(i;j)b.Fortheremainingdatasets,wewereinterestedinaddingsupervisedclassinformationtoim-proveanexistinglow-rankkernelmatrix.Inthiscase,wetookourinitiallow-rankkerneltobealinearkernelovertheoriginaldatamatrixandweaddedconstraintsoftheformd(i;j)(1e)bforsameclasspairsandd(i;j)(1+e)bfordifferentclasspairs(bistheoriginaldistanceande=:25).Thisisverysimilarto“idealizing”thekernel,asinKwokandTsang(2003).Ourcon-vergencetolerancewassetto103,andweincorporatedslackvariablesforthelargerdatasetsSpambaseandNursery(thesmallerdatasetsdidnotrequireslackvariablestoconverge).Notethatthereareothermethodsforchoosingconstraints;forexample,insteadofusing(1e)bfortheright-handsideforsameclasspairs,onecouldinsteadchooseaglobalvalueuforallsameclassconstraints(andanalogouslyfordifferentclasspairs).Werstshowthatthelow-rankkernelslearnedbyouralgorithmsattaingoodclusteringandclassicationaccuracies.WeranouralgorithmsontheDigitsdatasettolearnarank-16kernelmatrixusingrandomlygeneratedconstraints.TheGrammatrixoftheoriginalDigitsdatawasusedasourinitial(rank-16)kernelmatrix.TheleftplotofFigure1showsclusteringNMIvalueswithincreasingconstraints.Addingjustafewconstraintsimprovestheresultssignicantly,andbothofthekernellearningalgorithmsperformcomparably.Classicationaccuracyusingthek-nearestneighbormethodwasalsocomputedforthisdataset:marginalclassicationaccuracygainswereobservedwiththeadditionofconstraints(anincreasefrom94to97percentaccuracyfor366 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCES0204060801001201400.40.450.50.550.60.650.70.750.80.850.9Number of ConstraintsNMILogDet Divergencevon Neumann Divergence0501001502002503003500102030405060708090100Number of cyclesPercentage of constraints satisfiedFigure1:(Left)NormalizedMutualInformationresultsofclusteringtheDigitsdataset—bothalgorithmsimprovewithasmallnumberofconstraints(Right)Percentageofconstraintssatised(toatoleranceof103)inBregman'salgorithmasafunctionofthenumberofcycles,usingtheLogDetdivergence10020030040050060070080090010000.650.70.750.80.850.90.951Number of ConstraintsClassification AccuracyLogDet Divergencevon Neumann Divergence10020030040050060070080090010000100200300400500600Number of ConstraintsNumber of CyclesLogDet Divergencevon Neumann DivergenceFigure2:Classicationaccuracy(left)andconvergence(right)ontheGyrBdatasetbothdivergences).WealsorecordedconvergencedataonDigitsintermsofthenumberofcyclesthroughallconstraints;forthevonNeumanndivergence,convergencewasattainedin11cyclesfor30constraintsandin105cyclesfor420constraints,whilefortheLogDetdivergence,between17and354cycleswereneededforconvergence.Thisexperimenthighlightshowouralgorithmcanuseconstraintstolearnalow-rankkernelmatrix.Itisnoteworthythatthelearnedkernelperformsbetterthantheoriginalkernelforclusteringaswellasclassication.Furthermore,intherightplotofFigure1,weplotthenumberofconstraintssatisedwithinatoleranceof103(300totalconstraints)asafunctionofthenumberofcyclesthroughallconstraints,whenusingtheLogDetdivergence.SimilarresultsareobtainedwiththevonNeumanndivergence.Thenumberofcyclesrequiredforconvergencetohighaccuracycanpotentiallybelarge,butfortypicalmachinelearningtasks,highaccuracysolutionsarenotnecessaryandlowtomediumaccuracysolutionscanbeachievedmuchfaster.367 KULIS,SUSTIKANDDHILLON0501001502002503003504004505000.20.250.30.350.40.450.5Number of ConstraintsNMI ValueLogDet DivergenceVon Neumann Divergence0501001502002503003504004505000.10.120.140.160.180.20.220.240.260.280.3Number of ConstraintsNMI ValueLogDet DivergenceVon Neumann DivergenceFigure3:ClusteringaccuracyonSpambase(left)andNursery(right)Asasecondexperiment,weperformedexperimentsonGyrBandcomparisonstotheDenite-BoostalgorithmofTsudaetal.(2005)(modiedtocorrectlyhandleinequalityconstraints).Usingonlyconstraints,weattempttolearnakernelmatrixthatachieveshighclassicationaccuracy.InordertocomparetoDeniteBoost,welearnedafull-rankkernelmatrixstartingfromthescaledidentitymatrixasinTsudaetal.(2005).Inourexperiments,weobservedthatusingapproximateprojections—asdoneinDeniteBoost—considerablyincreasesthenumberofcyclesneededforconvergence.Forexample,startingwiththescaledidentityastheinitialkernelmatrixand100constraints,ittookourvonNeumannalgorithmonly11cyclestoconverge,whereasittook3220cyclesfortheDeniteBoostalgorithmtoconverge.Sincetheoptimalsolutionsarethesameforapproximateversusexactprojections,weconvergetothesamekernelmatrixasDeniteBoostbutinfarfeweriterations.Furthermore,ouralgorithmhastheabilitytolearnlow-rankkernels(asinthecaseoftheDigitsexample).Figure2depictstheclassicationaccuracyandconvergenceofourLogDetandvonNeumannalgorithmsonthisdataset.TheslowconvergenceoftheDeniteBoostalgorithmdidnotallowustorunitwithalargersetofconstraints.FortheLogDetandthevonNeu-mannexactprojectionalgorithms,thenumberofcyclesrequiredforconvergenceneverexceeded600onrunsofupto1000constraintsonGyrB.Theclassicationaccuracyontheoriginalmatrixis.948,andsoourlearnedkernelsachieveevenhigheraccuracythanthetargetkernelwithasufcientnumberofconstraints.Theseresultshighlightthatexcellentclassicationaccuracycanbeobtainedusingakernelthatislearnedusingonlydistanceconstraints.Notethatthestartingkernelwastheidentitymatrix,andsodidnotencodeanydomaininformation.OnthelargerdatasetsSpambaseandNursery,wefoundsimilarimprovementsinclusteringaccuracywhilelearningrank-57andrank-8kernelmatrices,respectively,usingthesamesetupasintheDigitsexperiment.Figures3and4showtheclusteringaccuracyandconvergenceonthesedatasets.NotethatbothdatasetsaretoolargeforDeniteBoosttohandle,andfurthermore,storingthefullkernelmatrixwouldrequiresignicantmemoryoverhead(forexample,MATLABrunsoutofmemorywhileattemptingtostorethefullkernelmatrixforNursery);thisscenariohighlightsthememoryoverheadadvantageofusinglow-rankkernels.WeseethatontheSpambasedataset,theLogDetdivergencealgorithmproducesclusteringaccuracyimprovementswithverylittlesupervision,butthenumberofcyclestoconvergeismuchlargerthanthevonNeumannalgorithm(for500constraints,thevonNeumannalgorithmrequiresonly29cyclestoconverge).Theslower368 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCES501001502002503003504004505000100020003000400050006000Number of ConstraintsNumber of Cycles to ConvergeLogDet DivergenceVon Neumann Divergence5010015020025030035040045050001002003004005006007008009001000Number of ConstraintsNumber of Cycles to ConvergeLogDet DivergenceVon Neumann DivergenceFigure4:ConvergenceonSpambase(left)andNursery(right)WineIonosphereBalance ScaleIrisSoybean00.050.10.150.20.250.30.35k-NN Error von NeumannLogDetEuclidean BaselineMCMLLMNNFigure5:Classicationerrorratesfork-nearestneighborclassicationviadifferentlearnedmetricsoverUCIdatasets.ThevonNeumannandLogDetalgorithmsarecompetitivewithex-istingstateoftheartmetriclearningmethods,andsignicantlyoutperformthebaseline.convergenceoftheLogDetdivergencealgorithmisatopicoffuturework—note,however,thateventhoughthenumberofcyclesforLogDettoconvergeishigherthanforvonNeumann,theoverallrunningtimeisoftenlowerduetotheefciencyofeachLogDetiterationandthelackofasuitableimplementationofthefastmultiplemethod.OntheNurserydataset,bothalgorithmsperformsimilarlyintermsofclusteringaccuracy,andthediscrepancyintermsofcyclestoconvergebetweenthealgorithmsislessdrastic.Westressthatothermethodsforlearningkernelmatrices,suchasusingthesquaredFrobeniusnorminplaceoftheLogDetorvonNeumanndivergences,wouldleadtolearningfull-rankmatricesandwouldrequiresignicantlymorememoryandcomputationalresources.369 KULIS,SUSTIKANDDHILLON7.2MetricLearningandLarge-ScaleExperimentsWenowmovebeyondthetransductivesettinginordertocomparewithexistingmetriclearningalgorithms.AsbrieydiscussedinSection6.1,learningalow-rankkernelwiththesamerangespaceastheinputkernelisequivalenttolearningalineartransformationoftheinputdata,sowenowcompareagainstsomeexistingmethodsforlearninglineartransformations.Inparticular,wecompareagainsttwopopularMahalanobismetriclearningalgorithms:metriclearningbycollapsingclasses(MCML)(GlobersonandRoweis,2005)andlarge-marginnearestneighbormetriclearning(LMNN)(Weinbergeretal.,2005).Asabaseline,weconsiderthesquaredEuclideandistance.Foreachdataset,weusetwo-foldcrossvalidationtolearnalineartransformationofourtrainingdata.Thistransformationisusedinak-nearestneighborclassieronourtestdata,andtheresultingerrorisreported.FortheLogDetandvonNeumannalgorithms,wecross-validatethegparameterandusetheconstraintselectionproceduredescribedinDavisetal.(2007).Inparticular,wegenerateconstraintsoftheformd(i;j)uforsame-classpairsandd(i;j)`fordifferent-classpairs,whereuand`arechosenbasedonthe5thand95thpercentileofalldistancesinthetrainingdata.Werandomlychoose40c2constraints,wherecisthenumberofclassesinthedata.WeranonanumberofstandardUCIdatasetstocomparewithexistingmethods.InFigure5,wedisplayk-NNaccuracy(k=4)overvedatasets.Overall,weseethatboththeLogDetandvonNeumannalgorithmscomparefavorablywithexistingmethods.Intermsoflarge-scaleexperiments,wealsoranontheMNISTdataset,ahandwrittendigitsdataset.Thisdatahas60,000trainingpointsand10,000testpoints.Afterdeskewingtheimagesandperformingdimensionalityreductiontotherst100principalcomponentsofMNIST,weranourmethodssuccessfullywith10,000and100,000constraints,chosenasinthemetriclearningexperimentsabove.Thebaselineerrorfora1-nearestneighborsearchoverthetop100principalcomponentsafterdeskewingis2.35%.For10,000constraints,theLogDetalgorithmranin4.8minutesandachievedatesterrorof2.29%,whilethevonNeumannalgorithmranin7.7minutesandachieved2.30%error.For100,000constraints,LogDetranin11.3minutesandachievedanerrorof2.18%whilevonNeumannranin71.4minutesandachievedanerrorof2.17%.Finally,wenotethatourmethodshaverecentlybeenappliedtoverylargeproblemsinthecomputervisiondomain.WereferthereadertoJainetal.(2008),whichadaptsthealgorithmsdiscussedinthispapertothreelarge-scalevisionexperiments:humanbodyposeestimation,featureindexingfor3-dreconstruction,andobjectclassication.Forposeestimation,thesizeofthedatawas500,000imagesandforfeatureindexing,therewere300,000imagepatches.Theseexperimentsvalidatetheuseofourmethodsonlarge-scaledata,andalsodemonstratethattheycanbeusedtooutperformstateoftheartmethodsincomputervision.8.ConclusionsInthispaper,wehavedevelopedalgorithmsforusingBregmanmatrixdivergencesforlow-rankmatrixnearnessproblems.Inparticular,wehavedevelopedaframeworkforusingtheLogDetdivergenceandthevonNeumanndivergencewhentheinitialmatricesarelow-rank;thisisachievedviaarestrictionoftherangespaceofthematrices.Unlikepreviouskernellearningalgorithms,whichhaverunningtimesthatarecubicinthenumberofdatapoints,ourresultingalgorithmsareefcient—bothalgorithmshaverunningtimeslinearinthenumberofdatapointsandquadraticintherankofthekernel.Furthermore,ouralgorithmscanbeusedinconjunctionwithanumberofkernel-basedlearningalgorithmsthatareoptimizedforlow-rankkernelrepresentations.The370 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESexperimentalresultsdemonstratethatouralgorithmsareeffectiveinlearninglow-rankandfull-rankkernelsforclassicationandclusteringproblemsonlarge-scaledatasetsthatariseindiverseapplications.Thereisstillmuchtobegainedfromstudyingthesedivergences.Algorithmically,wehaveconsideredonlyprojectionsontoasingleconstraint,butitisworthpursuingotherapproachestosolvetheoptimizationproblemssuchasmethodsthatoptimizewithrespecttomultipleconstraintssimultaneously.WediscussedhowtheLogDetdivergenceexhibitsconnectionstoMahalanobismetriclearningandsemideniteprogramming,andthereissignicantongoingworkintheseareas.Finally,wehopethatourframeworkwillleadtonewinsightsandalgorithmsforothermachinelearningproblemswherematrixnearnessproblemsneedtobesolved.AcknowledgmentsThisresearchwassupportedbyNSFgrantCCF-0431257,NSFCareerAwardACI-0093404,andNSF-ITRawardIIS-0325116.WethankKojiTsudafortheproteindatasetandKilianWeinbergerfortheimagedeskewingcode.AppendixA.PropertiesofBregmanMatrixDivergencesInthissection,wehighlightsomeimportantpropertiesofBregmanmatrixdivergences.Proposition11LetQbeasquare,orthogonalmatrix,thatis,QTQ=QQT=I.ThenforallspectralBregmanmatrixdivergences,Df(QTXQ;QTYQ)=Df(X;Y).ProofThisfollowsfromLemma1byobservingthatifX=VLVTandY=UQUT,theitheigenvec-torofQTXQisQTviandthejtheigenvectorofQTYQisQTuj.Theeigenvaluesremainunchangedbytheorthogonalsimilaritytransformation.Thus,theterm((QTvi)T(QTuj))2inLemma1simpli-esto(vTiuj)2usingthefactthatQQT=I.Proposition12LetMbeasquare,non-singularmatrix,andletXandYbennpositivedenitematrices.ThenD`d(MTXM;MTYM)=D`d(X;Y).ProofFirstobservethatifXispositivedenite,thenMTXMispositivedenite.Then:D`d(MTXM;MTYM)=tr(MTXM(MTYM)1)logdet(MTXM(MTYM)1)n=tr(MTXMM1Y1MT)logdet(MTXMM1Y1MT)n=tr(XY1)logdet(XY1)n=D`d(X;Y)Notethat,asacorollary,wehavethattheLogDetdivergenceisscale-invariant,thatis,D`d(X;Y)=D`d(cX;cY),foranypositivescalarc.Furthermore,thispropositionmaybeextendedtothecasewhenXandYarepositivesemidenite,usingthedenitionoftheLogDetdivergenceforrank-decientmatrices.371 KULIS,SUSTIKANDDHILLONProposition13LetYbeapositivedenitematrix,andletXbeanypositivesemidenitematrix.ThenD`d(X;Y)=DvN(I;Y12XY12)=D`d(Y12XY12;I)ProofTherstequalitybyexpandingthedenitionofthevonNeumanndivergence:DvN(I;Y12XY12)=tr(IlogIIlog(Y12XY12)I+Y12XY12)=tr(Y12XY12)tr(log(Y12XY12))tr(I)=tr(XY1)logdet(Y12XY12)n=tr(XY1)logdet(XY1)n=D`d(X;Y);wherewehaveusedthefactthattr(logA)=logdet(A).ThesecondequalityfollowsbyapplyingProposition12toD`d(X;Y)withM=Y12.AppendixB.NecessityofDualVariableCorrectionsInSection3.4,wepresentedthemethodofBregmanprojectionsandshowedthat,inthepresenceofinequalityconstraints,adualvariablecorrectionisneededtoguaranteeconvergencetothegloballyoptimalsolution.InsomerecentmachinelearningpaperssuchasTsudaetal.(2005),thiscorrectionhasbeenomittedfromthealgorithm.Wenowbrieydemonstratewhysuchcorrectionsareneeded.AverysimpleexampleillustratingthefailureofBregmanprojectionswithoutcorrectionsisinndingofthenearest22(positivedenite)matrixXtotheidentitymatrixthatsatisesasinglelinearconstraint:minimizeXDf(X;I)subjecttoX11+X222X121X0:Thestarting(identity)matrixsatisesthelinearconstraint,butasingleprojectionstep(tothecor-respondingequalityconstraint)withoutcorrectionswillproduceasuboptimalmatrix,regardlessofthedivergenceused.Ontheotherhand,employingcorrectionsleadstoa0=0,andtheinputmatrixremainsunchanged.ItisalsonotsufcienttorepeatedlyperformBregmanprojectionsonlyonviolatedconstraints(withoutanycorrections)untilallconstraintsaresatised—thisapproachisusedbyTsudaetal.(2005).Todemonstratethismoreinvolvedcase,weconsideranexampleovervectors,wherethegoalistominimizetherelativeentropytoagivenvectorx0underlinearconstraints.Notethattheargumentcarriesovertothematrixcase,butismoredifculttovisualize.minimizexKL(x;x0)subjecttoxT0:09120:93850:43770:0238xT0:60200:60200:43770:2554xT111=1:(24)372 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESFigure6:Terminationpointsfortherelativeentropyexampledescribedby(24).Theshadedre-gionsillustratethehyperplanesassociatedwiththetwoinequalityconstraintsoftheop-timizationproblem.Theoptimalsolutionisthemaximumentropyvectorxopt=131313,butwithoutcorrectionswearriveatx2=1571513aftertwoprojectionstepsstartingfromx0=[0:10:10:8].Intheaboveproblem,KL(x;x0)referstotherelativeentropy,denedearlier.Thersttwoconstraintsarelinearinequalityconstraintswhilethethirdconstraintforcesxtoremainontheunitsimplexsothatitisaprobabilityvector.Letx0equal[0:10:10:8].Ifweprocesstheconstraintsintheaboveorderwithoutcorrections,aftertwoprojectionsteps(oneprojectionontoeachofthersttwoconstraints),wearriveatthevector1571513;whichsatisesallthreeconstraints.Thereforethisvectorisreturnedastheoptimalsolutionifnocorrectionsareapplied.Incomparison,aproperexecutionofBregman'salgorithmemployingcorrectionsarrivesat131313;whichisthemaximumentropyvector.Itisalsoworthnotingthatifweprojecttothesecondconstraintrst,thenwearriveattheoptimalsolutionimmediately.Byappropriatelymodifyingtheconstraintsandthestartingpointinthisexample,theincorrectalgorithmcanbemadetoconvergearbitrarilycloseto[001],aminimumentropyvector,thusdrasticallydemonstratingthenecessityofdualvariablecorrections.ReferencesF.BachandM.Jordan.Predictivelow-rankdecompositionforkernelmethods.InProc.22ndInternationalConferenceonMachineLearning(ICML),2005.373 KULIS,SUSTIKANDDHILLONJ.BarnesandP.Hut.AhierarchicalO(nlogn)forcecalculationalgorithm.Nature,324:446–449,1986.M.Bilenko,S.Basu,andR.Mooney.Integratingconstraintsandmetriclearninginsemi-supervisedclustering.InProceedingsofthe21stInternationalConferenceonMachineLearning,2004.L.Bregman.Therelaxationmethodofndingthecommonpointofconvexsetsanditsapplicationtothesolutionofproblemsinconvexprogramming.USSRComp.MathematicsandMathematicalPhysics,7:200–217,1967.R.P.Brent.AlgorithmsforMinimizationwithoutDerivatives.Prentice-Hall,EnglewoodCliffs,NJ.,1973.ISBN0-13-022335-2.Y.CensorandS.Zenios.ParallelOptimization.OxfordUniversityPress,1997.T.M.CoverandJ.A.Thomas.ElementsofInformationTheory.Wiley-Interscience,1991.N.Cristianini,J.Shawe-Taylor,A.Elisseeff,andJ.Kandola.Onkernel-targetalignment.InAd-vancesinNeuralInformationProcessingSystems(NIPS)14,2002.J.Davis,B.Kulis,S.Sra,andI.S.Dhillon.Information-theoreticmetriclearning.InProc.24thInternationalConferenceonMachineLearning(ICML),2007.J.D.Demmel.AppliedNumericalLinearAlgebra.SocietyforIndustrialandAppliedMathematics,1997.I.S.DhillonandJ.A.Tropp.MatrixnearnessproblemswithBregmandivergences.SIAMJournalonMatrixAnalysisandApplications,29(4):1120–1146,2007.I.S.Dhillon,Y.Guan,andB.Kulis.Kernelk-means,spectralclusteringandnormalizedcuts.InProc.10thACMSIGKDDConference,2004.S.FineandK.Scheinberg.EfcientSVMtrainingusinglow-rankkernelrepresentations.JournalofMachineLearningResearch,2:243–264,2001.R.Fletcher.Anewvariationalresultforquasi-Newtonformulae.SIAMJournalonOptimziation,1(1),February1991.A.GlobersonandS.Roweis.Metriclearningbycollapsingclasses.InAdvancesinNeuralInfor-mationProcessingSystems(NIPS)18,2005.G.Golub.Somemodiedmatrixeigenvalueproblems.SIAMReview,15:318–334,1973.G.H.GolubandC.F.VanLoan.MatrixComputations.JohnsHopkinsUniversityPress,1996.L.GreengardandV.Rokhlin.Afastalgorithmforparticlesimulations.J.Comput.Phys.,73:325–348,1987.M.GuandS.Eisenstat.Astableandefcientalgorithmfortherank-onemodicationofthesymmetriceigenproblem.SIAMJ.MatrixAnal.Appl.,15:1266–1276,1994.374 LOW-RANKKERNELLEARNINGWITHBREGMANMATRIXDIVERGENCESN.Higham.Computingthenearestcorrelationmatrix—aproblemfromnance.IMAJ.NumericalAnalysis,22(3):329–343,2002.P.Jain,B.Kulis,andK.Grauman.Fastimagesearchforlearnedmetrics.InProc.IEEEConferenceonComputerVisionandPatternRecognition(CVPR),2008.W.JamesandC.Stein.Estimationwithquadraticloss.InProc.FourthBerkeleySymposiumonMathematicalStatisticsandProbability,volume1,pages361–379.Univ.ofCaliforniaPress,1961.B.Kulis,M.A.Sustik,andI.S.Dhillon.Learninglow-rankkernelmatrices.InProc.23rdInter-nationalConferenceonMachineLearning(ICML),2006.B.Kulis,S.Sra,S.Jegelka,andI.S.Dhillon.Scalablesemideniteprogrammingusingconvexperturbations.TechnicalReportTR-07-47,UniversityofTexasatAustin,September2007a.B.Kulis,A.Surendran,andJ.Platt.Fastlow-ranksemideniteprogrammingforembeddingandclustering.InProc.11thInternationalConferenceonAIandStatistics(AISTATS),2007b.J.KwokandI.Tsang.Learningwithidealizedkernels.InProc.20thInternationalConferenceonMachineLearning(ICML),2003.G.Lanckriet,N.Cristianini,P.Bartlett,L.E.Ghaoui,andM.Jordan.Learningthekernelmatrixwithsemideniteprogramming.JournalofMachineLearningResearch,5:27–72,2004.K.Lang.Fixingtwoweaknessesofthespectralmethod.InAdvancesinNeuralInformationPro-cessingSystems(NIPS)18,2005.M.A.NielsenandI.L.Chuang.QuantumComputationandQuantumInformation.CambridgeUniversityPress,2000.M.SchultzandT.Joachims.Learningadistancemetricfromrelativecomparisons.InAdvancesinNeuralInformationProcessingSystems(NIPS)16,2003.J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniversityPress,2004.J.ShermanandW.J.Morrison.Adjustmentofaninversematrixcorrespondingtochangesintheelementsofagivencolumnoragivenrowoftheoriginalmatrix.AnnalsofMathematicalStatistics,20,1949.A.Strehl,J.Ghosh,andR.Mooney.Impactofsimilaritymeasuresonweb-pageclustering.InWorkshoponArticialIntelligenceforWebSearch(AAAI),2000.M.A.SustikandI.S.Dhillon.Onsomemodiedroot-ndingproblems.Workingmanuscript,2008.L.TorresaniandK.Lee.Largemargincomponentanalysis.InAdvancesinNeuralInformationProcessingSystems(NIPS)19,2006.375 KULIS,SUSTIKANDDHILLONK.Tsuda,G.R´atsch,andM.Warmuth.MatrixexponentiatedgradientupdatesforonlinelearningandBregmanprojection.JournalofMachineLearningResearch,6:995–1018,2005.M.K.WarmuthandD.Kuzmin.RandomizedPCAalgorithmswithregretboundsthatarelog-arithmicinthedimension.InAdvancesinNeuralInformationProcessingSystems(NIPS)20,2006.K.Weinberger,F.Sha,andL.Saul.Learningakernelmatrixfornonlineardimensionalityreduction.InProc.21stInternationalConferenceonMachineLearning(ICML),2004.K.Weinberger,J.Blitzer,andL.Saul.Distancemetriclearningforlargemarginnearestneighborclassication.InAdvancesinNeuralInformationProcessingSystems(NIPS)18,2005.K.Weinberger,F.Sha,Q.Zhu,andL.Saul.GraphLaplacianmethodsforlarge-scalesemideniteprogramming,withanapplicationtosensorlocalization.InAdvancesinNeuralInformationProcessingSystems(NIPS)19,2006.376