/
JournalofMachineLearningResearch13(2012)519-547Submitted2/11;Revised7/ JournalofMachineLearningResearch13(2012)519-547Submitted2/11;Revised7/

JournalofMachineLearningResearch13(2012)519-547Submitted2/11;Revised7/ - PDF document

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
390 views
Uploaded On 2016-06-26

JournalofMachineLearningResearch13(2012)519-547Submitted2/11;Revised7/ - PPT Presentation

JAINKULISDAVISANDDHILLONfunctionssuchastheEuclideandistanceorcosinesimilarityareusedforexampleintextretrievalapplicationsthecosinesimilarityisastandardfunctiontocomparetwotextdocumentsHoweversu ID: 378420

JAIN KULIS DAVISANDDHILLONfunctionssuchastheEuclideandistanceorcosinesimilarityareused;forexample intextretrievalapplications thecosinesimilarityisastandardfunctiontocomparetwotextdocuments.However

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch13(2012)..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch13(2012)519-547Submitted2/11;Revised7/11;Published3/12MetricandKernelLearningUsingaLinearTransformationPrateekJainPRAJAIN@MICROSOFT.EDUMicrosoftResearchIndia#9LavelleRoadBangalore560003,IndiaBrianKulisKULIS@CSE.OHIO-STATE.EDU599DreeseLabsOhioStateUniversityColumbus,OH43210,USAJasonV.DavisJVDAVIS@GMAIL.COMEtsyInc.55WashingtonStreet,Ste.512Brooklyn,NY11201InderjitS.DhillonINDERJIT@CS.UTEXAS.EDUTheUniversityofTexasatAustin1UniversityStationC0500Austin,TX78712,USAEditors:S¨orenSonnenburg,FrancisBachandChengSoonOngAbstractMetricandkernellearningariseinseveralmachinelearningapplications.However,mostexistingmetriclearningalgorithmsarelimitedtolearningmetricsoverlow-dimensionaldata,whileexistingkernellearningalgorithmsareoftenlimitedtothetransductivesettinganddonotgeneralizetonewdatapoints.Inthispaper,westudytheconnectionsbetweenmetriclearningandkernellearningthatarisewhenstudyingmetriclearningasalineartransformationlearningproblem.Inparticular,weproposeageneraloptimizationframeworkforlearningmetricsvialineartransformations,andan-alyzeindetailaspecialcaseofourframework—thatofminimizingtheLogDetdivergencesubjecttolinearconstraints.Wethenproposeageneralregularizedframeworkforlearningakernelmatrix,andshowittobeequivalenttoourmetriclearningframework.Ourtheoreticalconnectionsbetweenmetricandkernellearninghavetwomainconsequences:1)thelearnedkernelmatrixparameterizesalineartransformationkernelfunctionandcanbeappliedinductivelytonewdatapoints,2)ourresultyieldsaconstructivemethodforkernelizingmostexistingMahalanobismetriclearningfor-mulations.Wedemonstrateourlearningapproachbyapplyingittolarge-scalerealworldproblemsincomputervision,textminingandsemi-supervisedkerneldimensionalityreduction.Keywords:metriclearning,kernellearning,lineartransformation,matrixdivergences,logdetdivergence1.IntroductionOneofthebasicrequirementsofmanymachinelearningalgorithms(e.g.,semi-supervisedcluster-ingalgorithms,nearestneighborclassicationalgorithms)istheabilitytocomparetwoobjectstocomputeasimilarityordistancebetweenthem.Inmanycases,off-the-shelfdistanceorsimilarityc\r2012PrateekJain,BrianKulis,JasonV.DavisandInderjitS.Dhillon. JAIN,KULIS,DAVISANDDHILLONfunctionssuchastheEuclideandistanceorcosinesimilarityareused;forexample,intextretrievalapplications,thecosinesimilarityisastandardfunctiontocomparetwotextdocuments.However,suchstandarddistanceorsimilarityfunctionsarenotappropriateforallproblems.Recently,therehasbeensignicanteffortfocusedontask-speciclearningforcomparingdataobjects.Oneprominentapproachhasbeentolearnadistancemetricbetweenobjectsgivenaddi-tionalsideinformationsuchaspairwisesimilarityanddissimilarityconstraintsoverthedata.AclassofdistancemetricsthathasshownexcellentgeneralizationpropertiesisthelearnedMaha-lanobisdistancefunction(Davisetal.,2007;Xingetal.,2002;Weinbergeretal.,2005;Goldbergeretal.,2004;Shalev-Shwartzetal.,2004).TheMahalanobisdistancecanbeviewedasamethodinwhichdataissubjecttoalineartransformation,andthegoalofsuchmetriclearningmethodsistolearnthelineartransformationforagiventask.Despitetheirsimplicityandgeneralizationability,Mahalanobisdistancessufferfromtwomajordrawbacks:1)thenumberofparameterstolearngrowsquadraticallywiththedimensionalityofthedata,makingitdifculttolearndistancefunctionsoverhigh-dimensionaldata,2)learningalineartransformationisinadequatefordatasetswithnon-lineardecisionboundaries.Toaddressthelattershortcoming,kernellearningalgorithmstypicallyattempttolearnaker-nelmatrixoverthedata.Limitationsoflinearmethodscanbeovercomebyemployinganon-linearinputkernel,whichimplicitlymapsthedatanon-linearlytoahigh-dimensionalfeaturespace.How-ever,manyexistingkernellearningmethodsarestilllimitedinthatthelearnedkernelsdonotgener-alizetonewpoints(KwokandTsang,2003;Kulisetal.,2006;Tsudaetal.,2005).Thesemethodsarethereforerestrictedtolearninginthetransductivesettingwhereallthedata(labeledandunla-beled)isassumedtobegivenupfront.Therehasbeensomeworkonlearningkernelsthatgeneralizetonewpoints,mostnotablyworkonhyperkernels(Ongetal.,2005),buttheresultingoptimizationproblemsareexpensiveandcannotbescaledtolargeorevenmedium-sizeddatasets.Anotherap-proachismultiplekernellearning(Lanckrietetal.,2004),whichlearnsamixtureofbasekernels;thisapproachisinductivebuttheclassoflearnablekernelscanberestrictive.Inthispaper,weexploremetriclearningwithlineartransformationsoverarbitrarilyhigh-dimensionalspaces;aswewillsee,thisisequivalenttolearningalineartransformationkernelfunctionf(x)TWf(y)givenaninputkernelfunctionf(x)Tf(y).Intherstpartofthepaper,weformulateametriclearningproblemthatusesaparticularlossfunctioncalledtheLogDetdiver-gence,forlearningthepositivedenitematrixW.Thislossfunctionisadvantageousforseveralreasons:itisdenedonlyoverpositivedenitematrices,whichmakestheoptimizationsimpler,aswewillbeabletoeffectivelyignorethepositivedenitenessconstraintonW.Furthermore,thelossfunctionhasprecedenceinoptimization(Fletcher,1991)andstatistics(JamesandStein,1961).Animportantadvantageofourmethodisthattheproposedoptimizationalgorithmisscalabletoverylargedatasetsoftheorderofmillionsofdataobjects.Butperhapsmostimportantly,thelossfunctionpermitsefcientkernelization,allowingefcientlearningofalineartransformationinkernelspace.Asaresult,unliketransductivekernellearningmethods,ourmethodeasilyhandlesout-of-sampleextensions,thatis,itcanbeappliedtounseendata.WebuilduponourresultsofkernelizationfortheLogDetformulationtodevelopageneralframeworkforlearninglineartransformationkernelfunctionsandshowthatsuchkernelscanbeefcientlylearnedoverawideclassofconvexconstraintsandlossfunctions.Ourresultcanbeviewedasarepresentertheorem,wheretheoptimalparameterscanbeexpressedpurelyintermsofthetrainingdata.Inourcase,eventhoughthematrixWmaybeinnite-dimensional,itcanbe520 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATIONfullyrepresentedintermsoftheconstraineddatapoints,makingitpossibletocomputethelearnedkernelfunctionoverarbitrarypoints.Wedemonstratethebenetsofageneralizedframeworkforinductivekernellearningbyapply-ingourtechniquestotheproblemofinductivekernelizedsemi-superviseddimensionalityreduction.Bychoosingthetrace-normasalossfunction,weobtainanovelkernellearningmethodthatlearnslow-ranklineartransformations;unlikepreviouskerneldimensionalitymethods,whichareeitherunsupervisedorcannoteasilybeappliedinductivelytonewdata,ourmethodintrinsicallypossessesbothdesirableproperties.Finally,weapplyourmetricandkernellearningalgorithmstoanumberofchallenginglearningproblems,includingonesfromthedomainsofcomputervisionandtextmining.Unlikeexist-ingtechniques,wecanlearnlineartransformation-baseddistanceorkernelfunctionsoverthesedomains,andweshowthattheresultingfunctionsleadtoimprovementsoverstate-of-the-arttech-niquesforavarietyofproblems.2.RelatedWorkMostoftheexistingworkinmetriclearninghasbeendoneintheMahalanobisdistance(ormetric)learningparadigm,whichhasbeenfoundtobeasufcientlypowerfulclassofmetricsforavarietyofdata.Inoneoftheearliestpapersonmetriclearning,Xingetal.(2002)proposeasemideniteprogrammingformulationundersimilarityanddissimilarityconstraintsforlearningaMahalanobisdistance,buttheresultingformulationisslowtooptimizeandhasbeenoutperformedbymorerecentmethods.Weinbergeretal.(2005)formulatethemetriclearningprobleminalargemarginsetting,withafocusonk-NNclassication.Theyalsoformulatetheproblemasasemideniteprogrammingproblemandconsequentlysolveitusingamethodthatcombinessub-gradientdescentandalternatingprojections.Goldbergeretal.(2004)proceedtolearnalineartransformationinthefullysupervisedsetting.Theirformulationseeksto`collapseclasses'byconstrainingwithin-classdistancestobezerowhilemaximizingbetween-classdistances.Whileeachofthesealgorithmswasshowntoyieldimprovedclassicationperformanceoverthebaselinemetrics,theirconstraintsdonotgeneralizeoutsideoftheirparticularproblemdomains;incontrast,ourapproachallowsarbitrarylinearconstraintsontheMahalanobismatrix.Furthermore,thesealgorithmsallrequireeigenvaluedecompositionsorsemi-deniteprogramming,whichisatleastcubicinthedimensionalityofthedata.OthernotableworksforlearningMahalanobismetricsincludePseudo-metricOnlineLearningAlgorithm(POLA)(Shalev-Shwartzetal.,2004),RelevantComponentsAnalysis(RCA)(SchultzandJoachims,2003),NeighborhoodComponentsAnalysis(NCA)(Goldbergeretal.,2004),andlocally-adaptivediscriminativemethods(HastieandTibshirani,1996).Inparticular,Shalev-Shwartzetal.(2004)providedtherstdemonstrationofMahalanobisdistancelearninginkernelspace.Theirconstruction,however,isexpensivetocompute,requiringcubictimeperiterationtoupdatetheparameters.Aswewillsee,ourLogDet-basedalgorithmcanbeimplementedmoreefciently.Non-lineartransformationbasedmetriclearningmethodshavealsobeenproposed,thoughthesemethodsusuallysufferfromsuboptimalperformance,non-convexity,orcomputationalcomplexity.ExamplesincludetheconvolutionalneuralnetbasedmethodofChopraetal.(2005);andageneralRiemannianmetriclearningmethod(Lebanon,2006).Mostoftheexistingworkonkernellearningcanbeclassiedintotwobroadcategories.Therstcategoryincludesparametricapproaches,wherethelearnedkernelfunctionisrestrictedtobeofa521 JAIN,KULIS,DAVISANDDHILLONspecicformandthentherelevantparametersarelearnedaccordingtotheprovideddata.Prominentmethodsincludemultiplekernellearning(Lanckrietetal.,2004),hyperkernels(Ongetal.,2005),andhyper-parametercross-validation(Seeger,2006).Mostofthesemethodseitherlackmodelingexibility,requirenon-convexoptimization,orarerestrictedtoasupervisedlearningscenario.Thesecondcategoryincludesnon-parametricmethods,whichexplicitlymodelgeometricstructureinthedata.Examplesincludespectralkernellearning(Zhuetal.,2005),manifold-basedkernellearning(Bengioetal.,2004),andkerneltargetalignment(Cristianinietal.,2001).However,mostoftheseapproachesarelimitedtothetransductivesettingandcannotbeusedtonaturallygeneralizetonewpoints.Incomparison,ourkernellearningmethodcombinesbothoftheaboveapproaches.Weproposeageneralnon-parametrickernelmatrixlearningframework,similartomethodsofthesecondcategory.However,basedonourchoiceofregularizationandconstraints,weshowthatourlearnedkernelmatrixcorrespondstoalineartransformationkernelfunctionparameterizedbyaPSDmatrix.Asaresult,ourmethodcanbeappliedtoinductivesettingswithoutsacricingsignicantmodelingpower.Inaddition,ourkernellearningmethodnaturallyprovideskernelizationformanyexistingmetriclearningmethods.Recently,Chatpatanasirietal.(2010)showedkernelizationforaclassofmetriclearningalgorithmsincludingLMNNandNCA;aswewillsee,ourresultismoregeneralandwecanprovekernelizationoveralargerclassofproblemsandcanalsoreducethenumberofparameterstobelearned.Furthermore,ourmethodscanbeappliedtoavarietyofdomainsandwithavarietyofformsofside-information.Independentofourwork,Argyriouetal.(2010)recentlyprovedarepresentertypeoftheoremforspectralregularizationfunctions.However,theframeworktheyconsiderisdifferentthanoursinthattheyareinterestedinsensinganunderlyinghigh-dimensionalmatrixusinggivenmeasurements.TheresearchinthispapercombinesandextendsworkdoneinDavisetal.(2007),Kulisetal.(2006),DavisandDhillon(2008),andJainetal.(2010).ThefocusinDavisetal.(2007)andDavisandDhillon(2008)wassolelyontheLogDetdivergence,whilethemaingoalinKulisetal.(2006)wastodemonstratethecomputationalbenetsofusingtheLogDetandvonNeumanndivergencesforlearninglow-rankkernelmatrices.InJainetal.(2010),weshowedtheequivalencebetweenageneralclassofkernellearningproblemsandmetriclearningproblems.Inthispaper,weunifyandsummarizetheworkintheexistingconferencepapersandalsoprovidedetailedproofsofthetheoremsinJainetal.(2010).3.LogDetDivergenceBasedMetricLearningWebeginbystudyingaparticularmethod,basedontheLogDetdivergence,forlearningmetricsvialearninglineartransformationsgivenpairwisedistanceconstraints.Wediscusskernelizationofthisformulationandpresentefcientoptimizationalgorithms.Finally,weaddresslimitationsofthemethodwhentheamountoftrainingdataislarge,andproposeamodiedalgorithmtoefcientlylearnakernelundersuchcircumstances.Insubsequentsections,wewilltaketheingredientsdevel-opedinthissectionandshowhowtogeneralizethemtoadapttoamuchlargerclassoflossfunctionsandconstraints,whichwillencompassmostofthepreviously-studiedapproachesforMahalanobismetriclearning.3.1MahalanobisDistancesandParameterizedKernelsFirstweintroducetheframeworkformetricandkernellearningthatisemployedinthispaper.GivenadatasetofobjectsX=[x1;:::;xn];xi2Rd0(whenworkinginkernelspace,thedata522 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATIONmatrixwillberepresentedasF=[f(x1);:::;f(xn)],wherefisthemappingtofeaturespace,thatis,fRd0!Rd),weareinterestedinndinganappropriatedistancefunctiontocomparetwoobjects.WeconsidertheMahalanobisdistance,parameterizedbyapositivedenitematrixW;thesquareddistancebetweenxiandxjisgivenbydW(xi;xj)=(xixj)TW(xixj):(1)ThisdistancefunctioncanbeviewedaslearningalineartransformationofthedataandmeasuringthesquaredEuclideandistanceinthetransformedspace.ThisisseenbyfactorizingthematrixW=GTGandobservingthatdW(xi;xj)=kGxiGxjk22.However,ifthedataisnotlinearlyseparableintheinputspace,thentheresultingdistancefunctionmaynotbepowerfulenoughforthedesiredapplication.Asaresult,weareinterestedinworkinginkernelspace;thatis,wewillexpresstheMahalanobisdistanceinkernelspaceusinganappropriatemappingffrominputtofeaturespace:dW(f(xi);f(xj))=(f(xi)f(xj))TW(f(xi)f(xj)):Notethatwhenwechooseftobetheidentity,weobtain(1);wewillusethemoregeneralformthroughoutthispaper.Asisstandardwithkernel-basedalgorithms,werequirethatthisdistancebecomputablegiventheabilitytocomputethekernelfunctionk0(x;y)=f(x)Tf(y).Wecanthereforeequivalentlyposetheproblemaslearningaparameterizedkernelfunctionk(x;y)=f(x)TWf(y)givensomeinputkernelfunctionk0(x;y)=f(x)Tf(y)Tolearntheresultingmetric/kernel,weassumethatwearegivenconstraintsonthedesireddistancefunction.Inthispaper,weassumethatpairwisesimilarityanddissimilarityconstraintsaregivenoverthedata—thatis,pairsofpointsthatshouldbesimilarunderthelearnedmetric/kernel,andpairsofpointsthatshouldbedissimilarunderthelearnedmetric/kernel.Suchconstraintsarenaturalinmanysettings;forexample,givenclasslabelsoverthedata,pointsinthesameclassshouldbesimilartooneanotheranddissimilartopointsindifferentclasses.However,ourapproachisgeneralandcanaccommodateotherpotentialconstraintsoverthedistancefunction,suchasrelativedistanceconstraints.ThemainchallengeisinndinganappropriatelossfunctionforlearningthematrixWsothat1)theresultingalgorithmisscalableandefcientlycomputableinkernelspace,2)theresultingmet-ric/kernelyieldsimprovedperformanceontheunderlyinglearningproblem,suchasclassication,semi-supervisedclusteringetc.Wenowmoveontothedetails.3.2LogDetMetricLearningTheLogDetdivergencebetweentwopositivedenitematrices1WW02RddisdenedtobeD`d(W;W0)=tr(WW10)logdet(WW10)d:WeareinterestedinndingWthatisclosesttoW0asmeasuredbytheLogDetdivergencebutthatsatisesourdesiredconstraints.WhenW0=I,wecaninterpretthelearningproblemasamaximumentropyproblem.GivenasetofsimilarityconstraintsSanddissimilarityconstraintsD,wepropose 1.ThedenitionofLogDetdivergencecanbeextendedtothecasewhenW0andWarerankdecientbyappropriateuseofthepseudo-inverse.TheinterestedreadermayrefertoKulisetal.(2008).523 JAIN,KULIS,DAVISANDDHILLONthefollowingproblem:minW0D`d(W;I);s.t.dW(f(xi);f(xj))u;(;)2S;dW(f(xi);f(xj))`;(;)2D:(2)Wemakeafewremarksaboutthisformulation.TheaboveproblemwasproposedandstudiedinDavisetal.(2007).LogDethasmanyimportantpropertiesthatmakeitusefulformachinelearningandoptimization,includingscale-invarianceandpreservationoftherangespace;seeKulisetal.(2008)foradetaileddiscussiononthepropertiesofLogDet.Beyondthis,wepreferLogDetoverotherlossfunctions(includingthesquaredFrobeniuslossasusedinShalev-Shwartzetal.,2004oralinearobjectiveasinWeinbergeretal.,2005)duetothefactthattheresultingalgorithmturnsouttobesimpleandefcientlykernelizable,aswewillsee.Wenotethatformulation(2)minimizestheLogDetdivergencetotheidentitymatrixI.ThiscaneasilybegeneralizedtoarbitrarypositivedenitematricesW0.Further,(2)considerssimplesimilarityanddissimilarityconstraintsoverthelearnedMahalanobisdistance,butotherlinearconstraintsarepossible.Finally,theaboveformulationassumesthatthereexistsafeasiblesolutiontotheproposedoptimizationproblem;extensionstotheinfeasiblecaseinvolvingslackvariablesarediscussedlater(seeSection3.5).3.3KernelizingtheLogDetMetricLearningProblemWenowconsidertheproblemofkernelizingthemetriclearningproblem.Subsequently,wewillpresentanefcientalgorithmanddiscussgeneralizationtonewpoints.Givenasetofnconstraineddatapoints,letK0denotetheinputkernelmatrixforthedata,thatis,K0(;)=k0(xi;xj)=f(xi)Tf(xj).NotethatthesquaredEuclideandistanceinkernelspacemaybewrittenasK(;)+K(;)2K(;),whereKisthelearnedkernelmatrix;equivalently,wemaywritethedistanceastr(K(eiej)(eiej)T),whereeiisthe-thcanonicalbasisvector.ConsiderthefollowingproblemtondKminK0D`d(K;K0);s.t.tr(K(eiej)(eiej)T)u;(;)2S;tr(K(eiej)(eiej)T)`;(;)2D:(3)ThiskernellearningproblemwasrstproposedinthetransductivesettinginKulisetal.(2008),thoughnoextensionstotheinductivecasewereconsidered.Notethatproblem(2)optimizesoveraddmatrixW,whilethekernellearningproblem(3)optimizesoverannnmatrixK.Wenowpresentourkeytheoremconnecting(2)and(3).Theorem1LetK00.LetWbetheoptimalsolutiontoproblem(2)andletKbetheoptimalsolutiontoproblem(3).Thentheoptimalsolutionsarerelatedbythefollowing:K=FTWF;W=I+FSFT;whereS=K10(KK0)K10;K0=FTF;F=[f(x1);f(x2);:::;f(xn)]:TheabovetheoremshowsthattheLogDetmetriclearningproblem(2)canbesolvedimplicitlybysolvinganequivalentkernellearningproblem(3).Infact,inSection4weshowanequivalencebe-tweenmetricandkernellearningforageneralclassofregularizationfunctions.Theabovetheoremfollowsasacorollarytoourgeneraltheorem(seeTheorem4),whichwillbeprovenlater.524 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATIONNext,wegeneralizetheabovetheoremtoregularizeagainstarbitrarypositivedenitematricesW0Corollary2Considerthefollowingproblem:minW0D`d(W;W0);s.t.dW(f(xi);f(xj))u;(;)2S;dW(f(xi);f(xj))`;(;)2D:(4)LetWbetheoptimalsolutiontoproblem(4)andletKbetheoptimalsolutiontoproblem(3)Thentheoptimalsolutionsarerelatedbythefollowing:K=FTWF;W=W0+W0FSFTW0;whereS=K10(KK0)K10;K0=FTW0F;F=[f(x1);f(x2);:::;f(xn)]:ProofNotethatD`d(W;W0)=D`d(W1=20WW1=20;I).LeteW=W1=20WW1=20.Problem(4)isnowequivalentto:mineW0D`d(eW;I);s.t.deW(˜f(xi);˜f(xj))u(;)2S;deW(˜f(xi);˜f(xj))`(;)2D;(5)whereeW=W1=20WW1=20eF=W1=20FandeF=[˜f(x1);˜f(x2);:::;˜f(xn)].NowusingTheo-rem1,theoptimalsolutioneWofproblem(5)isrelatedtotheoptimalKofproblem(3)byK=eFTeWeF=FTW1=20W1=20WW1=20W1=20F=FTWF.Similarly,W=W1=20eWW1=20=W0+W0FSFTW0whereS=K10(KK0)K10 SincethekernelizedversionofLogDetmetriclearningisalsoalinearlyconstrainedoptimizationproblemwithaLogDetobjective,similaralgorithmscanbeusedtosolveeitherproblem.ThisequivalenceimpliesthatwecanimplicitlysolvethemetriclearningproblembyinsteadsolvingfortheoptimalkernelmatrixK.NotethatusingLogDetdivergenceasobjectivefunctionhastwosig-nicantbenetsovermanyotherpopularlossfunctions:1)themetricandkernellearningproblems(2),(3)arebothequivalentandthereforesolvingthekernellearningformulationdirectlyprovidesanoutofsampleextension(seeSection3.4fordetails),2)projectionwithrespecttotheLogDetdivergenceontoasingledistanceconstrainthasaclosed-formsolution,thusmakingitamenabletoanefcientcyclicprojectionalgorithm(refertoSection3.5).3.4GeneralizingtoNewPointsInthissection,weseehowtogeneralizetonewpointsusingthelearnedkernelmatrixKSupposethatwehavesolvedthekernellearningproblemforK(fromnowon,wewilldropthesuperscriptandassumethatKandWareatoptimality).Thedistancebetweentwopointsf(xi)andf(xj)thatareinthetrainingsetcanbecomputeddirectlyfromthelearnedkernelmatrixasK(;)+K(;)2K(;).Wenowconsidertheproblemofcomputingthelearneddistancebetweentwopointsf(z1)andf(z2)thatmaynotbeinthetrainingset.525 JAIN,KULIS,DAVISANDDHILLONInTheorem1,weshowedthattheoptimalsolutiontothemetriclearningproblemcanbeex-pressedasW=I+FSFT.TocomputetheMahalanobisdistanceinkernelspace,weseethattheinnerproductf(z1)TWf(z2)canbecomputedentirelyviainnerproductsbetweenpoints:f(z1)TWf(z2)=f(z1)T(I+FSFT)f(z2)=f(z1)Tf(z2)+f(z1)TFSFTf(z2);=k0(z1;z2)+kT1Sk2;whereki=[k0(zi;x1);:::;k0(zi;xn)]T:(6)Thus,theexpressionabovecanbeusedtoevaluatekernelizeddistanceswithrespecttothelearnedkernelfunctionbetweenarbitrarydataobjects.Insummary,theconnectionbetweenkernellearningandmetriclearningallowsustogeneralizeourmetricstonewpointsinkernelspace.ThisisperformedbyrstsolvingthekernellearningproblemforK,thenusingthelearnedkernelmatrixandtheinputkernelfunctiontocomputelearneddistancesusing(6).3.5KernelLearningAlgorithmGiventheconnectionbetweentheMahalanobismetriclearningproblemfortheddmatrixWandthekernellearningproblemforthennkernelmatrixK,wedevelopanalgorithmforefcientlyperformingmetriclearninginkernelspace.Specically,weprovideanalgorithm(seeAlgorithm1)forsolvingthekernelizedLogDetmetriclearningproblem(3).First,toavoidproblemswithinfeasibility,weincorporateslackvariablesintoourformulation.TheseprovideatradeoffbetweenminimizingthedivergencebetweenKandK0andsatisfyingtheconstraints.Notethatourearlierresults(seeTheorem1)easilygeneralizetotheslackcase:minK;D`d(K;K0)+gD`d(diag();diag(0))s.t.tr(K(eiej)(eiej)T)xij(;)2S;tr(K(eiej)(eiej)T)xij(;)2D:(7)TheparametergabovecontrolsthetradeoffbetweensatisfyingtheconstraintsandminimizingD`d(K;K0),andtheentriesof0aresettobeuforcorrespondingsimilarityconstraintsand`fordissimilarityconstraints.Tosolveproblem(7),weemploythetechniqueofBregmanprojections,asdiscussedinthetransductivesetting(Kulisetal.,2008).Ateachiteration,wechooseaconstraint(;)fromSorDWethenapplyaBregmanprojectionsuchthatKsatisestheconstraintafterprojection;notethattheprojectionisnotanorthogonalprojectionbutisrathertailoredtotheparticularfunctionthatweareoptimizing.Algorithm1detailsthestepsforBregman'smethodonthisoptimizationproblem.Eachupdateisarank-oneupdateK K+bK(eiej)(eiej)TK;wherebisaprojectionparameterthatcanbecomputedinclosedform(seeAlgorithm1).Algorithm1hasanumberofkeypropertieswhichmakeitusefulforvariouskernellearningtasks.First,theBregmanprojectionscanbecomputedinclosedform,assuringthattheprojectionupdatesareefcient(O(n2)).Notethat,ifthefeaturespacedimensionalitydislessthannthenasimilaralgorithmcanbeuseddirectlyinthefeaturespace(seeDavisetal.,2007).InsteadofLogDet,ifweusethevonNeumanndivergence,anotherpotentiallossfunctionforthisproblem,526 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATION Algorithm1Metric/KernelLearningwiththeLogDetDivergence Input:K0:inputnnkernelmatrix,S:setofsimilarpairs,:setofdissimilarpairs,u;`:distancethresholds,:slackparameterOutput:K:outputkernelmatrix1.KK0,lij08ij2.xijufor(i;j)2S;otherwisexij`3.repeat3.1.Pickaconstraint(i;j)2Sor3.2.p(eiej)TK(eiej)3.3.d1if(i;j)2S,1otherwise3.4.aminlij;dg g+11 p1 xij3.5.bda=(1dap)3.6.xijgxij=(+daxij)3.7.lijlija3.8.KK+bK(eiej)(eiej)TK4.untilconvergencereturnK O(n2)updatesarepossible,butaremuchmorecomplicatedandrequireuseofthefastmultipolemethod,whichcannotbeemployedeasilyinpractice.Secondly,theprojectionsmaintainpositivedeniteness,whichavoidsanyeigenvectorcomputationorsemideniteprogramming.ThisisinstarkcontrastwiththeFrobeniusloss,whichrequiresadditionalcomputationtomaintainpositivedeniteness,leadingtoO(n3)updates.3.6Metric/KernelLearningwithLargeDataSetsInSections3.1and3.3,weproposedaLogDetdivergence-basedMahalanobismetriclearningprob-lem(2)andanequivalentkernellearningproblem(3).ThenumberofparametersinvolvedintheseproblemsisO(min(n;d)2),wherenisthenumberoftrainingpointsanddisthedimensionalityofthedata.Thequadraticdependenceaffectsnotonlytherunningtimefortrainingandtesting,butalsorequiresestimatingalargenumberofparameters.Forexample,adatasetwith10,000dimensionsleadstoaMahalanobismatrixwith100millionentries.Thisrepresentsafundamentallimitationofexistingapproaches,asmanymoderndataminingproblemspossessrelativelyhighdimensionality.Inthissection,wepresentaheuristicforlearningstructuredMahalanobisdistance(kernel)func-tionsthatscalelinearlywiththedimensionality(ortrainingsetsize).InsteadofrepresentingtheMahalanobisdistance/kernelmatrixasafulldd(ornn)matrixwithO(min(n;d)2)parameters,ourmethodsusecompressedrepresentations,admittingmatricesparameterizedbyO(min(n;d))values.ThisenablestheMahalanobisdistance/kernelfunctiontobelearned,stored,andevaluatedefcientlyinthecontextofhighdimensionalityandlargetrainingsetsize.Inparticular,wepro-poseamethodtoefcientlylearnanidentitypluslow-rankMahalanobisdistancematrixanditsequivalentkernelfunction.Now,weformulatethisapproach,whichwecallthehigh-dimensionalidentitypluslow-rank(IPLR)metriclearningproblem.Consideralow-dimensionalsubspaceinRdandletthecolumnsofUformanorthogonalbasisofthissubspace.WewillconstrainthelearnedMahalanobisdistance527 JAIN,KULIS,DAVISANDDHILLONmatrixtobeoftheform:W=Id+Wl=Id+ULUT;whereIdistheddidentitymatrix,Wldenotesthelow-rankpartofWandL2Skk+withkmin(n;d).Analogousto(2),weproposethefollowingproblemtolearnanidentitypluslow-rankMahalanobisdistancefunction:minW;L0D`d(W;Id)s.t.dW(f(xi);f(xj))u(;)2S;dW(f(xi);f(xj))`(;)2D;W=Id+ULUT:(8)Notethattheaboveproblemisidenticalto(2)exceptfortheaddedconstraintW=Id+ULUT.LetF=Ik+L.NowwehaveD`d(W;Id)=tr(Id+ULUT)logdet(Id+ULUT)d;=tr(Ik+L)+dklogdet(Ik+L)d=D`d(F;Ik);(9)wherethesecondequalityfollowsfromthefactthattr(AB)=tr(BA)andSylvester'sdeterminantlemma.AlsonotethatforallC2Rnntr(WFCFT)=tr((Id+ULUT)FCFT)=tr(FCFT)+tr(LUTFCFTU);=tr(FCFT)tr(F0CF0T)+tr(FF0CF0T);whereF0=UTFisthereduced-dimensionalrepresentationofF.Therefore,dW(f(xi);f(xj))=tr(WF(eiej)(eiej)TFT)(10)=dI(f(xi);f(xj))dI(f0(xi);f0(xj))+dF(f0(xi);f0(xj)):Using(9)and(10),problem(8)isequivalenttothefollowing:minF0D`d(F;Ik);s.t.dF(f0(xi);f0(xj))udI(f(xi);f(xj))+dI(f0(xi);f0(xj));(;)2S;dF(f0(xi);f0(xj))`dI(f(xi);f(xj))+dI(f0(xi);f0(xj));(;)2D:(11)Notethattheaboveformulationisaninstanceofproblem(2)andcanbesolvedusinganalgorithmsimilartoAlgorithm1.Furthermore,theaboveproblemsolvesforakkmatrixratherthanaddmatrixseeminglyrequiredby(8).TheoptimalWisobtainedasW=Id+U(FIk)UTNext,weshowthatproblem(11)andequivalently(8)canbesolvedefcientlyinfeaturespacebyselectinganappropriatebasisR(U=R(RTR)1=2).LetR=FJ,whereJ2Rnk.NotethatU=FJ(JTK0J)1=2andF0=UTF=(JTK0J)1=2JTK0,thatis,F02Rkncanbecomputedefcientlyinthefeaturespace(requiringinversionofonlyakkmatrix).Hence,problem(11)canbesolvedefcientlyinfeaturespaceusingAlgorithm1,andtheoptimalkernelKisgivenbyK=FTWF=K0+K0J(JTK0J)1=2(FIk)(JTK0J)1=2JTK0:Notethat(11)canbesolvedviaAlgorithm1usingO(k2)computationalstepsperiteration.Additionally,O(min(n;d)k)stepsarerequiredtopreparethedata.Also,theoptimalsolutionW(or528 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATIONK)canbestoredimplicitlyusingO(min(n;d)k)memoryandsimilarly,theMahalanobisdistancebetweenanytwopointscanbecomputedinO(min(n;d)k)time.Themetriclearningproblempresentedheredependscriticallyonthebasisselected.ForthecasewhendisnotsignicantlylargerthannandfeaturespacevectorsFareavailableexplicitly,thebasisRcanbeselectedbyusingoneofthefollowingheuristics(seeSection5,DavisandDhillon,2008formoredetails):UsingthetopksingularvectorsofFClusteringthecolumnsofFandusingthemeanvectorsasthebasisRForthefully-supervisedcase,ifthenumberofclasses(c)isgreaterthantherequireddimen-sionality(k)thenclustertheclass-meanvectorsintokclustersandusetheobtainedclustercentersforformingthebasisR.Ifckthenclustereachclassintok=cclustersandusetheclustercenterstoformRForlearningthekernelfunction,thebasisR=FJcanbeselectedby:1)usingarandomlysampledcoefcientmatrixJ,2)clusteringFusingkernelk-meansoraspectralclusteringmethod,3)choos-ingarandomsubsetofF,thatis,thecolumnsofJarerandomindicatorvectors.AmorecarefulselectionofthebasisRshouldfurtherimproveaccuracyofourmethodandisleftasatopicforfutureresearch.4.KernelLearningwithOtherConvexLossFunctionsOneofthekeybenetsofourkernellearningformulationusingtheLogDetdivergence(3)isintheabilitytoefcientlylearnalineartransformation(LT)kernelfunction(akerneloftheformf(x)TWf(y)forsomematrixW0)whichallowsthelearnedkernelfunctiontobecomputedovernewdatapoints.Anaturalquestioniswhetheronecanlearnsimilarkernelfunctionswithotherlossfunctions,suchasthoseconsideredpreviouslyintheliteratureforMahalanobismetriclearning.Inthissection,weproposeandanalyzeageneralkernelmatrixlearningproblemsimilarto(3)butusingamoregeneralclassoflossfunctions.AsintheLogDetlossfunctioncase,weshowthatourkernelmatrixlearningproblemisequivalenttolearningalineartransformation(LT)kernelfunctionwithaspeciclossfunction.ThisimpliesthatthelearnedLTkernelfunctioncanbenaturallyappliedtonewdata.Additionally,sincealargeclassofmetriclearningmethodscanbeseenaslearningaLTkernelfunction,ourresultprovidesaconstructivemethodforkernelizingthesemethods.Ouranalysisrecoverssomerecentkernelizationresultsformetriclearning,butalsoimpliesseveralnewresults.4.1AGeneralKernelLearningFrameworkRecallthatk0Rd0Rd0!Ristheinputkernelfunction.WeassumethatthedatavectorsinXhavebeenmappedviaf,resultinginF=[f(x1);f(x2);:::;f(xn)].Asbefore,denotetheinputkernelmatrixasK0=FTF.Thegoalistolearnakernelfunctionkthatisregularizedagainstk0butincorporatestheprovidedside-information.AsintheLogDetformulation,wewillrstconsideratransductivescenario,wherewelearnakernelmatrixKthatisregularizedagainstK0whilesatisfyingtheavailableside-information.529 JAIN,KULIS,DAVISANDDHILLONRecallthattheLogDetdivergencebasedlossfunctioninthekernelmatrixlearningproblem(3)isgivenby:D`d(K;K0)=tr(KK10)logdet(KK10)n;=tr(K1=20KK1=20)logdet(K1=20KK1=20)n:Thekernelmatrixlearningproblem(3)canberewrittenas:minK0(K1=20KK1=20);s.t.tr(K(eiej)(eiej)T)u;(;)2S;tr(K(eiej)(eiej)T)`;(;)2D;where(A)=tr(A)logdet(A)Inthissection,wewillgeneralizeouroptimizationproblemtoincludemoregenerallossfunc-tionsbeyondtheLogDet-basedlossfunctionspeciedabove.WealsogeneralizeourconstraintstoincludearbitraryconstraintsoverthekernelmatrixKratherthanjustthepairwisedistancecon-straintsintheaboveproblem.Usingtheabovespeciedgeneralizations,theoptimizationproblemthatweobtainisgivenby:minK0(K1=20KK1=20);s.t.gi(K)bi;1m;(12)whereandgiarefunctionsfromRnn!R.Wecallthelossfunction(orregularizer)andgitheconstraints.Notethatifandconstraintsgi'sareallconvex,thentheaboveproblemcanbesolvedoptimally(undermildconditions)usingstandardconvexoptimizationalgorithms(Groscheletal.,1988).Notethatourresultsalsoholdforunconstrainedvariantsoftheaboveproblem,aswellasvariantswithslackvariables.Ingeneral,suchformulationsarelimitedinthatthelearnedkernelcannotreadilybeappliedtonewdatapoints.However,wewillshowthattheaboveproposedproblemisequivalenttolearninglineartransformation(LT)kernelfunctions.Formally,anLTkernelfunctionkWisakernelfunctionoftheformk(x;y)=f(x)TWf(y),whereWisapositivesemi-denite(PSD)matrix.AnaturalwaytolearnanLTkernelfunctionwouldbetolearntheparameterizationmatrixWusingtheprovidedside-information.Tothisend,weconsiderthefollowinggeneralizationofourLogDetbasedlearningproblem(2):minW0(W);s.t.gi(FTWF)bi;1m;(13)where,asbefore,thefunctionisthelossfunctionandthefunctionsgiaretheconstraintsthatencodethesideinformation.TheconstraintsgiareassumedtobeafunctionofthematrixFTWFoflearnedkernelvaluesoverthetrainingdata.NotethatmostMahalanobismetriclearningmethodsmaybeviewedasaspecialcaseoftheaboveframework(seeSection5).Also,fordatamappedtohigh-dimensionalspacesviakernelfunctions,thisproblemisseeminglyimpossibletooptimizesincethesizeofWgrowsquadraticallywiththedimensionality.4.2AnalysisWenowanalyzetheconnectionbetweentheproblems(12)and(13).Wewillshowthatthesolutionstothetwoproblemsareequivalent,thatis,byoptimallysolvingoneoftheproblems,thesolution530 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATIONtotheothercanbecomputedinclosedform.Further,thisresultwillyieldinsightintothetypeofkernelthatislearnedbythekernellearningproblem.WebeginbydeningtheclassoflossfunctionsconsideredinouranalysisDenition3WesaythatfRnn!Risaspectralfunctioniff(A)=åis(li),wherel1;:::;lnaretheeigenvaluesofAandfsR!Risareal-valuedscalarfunction.Notethatiffsisaconvexscalarfunction,thenfisalsoconvex.NotethattheLogDetbasedlossfunctionin(3)isaspectralfunction.Similarly,mostoftheexistingmetriclearningformulationshaveaspectralfunctionastheirobjectivefunction.Nowwestateourmainresultthatforaspectralfunction,problems(12)and(13)areequiva-lent.Theorem4LetK0=FTF0,fbeaspectralfunctionasinDenition3andassumethattheglobalminimumofthecorrespondingstrictlyconvexscalarfunctionfsisa�0.LetWbeanoptimalsolutionto(13)andKbeanoptimalsolutionto(12).Then,W=aId+FSFT;whereS=K10(KaK0)K10.Furthermore,K=FTWFTherstpartofthetheoremdemonstratesthat,givenanoptimalsolutionKto(12),onecancon-structthecorrespondingsolutionWto(13),whilethesecondpartshowsthereverse.NotethesimilaritiesbetweenthistheoremandtheearlierTheorem1.Weprovidetheproofofthistheorembelow.Themainideabehindtheproofistorstshowthattheoptimalsolutionto(13)isalwaysoftheformW=aId+FSFT,andthenweobtaintheclosedformexpressionforSusingsimplealgebraicmanipulations.Firstweintroduceandanalyzeanauxiliaryoptimizationproblemthatwillhelpinprovingtheabovetheorem.Considerthefollowingproblem:minW0;L(W);s.t.gi(FTWF)bi;1m;W=aId+ULUT;(14)whereL2Rkk;U2Rdkisacolumnorthogonalmatrix,andIdistheddidentitymatrix.Ingeneral,kcanbesignicantlysmallerthanmin(n;d).Notethattheaboveproblemisidenticalto(13)exceptforanaddedconstraintW=aId+ULUT.Wenowshowthat(14)isequivalenttoaproblemoverkkmatrices.Inparticular,(14)isequivalentto(15)denedbelow.Lemma5LetfbeaspectralfunctionasinDenition3andleta�0beanyscalar.Then,(14)isequivalentto:minLaIk(aIk+L);s.t.gi(aFTF+FTULUTF)bi;1m:(15)ProofThelastconstraintin(14)assertsthatW=aId+ULUT,whichimpliesthatthereisaone-to-onemappingbetweenWandL:givenWLcanbecomputedandvice-versa.Asaresult,wecaneliminatethevariableWfrom(14)bysubstitutingaId+ULUTforW(viathelastconstraintin(14)).Theresultingoptimizationproblemis:minLaIk(aId+ULUT);s.t.gi(aFTF+FTULUTF)bi;1m:(16)531 JAIN,KULIS,DAVISANDDHILLONNotethat(15)and(16)arethesameexceptfortheirobjectivefunctions.Below,weshowthatboththeobjectivefunctionsareequaluptoaconstant,sotheyareinterchangeableintheoptimizationproblem.LetU02RddbeanorthonormalmatrixobtainedbycompletingthebasisrepresentedbyU,thatis,U0=[UU?]forsomeU?2Rd(dk)s.t.UTU?=0andUT?U?=Idk.Now,W=aId+ULUT=U0aId+L000U0T:Itisstraightforwardtoseethatforaspectralfunction(VWVT)=(W);whereVisanorthogonalmatrix.Also,8A;B2RddA00B=(A)+(B):Usingtheaboveobservations,weget:(W)=(aId+ULUT)=aIk+L00aIdk=(aIk+L)+(dk)(a):(17)Therefore,theobjectivefunctionsof(15)and(16)differbyonlyaconstant,thatis,theyareequiv-alentwithrespecttotheoptimizationproblem.Thelemmafollows. Wenowshowthatforstrictlyconvexspectralfunctions(seeDenition3)theoptimalsolutionWto(13)isoftheformW=aId+FSFT,forsomeSLemma6Supposef,K0andasatisfytheconditionsgiveninTheorem4.Then,theoptimalsolu-tionto(13)isoftheformW=aId+FSFT,whereSisannmatrix.ProofNotethatK00impliesthatdn.Ourresultscanbeextendedwhendn,thatis,K00,byusingthepseudo-inverseofK0insteadoftheinverse.However,forsimplicityweonlypresentthefull-rankcase.Now,letW=ULUT=åjljujuTjbetheeigenvaluedecompositionofW.Consideraconstraintgi(FTWF)biasspeciedin(13).Notethatifthe-theigenvectorujofWisorthogonaltotherangespaceofF,thatis,FTuj=0,thenthecorrespondingeigenvalueljisnotconstrained(exceptforthenon-negativityconstraintimposedbythepositivesemi-denitenessconstraint).SincetherangespaceofFisatmostn-dimensional,wecanassumethatlj0;8&#x-0.9;畵narenotconstrainedbythelinearinequalityconstraintsin(13).SincesatisestheconditionsofTheorem4,(W)=åjs(lj).Also,s(a)=minxs(x)Hence,tominimize(W),wecanselectlj=a0;8&#x-0.9;畵n(notethatthenon-negativityconstraintisalsosatisedhere).Furthermore,theeigenvectorsuj;8n,lieintherangespaceofF,thatis,8n;uj=Fzjforsomezj2Rn.Therefore,W=nåj=1ljujuTj+adåj=n+1ujuTj=nåj=1(lja)ujuTj+adåj=1ujuTj=FSFT+aId;whereS=ånj=1(lja)zjzTj NowweuseLemmas5and6toproveTheorem4.Proof[ProofofTheorem4]LetF=UFSVTFbethesingularvaluedecomposition(SVD)ofFNotethatK0=FTF=VFS2VTF,soSVTF=VTFK1=20.Also,assumingF2Rdntobefull-rankandd�nVFVTF=IUsingLemma6,theoptimalsolutionto(13)isrestrictedtobeoftheformW=aId+FSFT=aId+UFSVTFSVFSUTF=aId+UFVTFK1=20SK1=20VFUTF=aId+UFVTFLVFUTF,whereL=K1=20SK1=20532 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATIONAsaresult,forspectralfunctions,(13)isequivalentto(14),sousingLemma5,(13)isequivalentto(15)withU=UFVTFandL=K1=20SK1=20.Also,notethattheconstraintsin(15)canbesimpliedto:gi(aFTF+FTULUTF)bigi(aK0+K1=20LK1=20)bi:Now,letK=aK0+K1=20LK1=20=aK0+K0SK0,thatis,L=K1=20(KaK0)K1=20.Theorem4nowfollowsbysubstitutingforLin(15). Asarstconsequenceofthisresult,wecanachieveinductionoverthelearnedkernels,analo-gousto(6)fortheLogDetcase.GiventhatK=FTWF,wecanseethatthelearnedkernelfunctionisalineartransformationkernel;thatis,k(xi;xj)=f(xi)TWf(xj).Givenapairofnewdatapointsz1andz2,weusethefactthatthelearnedkernelisalineartransformationkernel,alongwiththerstresultofthetheorem(W=aId+FSFT)tocomputethelearnedkernelas:f(z1)TWf(z2)=ak0(z1;z2)+kT1Sk2;whereki=[k0(zi;x1);:::;k0(zi;xn)]T:(18)SinceLogDetdivergenceisalsoaspectralfunction,Theorem4isageneralizationofTheorem1andimplieskernelizationforourmetriclearningformulation(2).Moreover,manyMahalanobismetriclearningmethodscanbeviewedasaspecialcaseof(13),soacorollaryofTheorem4isthatwecanconstructivelyapplythesemetriclearningmethodsinkernelspacebysolvingtheircor-respondingkernellearningproblem,andthencomputethelearnedmetricsvia(18).KernelizationofMahalanobismetriclearninghaspreviouslybeenestablishedforsomespecialcases;ourresultsgeneralizeandextendpreviousmethods,aswellasprovidesimplertechniquesinsomecases.WewillfurtherelaborateinSection5withseveralspecialcases.4.3ParameterReductionAsnotedinSection3.6thatthesizeofthekernelmatricesKandtheparametermatricesSarennandthusgrowquadraticallywiththenumberofdatapoints.SimilartothespecialcaseoftheLogDetdivergence(seeSection3.6),wewouldliketohaveawaytorestrictourgeneraloptimizationprob-lem(12)overasmallernumberofparameters.So,wenowdiscussageneralizationof(13)byintroducinganadditionalconstrainttomakeitpossibletoreducethenumberofparameterstolearn,permittingscalabilitytodatasetswithmanytrainingpointsandwithveryhighdimensionality.Theorem4showsthattheoptimalKisoftheformFTWF=aK0+K0SK0.Inordertoaccommodatefewerparameterstolearn,anaturaloptionistoreplacetheunknownSmatrixwithalow-rankmatrixJLJT,whereJ2Rnkisapre-speciedmatrix,L2Rkkisunknown(weuseLinsteadofStoemphasizethatSisofsizennwhereasLiskk),andtherankkisaparameterofthealgorithm.Then,wewillexplicitlyenforcethatthelearnedkernelisofthisform.ByplugginginK=aK0+K0SK0into(12)andreplacingSwithJLJT,theresultingoptimizationproblemisgivenby:minL0(aIn+K1=20JLJTK1=20);s.t.gi(aK0+K0JLJTK0)bi;1m:(19)NotethattheaboveproblemisastrictgeneralizationofourLogDetfunctionbasedparameterre-ductionapproach(seeSection3.6).Whiletheaboveprobleminvolvesonlykkvariables,thefunctionsandgi'sareappliedtonnmatricesandthereforetheproblemmaystillbecomputationallyexpensivetooptimize.Below,533 JAIN,KULIS,DAVISANDDHILLONweshowthatforanyspectralfunctionandlinearconstraintsgi(K)=tr(CiK),(19)reducestoaproblemthatappliesandgi'stokkmatricesonly,whichprovidessignicantscalability.Theorem7LetK0=FTFandJbesomematrixinRnk.Also,letthelossfunctionfbeaspectralfunction(seeDenition3)suchthatthecorrespondingstrictlyconvexscalarfunctionfshastheglobalminimumata�0.Thenproblem(19)withgi(K)=tr(CiK)isequivalenttothefollowingalternativeoptimizationproblem:minLa(KJ)1((KJ)1=2(aKJ+KJLKJ)(KJ)1=2);s.t.tr(LJTK0CiK0J)bitr(aK0Ci);1m;(20)whereKJ=JTK0J.ProofLetU=K1=20J(JTK0J)1=2andletJbeafullrankmatrix,thenUisanorthogonalmatrix.Using(17)weget,(aIn+U(JTK0J)1=2L(JTK0J)1=2UT)=(aIk+(JTK0J)1=2L(JTK0J)1=2):Nowconsideralinearconstrainttr(Ci(aK0+K0JLJTK0))bi.Thiscanbeeasilysimpliedtotr(LJTK0CiK0J)bitr(aK0Ci):SimilarsimplealgebraicmanipulationstothePSDconstraintcompletestheproof. Notethat(20)isoverkkmatrices(afterinitialpre-processing)andisinfactsimilartothekernellearningproblem(12),butwithakernelKJofsmallersizekkknSimilarto(12),wecanshowthat(19)isalsoequivalenttoLTkernelfunctionlearning.Thisenablesustonaturallyapplytheabovekernellearningproblemintheinductivesetting.Theorem8Consider(19)withgi(K)=tr(CiK)andaspectralfunctionfwhosecorrespondingscalarfunctionfshasaglobalminimumata�0.LetJ2Rnk.Then,(19)and(20)withgi(K)=tr(CiK)areequivalenttothefollowinglineartransformationkernellearningproblem(analogoustotheconnectionbetween(12)and(13)):minW0;L(W);s.t.tr(FTWF)bi;1m;W=aId+FJLJTFT:(21)ProofConsiderthelastconstraintin(21):W=aId+FJLJTFT:LetF=USVTbetheSVDofFHence,W=aId+UVTVSVTJLJTVSVTVUT=aId+UVTK1=20JLJTK1=20VUT,whereweusedK1=20=VSVT.Fordis-ambiguity,renameLasL0andUasU0.TheresultnowfollowsbyusingLemma5whereU=U0VTandL=K1=20JL0JTK1=20 Notethat,incontrastto(13),wherethelastconstraintoverWisachievedautomatically,(21)re-quiresthisconstraintonWtobesatisedduringtheoptimizationprocess,whichleadstoareducednumberofparametersforourkernellearningproblem.Theabovetheoremshowsthatourreduced-parameterkernellearningmethod(19)alsoimplicitlylearnsalineartransformationkernelfunction,hencewecangeneralizethelearnedkerneltounseendatapointsusinganexpressionsimilarto(18).534 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATION5.SpecialCasesIntheprevioussection,weprovedageneralresultshowingtheconnectionsbetweenmetricandkernellearningusingawideclassoflossfunctionsandconstraints.Inthissection,weconsiderafewspecialcasesofinterest:thevonNeumanndivergence,thesquaredFrobeniusnormandsemi-deniteprogramming.Foreachofthecases,wederivetherequiredoptimizationproblemandmentiontherelevantoptimizationalgorithmsthatcanbeused.5.1VonNeumannDivergenceThevonNeumanndivergenceisageneralizationofthewellknownKL-divergencetomatrices.Itisusedextensivelyinquantumcomputingtocomparedensitymatricesoftwodifferentsystems(NielsenandChuang,2000).ItisalsousedintheexponentiatedmatrixgradientmethodbyTsudaetal.(2005),online-PCAmethodbyWarmuthandKuzmin(2008)andfastSVDsolverbyAroraandKale(2007).ThevonNeumanndivergencebetweenAandA0isdenedtobe,DvN(A;A0)=tr(AlogAAlogA0A+A0);wherebothAandA0arepositivedenite.ComputingthevonNeu-manndivergencewithrespecttotheidentitymatrix,weget:vN(A)=tr(AlogAA+I):NotethatvNisaspectralfunctionwithcorrespondingscalarfunctionvN(l)=llogllandminimaatl=1.Now,thekernellearningproblem(12)withlossfunctionvNandlinearconstraintsis:minK0vN(K1=20KK1=20);s.t.tr(KCi)bi;81m:(22)AsvNisanspectralfunction,usingTheorem4,theabovekernellearningproblemisequivalenttothefollowingmetriclearningproblem:minW0DvN(W;I);s.t.tr(WFCiFT)bi;81m:Usingelementarylinearalgebra,weobtainthefollowingsimpliedversion:minl1;l2;:::;lm0F(l)=tr(exp(C(l)K0))+b(l);whereC(l)=åiliCiandb(l)=åilibi.Further,¶F ¶li=tr(exp(C(l)K0)CiK0)+bi,soanyrst-ordersmoothoptimizationmethodcanbeusedtosolvetheabovedualproblem.Alternatively,similartothemethodofKulisetal.(2008),Bregman'sprojectionmethodcanbeusedtosolvetheprimalproblem(22).5.2PseudoOnlineMetricLearning(POLA)Shalev-Shwartzetal.(2004)proposedthefollowingbatchmetriclearningformulation:minW0kWk2F;s.t.yij(bdW(xi;xj))1;8(;)2P;whereyij=1ifxiandxjaresimilar,andyij=1ifxiandxjaredissimilar.Pisasetofpairsofpointswithknowndistanceconstraints.POLAisaninstantiationof(13)with(A)=1 2kAk2Fandside-informationavailableintheformofpair-wisedistanceconstraints.Notethattheregularizer(A)=1 2kAk2wasalsoemployedinSchultzandJoachims(2003)andKwokandTsang(2003),and535 JAIN,KULIS,DAVISANDDHILLONthesemethodsalsofallunderourgeneralformulation.Inthiscase,isonceagainastrictlyconvexspectralfunction,anditsglobalminimumisa=0,sowecanuse(12)tosolveforthelearnedkernelKasminK0kKK10k2F;s.t.gi(K)bi;1m;Theconstraintsgiforthisproblemcanbeeasilyconstructedbyre-writingeachofPOLA'scon-straintsasafunctionofFTWF.NotethattheaboveapproachforkernelizationismuchsimplerthanthemethodsuggestedinShalev-Shwartzetal.(2004),whichinvolvesakernelizedGram-Schmidtprocedureateachstepofthealgorithm.5.3SDPsWeinbergeretal.(2005)andGlobersonandRoweis(2005)proposedmetriclearningformulationsthatcanberewrittenassemi-deniteprograms(SDP),whichisaspecialcaseof(13)withthelossfunctionbeingalinearfunction.Considerthefollowinggeneralsemideniteprogram(SDP)tolearnalineartransformationWminW0tr(WFC0FT);s.t.tr(WFCiFT)bi;81m:(23)Hereweshowthatthisproblemcanbeefcientlysolvedforhighdimensionaldatainitskernelspace,hencekernelizingthemetriclearningmethodsintroducedbyWeinbergeretal.(2005)andGlobersonandRoweis(2005).Theorem9Problem(23)iskernelizable.Proof(23)hasalinearobjective,thatis,itisanon-strictconvexproblemthatmayhavemultiplesolutions.Avarietyofregularizationscanbeconsideredthatleadtoslightlydifferentsolutions.Here,weconsidertheLogDetregularization,whichseeksthesolutionwithmaximumdeterminant.Tothiseffect,weaddalog-determinantregularization:minW0tr(WFC0FT)glogdetW;s.t.tr(WFCiFT)bi;81m:(24)TheaboveregularizationwasalsoconsideredbyKulisetal.(2009b),whoprovidedafastprojectionalgorithmforthecasewheneachCiisaone-rankmatrixanddiscussedconditionsforwhichtheoptimalsolutiontotheregularizedproblemisanoptimalsolutiontotheoriginalSDP.TheaboveformulationalsogeneralizesthemetriclearningformulationofRCA(Bar-Hilleletal.,2005).Considerthefollowingvariationalformulationof(24):mintminW0glogdetW;s.t.tr(WFCiFT)bi;81m;tr(WFC0FT):(25)Notethattheobjectivefunctionoftheinneroptimizationproblemof(25)isaspectralfunctionandhenceusingTheorem4,(25),orequivalently(24),iskernelizable. 536 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATION5.4Trace-normBasedInductiveSemi-supervisedKernelDimensionalityReduction(Trace-SSIKDR)Finally,weapplyourframeworktosemi-supervisedkerneldimensionalityreduction,whichpro-videsanovelandpracticalapplicationofourframework.Whilethereexistsavarietyofmethodsforkerneldimensionalityreduction,mostofthesemethodsareunsupervised(e.g.,kernel-PCA)orarerestrictedtothetransductivesetting.Incontrast,wecanuseourkernellearningframeworktolearnalow-ranktransformationofthefeaturevectorsimplicitlythatinturnprovidesalow-dimensionalembeddingofthedataset.Furthermore,ourframeworkpermitsavarietyofside-informationsuchaspair-wiseorrelativedistanceconstraints,beyondtheclasslabelinformationallowedbyexistingtransductivemethods.Wedescribeourmethodstartingfromthelineartransformationproblem.Ourgoalistolearnalow-ranklineartransformationWwhosecorrespondinglow-dimensionalmappedembeddingofxiisW1=2f(xi).Evenwhenthedimensionalityoff(xi)isverylarge,iftherankofWislowenough,thenthemappedembeddingwillhavesmalldimensionality.Withthatinmind,apossibleregularizercouldbetherank,thatis,r(A)=rank(A);onecaneasilyshowthatthissatisesthedenitionofaspectralfunction.Unfortunately,optimizationisintractableingeneralwiththenon-convexrankfunction,soweusethetrace-normrelaxationforthematrixrankfunction,thatis,weset(A)=tr(A).ThisfunctionhasbeenextensivelystudiedasarelaxationfortherankfunctioninRechtetal.(2010),anditsatisesthedenitionofaspectralfunction(witha=0).WealsoaddasmallFrobeniusnormregularizationforeaseofoptimization(thisdoesnotaffectthespectralpropertyoftheregularizationfunction).ThenusingTheorem4,theresultingrelaxedkernellearningproblemis:minK0ttr(K1=20KK1=20)+kK1=20KK1=20k2F;s.t.tr(CiK)bi;1m;(26)wheret�0isaparameter.TheaboveproblemcanbesolvedusingamethodbasedonUzawa'sinexactalgorithm,similartoCaietal.(2008).Webrieydescribethestepstakenbyourmethodateachiteration.Forsimplicity,denote˜K=K1=20KK1=20;wewilloptimizewithrespectto˜KinsteadofK.Let˜Ktbethe-thiterate.Associatevariablezti;1mwitheachconstraintateachiteration,andletz0i=0;8.Letdtbethestepsizeatiteration.Thealgorithmperformsthefollowingupdates:USUT K1=20CK1=20;˜Kt Umax(StI;0)UT;(27)zti zt1idmax(tr(CiK1=2˜KtK1=2)bi;0);8;(28)whereC=åizt1iCi.TheaboveupdatesrequirecomputationofK1=20whichisexpensiveforlargehigh-rankmatrices.However,usingelementarylinearalgebrawecanshowthat˜KandthelearnedkernelfunctioncanbecomputedefcientlywithoutcomputingK1=20bymaintainingS=K1=20˜KK1=20fromsteptostep.WerstproveatechnicallemmatorelateeigenvectorsUofK1=20CK1=20andVofthematrixCK0Lemma10LetK1=20CK1=20=UkSkUTk,whereUkcontainsthetop-keigenvectorsofK1=20CK1=20andSkcontainsthetop-keigenvaluesofK1=20CK1=20.Similarly,letCK0=VkLkV1k,whereVkcontains537 JAIN,KULIS,DAVISANDDHILLON Algorithm2Trace-SSIKDR Input:K0(Ci;bi);1mtd1:Initialize:z0i=0,=02:repeat3:=+14:ComputeVkandSk,thetopkeigenvectorsandeigenvaluesofåizt1iCiK0,wherek=argmaxjsj�t5:Dk(;) 1=vTiK0vi,1k6:zti zt1idmax(tr(CiK0VkDkSkDkVTkK0)bi;0);8:7:untilConvergence8:ReturnSk;Dk;Vk thetop-krighteigenvectorsofCK0andLkcontainsthetop-keigenvaluesofCK0.Then,Uk=K1=20VkDk;Sk=Lk;whereDkisadiagonalmatrixwithi-thdiagonalelementDk(;)=1=vTiK0vi.Notethateigenvaluedecompositionisuniqueuptosign,soweassumethatthesignhasbeensetcorrectly.ProofLetvibe-theigenvectorofCK0.Then,CK0vi=livi.MultiplyingbothsideswithK1=20,wegetK1=20CK1=20K1=20vi=liK1=20vi.Afternormalizationweget:(K1=20CK1=20)K1=20vi vTiK0vi=liK1=20vi vTiK0vi:Hence,K1=20vi vTiK0vi=K1=20viDk(;)isthe-theigenvectoruiofK1=20CK1=20.Also,Sk(;)=li Usingtheabovelemmaand(27),weget:˜K=K1=20VkDklDkV1kK1=20:Therefore,theupdateforthezvariables(see(28))reducesto:zti zt1idmax(tr(CiK0VkDklDkV1kK0)bi;0);8:Lemma10alsoimpliesthatifkeigenvaluesofCK0arelargerthantthenweonlyneedthetopkeigenvaluesofCK0.Sincekistypicallysignicantlysmallerthann,theupdateshouldbesigni-cantlymoreefcientthancomputingthewholeeigenvaluedecomposition.Algorithm2detailsanefcientmethodforoptimizing(26)andreturnsmatricesSkDkandVk,allofwhichcontainonlyO(nk)parameters,wherekistherankof˜Kt,whichchangesfromiterationtoiteration.Notethatstep4ofthealgorithmcomputesksingularvectorsandrequiresonlyO(nk2)computation.Notealsothatthelearnedembeddingxi!˜K1=2K1=20ki;wherekiisavectorofinputkernelfunctionvaluesbetweenxiandthetrainingdata,canbecomputedefcientlyasxi!S1=2kDkVkki,whichdoesnotrequireK1=20explicitly.6.ExperimentalResultsInSection3,wepresentedmetriclearningasaconstrainedLogDetoptimizationproblemtolearnalineartransformation,andweshowedthattheproblemcanbeefcientlykernelizedtolearn538 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATIONlineartransformationkernels.Kernellearningyieldstwofundamentaladvantagesoverstandardnon-kernelizedmetriclearning.First,anon-linearkernelcanbeusedtolearnnon-lineardecisionboundariescommoninapplicationssuchasimageanalysis.Second,inSection3.6,weshowedthatthekernelizedproblemcanbelearnedwithrespecttoareducedbasisofsizek,admittingalearnedkernelparameterizedbyO(k2)values.Whenthenumberoftrainingexamplesnislarge,thisrep-resentsasubstantialimprovementoveroptimizingovertheentireO(n2)matrix,bothintermsofcomputationalefciencyaswellasstatisticalrobustness.InSection4,wegeneralizedkernelfunc-tionlearningtootherlossfunctions.Aspecialcaseofourapproachisthetrace-normbasedkernelfunctionlearningproblem,whichcanbeappliedtothetaskofsemi-supervisedinductivekerneldimensionalityreduction.Inthissection,wepresentexperimentsfromseveraldomains:benchmarkUCIdata,automatedsoftwaredebugging,textanalysis,andobjectrecognitionforcomputervision.Weevaluateperfor-manceofourlearneddistancemetricsorkernelfunctionsinthecontextofa)classicationaccuracyforthek-nearestneighboralgorithm,b)kerneldimensionalityreduction.Fortheclassicationtask,ourk-nearestneighborclassierusesk=10nearestneighbors(exceptforSection6.3whereweusek=1),breakingtiesarbitrarily.Weselectthevalueofkarbitrarilyandexpecttogetslightlybetteraccuraciesusingcross-validation.Accuracyisdenedasthenumberofcorrectlyclassiedexamplesdividedbythetotalnumberofclassiedexamples.Forthedimensionalityreductiontask,wevisualizethetwodimensionalembeddingofthedatausingourtrace-normbasedmethodwithpairwisesimilarity/dissimilarityconstraints.Forourproposedalgorithms,pairwiseconstraintsareinferredfromtrueclasslabels.Foreachclass,100pairsofpointsarerandomlychosenfromwithinclassandareconstrainedtobesimilar,and100pairsofpointsaredrawnfromclassesotherthantoformdissimilarityconstraints.Givencclasses,thisresultsin100csimilarityconstraints,and100cdissimilarityconstraints,foratotalof200cconstraints.Theupperandlowerboundsforthesimilarityanddissimilarityconstraintsaredeterminedempiricallyasthe1stand99thpercentilesofthedistributionofdistancescomputedusingabaselineMahalanobisdistanceparameterizedbyW0.Finally,theslackpenaltyparametergusedbyouralgorithmsiscross-validatedusingvaluesf:01;:1;1;10;100;1000gAllmetrics/kernelsaretrainedusingdataonlyinthetrainingset.Testinstancesaredrawnfromthetestsetandarecomparedtoexamplesinthetrainingsetusingthelearneddistance/kernelfunc-tion.Thetestandtrainingsetsareestablishedusingastandardtwo-foldcrossvalidationapproach.Forexperimentsinwhichabaselinedistancemetric/kernelisevaluated(forexample,thesquaredEuclideandistance),nearestneighborsearchesareagaincomputedfromtestinstancestoonlythoseinstancesinthetrainingset.Foradditionallarge-scaleresults,seeKulisetal.(2009a),whichusesourparameter-reductionstrategytolearnkernelsoveradatasetcontainingnearlyhalfamillionimagesin24,000dimen-sionalspacefortheproblemofhuman-bodyposeestimation;wealsoappliedouralgorithmsontheMNISTdatasetof60,000digitsinKulisetal.(2008).6.1Low-DimensionalDataSetsFirstweevaluateourLogDetdivergencebasedmetriclearningmethod(seeAlgorithm1)onthestandardUCIdatasetsinthelow-dimensional(non-kernelized)setting,todirectlycomparewithseveralexistingmetriclearningmethods.InFigure1(a),wecompareLogDetLinear(K0equalsthelinearkernel)andtheLogDetGaussian(K0equalsGaussiankernelinkernelspace)algorithms539 JAIN,KULIS,DAVISANDDHILLON Wine Ionosphere Balance Scale Iris Soybean 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 k-NN Error LogDet Gaussian LogDet Linear LogDet Online Euclidean Inv. Covariance MCML LMNN LatexMpg321FoxproIptables LogDet LinearError 0.00.10.20.3 5 10 15 20 25 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Number of DimensionsError LogDet Linear Euclidean MCML LMNN (a)UCIDataSets(b)ClarifyDataSets(c)LatexFigure1:(a):ResultsoverbenchmarkUCIdatasets.LogDetmetriclearningwasrunwithininputspace(LogDetLinear)aswellasinkernelspacewithaGaussiankernel(LogDetGaussian).(b),(c):Classicationerrorratesfork-nearestneighborsoftwaresupportviadifferentlearnedmetrics.Weseeingure(b)thatLogDetLinearistheonlyalgorithmtobeoptimal(withinthe95%condenceintervals)acrossalldatasets.In(c),weseethattheerrorratefortheLatexdatasetstaysrelativelyconstantforLogDetLinear.againstexistingmetriclearningmethodsfork-NNclassication.Wealsoshowresultsofarecently-proposedonlinemetriclearningalgorithmbasedontheLogDetdivergenceoverthisdata(Jainetal.,2008),calledLogDetOnline.WeusethesquaredEuclideandistance,d(x;y)=(xy)T(xy)asabaselinemethod(i.e.,W0=I).WealsouseaMahalanobisdistanceparameterizedbytheinverseofthesamplecovariancematrix(i.e.,W0=S1;whereSisthesamplecovariancematrixofthedata).ThismethodisequivalenttorstperformingastandardPCAwhiteningtransformoverthefeaturespaceandthencomputingdistancesusingthesquaredEuclideandistance.Wecompareourmethodtotworecentlyproposedalgorithms:MaximallyCollapsingMetricLearningbyGlobersonandRoweis(2005)(MCML),andmetriclearningviaLargeMarginNearestNeighborbyWeinbergeretal.(2005)(LMNN).ConsistentwithexistingworksuchasGlobersonandRoweis(2005),wefoundthemethodofXingetal.(2002)tobeveryslowandinaccurate,sothelatterwasnotincludedinourexperiments.AsseeninFigure1(a),LogDetLinearandLogDetGaussianalgorithmsobtainsomewhathigheraccuracyformostofthedatasets.InadditiontoourevaluationsonstandardUCIdatasets,wealsoapplyouralgorithmtotherecentlyproposedproblemofnearestneighborsoftwaresupportfortheClarifysystem(Haetal.,2007).ThebasisoftheClarifysystemliesinthefactthatmodernsoftwaredesignpromotesmodularityandabstraction.Whenaprogramterminatesabnormally,itisoftenunclearwhichcomponentshouldberesponsibleorcapableofprovidinganerrorreport.Thesystemworksbymonitoringasetofpredenedprogramfeatures(thedatasetspresentedusefunctioncounts)duringprogramruntimewhicharethenusedbyaclassierintheeventofabnormalprogramtermination.Nearestneighborsearchesareparticularlyrelevanttothisproblem.Ideally,theneighborsreturnedshouldnotonlyhavethecorrectclasslabel,butshouldalsorepresentsimilarprogramcongurationsorprograminputs.Suchamatchingcanbeapowerfultooltohelpusersdiagnosetherootcauseoftheirproblem.Thefourdatasetsweusecorrespondtothefollowingsoftware:Latex(thedocumentcompiler,9classes),Mpg321(anmp3player,4classes),Foxpro(adatabasemanager,4classes),andIptables(aLinuxkernelapplication,5classes).540 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATION DataSet n d Unsupervised LogDetLinear HMRF-KMeans Ionosphere 351 34 0.314 0.113 0.256 Digits-389 317 16 0.226 0.175 0.286 Table1:Unsupervisedk-meansclusteringerrorusingthebaselinesquaredEuclideandistance,alongwithsemi-supervisedclusteringerrorwith50constraints.OurexperimentsontheClarifysystem,liketheUCIdata,areoverfairlylow-dimensionaldata.Haetal.(2007)showedthathighclassicationaccuracycanbeobtainedbyusingarelativelysmallsubsetofavailablefeatures.Thus,foreachdataset,weuseastandardinformationgainfeatureselectiontesttoobtainareducedfeaturesetofsize20.Fromthis,welearnmetricsfork-NNclas-sicationusingthemethodsdevelopedinthispaper.ResultsaregiveninFigure1(b).TheLogDetLinearalgorithmyieldssignicantgainsfortheLatexbenchmark.NotethatfordatasetswhereEuclideandistanceperformsbetterthantheinversecovariancemetric,theLogDetLinearalgorithmthatnormalizestothestandardEuclideandistanceyieldshigheraccuracythanthatregularizedtoinversecovariance(LogDet-InverseCovariance).Ingeneral,fortheMpg321,Foxpro,andIptablesdatasets,learnedmetricsyieldonlymarginalgainsoverthebaselineEuclideandistancemeasure.Figure1(c)showstheerrorratefortheLatexdatasetswithavaryingnumberoffeatures(thefeaturesetsareagainchosenusingtheinformationgaincriteria).WeseeherethatLogDetLinearissurprisinglyrobust.Euclideandistance,MCML,andLMNNallachievetheirbesterrorratesforvedimensions.LogDetLinear,however,attainsitslowesterrorrateof.15atd=20dimensions.Wealsobrieypresentsomesemi-supervisedclusteringresultsfortwooftheUCIdatasets.NotethatbothMCMLandLMNNarenotamenabletooptimizationsubjecttopairwisedistanceconstraints.Instead,wecompareourmethodtothesemi-supervisedclusteringalgorithmHMRF-KMeans(Basuetal.,2004).Weuseastandard2-foldcrossvalidationapproachforevaluatingsemi-supervisedclusteringresults.Distancesareconstrainedtobeeithersimilarordissimilar,basedonclassvalues,andaredrawnonlyfromthetrainingset.Theentiredatasetisthenclusteredintocclustersusingk-means(wherecisthenumberofclasses)anderroriscomputedusingonlythetestset.Table1providesresultsforthebaselinek-meanserror,aswellassemi-supervisedclusteringresultswith50constraints.6.2MetricLearningforTextClassicationNextwepresentresultsinthetextdomain.Ourtextdatasetsarecreatedbystandardbag-of-wordsTf-Idfrepresentations.WordsarestemmedusingastandardPorterstemmerandcommonstopwordsareremoved,andthetextmodelsarelimitedtothe5,000wordswiththelargestdocu-mentfrequencycounts.Weprovideexperimentsfortwodatasets:CMU20-NewsgroupsDataSet(2008),andClassic3DataSet(2008).Classic3isarelativelysmall3classproblemwith3,891in-stances.Thenewsgroupdatasetismuchlarger,having20differentclassesfromvariousnewsgroupcategoriesand20,000instances.Ourtextexperimentsemployalinearkernel,andweuseasetofbasisvectorsthatisconstructedfromtheclasslabelsviathefollowingprocedure.Letcbethenumberofdistinctclassesandletkbethesizeofthedesiredbasis.Ifk=c,theneachclassmeanriiscomputedtoformthebasisR=[r1:::rc].Ifkcasimilarprocessisusedbutrestrictedtoarandomlyselectedsubsetof541 JAIN,KULIS,DAVISANDDHILLON 2 2.5 3 3.5 4 4.5 5 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 1 Kernel Basis SizeAccuracy LogDet Linear LSA LMNN Euclidean 5 10 15 20 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Basis SizeAccuracy LogDet Linear LSA LMNN Euclidean -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 -0.1 -0.05 0 0.05 0.1 (a)Classic3(b)20-Newsgroups(c)USPSDigits(d)USPSDigitsFigure2:(a),(b):ClassicationaccuracyforourMahalanobismetricslearnedoverbasisofdif-ferentdimensionality.Overall,ourmethod(LogDetLinear)signicantlyoutperformsexistingmethods.(c):Two-dimensionalembeddingof2000USPSdigitsobtainedusingourmethodTrace-SSIKDRforatrainingsetofjust100USPSdigits.Notethatweusetheinductivesettinghereandtheembeddingiscolorcodedaccordingtotheunderlyingdigit.(d):EmbeddingoftheUSPSdigitsdatasetobtainedusingkernel-PCA.kclasses.Ifk�c,instanceswithineachclassareclusteredintoapproximatelyk cclusters.Eachcluster'smeanvectoristhencomputedtoformthesetoflow-rankbasisvectorsR.Figure2showsclassicationaccuracyacrossbasesofvaryingsizesfortheClassic3dataset,alongwiththenews-groupdataset.Asbaselinemeasures,thestandardsquaredEuclideandistanceisshown,alongwithLatentSemanticAnalysis(LSA)(Deerwesteretal.,1990),whichworksbyprojectingthedataviaprincipalcomponentsanalysis(PCA),andcomputingdistancesinthisprojectedspace.ComparingouralgorithmtothebaselineEuclideanmeasure,wecanseethatforsmallerbases,theaccuracyofouralgorithmissimilartotheEuclideanmeasure.Asthesizeofthebasisincreases,ourmethodobtainssignicantlyhigheraccuracycomparedtothebaselineEuclideanmeasure.6.3KernelLearningforVisualObjectRecognitionNextweevaluateourmethodoverhigh-dimensionaldataappliedtotheobject-recognitiontaskusingCaltech-101DataSet(2004),acommonbenchmarkforthistask.Thegoalistopredictthecategoryoftheobjectinthegivenimageusingak-NNclassier.Wecomputedistancesbetweenimagesusinglearningkernelswiththreedifferentbaseimagekernels:1)PMK:GraumanandDarrell'spyramidmatchkernel(GraumanandDarrell,2007)ap-pliedtoSIFTfeatures,2)CORR:thekerneldesignedbyZhangetal.(2006)appliedtogeometricblurfeatures,and3)SUM:theaverageoffourimagekernels,namely,PMK(GraumanandDar-rell,2007),SpatialPMK(Lazebniketal.,2006),andtwokernelsobtainedviageometricblur(BergandMalik,2001).Notethattheunderlyingdimensionalityoftheseembeddingsistypicallyinthemillionsofdimensions.Weevaluatetheeffectivenessofmetric/kernellearningonthisdataset.Weposeak-NNclassi-cationtask,andevaluateboththeoriginal(SUM,PMKorCORR)andlearnedkernels.Wesetk=1forourexperiments;thisvaluewaschosenarbitrarily.WevarythenumberoftrainingexamplesTperclassforthedatabase,usingtheremainderastestexamples,andmeasureaccuracyintermsofthemeanrecognitionrateperclass,asisstandardpracticeforthisdataset.Figure3(a)showsourresultsrelativetoseveralotherexistingtechniquesthathavebeenappliedtothisdataset.Ourapproachoutperformsallexistingsingle-kernelclassiermethodswhenusing542 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATION 0 5 10 15 20 25 10 20 30 40 50 60 70 80 number of training examples per classmean recognition rate per classCaltech 101: Comparison to Existing Methods LogDet+SUM LogDet+CORR LogDet+PMK Frome et al. (ICCV07) Zhang et al.(CVPR06) Lazebnik et al. (CVPR06) Berg (thesis) Mutch & Lowe(CVPR06) Grauman & Darrell(ICCV 2005) Berg et al.(CVPR05) Wang et al.(CVPR06) Holub et al.(ICCV05) Serre et al.(CVPR05) Fei-Fei et al. (ICCV03) SSD baseline 5 10 15 20 25 0 10 20 30 40 50 60 70 80 number of training examples per classmean recognition rate per classCaltech 101: Gains over Baseline LogDet+Sum(PMK, SPMK, Geoblur) LogDet+Zhang et al.(CVPR06) LogDet+PMK NN+Sum(PMK, SPMK, Geoblur) NN+Zhang et al.(CVPR06) NN+PMK (a)(b)Figure3:ResultsonCaltech-101.LogDet+SUMreferstoourlearnedkernelwhenthebasekernelistheaverageoffourkernels(PMK,SPMK,Geoblur-1,Geoblur-2),LogDet+PMKreferstothelearnedkernelwhenthebasekernelispyramidmatchkernel,andLogDet+CORRreferstothelearnedkernelwhenthebasekerneliscorrespondencekernelofZhangetal.(2006).(a):ComparisonofLogDetbasedmetriclearningmethodwithotherstate-of-the-artobjectrecognitionmethods.Ourmethodoutperformsallothersinglemetric/kernelapproaches.(b):OurlearnedkernelssignicantlyimproveNNrecognitionaccuracyrelativetotheirnon-learnedcounterparts,theSUM(averageoffourkernels),theCORRandPMKkernels.thelearnedCORRkernel:weachieve61.0%accuracyforT=15and69.6%accuracyforT=30.OurlearnedPMKachieves52.2%accuracyforT=15and62.1%accuracyforT=30.Similarly,ourlearnedSUMkernelachieves73:7%accuracyforT=15.Figure3(b)specicallyshowsthecomparisonoftheoriginalbaselinekernelsforNNclassication.Theplotrevealsgainsin1-NNclassicationaccuracy;notably,ourlearnedkernelswithsimpleNNclassicationalsooutperformthebaselinekernelswhenusedwithSVMs(Zhangetal.,2006;GraumanandDarrell,2007).6.4USPSDigitsFinally,wequalitativelyevaluateourdimensionalityreductionmethod(seeSection5.4)ontheUSPSdigitsdataset.Here,wetrainourmethodusing100examplestolearnamappingtotwodimensions,thatis,arank-2matrixW.Forthebaselinekernel,weusethedata-dependentkernelfunctionproposedbySindhwanietal.(2005)thatalsoaccountsforthemanifoldstructureofthedatawithinthekernelfunction.Wethenembed2000(unseen)testexamplesintotwodimensionsusingourlearnedlow-ranktransformation.Figure2(c)showstheembeddingobtainedbyourTrace-SSIKDRmethod,whileFigure2(d)showstheembeddingobtainedbythekernel-PCAalgorithm.Eachpointiscolorcodedaccordingtotheunderlyingdigit.Notethatourmethodisabletoseparateoutsevenofthedigitsreasonablywell,whilekernel-PCAisabletoseparateoutonlythreeofthedigits.543 JAIN,KULIS,DAVISANDDHILLON7.ConclusionsInthispaper,weconsideredthegeneralproblemoflearningalineartransformationoftheinputdataandappliedittotheproblemsofmetricandkernellearning,withafocusonestablishingconnec-tionsbetweenthetwoproblems.WeshowedthattheLogDetdivergenceisausefullossforlearningalineartransformationoververyhigh-dimensionaldata,asthealgorithmcaneasilybegeneralizedtoworkinkernelspace,andproposedanalgorithmbasedonBregmanprojectionstolearnakernelfunctionoverthedatapointsefcientlyunderthisloss.Wealsoshowedthatourlearnedmetriccanberestrictedtoasmalldimensionalbasisefciently,therebypermittingscalabilityofourmethodtolargedatasetswithhigh-dimensionalfeaturespaces.Thenweconsideredhowtogeneralizethisresulttoalargerclassofconvexlossfunctionsforlearningthemetric/kernelusingalineartransfor-mationofthedata.Weprovedthatmanylossfunctionscanleadtoefcientkernelfunctionlearning,thoughtheresultingoptimizationsmaybemoreexpensivetosolvethanthesimplerLogDetformu-lation.AkeyconsequenceofouranalysisisthatanumberofexistingapproachesforMahalanobismetriclearningmaybeappliedinkernelspaceusingourkernellearningformulation.Finally,wepresentedseveralexperimentsonbenchmarkdata,high-dimensionalvision,andtextclassicationproblemsaswellasasemi-supervisedkerneldimensionalityreductionproblem,demonstratingourmethodcomparedtoseveralexistingstate-of-the-arttechniques.Thereareseveralpotentialdirectionsforfuturework.Tofacilitateevenlargerdatasetsthantheonesconsideredinthispaper,onlinelearningmethodsareonepromisingresearchdirection;inJainetal.(2008),anonlinelearningalgorithmwasproposedbasedonLogDetregularization,andthisremainsapartofourongoingefforts.Recently,therehasbeensomeinterestinlearningmultiplelocalmetricsoverthedata;WeinbergerandSaul(2008)consideredthisproblem.WeplantoexplorethissettingwiththeLogDetdivergence,withafocusonscalabilitytoverylargedatasets.AcknowledgmentsThisresearchwassupportedbyNSFgrantCCF-0728879.WewouldliketoacknowledgeSuvritSraforvarioushelpfuldiscussionsandanonymousreviewersforvarioushelpfulsuggestions.ReferencesA.Argyriou,C.A.Micchelli,andM.Pontil.Onspectrallearning.JournalofMachineLearningResearch(JMLR),11:935–953,2010.S.AroraandS.Kale.Acombinatorial,primal-dualapproachtosemideniteprograms.InProceed-ingsoftheACMSymposiumonTheoryofComputing(STOC),pages227–236,2007.A.Bar-Hillel,T.Hertz,N.Shental,andD.Weinshall.Learningamahalanobismetricfromequiva-lenceconstraints.JournalofMachineLearningResearch(JMLR),6:937–965,2005.S.Basu,M.Bilenko,andR.J.Mooney.Aprobabilisticframeworkforsemi-supervisedcluster-ing.InProceedingsoftheACMSIGKDDInt.Conf.onKnowledgeDiscoveryandDataMining(KDD),2004.544 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATIONY.Bengio,O.Delalleau,N.LeRoux,J.Paiement,P.Vincent,andM.Ouimet.Learningeigen-functionslinksspectralembeddingandkernelPCA.NeuralComputation,16(10):2197–2219,2004.A.C.BergandJ.Malik.Geometricblurfortemplatematching.InProccedingsoftheIEEEInternationalConferenceonComputerVisionandPatternRecognition(CVPR),pages607–614,2001.J.Cai,E.J.Candes,andZ.Shen.Asingularvaluethresholdingalgorithmformatrixcompletion.Arxiv:0810.3286,2008.Caltech-101DataSet.http://www.vision.caltech.edu/Image Datasets/Caltech101/,2004.R.Chatpatanasiri,T.Korsrilabutr,P.Tangchanachaianan,andB.Kijsirikul.AnewkernelizationframeworkforMahalanobisdistancelearningalgorithms.Neurocomputing,73(10–12):1570–1579,2010.S.Chopra,R.Hadsell,andY.LeCun.Learningasimilaritymetricdiscriminatively,withapplicationtofaceverication.InProccedingsoftheIEEEInternationalConferenceonComputerVisionandPatternRecognition(CVPR),2005.Classic3DataSet.ftp.cs.cornell.edu/pub/smart,2008.CMU20-NewsgroupsDataSet.http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html,2008.N.Cristianini,J.Shawe-Taylor,A.Elisseeff,andJ.Kandola.Onkernel-targetalignment.InProc-cedingsofAdvancesinNeuralInformationProcessingSystems(NIPS),2001.J.V.DavisandI.S.Dhillon.Structuredmetriclearningforhighdimensionalproblems.InProceed-ingsoftheACMSIGKDDInt.Conf.onKnowledgeDiscoveryandDataMining(KDD),pages195–203,2008.J.V.Davis,B.Kulis,P.Jain,S.Sra,andI.S.Dhillon.Information-theoreticmetriclearning.InProccedingsoftheInternationalConferenceonMachineLearning(ICML),pages209–216,2007.S.C.Deerwester,S.T.Dumais,T.K.Landauer,G.W.Furnas,andR.A.Harshman.Indexingbylatentsemanticanalysis.JournaloftheAmericanSocietyofInformationScience,41(6):391–407,1990.R.Fletcher.Anewvariationalresultforquasi-Newtonformulae.SIAMJournalonOptimization,1(1),1991.A.GlobersonandS.T.Roweis.Metriclearningbycollapsingclasses.InProccedingsofAdvancesinNeuralInformationProcessingSystems(NIPS),2005.J.Goldberger,S.Roweis,G.Hinton,andR.Salakhutdinov.Neighbourhoodcomponentanalysis.InProccedingsofAdvancesinNeuralInformationProcessingSystems(NIPS),2004.545 JAIN,KULIS,DAVISANDDHILLONK.GraumanandT.Darrell.ThePyramidMatchKernel:Efcientlearningwithsetsoffeatures.JournalofMachineLearningResearch(JMLR),8:725–760,April2007.M.Groschel,L.Lovasz,andA.Schrijver.GeometricAlgorithmsandCombinatorialOptimizationSpringer-Verlag,1988.J.Ha,C.J.Rossbach,J.V.Davis,I.Roy,H.E.Ramadan,D.E.Porter,D.L.Chen,andE.Witchel.Improvederrorreportingforsoftwarethatusesblack-boxcomponents.InProceedingsoftheACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation,pages101–111,2007.T.HastieandR.Tibshirani.Discriminantadaptivenearestneighborclassication.IEEETransac-tionsonPatternAnalysisandMachineIntelligence(PAMI),18:607–616,1996.P.Jain,B.Kulis,I.S.Dhillon,andK.Grauman.Onlinemetriclearningandfastsimilaritysearch.InProccedingsofAdvancesinNeuralInformationProcessingSystems(NIPS),pages761–768,2008.P.Jain,B.Kulis,andI.S.Dhillon.Inductiveregularizedlearningofkernelfunctions.InProccedingsofAdvancesinNeuralInformationProcessingSystems(NIPS),2010.W.JamesandC.Stein.Estimationwithquadraticloss.InFourthBerkeleySymposiumonMathe-maticalStatisticsandProbability,volume1,pages361–379.Univ.ofCaliforniaPress,1961.B.Kulis,M.Sustik,andI.S.Dhillon.Learninglow-rankkernelmatrices.InProccedingsoftheInternationalConferenceonMachineLearning(ICML),pages505–512,2006.B.Kulis,M.Sustik,andI.Dhillon.Low-rankkernellearningwithBregmanmatrixdivergences.JournalofMachineLearningResearch,2008.B.Kulis,P.Jain,andK.Grauman.FastsimilaritysearchforlearnedmetricsIEEETransactionsonPatternAnalysisandMachineIntelligence(PAMI),31(12):2143–2157,2009a.B.Kulis,S.Sra,andI.S.Dhillon.Convexperturbationsforscalablesemideniteprogramming.InProceedingsoftheInternationalConferenceonArticialIntelligenceandStatistics(AISTATS)2009b.J.T.KwokandI.W.Tsang.Learningwithidealizedkernels.InProccedingsoftheInternationalConferenceonMachineLearning(ICML),2003.G.R.G.Lanckriet,N.Cristianini,P.L.Bartlett,L.ElGhaoui,andM.I.Jordan.Learningthekernelmatrixwithsemideniteprogramming.JournalofMachineLearningResearch(JMLR)5:27–72,2004.S.Lazebnik,C.Schmid,andJ.Ponce.Beyondbagsoffeatures:Spatialpyramidmatchingforrecognizingnaturalscenecategories.InProccedingsoftheIEEEInternationalConferenceonComputerVisionandPatternRecognition(CVPR),pages2169–2178,2006.G.Lebanon.Metriclearningfortextdocuments.IEEETransactionsonPatternAnalysisandMachineIntelligence(PAMI),28(4):497–508,2006.ISSN0162-8828.546 METRICANDKERNELLEARNINGUSINGALINEARTRANSFORMATIONM.A.NielsenandI.L.Chuang.QuantumComputationandQuantumInformation.CambridgeUniversityPress,2000.C.S.Ong,A.J.Smola,andR.C.Williamson.Learningthekernelwithhyperkernels.JournalofMachineLearningResearch(JMLR),6:1043–1071,2005.B.Recht,M.Fazel,andP.A.Parrilo.Guaranteedminimumranksolutionstolinearmatrixequationsvianuclearnormminimization.SIAMReview,52(3):471–501,2010.M.SchultzandT.Joachims.Learningadistancemetricfromrelativecomparisons.InProccedingsofAdvancesinNeuralInformationProcessingSystems(NIPS),2003.M.Seeger.Cross-validationoptimizationforlargescalehierarchicalclassicationkernelmethods.InProccedingsofAdvancesinNeuralInformationProcessingSystems(NIPS),pages1233–1240,2006.S.Shalev-Shwartz,Y.Singer,andA.Y.Ng.Onlineandbatchlearningofpseudo-metrics.InProccedingsoftheInternationalConferenceonMachineLearning(ICML),2004.V.Sindhwani,P.Niyogi,andM.Belkin.Beyondthepointcloud:fromtransductivetosemi-supervisedlearning.InProccedingsoftheInternationalConferenceonMachineLearning(ICML),pages824–831,2005.K.Tsuda,G.R¨atsch,andM.K.Warmuth.Matrixexponentiatedgradientupdatesforon-linelearn-ingandBregmanprojection.JournalofMachineLearningResearch(JMLR),6:995–1018,2005.M.K.WarmuthandD.Kuzmin.RandomizedonlinePCAalgorithmswithregretboundsthatarelogarithmicinthedimension.JournalofMachineLearningResearch,9:2287–2320,2008.K.Q.WeinbergerandL.K.Saul.Fastsolversandefcientimplementationsfordistancemetriclearning.InProccedingsoftheInternationalConferenceonMachineLearning(ICML),2008.K.Q.Weinberger,J.Blitzer,andL.K.Saul.Distancemetriclearningforlargemarginnearestneighborclassication.InProccedingsofAdvancesinNeuralInformationProcessingSystems(NIPS),2005.E.P.Xing,A.Y.Ng,M.I.Jordan,andS.J.Russell.Distancemetriclearningwithapplicationtoclusteringwithside-information.InProccedingsofAdvancesinNeuralInformationProcessingSystems(NIPS),pages505–512,2002.H.Zhang,A.C.Berg,M.Maire,andJ.Malik.SVM-KNN:Discriminativenearestneighborclassi-cationforvisualcategoryrecognition.InProccedingsoftheIEEEInternationalConferenceonComputerVisionandPatternRecognition(CVPR),pages2126–2136,2006.X.Zhu,J.Kandola,Z.Ghahramani,andJ.Lafferty.Nonparametrictransformsofgraphkernelsforsemi-supervisedlearning.InLawrenceK.Saul,YairWeiss,andL´eonBottou,editors,Proc-cedingsofAdvancesinNeuralInformationProcessingSystems(NIPS),volume17,pages1641–1648,2005.547