/
JournalofMachineLearningResearch5(2004)27-72Submitted10/02;Revised8/03 JournalofMachineLearningResearch5(2004)27-72Submitted10/02;Revised8/03

JournalofMachineLearningResearch5(2004)27-72Submitted10/02;Revised8/03 - PDF document

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
401 views
Uploaded On 2015-08-21

JournalofMachineLearningResearch5(2004)27-72Submitted10/02;Revised8/03 - PPT Presentation

LanckrietCristianiniBartlettElGhaouiandJordan1IntroductionRecentadvancesinkernelbasedlearningalgorithmshavebroughtthe ID: 112166

Lanckriet Cristianini Bartlett ElGhaouiandJordan1.IntroductionRecentadvancesinkernel-basedlearningalgorithmshavebroughtthe

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch5(2004)2..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch5(2004)27-72Submitted10/02;Revised8/03;Published1/04LearningtheKernelMatrixwithSemide¯niteProgrammingGertR.G.Lanckrietgert@eecs.berkeley.eduDepartmentofElectricalEngineeringandComputerScienceUniversityofCaliforniaBerkeley,CA94720,USANelloCristianininello@support-vector.netDepartmentofStatisticsUniversityofCaliforniaDavis,CA95616,USAPeterBartlettbartlett@stat.berkeley.eduDepartmentofElectricalEngineeringandComputerScienceandDepartmentofStatisticsBerkeley,CA94720,USALaurentElGhaouielghaoui@eecs.berkeley.eduDepartmentofElectricalEngineeringandComputerScienceUniversityofCaliforniaBerkeley,CA94720,USAMichaelI.Jordanjordan@stat.berkeley.eduDepartmentofElectricalEngineeringandComputerScienceandDepartmentofStatisticsUniversityofCaliforniaBerkeley,CA94720,USAEditor:BernhardSchÄolkopfAbstractKernel-basedlearningalgorithmsworkbyembeddingthedataintoaEuclideanspace,andthensearchingforlinearrelationsamongtheembeddeddatapoints.Theembeddingisperformedimplicitly,byspecifyingtheinnerproductsbetweeneachpairofpointsintheembeddingspace.Thisinformationiscontainedintheso-calledkernelmatrix,asymmetricandpositivesemide¯nitematrixthatencodestherelativepositionsofallpoints.Specifyingthismatrixamountstospecifyingthegeometryoftheembeddingspaceandinducinganotionofsimilarityintheinputspace|classicalmodelselectionproblemsinmachinelearning.Inthispaperweshowhowthekernelmatrixcanbelearnedfromdataviasemide¯niteprogramming(SDP)techniques.Whenappliedtoakernelmatrixassociatedwithbothtrainingandtestdatathisgivesapowerfultransductivealgorithm|usingthelabeledpartofthedataonecanlearnanembeddingalsofortheunlabeledpart.Thesimilaritybetweentestpointsisinferredfromtrainingpointsandtheirlabels.Importantly,theselearningproblemsareconvex,soweobtainamethodforlearningboththemodelclassandthefunctionwithoutlocalminima.Furthermore,thisapproachleadsdirectlytoaconvexmethodforlearningthe2-normsoftmarginparameterinsupportvectormachines,solvinganimportantopenproblem.ords:kernelmethods,learningkernels,transduction,modelselection,supportvectorma-chines,convexoptimization,semide¯niteprogrammingc°2004GertR.G.Lanckriet,NelloCristianini,PeterBartlett,LaurentElGhaouiandMichaelI.Jordan. Lanckriet,Cristianini,Bartlett,ElGhaouiandJordan1.IntroductionRecentadvancesinkernel-basedlearningalgorithmshavebroughtthe¯eldofmachinelearningclosertothedesirablegoalofautonomy|thegoalofprovidinglearningsystemsthatrequireaslittleinterventionaspossibleonthepartofahumanuser.Inparticular,kernel-basedalgorithmsaregenerallyformulatedintermsofconvexoptimizationproblems,whichhaveasingleglobaloptimumandthusdonotrequireheuristicchoicesoflearningrates,startingcon¯gurationsorotherfreeparameters.Thereare,ofcourse,statisticalmodelselectionproblemstobefacedwithinthekernelapproach;inparticular,thechoiceofthekernelandthecorrespondingfeaturespacearecentralchoicesthatmustgenerallybemadebyahumanuser.Whilethisprovidesopportunitiesforpriorknowledgetobebroughttobear,itcanalsobedi±cultinpracticeto¯ndpriorjusti¯cationfortheuseofonekernelinsteadofanother.Itwouldbedesirabletoexploremodelselectionmethodsthatallowkernelstobechoseninamoreautomaticwaybasedondata.Itisimportanttoobservethatwedonotnecessarilyneedtochooseakernelfunction,specifyingtheinnerproductbetweentheimagesofallpossibledatapointswhenmappedfromaninputspaceXtoanappropriatefeaturespaceF.Sincekernel-basedlearningmethodsextractallinformationneededfrominnerproductsoftrainingdatapointsinF,thevaluesofthekernelfunctionatpairswhicharenotpresentareirrelevant.So,thereisnoneedtolearnakernelfunctionovertheentiresamplespacetospecifytheembeddingofa¯nitetrainingdatasetviaakernelfunctionmapping.Instead,itissu±cienttospecifya¯nite-dimensionalkernelmatrix(alsoknownasaGrammatrix)thatcontainsasitsentriestheinnerproductsinFbetweenallpairsofdatapoints.Notealsothatitispossibletoshowthatanysymmetricpositivesemide¯nitematrixisavalidGrammatrix,basedonaninnerproductinsomeHilbertspace.ThissuggestsviewingthemodelselectionproblemintermsofGrammatricesratherthankernelfunctions.Inthispaperourmainfocusistransduction|theproblemofcompletingthelabelingofapartiallylabeleddataset.Inotherwords,wearerequiredtomakepredictionsonlyata¯nitesetofpoints,whicharespeci¯edapriori.Thus,insteadoflearningafunction,weonlyneedtolearnasetoflabels.Therearemanypracticalproblemsinwhichthisformulationisnatural|anexampleisthepredictionofgenefunction,wherethegenesofinterestarespeci¯edapriori,butthefunctionofmanyofthesegenesisunknown.Wewilladdressthisproblembylearningakernelmatrixcorrespondingtotheentiredataset,amatrixthatoptimizesacertaincostfunctionthatdependsontheavailablelabels.Inotherwords,weusetheavailablelabelstolearnagoodembedding,andweapplyittoboththelabeledandtheunlabeleddata.Theresultingkernelmatrixcanthenbeusedincombinationwithanyofanumberofexistinglearningalgorithmsthatusekernels.Oneexamplethatwediscussindetailisthesupportvectormachine(SVM),whereourmethodsyieldanewtransductionmethodforSVMsthatscalespolynomiallywiththenumberoftestpoints.Furthermore,thisapproachwillo®erusamethodtooptimizethe2-normsoftmarginparameterfortheseSVMlearningalgorithms,solvinganimportantopenproblem.Allthiscanbedoneinfullgeneralitybyusingtechniquesfromsemide¯niteprogramming(SDP),abranchofconvexoptimizationthatdealswiththeoptimizationofconvexfunctionsovertheconvexconeofpositivesemide¯nitematrices,orconvexsubsetsthereof.Anyconvexsetofkernelmatricesisasetofthiskind.Furthermore,itturnsoutthatmanynaturalcostfunctions,motivatedbyerrorbounds,areconvexinthekernelmatrix.Asecondapplicationoftheideasthatwepresenthereistotheproblemofcombiningdatafrommultiplesources.Speci¯cally,assumethateachsourceisassociatedwithakernelfunction,suchthatatrainingsetyieldsasetofkernelmatrices.Thetoolsthatwedevelopinthispapermake28 LearningtheKernelMatrixwithSemidefiniteProgrammingitpossibletooptimizeoverthecoe±cientsinalinearcombinationofsuchkernelmatrices.Thesecoe±cientscanthenbeusedtoformlinearcombinationsofkernelfunctionsintheoverallclassi¯er.Thusthisapproachallowsustocombinepossiblyheterogeneousdatasources,makinguseofthereductionofheterogeneousdatatypestothecommonframeworkofkernelmatrices,andchoosingcoe±cientsthatemphasizethosesourcesmostusefulintheclassi¯cationdecision.InSection2,werecallthemainideasfromkernel-basedlearningalgorithms,andintroduceavarietyofcriteriathatcanbeusedtoassessthesuitabilityofakernelmatrix:thehardmargin,the1-normand2-normsoftmargin,andthekernelalignment.Section3reviewsthebasicconceptsofsemide¯niteprogramming.InSection4weputtheseideastogetherandconsidertheoptimizationofthevariouscriteriaoversetsofkernelmatrices.Forasetoflinearcombinationsof¯xedkernelmatrices,theseoptimizationproblemsreducetoSDP.Ifthelinearcoe±cientsareconstrainedtobepositive,theycanbesimpli¯edevenfurther,yieldingaquadratically-constrainedquadraticprogram,aspecialcaseoftheSDPframework.Ifthelinearcombinationcontainstheidentitymatrix,weobtainaconvexmethodforoptimizingthe2-normsoftmarginparameterinsupportvectormachines.Section5presentsstatisticalerrorboundsthatmotivateoneofourcostfunctions.EmpiricalresultsarereportedinSection6.Notationectorsarerepresentedinboldnotation,e.g.,v2Rn,andtheirscalarcomponentsinitalicscript,e.g.,v1;v2;:::;vn.Matricesarerepresentedinitalicscript,e.g.,X2Rm£n.Forasquare,symmetricmatrixX,Xº0meansthatXispositivesemide¯nite,whileXÂ0meansthatXispositivede¯nite.Foravectorv,thenotationsv¸0andv�0areunderstoodcomponentwise.2.KernelMethodsKernel-basedlearningalgorithms(see,forexample,CristianiniandShawe-Taylor,2000;SchÄolkopfandSmola,2002;Shawe-TaylorandCristianini,2004)workbyembeddingthedataintoaHilbertspace,andsearchingforlinearrelationsinsuchaspace.Theembeddingisperformedimplicitly,byspecifyingtheinnerproductbetweeneachpairofpointsratherthanbygivingtheircoordinatesexplicitly.Thisapproachhasseveraladvantages,themostimportantderivingfromthefactthattheinnerproductintheembeddingspacecanoftenbecomputedmuchmoreeasilythanthecoordinatesofthepointsthemselves.GivenaninputsetX,andanembeddingspaceF,weconsideramap©:X!F.Giventwopointsxi2Xandxj2X,thefunctionthatreturnstheinnerproductbetweentheirimagesinthespaceFisknownasthekernelfunction.De¯nition1Akernelisafunctionk,suchthatk(x;z)=h©(x);©(z)iforallx;z2X,where©isamappingfromXtoan(innerproduct)featurespaceF.AkernelmatrixisasquarematrixK2Rn£nsuchthatKij=k(xi;xj)forsomex1;:::;xn2Xandsomekernelfunctionk.ThekernelmatrixisalsoknownastheGrammatrix.Itisasymmetric,positivesemide¯nitematrix,andsinceitspeci¯estheinnerproductsbetweenallpairsofpointsfxign=1,itcompletelydeterminestherelativepositionsofthosepointsintheembeddingspace.Sinceinthispaperwewillconsidera¯niteinputsetX,wecancharacterizekernelfunctionsandmatricesinthefollowingsimpleway.Proposition2Everypositivesemide¯niteandsymmetricmatrixisakernelmatrix.Conversely,everykernelmatrixissymmetricandpositivesemide¯nite.29 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanNoticethat,ifwehaveakernelmatrix,wedonotneedtoknowthekernelfunction,northeimplicitlyde¯nedmap©,northecoordinatesofthepoints©(xi).WedonotevenneedXtobeavectorspace;infactinthispaperitwillbeageneric¯niteset.WeareguaranteedthatthedataareimplicitlymappedtosomeHilbertspacebysimplycheckingthatthekernelmatrixissymmetricandpositivesemide¯nite.Thesolutionssoughtbykernel-basedalgorithmssuchasthesupportvectormachine(SVM)area±nefunctionsinthefeaturespace:f(x)=hw;©(x)i+b;forsomeweightvectorw2F.Thekernelcanbeexploitedwhenevertheweightvectorcanbeexpressedasalinearcombinationofthetrainingpoints,w=Pn=1®i©(xi),implyingthatwecanexpressfasf(x)=nXi=1®ik(xi;x)+b:Forexample,forbinaryclassi¯cation,wecanuseathresholdedversionoff(x),i.e.,sign(f(x)),asadecisionfunctiontoclassifyunlabeleddata.Iff(x)ispositive,thenweclassifyxasbelongingtoclass+1;otherwise,weclassifyxasbelongingtoclass¡1.Animportantissueinapplicationsisthatofchoosingakernelkforagivenlearningtask;intuitively,wewishtochooseakernelthatinducesthe\right"metricintheinputspace.2.1CriteriaUsedinKernelMethodsKernelmethodschooseafunctionthatislinearinthefeaturespacebyoptimizingsomecriterionoverthesample.Thissectiondescribesseveralsuchcriteria(see,forexample,CristianiniandShawe-Taylor,2000;SchÄolkopfandSmola,2002;Shawe-TaylorandCristianini,2004).Allofthesecriteriacanbeconsideredasmeasuresofseparationofthelabeleddata.We¯rstconsiderthehardmarginoptimizationproblem.De¯nition3HardMarginGivenalabeledsampleSl=f(x1;y1);:::;(xn;yn)g,thehyperplane(w¤;b¤)thatsolvestheoptimizationproblemminw;bhw;wi(1)subjecttoyi(hw;©(xi)i+b)¸1;i=1;:::;n;realizesthemaximalmarginclassi¯erwithgeometricmargin°=1=kw¤k2,assumingitexists.Geometrically,°correspondstothedistancebetweentheconvexhulls(thesmallestconvexsetsthatcontainthedataineachclass)ofthetwoclasses(BennettandBredensteiner,2000).Bytransforming(1)intoitscorrespondingLagrangiandualproblem,thesolutionisgivenby!(K)=1=°2=hw¤;w¤i=max®2®Te¡®TG(K)®:®¸0;®Ty=0;(2)whereeisthen-vectorofones,®2Rn,G(K)isde¯nedbyGij(K)=[K]ijyiyj=k(xi;xj)yiyj,and®¸0means®i¸0;i=1;:::;n.Thehardmarginsolutionexistsonlywhenthelabeledsampleislinearlyseparableinfeaturespace.Foranon-linearly-separablelabeledsampleSl,wecande¯nethesoftmargin.Weconsiderthe1-normand2-normsoftmargins.30 LearningtheKernelMatrixwithSemidefiniteProgrammingDe¯nition41-normSoftMarginGivenalabeledsampleSl=f(x1;y1);:::;(xn;yn)g,thehyperplane(w¤;b¤)thatsolvestheoptimizationproblemminw;b;»hw;wi+CnXi=1»i(3)subjecttoyi(hw;©(xi)i+b)¸1¡»i;i=1;:::;n»i¸0;i=1;:::;nrealizesthe1-normsoftmarginclassi¯erwithgeometricmargin°=1=kw¤k2.Thismarginisalsocalledthe1-normsoftmargin.Asforthehardmargin,wecanexpressthesolutionof(3)inarevealingwaybyconsideringthecorrespondingLagrangiandualproblem:!S1(K)=hw¤;w¤i+CnXi=1»i;¤(4)=max®2®Te¡®TG(K)®:C¸®¸0;®Ty=0:De¯nition52-normSoftMarginGivenalabeledsampleSl=f(x1;y1);:::;(xn;yn)g,thehyperplane(w¤;b¤)thatsolvestheoptimizationproblemminw;b;»hw;wi+CnXi=1»2i(5)subjecttoyi(hw;©(xi)i+b)¸1¡»i;i=1;:::;n»i¸0;i=1;:::;nrealizesthe2-normsoftmarginclassi¯erwithgeometricmargin°=1=kw¤k2.Thismarginisalsocalledthe2-normsoftmargin.Again,byconsideringthecorrespondingdualproblem,thesolutionof(5)canbeexpressedas!S2(K)=hw¤;w¤i+CnXi=1»2i;¤(6)=max®2®Te¡®TµG(K)+1CIn¶®:®¸0;®Ty=0:Witha¯xedkernel,allofthesecriteriagiveupperboundsonmisclassi¯cationprobability(see,forexample,Chapter4ofCristianiniandShawe-Taylor,2000).Solvingtheseoptimizationproblemsforasinglekernelmatrixisthereforeawayofoptimizinganupperboundonerrorprobability.Inthispaper,weallowthekernelmatrixtobechosenfromaclassofkernelmatrices.Previouserrorboundsarenotapplicableinthiscase.However,aswewillseeinSection5,themargin°canbeusedtoboundtheperformanceofsupportvectormachinesfortransduction,withalinearlyparameterizedclassofkernels.Wedonotdiscussfurtherthemeritofthesedi®erentcostfunctions,deferringtothecurrentliteratureonclassi¯cation,wherethesecostfunctionsarewidelyusedwith¯xedkernels.Ourgoalistoshowthatthesecostfunctionscanbeoptimized|withrespecttothekernelmatrix|inanSDPsetting.31 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanFinally,wede¯nethealignmentoftwokernelmatrices(Cristianinietal.,2001,2002).Givenan(unlabeled)sampleS=fx1;:::;xng,weusethefollowing(Frobenius)innerproductbetweenGrammatrices,hK1;K2iF=trace(KT1K2)=Pn=1k1(xi;xj)k2(xi;xj).De¯nition6AlignmentThe(empirical)alignmentofakernelk1withakernelk2withrespecttothesampleSisthequantity^A(S;k1;k2)=hK1;K2iFphK1;K1iFhK2;K2iF;whereKiisthekernelmatrixforthesampleSusingkernelki.Thiscanalsobeviewedasthecosineoftheanglebetweentwobi-dimensionalvectorsK1andK2,representingtheGrammatrices.NoticethatwedonotneedtoknowthelabelsforthesampleSinordertode¯nethealignmentoftwokernelswithrespecttoS.However,whenthevectoryoff§1glabelsforthesampleisknown,wecanconsiderK2=yyT|theoptimalkernelsincek2(xi;xj)=1ifyi=yjandk2(xi;xj)=¡1ifyi6=yj.Thealignmentofakernelkwithk2withrespecttoScanbeconsideredasaqualitymeasurefork:^A(S;K;yyT)=hK;yyTiFphK;KiFhyyT;yyTiF=­K;yyT®FnphK;KiF;(7)since­yyT;yyT®F=n2.3.Semide¯niteProgramming(SDP)Inthissectionwereviewthebasicde¯nitionofsemide¯niteprogrammingaswellassomeimportantconceptsandkeyresults.DetailsandproofscanbefoundinBoydandVandenberghe(2003).Semide¯niteprogramming(NesterovandNemirovsky,1994;VandenbergheandBoyd,1996;BoydandVandenberghe,2003)dealswiththeoptimizationofconvexfunctionsovertheconvexcone1ofsymmetric,positivesemide¯nitematricesP=©X2Rp£pjX=XT;Xº0ª;ora±nesubsetsofthiscone.GivenProposition2,Pcanbeviewedasasearchspaceforpossiblekernelmatrices.Thisconsiderationleadstothekeyproblemaddressedinthispaper|wewishtospecifyaconvexcostfunctionthatwillenableustolearntheoptimalkernelmatrixwithinPusingsemide¯niteprogramming.3.1De¯nitionofSemide¯niteProgrammingAlinearmatrixinequality,abbreviatedLMI,isaconstraintoftheformF(u):=F0+u1F1+:::+uqFq¹0:Here,uisthevectorofdecisionvariables,andF0;:::;Fqaregivensymmetricp£pmatrices.ThenotationF(u)¹0meansthatthesymmetricmatrixFisnegativesemide¯nite.Notethatsuchaconstraintisingeneralanonlinearconstraint;theterm\linear"inthenameLMImerely1.SµRdisaconvexconeifandonlyif8x;y2Sand8¸;¹¸0,wehave¸x+¹y2S.32 LearningtheKernelMatrixwithSemidefiniteProgrammingemphasizesthatFisa±neinu.PerhapsthemostimportantfeatureofanLMIconstraintisitsconvexity:thesetofuthatsatisfytheLMIisaconvexset.AnLMIconstraintcanbeseenasanin¯nitesetofscalar,a±neconstraints.Indeed,foragivenu,F(u)¹0ifandonlyifzTF(u)z·0foreveryz;everyconstraintindexedbyzisana±neinequality,intheordinarysense,i.e.,theleft-handsideoftheinequalityisascalar,composedofalinearterminuandaconstantterm.Alternatively,usingastandardresultfromlinearalgebra,wemaystatetheconstraintas8Z2P:trace(F(u)Z)·0:(8)ThiscanbeseenbywritingdownthespectraldecompositionofZandusingthefactthatzTF(u)z·0foreveryz.Asemide¯niteprogram(SDP)isanoptimizationproblemwithalinearobjective,andlinearmatrixinequalityanda±neequalityconstraints.De¯nition7Asemide¯niteprogramisaproblemoftheformminucTu(9)subjecttoFj(u)=Fj0+u1Fj1+:::+uqFjq¹0;j=1;:::;LAu=b;whereu2Rqisthevectorofdecisionvariables,c2Rqistheobjectivevector,andmatricesFji=(Fji)T2Rp£paregiven.GiventheconvexityofitsLMIconstraints,SDPsareconvexoptimizationproblems.TheusefulnessoftheSDPformalismstemsfromtwoimportantfacts.First,despitetheseeminglyveryspecializedformofSDPs,theyariseinahostofapplications;second,thereexistinterior-pointalgorithmstosolveSDPsthathavegoodtheoreticalandpracticalcomputationale±ciency(VandenbergheandBoyd,1996).OneveryusefultooltoreduceaproblemtoanSDPistheso-calledSchurcomplementlemma;itwillbeinvokedrepeatedly.Lemma8(SchurComplementLemma)ConsiderthepartitionedsymmetricmatrixX=XT=µABBTC¶;whereA;Caresquareandsymmetric.Ifdet(A)6=0,wede¯netheSchurcomplementofAinXbythematrixS=C¡BTA¡1B.TheSchurComplementLemmastatesthatifAÂ0,thenXº0ifandonlyifSº0.ToillustratehowthislemmacanbeusedtocastanonlinearconvexoptimizationproblemasanSDP,considerthefollowingresult:Lemma9Thequadraticallyconstrainedquadraticprogram(QCQP)minuf0(u)(10)subjecttofi(u)·0;i=1;:::;M;33 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanwithfi(u),(Aiu+bi)T(Aiu+bi)¡cTu¡di,isequivalenttothesemide¯niteprogrammingproblemminu;tt(11)subjecttoµIA0u+b0(A0u+b0)Tc0Tu+d0+t¶º0;µIAiu+bi(Aiu+bi)TcTu+di¶º0;i=1;:::;M:ThiscanbeseenbyrewritingtheQCQP(10)asminu;ttsubjecttot¡f0(u)¸0;¡fi(u)¸0;i=1;:::;M:Notethatfora¯xedandfeasibleu,t=f0(u)istheoptimalsolution.Theconvexquadraticinequalityt¡f0(u)=(t+c0Tu+d0)¡(A0u+b0)TI¡1(A0u+b0)¸0isnowequivalenttothefollowingLMI,usingtheSchurComplementLemma8:µIA0u+b0(A0u+b0)Tc0Tu+d0+t¶º0:Similarstepsfortheotherquadraticinequalityconstraints¯nallyyield(11),anSDPinstandardform(9),equivalentto(10).ThisshowsthataQCQPcanbecastasanSDP.Ofcourse,inpracticeaQCQPshouldnotbesolvedusinggeneral-purposeSDPsolvers,sincetheparticularstructureoftheproblemathandcanbee±cientlyexploited.TheaboveshowthatQCQPs,andinparticularlinearprogrammingproblems,belongtotheSDPfamily.3.2DualityAnimportantprincipleinoptimization|perhapseventhemostimportantprinciple|isthatofduality.ToillustratedualityinthecaseofanSDP,wewill¯rstreviewbasicconceptsindualitytheoryandthenshowhowtheycanbeextendedtosemide¯niteprogramming.Inparticular,dualitywillgiveinsightsintooptimalityconditionsforthesemide¯niteprogram.Consideranoptimizationproblemwithnvariablesandmscalarconstraints:minuf0(u)(12)subjecttofi(u)·0;i=1;:::;m;whereu2Rn.Inthecontextofduality,problem(12)iscalledtheprimalproblem;wedenoteitsoptimalvaluep¤.Fornow,wedonotassumeconvexity.De¯nition10LagrangianTheLagrangianL:Rn+m!Rcorrespondingtotheminimizationproblem(12)isde¯nedasL(u;¸)=f0(u)+¸1f1(u)+:::+¸mfm(u):The¸i2R;i=1;:::;marecalledLagrangemultipliersordualvariables.34 LearningtheKernelMatrixwithSemidefiniteProgrammingOnecannownoticethath(u)=max¸¸0L(u;¸)=½f0(u)iffi(u)·0;i=1;:::;m+1otherwise:So,thefunctionh(u)coincideswiththeobjectivef0(u)inregionswheretheconstraintsfi(u)·0,i=1;:::;m,aresatis¯edandh(u)=+1ininfeasibleregions.Inotherwords,hactsasa\barrier"ofthefeasiblesetoftheprimalproblem.Thuswecanaswelluseh(u)asobjectivefunctionandrewritetheoriginalprimalproblem(12)asanunconstrainedoptimizationproblem:p¤=minumax¸¸0L(u;¸):(13)Thenotionofweakdualityamountstoexchangingthe\min"and\max"operatorsintheaboveformulation,resultinginalowerboundontheoptimalvalueoftheprimalproblem.Strongdualityreferstothecasewhenthisexchangecanbedonewithoutalteringthevalueoftheresult:thelowerboundisactuallyequaltotheoptimalvaluep¤.Whileweakdualityalwayshold,eveniftheprimalproblem(13)isnotconvex,strongdualitymaynothold.However,foralargeclassofgenericconvexproblems,strongdualityholds.Lemma11WeakdualityForallfunctionsf0;f1;:::;fmin(12),notnecessarilyconvex,wecanexchangethemaxandtheminandgetalowerboundonp¤:d¤=max¸¸0minuL(u;¸)·minumax¸¸0L(u;¸)=p¤:Theobjectivefunctionofthemaximizationproblemisnowcalledthe(Lagrange)dualfunction.De¯nition12(Lagrange)dualfunctionThe(Lagrange)dualfunctiong:Rm!Risde¯nedasg(¸)=minuL(u;¸)=minuf0(u)+¸1f1(u)+:::+¸mfm(u):(14)Furthermoreg(¸)isconcave,evenifthefi(u)arenotconvex.Theconcavitycaneasilybeseenbyconsidering¯rstthatforagivenu,L(u;¸)isana±nefunctionof¸andhenceisaconcavefunction.Sinceg(¸)isthepointwiseminimumofsuchconcavefunctions,itisconcave.De¯nition13LagrangedualproblemTheLagrangedualproblemisde¯nedasd¤=max¸¸0g(¸):Sinceg(¸)isconcave,thiswillalwaysbeaconvexoptimizationproblem,eveniftheprimalisnot.Byweakduality,wealwayshaved¤·p¤,evenfornon-convexproblems.Thevaluep¤¡d¤iscalledthedualitygap.Forconvexproblems,weusually(althoughnotalways)havestrongdualityattheoptimum,i.e.,d¤=p¤;whichisalsoreferredtoasazerodualitygap.Forconvexproblems,asu±cientconditionforzerodualitygapisprovidedbySlater'scondition:Lemma14Slater'sconditionIftheprimalproblem(12)isconvexandisstrictlyfeasible,i.e.,9u0:fi(u0)0;i=1;:::;m,thenp¤=d¤:35 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordan3.3SDPDualityandOptimalityConditionsConsiderforsimplicitythecaseofanSDPwithasingleLMIconstraint,andnoa±neequalities:p¤=minucTusubjecttoF(u)=F0+u1F1+:::uqFq¹0:(15)ThegeneralcaseofmultipleLMIconstraintsanda±neequalitiescanbehandledbyeliminationofthelatterandusingblock-diagonalmatricestorepresenttheformerasasingleLMI.TheclassicalLagrangedualitytheoryoutlinedintheprevioussectiondoesnotdirectlyapplyhere,sincewearenotdealingwith¯nitelymanyconstraintsinscalarform;asnotedearlier,theLMIconstraintinvolvesanin¯nitenumberofsuchconstraints,oftheform(8).OnewaytohandlesuchconstraintsistointroduceaLagrangianoftheformL(u;Z)=cTu+trace(ZF(u));wherethedualvariableZisnowasymmetricmatrix,ofthesamesizeasF(u).WecancheckthatsuchaLagrangefunctionful¯llsthesameroleassignedtothefunctionde¯nedinDe¯nition10forthecasewithscalarconstraints.Indeed,ifwede¯neh(u)=maxZº0L(u;Z)thenh(u)=maxZº0L(u;Z)=½cTuifF(u)¹0;+1otherwise.Thus,h(u)isabarrierfortheprimalSDP(15),thatis,itcoincideswiththeobjectiveof(15)onitsfeasibleset,andisin¯niteotherwise.NoticethattotheLMIconstraintwenowassociateamultipliermatrix,whichwillbeconstrainedtothepositivesemide¯nitecone.Intheabove,wemadeuseofthefactthat,foragivensymmetricmatrixF,Á(F):=supZº0trace(ZF)is+1ifFhasapositiveeigenvalue,andzeroifFisnegativesemide¯nite.Thispropertyisobviousfordiagonalmatrices,sinceinthatcasethevariableZcanbeconstrainedtobediagonalwithoutlossofgenerality.ThegeneralcasefollowsfromthefactthatifFhastheeigenvaluedecompositionF=U¤UT,where¤isadiagonalmatrixcontainingtheeigenvaluesofF,andUisorthogonal,thentrace(ZF)=trace(Z0¤),whereZ0=UTZUspansthepositivesemide¯niteconewheneverZdoes.UsingtheaboveLagrangian,onecancasttheoriginalproblem(15)asanunconstrainedopti-mizationproblem:p¤=minumaxZº0L(u;Z):Byweakduality,weobtainalowerboundonp¤byexchangingtheminandmax:d¤=maxZº0minuL(u;Z)·minumaxZº0L(u;Z)=p¤:Theinnerminimizationproblemiseasilysolvedanalytically,duetothespecialstructureoftheSDP.Weobtainaclosedformforthe(Lagrange)dualfunction:g(Z)=minuL(u;Z)=minucTu+trace(ZF0)+qXi=1uitrace(ZFi)=½trace(ZF0)ifci=¡trace(ZFi);i=1;:::;q¡1otherwise:36 LearningtheKernelMatrixwithSemidefiniteProgrammingThedualproblemcanbeexplicitlystatedasfollows:d¤=maxZº0minuL(u;Z)=maxZtrace(ZF0)subjecttoZº0;ci=¡trace(ZFi);i=1;:::;q:(16)WeobservethattheaboveproblemisanSDP,withasingleLMIconstraintandqa±neequalitiesinthematrixdualvariableZ.Whileweakdualityalwaysholds,strongdualitymaynot,evenforSDPs.Notsurprisingly,aSlater-typeconditionensuresstrongduality.Precisely,iftheprimalSDP(15)isstrictlyfeasible,thatis,thereexistsau0suchthatF(u0)Á0,thenp¤=d¤.If,inaddition,thedualproblemisalsostrictlyfeasible,meaningthatthereexistsaZÂ0suchthatci=trace(ZFi),i=1;:::;q,thenbothprimalanddualoptimalvaluesareattainedbysomeoptimalpair(u¤;Z¤).Inthatcase,wecancharacterizesuchoptimalpairsasfollows.Inviewoftheequalityconstraintsofthedualproblem,thedualitygapcanbeexpressedasp¤¡d¤=cTu¤¡trace(Z¤F0)=¡trace(Z¤F(u¤)):Azerodualitygapisequivalenttotrace(Z¤F(u¤))=0,whichinturnisequivalenttoZ¤F(u¤)=O,whereOdenotesthezeromatrix,sincetheproductofapositivesemide¯niteandanegativesemide¯nitematrixhaszerotraceifandonlyifitiszero.Tosummarize,considertheSDP(15)anditsLagrangedual(16).Ifeitherproblemisstrictlyfeasible,thentheysharethesameoptimalvalue.Ifbothproblemsarestrictlyfeasible,thentheoptimalvaluesofbothproblemsareattainedandcoincide.Inthiscase,aprimal-dualpair(u¤;Z¤)isoptimalifandonlyifF(u¤)¹0;Z¤º0;ci=¡trace(Z¤Fi);i=1;:::;q;Z¤F(u¤)=O:TheaboveconditionsrepresenttheexpressionofthegeneralKarush-Kuhn-Tucker(KKT)condi-tionsinthesemide¯niteprogrammingsetting.The¯rstthreesetsofconditionsexpressthatu¤andZ¤arefeasiblefortheirrespectiveproblems;thelastconditionexpressesacomplementaritycondition.Forapairofstrictlyfeasibleprimal-dualSDPs,solvingtheprimalminimizationproblemisequivalenttomaximizingthedualproblemandbothcanthusbeconsideredsimultaneously.Al-gorithmsindeedmakeuseofthisrelationshipandusethedualitygapasastoppingcriterion.Ageneral-purposeprogramsuchasSeDuMi(Sturm,1999)handlesthoseproblemse±ciently.Thiscodeusesinterior-pointmethodsforSDP(NesterovandNemirovsky,1994);thesemethodshaveaworst-casecomplexityofO(q2p2:5)forthegeneralproblem(15).Inpractice,problemstructurecanbeexploitedforgreatcomputationalsavings:e.g.,whenF(u)2Rp£pconsistsofLdiagonalblocksofsizepi;i=1;:::;L,thesemethodshaveaworst-casecomplexityofO(q2(PL=1p2)p0:5)(VandenbergheandBoyd,1996).4.AlgorithmsforLearningKernelsWeworkinatransductionsetting,wheresomeofthedata(thetrainingsetSntr=f(x1;y1);:::;(xntr;yntr)g)arelabeled,andtheremainder(thetestsetTnt=fxntr+1;:::;xntr+ntg)areunlabeled,andthe37 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanaimistopredictthelabelsofthetestdata.Inthissetting,optimizingthekernelcorrespondstochoosingakernelmatrix.ThismatrixhastheformK=µKtrKtr;tKTtr;tKt¶;(17)whereKij=h©(xi);©(xj)i,i;j=1;:::;ntr;ntr+1;:::;ntr+nt.Byoptimizingacostfunctionoverthe\training-datablock"Ktr,wewanttolearntheoptimalmixedblockKtr;tandtheoptimal\test-datablock"Kt.Thisimpliesthattrainingandtest-datablocksmustsomehowbeentangled:tuningtraining-dataentriesinK(tooptimizetheirembedding)shouldimplythattest-dataentriesareautomati-callytunedinsomewayaswell.Thiscanbeachievedbyconstrainingthesearchspaceofpossiblekernelmatrices:wecontrolthecapacityofthesearchspaceofpossiblekernelmatricesinordertopreventover¯ttingandachievegoodgeneralizationontestdata.We¯rstconsiderageneraloptimizationprobleminwhichthekernelmatrixKisrestrictedtoaconvexsubsetKofP,thepositivesemide¯nitecone.Wethenconsidertwospeci¯cexamples.The¯rstisthesetofpositivesemide¯nitematriceswithboundedtracethatcanbeexpressedasalinearcombinationofkernelmatricesfromthesetfK1;:::;Kmg.Thatis,KisthesetofmatricesKsatisfyingK=mXi=1¹iKi;(18)Kº0;trace(K)·c:Inthiscase,thesetKliesintheintersectionofalow-dimensionallinearsubspacewiththepositivesemide¯niteconeP.Geometricallythiscanbeviewedascomputingallembeddings(foreveryKi),indisjointfeaturespaces,andthenweightingthese.ThesetfK1;:::;Kmgcouldbeasetofinitial\guesses"ofthekernelmatrix,e.g.,linear,Gaussianorpolynomialkernelswithdi®erentkernelparametervalues.Insteadof¯ne-tuningthekernelparameterforagivenkernelusingcross-validation,onecannowevaluatethegivenkernelforarangeofkernelparametersandthenoptimizetheweightsinthelinearcombinationoftheobtainedkernelmatrices.Alternatively,theKicouldbechosenastherank-onematricesKi=vivTi,withviasubsetoftheeigenvectorsofK0,aninitialkernelmatrix,orwithvisomeothersetoforthogonalvectors.ApracticallyimportantformisthecaseinwhichadiversesetofpossiblygoodGrammatricesKi(similaritymeasures/representations)hasbeenconstructed,e.g.,usingheterogeneousdatasources.Thechallengeistocombinethesemeasuresintooneoptimalsimilaritymeasure(embedding),tobeusedforlearning.ThesecondexampleofarestrictedsetKofkernelsisthesetofpositivesemide¯nitematriceswithboundedtracethatcanbeexpressedasalinearcombinationofkernelmatricesfromthesetfK1;:::;Kmg,butwiththeparameters¹iconstrainedtobenon-negative.Thatis,KisthesetofmatricesKsatisfyingK=mXi=1¹iKi;¹i¸0i2f1;:::;mgKº0;trace(K)·c:38 LearningtheKernelMatrixwithSemidefiniteProgrammingThisfurtherconstrainstheclassoffunctionsthatcanberepresented.Ithastwoadvantages:weshallseethatthecorrespondingoptimizationproblemhassigni¯cantlyreducedcomputationalcomplexity,anditismoreconvenientforstudyingthestatisticalpropertiesofaclassofkernelmatrices.AswewillseeinSection5,wecanestimatetheperformanceofsupportvectormachinesfortransductionusingpropertiesoftheclassK.AsexplainedinSection2,wecanuseathresholdedversionoff(x),i.e.,sign(f(x)),asabinaryclassi¯cationdecision.Usingthisdecisionfunction,wewillprovethattheproportionoferrorsonthetestdataTn(where,forconvenience,wesupposethattrainingandtestdatahavethesamesizentr=nt=n)is,withprobability1¡±(overtherandomdrawofthetrainingsetSnandtestsetTn),boundedby1nnXi=1maxf1¡yif(xi);0g+1pnÃ4+p2log(1=±)+sC(K)n°2!;(19)where°isthe1-normsoftmarginonthedataandC(K)isacertainmeasureofthecomplexityofthekernelclassK.Forinstance,fortheclassKofpositivelinearcombinationsde¯nedabove,C(K)·mc,wheremisthenumberofkernelmatricesinthecombinationandcistheboundonthetrace.So,theproportionoferrorsonthetestdataisboundedbytheaverageerroronthetrainingsetandacomplexityterm,determinedbytherichnessoftheclassKandthemargin°.Goodgeneralizationcanthusbeexpectediftheerroronthetrainingsetissmall,whilehavingalargemarginandaclassKthatisnottoorich.Thenextsectionpresentsthemainoptimizationresultofthepaper:minimizingageneralizedperformancemeasure!C;¿(K)withrespecttothekernelmatrixKcanberealizedinasemide¯niteprogrammingframework.Afterwards,weproveasecondgeneralresultshowingthatminimizing!C;¿(K)withrespecttoakernelmatrixK,constrainedtothelinearsubspaceK=Pm=1¹iKiwith¹¸0,leadstoaquadraticallyconstrainedquadraticprogramming(QCQP)problem.MaximizingthemarginofahardmarginSVMwithrespecttoK,aswellasbothsoftmargincasescanthenbetreatedasspeci¯cinstancesofthisgeneralresultandwillbediscussedinlatersections.4.1GeneralOptimizationResultInthissection,we¯rstofallshowthatminimizingthegeneralizedperformancemeasure!C;¿(K)=max®2®Te¡®T(G(K)+¿I)®:C¸®¸0;®Ty=0;(20)with¿¸0,onthetrainingdatawithrespecttothekernelmatrixK,insomeconvexsubsetKofpositivesemide¯nitematriceswithtraceequaltoc,minK2K!C;¿(Ktr)s.t.trace(K)=c;(21)canberealizedinasemide¯niteprogrammingframework.We¯rstnoteafundamentalpropertyofthegeneralizedperformancemeasure,apropertythatiscrucialfortheremainderofthepaper.Proposition15Thequantity!C;¿(K)=max®2®Te¡®T(G(K)+¿I)®:C¸®¸0;®Ty=0;isconvexinK.39 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanThisiseasilyseenbyconsidering¯rstthat2®Te¡®T(G(K)+¿I)®isana±nefunctionofK,andhenceisaconvexfunctionaswell.Secondly,wenoticethat!C;¿(K)isthepointwisemaximumofsuchconvexfunctionsandisthusconvex.TheconstraintsC¸®¸0;®Ty=0areobviouslyconvex.Problem(21)isnowaconvexoptimizationproblem.Thefollowingtheoremshowsthat,forasuitablechoiceofthesetK,thisproblemcanbecastasanSDP.Theorem16GivenalabeledsampleSntr=f(x1;y1);:::;(xntr;yntr)gwiththesetoflabelsdenotedy2Rntr,thekernelmatrixK2Kthatoptimizes(21),with¿¸0,canbefoundbysolvingthefollowingconvexoptimizationproblem:minK;t;¸;º;±t(22)subjecttotrace(K)=c;K2K;µG(Ktr)+¿Intre+º¡±+¸y(e+º¡±+¸y)Tt¡2C±Te¶º0;º¸0;±¸0:ProofWebeginbysubstituting!C;¿(Ktr),asde¯nedin(20),into(21),whichyieldsminK2Kmax®2®Te¡®T(G(Ktr)+¿Intr)®:C¸®¸0;®Ty=0;trace(K)=c;(23)withcaconstant.AssumethatKtrÂ0,henceG(Ktr)Â0andG(Ktr)+¿IntrÂ0since¿¸0(thefollowingcanbeextendedtothegeneralsemide¯nitecase).FromProposition15,weknowthat!C;¿(Ktr)isconvexinKtrandthusinK.Giventheconvexconstraintsin(23),theoptimizationproblemisthuscertainlyconvexinK.WewritethisasminK2K;tt:t¸max®2®Te¡®T(G(Ktr)+¿Intr)®;(24)C¸®¸0;®Ty=0;trace(K)=c:Wenowexpresstheconstraintt¸max®2®Te¡®T(G(Ktr)+¿Intr)®asanLMIusingduality.Inparticular,dualitywillallowustodroptheminimizationandtheSchurcomplementlemmathenyieldsanLMI.De¯netheLagrangianofthemaximizationproblem(20)byL(®;º;¸;±)=2®Te¡®T(G(Ktr)+¿Intr)®+2ºT®+2¸yT®+2±T(Ce¡®);where¸2Randº;±2Rntr.Byduality,wehave!C;¿(Ktr)=max®minº¸0;±¸0;¸L(®;º;¸;±)=minº¸0;±¸0;¸max®L(®;º;¸;±):SinceG(Ktr)+¿IntrÂ0,attheoptimumwehave®=(G(Ktr)+¿Intr)¡1(e+º¡±+¸y);andwecanformthedualproblem!C;¿(Ktr)=minº;±;¸(e+º¡±+¸y)T(G(Ktr)+¿Intr)¡1(e+º¡±+¸y)+2C±Te:º¸0;±¸0:40 LearningtheKernelMatrixwithSemidefiniteProgrammingThisimpliesthatforanyt�0,theconstraint!C;¿(Ktr)·tholdsifandonlyifthereexistº¸0;±¸0and¸suchthat(e+º¡±+¸y)T(G(Ktr)+¿Intr)¡1(e+º¡±+¸y)+2C±Te·t;or,equivalently(usingtheSchurcomplementlemma),suchthatµG(Ktr)+¿Intre+º¡±+¸y(e+º¡±+¸y)Tt¡2C±Te¶º0holds.Takingthisintoaccount,(24)canbeexpressedasminK;t;¸;º;±tsubjecttotrace(K)=c;K2K;µG(Ktr)+¿Intre+º¡±+¸y(e+º¡±+¸y)Tt¡2C±Te¶º0;º¸0;±¸0;whichyields(22).Noticethatº¸0,diag(º)º0,andisthusanLMI;similarlyfor±¸0.NoticethatifK=fKº0g,thisoptimizationproblemisanSDPinthestandardform(9).Ofcourse,inthatcasethereisnoconstrainttoensureentanglementoftrainingandtest-datablocks.Indeed,itiseasytoseethatthecriterionwouldbeoptimizedwithatestmatrixKt=O.ConsidertheconstraintK=spanfK1;:::;Kmg\fKº0g.Weobtainthefollowingconvexoptimizationproblem:minK!C;¿(Ktr)(25)subjecttotrace(K)=c;Kº0;K=mXi=1¹iKi;whichcanbewritteninthestandardformofasemide¯niteprogram,inamanneranalogousto(22):min¹;t;¸;º;±t(26)subjecttotraceÃmXi=1¹iKi!=c;mXi=1¹iKiº0;µG(Pm=1¹iKi;tr)+¿Intre+º¡±+¸y(e+º¡±+¸y)Tt¡2C±Te¶º0;º¸0;±¸0:41 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanTosolvethisgeneraloptimizationproblem,onehastosolveasemide¯niteprogrammingprob-lem.General-purposeprogramssuchasSeDuMi(Sturm,1999)useinterior-pointmethodstosolveSDPproblems(NesterovandNemirovsky,1994).Thesemethodsarepolynomialtime.How-ever,applyingthecomplexityresultsmentionedinSection3.3leadstoaworst-casecomplexityO¡(m+ntr)2(n2+n2)(n+ntr)0:5¢,orroughlyO¡(m+ntr)2n2:5¢,inthisparticularcase.Considerafurtherrestrictiononthesetofkernelmatrices,wherethematricesarerestrictedtopositivelinearcombinationsofkernelmatricesfK1;:::;Kmg\fKº0g:K=mXi=1¹iKi;¹¸0:Forthisrestrictedlinearsubspaceofthepositivesemide¯niteconeP,wecanprovethefollowingtheorem:17GivenalabeledsampleSntr=f(x1;y1);:::;(xntr;yntr)gwiththesetoflabelsdenotedy2Rntr,thekernelmatrixK=Pm=1¹iKithatoptimizes(21),with¿¸0,undertheadditionalconstraint¹¸0canbefoundbysolvingthefollowingconvexoptimizationproblem,andconsideringitsdualsolution:max®;t2®Te¡¿®T®¡ct(27)subjecttot¸1ri®TG(Ki;tr)®;i=1;:::;m®Ty=0;C¸®¸0;wherer2Rmwithtrace(Ki)=ri.ProofSolvingproblem(21)subjecttoK=Pm=1¹iKi,withKiº0,andtheextraconstraint¹¸0yieldsminKmax®:C¸®¸0;®Ty=02®Te¡®T(G(Ktr)+¿Intr)®subjecttotrace(K)=c;Kº0;K=mXi=1¹iKi;¹¸0;when!C;¿(Ktr)isexpressedusing(20).Wecanomitthesecondconstraint,becausethisisimpliedbythelasttwoconstraints,ifKiº0.Theproblemthenreducestomin¹max®:C¸®¸0;®Ty=02®Te¡®T(G(mXi=1¹iKi;tr)+¿Intr)®subjectto¹Tr=c;¹¸0;42 LearningtheKernelMatrixwithSemidefiniteProgrammingwhereKi;tr=Ki(1:ntr;1:ntr).Wecanwritethisasmin¹:¹¸0;¹Tr=cmax®:C¸®¸0;®Ty=02®Te¡®TÃdiag(y)(mXi=1¹iKi;tr)diag(y)+¿Intr!®=min¹:¹¸0;¹Tr=cmax®:C¸®¸0;®Ty=02®Te¡mXi=1¹i®Tdiag(y)Ki;trdiag(y)®¡¿®T®=min¹:¹¸0;¹Tr=cmax®:C¸®¸0;®Ty=02®Te¡mXi=1¹i®TG(Ki;tr)®¡¿®T®=max®:C¸®¸0;®Ty=0min¹:¹¸0;¹Tr=c2®Te¡mXi=1¹i®TG(Ki;tr)®¡¿®T®;withG(Ki;tr)=diag(y)Ki;trdiag(y).Theinterchangeoftheorderoftheminimizationandthemaximizationisjusti¯ed(see,e.g.,BoydandVandenberghe,2003)becausetheobjectiveisconvexin¹(itislinearin¹)andconcavein®,becausetheminimizationproblemisstrictlyfeasiblein¹,andthemaximizationproblemisstrictlyfeasiblein®(wecanskipthecaseforallelementsofyhavingthesamesign,becausewecannotevende¯neamargininsuchacase).Wethusobtainmax®:C¸®¸0;®Ty=0min¹:¹¸0;¹Tr=c2®Te¡mXi=1¹i®TG(Ki;tr)®¡¿®T®=max®:C¸®¸0;®Ty=0"2®Te¡¿®T®¡max¹:¹¸0;¹Tr=cÃmXi=1¹i®TG(Ki;tr)®!#=max®:C¸®¸0;®Ty=0·2®Te¡¿®T®¡maxiµcri®TG(Ki;tr)®¶¸:Finally,thiscanbereformulatedasmax®;t2®Te¡¿®T®¡ctsubjecttot¸1ri®TG(Ki;tr)®;i=1;:::;m®Ty=0;C¸®¸0;whichprovesthetheorem.Thisconvexoptimizationproblem,aQCQPmoreprecisely,isaspecialinstanceofanSOCP(second-orderconeprogrammingproblem),whichisinturnaspecialformofSDP(BoydandVandenberghe,2003).SOCPscanbesolvede±cientlywithprogramssuchasSeDuMi(Sturm,1999)orMosek(AndersenandAndersen,2000).Thesecodesuseinterior-pointmethods(NesterovandNemirovsky,1994)whichyieldaworst-casecomplexityofO(mn3).Thisimpliesamajorimprovementcomparedtotheworst-casecomplexityofageneralSDP.Furthermore,thecodessimultaneouslysolvetheaboveproblemanditsdualform.Theythusreturnoptimalvaluesforthedualvariablesaswell|thisallowsustoobtaintheoptimalweights¹i,fori=1;:::;m.43 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordan4.2HardMarginInthissection,weshowhowmaximizingthemarginofahardmarginSVMwithrespecttothekernelmatrixcanberealizedinthesemide¯niteprogrammingframeworkderivedinTheorem16.Inspiredby(19),letustryto¯ndthekernelmatrixKinsomeconvexsubsetKofpositivesemide¯nitematricesforwhichthecorrespondingembeddingshowsmaximalmarginonthetrainingdata,keepingthetraceofKconstant:minK2K!(Ktr)s.t.trace(K)=c:(28)Notethat!(Ktr)=!1;0(Ktr).FromProposition15,wethenobtainthefollowingimportantresult:18Thequantity!(K)=max®2®Te¡®TG(K)®:®¸0;®Ty=0;isconvexinK.So,afundamentalpropertyoftheinversemarginisthatitisconvexinK.Thisisessential,sinceitallowsustooptimizethisquantityinaconvexframework.Thefollowingtheoremshowsthat,forasuitablechoiceofthesetK,thisconvexoptimizationproblemcanbecastasanSDP.Theorem19GivenalinearlyseparablelabeledsampleSntr=f(x1;y1);:::;(xntr;yntr)gwiththesetoflabelsdenotedy2Rntr,thekernelmatrixK2Kthatoptimizes(28)canbefoundbysolvingthefollowingproblem:minK;t;¸;ºt(29)subjecttotrace(K)=c;K2K;µG(Ktr)e+º+¸y(e+º+¸y)Tt¶º0;º¸0:ProofObserve!(Ktr)=!1;0(Ktr).ApplyTheorem16forC=1and¿=0.IfK=fKº0g,thereisnoconstrainttoensurethatalargemarginonthetrainingdatawillgivealargemarginonthetestdata:atestmatrixKt=Owouldoptimizethecriterion.IfwerestrictthekernelmatrixtoalinearsubspaceK=spanfK1;:::;Kmg\fKº0g,weobtainminK!(Ktr)(30)subjecttotrace(K)=c;Kº0;K=mXi=1¹iKi;44 LearningtheKernelMatrixwithSemidefiniteProgrammingwhichcanbewritteninthestandardformofasemide¯niteprogram,inamanneranalogousto(29):min¹i;t;¸;ºt(31)subjecttotraceÃmXi=1¹iKi!=c;mXi=1¹iKiº0;µG(Pm=1¹iKi;tr)e+º+¸y(e+º+¸y)Tt¶º0;º¸0:NoticethattheSDPapproachisconsistentwiththeboundin(19).Themarginisoptimizedoverthelabeleddata(viatheuseofKi;tr),whilethepositivesemide¯nitenessandthetraceconstraintareimposedfortheentirekernelmatrixK(viatheuseofKi).Thisleadstoageneralmethodforlearningthekernelmatrixwithsemide¯niteprogramming,whenusingamargincriterionforhardmarginSVMs.ApplyingthecomplexityresultsmentionedinSection3.3leadstoaworst-casecomplexityO¡(m+ntr)2n2:5¢whenusinggeneral-purposeinterior-pointmethodstosolvethisparticularSDP.Furthermore,thisgivesanewtransductionmethodforhardmarginSVMs.WhereasVapnik'soriginalmethodfortransductionscalesexponentiallyinthenumberoftestsamples,thenewSDPmethodhaspolynomialtimecomplexity.Remark.Forthespeci¯ccaseinwhichtheKiarerank-onematricesKi=vivTi,withviorthonor-mal(e.g.,thenormalizedeigenvectorsofaninitialkernelmatrixK0),thesemide¯niteprogramreducestoaQCQP:max®;t2®Te¡ct(32)subjecttot¸(¸vTi®)2;i=1;:::;m®Ty=0;®¸0;with¸vi=diag(y)vi(1:ntr).Thiscanbeseenbyobservingthat,forKi=vivTi,withvTivj=±ij,wehavethatPm=1¹iKiº0isequivalentto¹¸0.So,wecanapplyTheorem17,with¿=0andC=1,where1ri®TG(Ki;tr)®=®Tdiag(y)vi(1:ntr)vi(1:ntr)Tdiag(y)®=(¸vTi®)2.4.3HardMarginwithKernelMatricesthatarePositiveLinearCombinationsTolearnakernelmatrixfromthislinearclassK,onehastosolveasemide¯niteprogrammingproblem:interior-pointmethods(NesterovandNemirovsky,1994)arepolynomialtime,buthaveaworst-casecomplexityO¡(m+ntr)2n2:5¢inthisparticularcase.WenowrestrictKtothepositivelinearcombinationsofkernelmatrices:K=mXi=1¹iKi;¹¸0:45 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanAssumingpositiveweightsyieldsasmallersetofkernelmatrices,becausetheweightsneednotbepositiveforKtobepositivesemide¯nite,evenifthecomponentsKiarepositivesemide¯nite.Moreover,therestrictionhasbene¯cialcomputationale®ects:(1)thegeneralSDPreducestoaQCQP,whichcanbesolvedwithsigni¯cantlylowercomplexityO(mn3);(2)theconstraintcanresultinimprovednumericalstability|itpreventsthealgorithmfromusinglargeweightswithoppositesignthatcancel.Finally,weshallseeinSection5thattheconstraintalsoyieldsbetterestimatesofthegeneralizationperformanceofthesealgorithms.Theorem20GivenalabeledsampleSntr=f(x1;y1);:::;(xntr;yntr)gwiththesetoflabelsdenotedy2Rntr,thekernelmatrixK=Pm=1¹iKithatoptimizes(21),with¿¸0,undertheadditionalconstraint¹¸0canbefoundbysolvingthefollowingconvexoptimizationproblem,andconsideringitsdualsolution:max®;t2®Te¡ct(33)subjecttot¸1ri®TG(Ki;tr)®;i=1;:::;m®Ty=0;®¸0:wherer2Rmwithtrace(Ki)=ri.ProofApplyTheorem17forC=1and¿=0.Noteonceagainthattheoptimalweights¹i;i=1;:::;m,canberecoveredfromtheprimal-dualsolutionfoundbystandardsoftwaresuchasSeDuMi(Sturm,1999)orMosek(AndersenandAndersen,2000).4.41-NormSoftMarginForthecaseofnon-linearlyseparabledata,wecanconsiderthe1-normsoftmargincostfunctionin(3).TrainingtheSVMforagivenkernelinvolvesminimizingthisquantitywithrespecttow;b,and»,whichyieldstheoptimalvalue(4):obviouslythisminimumisafunctionoftheparticularchoiceofK,whichisexpressedexplicitlyin(4)asadualproblem.LetusnowoptimizethisquantitywithrespecttothekernelmatrixK,i.e.,letustryto¯ndthekernelmatrixK2Kforwhichthecorrespondingembeddingyieldsminimal!S1(Ktr),keepingthetraceofKconstant:minK2K!S1(Ktr)s.t.trace(K)=c:(34)Thisisagainaconvexoptimizationproblem.Theorem21GivenalabeledsampleSntr=f(x1;y1);:::;(xntr;yntr)gwiththesetoflabelsdenotedy2Rntr,thekernelmatrixK2Kthatoptimizes(34),canbefoundbysolvingthefollowingconvex46 LearningtheKernelMatrixwithSemidefiniteProgrammingoptimizationproblem:minK;t;¸;º;±t(35)subjecttotrace(K)=c;K2K;µG(Ktr)e+º¡±+¸y(e+º¡±+¸y)Tt¡2C±Te¶º0;º¸0;±¸0:ProofObserve!S1(Ktr)=!C;0(Ktr).ApplyTheorem16for¿=0.Again,ifK=fKº0g,thisisanSDP.Addingtheadditionalconstraint(18)thatKisalinearcombinationof¯xedkernelmatricesleadstothefollowingSDP:min¹i;t;¸;º;±t(36)subjecttotraceÃmXi=1¹iKi!=c;mXi=1¹iKiº0;µG(Pm=1¹iKi;tr)e+º¡±+¸y(e+º¡±+¸y)Tt¡2C±Te¶º0;º;±¸0:Remark.Forthespeci¯ccaseinwhichtheKiarerank-onematricesKi=vivTi,withvior-thonormal(e.g.,thenormalizedeigenvectorsofaninitialkernelmatrixK0),theSDPreducestoaQCQPusingTheorem17,with¿=0,inamanneranalogoustothehardmargincase:max®;t2®Te¡ct(37)subjecttot¸(¸vTi®)2;i=1;:::;m®Ty=0;C¸®¸0;with¸vi=diag(y)vi(1:ntr).Solvingtheoriginallearningproblemsubjecttotheextraconstraint¹¸0yields,afterapplyingTheorem17,with¿=0:max®;t2®Te¡ct(38)subjecttot¸1ri®TG(Ki;tr)®;i=1;:::;m®Ty=0;C¸®¸0:47 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordan4.52-NormSoftMarginForthecaseofnon-linearlyseparabledata,wecanalsoconsiderthe2-normsoftmargincostfunction(5).Again,trainingforagivenkernelwillminimizethisquantitywithrespecttow;b,and»andtheminimumisafunctionoftheparticularchoiceofK,asexpressedin(6)indualform.LetusnowoptimizethisquantitywithrespecttothekernelmatrixK:minK2K!S2(Ktr)s.t.trace(K)=c:(39)Thisisagainaconvexoptimizationproblem,andcanberestatedasfollows.Theorem22GivenalabeledsampleSntr=f(x1;y1);:::;(xntr;yntr)gwiththesetoflabelsdenotedy2Rntr,thekernelmatrixK2Kthatoptimizes(39)canbefoundbysolvingthefollowingoptimizationproblem:minK;t;¸;ºt(40)subjecttotrace(K)=c;K2K;µG(Ktr)+1CIntre+º+¸y(e+º+¸y)Tt¶º0;º¸0:ProofObserve!S2(Ktr)=!1;¿(Ktr).ApplyTheorem16forC=1.Again,ifK=fKº0g,thisisanSDP.Moreover,constrainingKtobealinearcombinationof¯xedkernelmatrices,weobtainmin¹i;t;¸;ºt(41)subjecttotraceÃmXi=1¹iKi!=c;mXi=1¹iKiº0;0G(Pm=1¹iKi;tr)+1CIntre+º+¸y(e+º+¸y)Tt1º0;º¸0:Also,whentheKiarerank-onematrices,Ki=vivTi,withviorthonormal,weobtainaQCQP:max®;t2®Te¡1C®T®¡ct(42)subjecttot¸(¸vTi®)2;i=1;:::;m®Ty=0;®¸0;48 LearningtheKernelMatrixwithSemidefiniteProgrammingand,¯nally,imposingtheconstraint¹¸0yieldsmax®;t2®Te¡1C®T®¡ct(43)subjecttot¸1ri®TG(Ki;tr)®;i=1;:::;m®Ty=0;®¸0;followingasimilarderivationasbefore:applyTheorem17withC=1,and,for(42),observethat¹¸0isequivalenttoPm=1¹iKiº0ifKi=vivTiandvTivj=±ij.4.6Learningthe2-NormSoftMarginParameter¿=1=CThissectionshowshowthe2-normsoftmarginparameterofSVMscanbelearnedusingSDPorQCQP.MoredetailscanbefoundinDeBieetal.(2003).Intheprevioussection,wetriedto¯ndthekernelmatrixK2Kforwhichthecorrespondingembeddingyieldsminimal!S2(Ktr),keepingthetraceofKconstant.Sinceinthedualformulation(6)theidentitymatrixinducedbythe2-normformulationappearsinexactlythesamewayastheothermatricesKi,wecantreatitonthesamebasisandoptimizeitsweighttoobtaintheoptimaldualformulation,i.e.,tominimize!S2(Ktr).Sincethisweightnowhappenstocorrespondtotheparameter¿=1=C,optimizingitcorrespondstolearningthe2-normsoftmarginparameterandthushasasigni¯cantmeaning.Sincetheparameter¿=1=Ccanbetreatedinthesamewayastheweights¹i,tuningitsuchthatthequantity!S2(Ktr;¿)isminimizedcanbeviewedasamethodforchoosing¿.Firstofall,considerthedualformulation(6)andnoticethat!S2(Ktr;¿)isconvexin¿=1=C(beingthepointwisemaximumofa±neandthusconvexfunctionsin¿).Secondly,since¿!1leadsto!S2(Ktr;¿)!0,weimposetheconstrainttrace(K+¿In)=c.Thisresultsinthefollowingconvexoptimizationproblem:minK2K;¿¸0!S2(Ktr;¿)s.t.trace(K+¿In)=c:AccordingtoTheorem22,thiscanberestatedasfollows:minK;t;¸;º;¿t(44)subjecttotrace(K+¿In)=c;K2K;µG(Ktr)+¿Intre+º+¸y(e+º+¸y)Tt¶º0;º;¿¸0:49 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanAgain,ifK=fKº0g,thisisanSDP.ImposingtheadditionalconstraintthatKisalinearfunctionof¯xedkernelmatrices,weobtaintheSDPmin¹i;t;¸;º;¿t(45)subjecttotraceÃmXi=1¹iKi+¿In!=c;mXi=1¹iKiº0;0G(Pm=1¹iKi;tr)+¿Intre+º+¸y(e+º+¸y)Tt1º0;º;¿¸0;andimposingtheadditionalconstraintthattheKiarerank-onematrices,weobtainaQCQP:max®;t2®Te¡ct(46)subjecttot¸(¸vTi®)2;i=1;:::;mt¸1n®T®®Ty=0;®¸0;with¸vi=diag(y)¹vi=diag(y)vi(1:ntr).Finally,imposingtheconstraintthat¹¸0yieldsthefollowing:max®;t2®Te¡ct(47)subjecttot¸1ri®TG(Ki;tr)®;i=1;:::;mt¸1n®T®(48)®Ty=0;®¸0;which,asbefore,isaQCQP.Solving(47)correspondstolearningthekernelmatrixasapositivelinearcombinationofkernelmatricesaccordingtoa2-normsoftmargincriterionandsimultaneouslylearningthe2-normsoftmarginparameter¿=1=C.Comparing(47)with(33),wecanseethatthisreducestolearninganaugmentedkernelmatrixK0asapositivelinearcombinationofkernelmatricesandtheidentitymatrix,K0=K+¿In=Pm=1¹iKi+¿In,usingahardmargincriterion.However,thereisanimportantdi®erence:whenevaluatingtheresultingclassi¯er,theactualkernelmatrixKisused,insteadoftheaugmentedK0(see,forexample,Shawe-TaylorandCristianini,1999).Form=1,wenoticethat(45)directlyreducesto(47)ifK1º0.Thiscorrespondstoautomaticallytuningtheparameter¿=1=Cfora2-normsoftmarginSVMwithkernelmatrixK1.So,evenwhennotlearningthekernelmatrix,thisapproachcanbeusedtotunethe2-normsoftmarginparameter¿=1=Cautomatically.50 LearningtheKernelMatrixwithSemidefiniteProgramming4.7AlignmentInthissection,weconsidertheproblemofoptimizingthealignmentbetweenasetoflabelsandakernelmatrixfromsomeclassKofpositivesemide¯nitekernelmatrices.Weshowthat,ifKisaclassoflinearcombinationsof¯xedkernelmatrices,thisproblemcanbecastasanSDP.ThisresultgeneralizestheapproachpresentedinCristianinietal.(2001,2002).Theorem23ThekernelmatrixK2Kwhichismaximallyalignedwiththesetoflabelsy2Rntrcanbefoundbysolvingthefollowingoptimizationproblem:maxA;KhKtr;yyTiF(49)subjecttotrace(A)·1µAKTKIn¶º0K2K;whereInistheidentitymatrixofdimensionn.ProofWewantto¯ndthekernelmatrixKwhichismaximallyalignedwiththesetoflabelsy:maxK^A(S;Ktr;yyT)subjecttoK2K;trace(K)=1:Thisisequivalenttothefollowingoptimizationproblem:maxKhKtr;yyTiF(50)subjecttohK;KiF=1K2K;trace(K)=1:Toexpressthisinthestandardform(9)ofasemide¯niteprogram,weneedtoexpressthequadraticequalityconstrainthK;KiF=1asanLMI.First,noticethat(50)isequivalenttomaxKhKtr;yyTiF(51)subjecttohK;KiF·1K2K:Indeed,wearemaximizinganobjectivewhichislinearintheentriesofK,soattheoptimumK=K¤,theconstrainthK;KiF=trace(KTK)·1isachieved:hK¤;K¤iF=1.Thequadraticinequalityconstraintin(51)isnowequivalentto9A:KTK¹Aandtrace(A)·1:Indeed,A¡KTKº0impliestrace(A¡KTK)=trace(A)¡trace(KTK)¸0becauseoflinearityofthetrace.UsingtheSchurcomplementlemma,wecanexpressA¡KTKº0asanLMI:A¡KTKº0,µAKTKIn¶º0:51 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanWecanthusrewritetheoptimizationproblem(50)asmaxA;KhKtr;yyTiFsubjecttotrace(A)·1µAKTKIn¶º0K2K;whichcorrespondsto(49).Noticethat,whenKisthesetofallpositivesemide¯nitematrices,thisisanSDP(aninequalityconstraintcorrespondstoaone-dimensionalLMI;considertheentriesofthematricesAandKastheunknowns,correspondingtotheuiin(9)).Inthatcase,onesolutionof(49)isfoundbysimplyselectingKtr=cnyyT,forwhichthealignment(7)isequaltooneandthusmaximized.Addingtheadditionalconstraint(18)thatKisalinearcombinationof¯xedkernelmatricesleadstomaxK­Ktr;yyT®F(52)subjecttohK;KiF·1;Kº0;K=mXi=1¹iKi;whichcanbewritteninthestandardformofasemide¯niteprogram,inasimilarwayasfor(49):maxA;¹i*mXi=1¹iKi;tr;yyT+F(53)subjecttotrace(A)·1;µAPm=1¹iKTiPm=1¹iKiIn¶º0;mXi=1¹iKiº0:Remark.Forthespeci¯ccasewheretheKiarerank-onematricesKi=vivTi,withviorthonormal(e.g.,thenormalizedeigenvectorsofaninitialkernelmatrixK0),thesemide¯niteprogramreducestoaQCQP(seeAppendixA):max¹imXi=1¹i(¹vTiy)2(54)subjecttomXi=1¹2·1¹i¸0;i=1;:::;m52 LearningtheKernelMatrixwithSemidefiniteProgrammingwith¹vi=vi(1:ntr).ThiscorrespondsexactlytotheQCQPobtainedasanillustrationinCristianinietal.(2002),whichisthusentirelycapturedbythegeneralSDPresultobtainedinthissection.Solvingtheoriginallearningproblem(52)subjecttotheextraconstraint¹¸0yieldsmaxK­Ktr;yyT®FsubjecttohK;KiF·1;Kº0;K=mXi=1¹iKi;¹¸0:Wecanomitthesecondconstraint,becausethisisimpliedbythelasttwoconstraints,ifKiº0.Thisreducestomax¹*mXi=1¹iKi;tr;yyT+Fsubjectto*mXi=1¹iKi;mXj=1¹jKj+F·1;¹¸0;whereKi;tr=Ki(1:ntr;1:ntr).Expandingthisfurtheryields*mXi=1¹iKi;tr;yyT+F=mXi=1¹i­Ki;tr;yyT®F=¹Tq;(55)*mXi=1¹iKi;mXj=1¹jKj+F=mXi;j=1¹i¹jhKi;KjiF=¹TS¹(56)withqi=­Ki;tr;yyT®F=trace(Ki;tryyT)=trace(yTKi;try)=yTKi;tryandSij=hKi;KjiF,whereq2Rm;S2Rm£m.Weusedthefactthattrace(ABC)=trace(BCA)(iftheproductsarewell-de¯ned).Weobtainthefollowinglearningproblem:max¹¹Tqsubjectto¹TS¹·1;¹¸0;whichisaQCQP.4.8InductionInprevioussectionswehaveconsideredthetransductionsetting,whereitisassumedthatthecovariatevectorsforbothtraining(labeled)andtest(unlabeled)dataareknownbeforehand.While53 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanthissettingcapturesmanyrealisticlearningproblems,itisalsoofinteresttoconsiderpossibleextensionsofourapproachtothemoregeneralsettingofinduction,inwhichthecovariatesareknownbeforehandonlyforthetrainingdata.Considerthefollowingsituation.WelearnthekernelmatrixasapositivelinearcombinationofnormalizedkernelmatricesKi.ThoseKiareobtainedthroughtheevaluationofakernelfunctionorthroughaknownprocedure(e.g.,astring-matchingkernel),yieldingKiº0.So,K=Pm=1¹iKiº0.NormalizationisdonebyreplacingKi(k;l)byKi(k;l)=pKi(k;k)¢Ki(l;l).Inthiscase,theextensiontoaninductionsettingiselegantandsimple.Letntrbethenumberoftrainingdatapoints(alllabeled).Considerthetransductionproblemforthosentrdatapointsandoneunknowntestpoint,e.g.,forahardmarginSVM.Theoptimalweights¹¤;i=1;:::;marelearnedbysolving(33):max®;t2®Te¡ct(57)subjecttot¸1ntr+1®TG(Ki;tr)®;i=1;:::;m®Ty=0;®¸0:EvenwithoutknowingthetestpointandtheentriesoftheKi'srelatedtoit(columnandrowntr+1),weknowthatK(ntr+1;ntr+1)=1becauseofthenormalization.So,trace(Ki)=ntr+1.Thisallowssolvingfortheoptimalweights¹¤;i=1;:::;mandtheoptimalSVMparameters®¤j;j=1;:::;ntrandb¤,withoutknowingthetestpoint.Whenatestpointbecomesavailable,wecompletetheKi'sbycomputingtheir(ntr+1)-thcolumnandrow(evaluatethekernelfunctionorfollowtheprocedureandnormalize).CombiningthoseKiwithweights¹¤iyieldsthe¯nalkernelmatrixK,whichcanthenbeusedtolabelthetestpoint:y=sign(mXi=1ntrXj=1¹¤®jKi(xj;x)):Remark:Theoptimalweightsareindependentofthenumberofunknowntestpointsthatareconsideredinthissetting.Considerthetransductionproblem(57)forlunknowntestpointsinsteadofoneunknowntestpoint:max~®;~t2~®Te¡c~t(58)subjectto~t¸1ntr+l~®TG(Ki;tr)~®;i=1;:::;m~®Ty=0;~®¸0:Onecanseethatsolving(58)isequivalenttosolving(57)wheretheoptimalvaluesrelateas~®¤=ntr+lntr+1®¤and~t¤=ntr+lntr+1t¤andwheretheoptimalweights¹¤;i=1;:::;marethesame.Tacklingtheinductionprobleminfullgeneralityremainsachallengeforfuturework.Obviously,onecouldconsiderthetransductioncasewithzerotestpoints,yieldingtheinductioncase.Iftheweights¹iareconstrainedtobenonnegativeandfurthermorethematricesKiareguaranteedtobepositivesemide¯nite,theweightscanbereusedatnewtestpoints.TodealwithinductioninageneralSDPsetting,onecouldsolveatransductionproblemforeachnewtestpoint.Forevery54 LearningtheKernelMatrixwithSemidefiniteProgrammingtestpoint,thisleadstosolvinganSDPofdimensionntr+1,whichiscomputationallyexpensive.ClearlythereisaneedtoexplorerecursivesolutionstotheSDPproblemthatallowthesolutionoftheSDPofdimensionntrtobeusedinthesolutionofanSDPofdimensionntr+1.Suchsolutionswouldofcoursealsohaveimmediateapplicationstoon-linelearningproblems.5.ErrorBoundsforTransductionIntheproblemoftransduction,wehaveaccesstotheunlabeledtestdata,aswellasthelabeledtrainingdata,andtheaimistooptimizeaccuracyinpredictingthetestdata.Weassumethatthedataare¯xed,andthattheorderischosenrandomly,yieldingarandompartitionintoalabeledtrainingsetandanunlabeledtestset.Forconvenience,wesupposeherethatthetrainingandtestsetshavethesamesize.Ofcourse,ifwecanshowaperformanceguaranteethatholdswithhighprobabilityoveruniformlychosentraining/testpartitionsofthiskind,italsoholdswithhighprobabilityoverani.i.d.choiceofthetrainingandtestdata,sincepermutingani.i.d.sampleleavesthedistributionunchanged.Thefollowingtheoremgivesanupperboundontheerrorofakernelclassi¯eronthetestdataintermsoftheaverageoverthetrainingdataofacertainmargincostfunction,togetherwithpropertiesofthekernelmatrix.Wefocusonthe1-normsoftmarginclassi¯er,althoughourresultsextendinastraightforwardwaytoothercases,includingthe2-normsoftmarginclassi¯er.The1-normsoftmarginclassi¯erchoosesakernelclassi¯erftominimizeaweightedcombinationofaregularizationterm,kwk2,andtheaverageoverthetrainingsampleoftheslackvariables,»i=max(1¡yif(xi);0):WecanviewthisregularizedempiricalcriterionastheLagrangianfortheconstrainedminimizationof1nnXi=1»i=1nnXi=1max(1¡yif(xi);0)subjecttotheupperboundkwk2·1=°2.Fixasequenceof2npairs(X1;Y1);:::;(X2n;Y2n)fromX£Y.Let¼:f1;:::;2ng!f1;:::;2ngbearandompermutation,chosenuniformly,andlet(xi;yi)=(X¼(i);Y¼(i)).The¯rsthalfofthisrandomlyorderedsequenceisthetrainingdataTn=((x1;y1);:::;(xn;yn)),andthesecondhalfisthetestdataSn=((xn+1;yn+1);:::;(x2n;y2n)).Forafunctionf:X!R,theproportionoferrorsonthetestdataofathresholdedversionoffcanbewrittenaser(f)=1njfn+1·i·2n:yif(xi)·0gj:Weconsiderkernelclassi¯ersobtainedbythresholdingkernelexpansionsoftheformf(x)=hw;©(x)i=2nXi=1®ik(xi;x);(59)wherew=P2ni=1®i©(xi)ischosenwithboundednorm,kwk2=2nXi;j=1®i®jk(xi;xj)=®TK®·1°2;(60)andKisthe2n£2nkernelmatrixwithKij=k(xi;xj).55 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanLetFKdenotetheclassoffunctionsonSoftheform(59)satisfying(60),forsomeK2K,FK=(xj7!2nXi=1®iKij:K2K;®TK®·1°2);whereKisasetofpositivesemide¯nite2n£2nmatrices.Wealsoconsidertheclassofkernelexpansionsobtainedfromcertainlinearcombinationsofa¯xedsetfK1;:::;Kmgofkernelmatrices:De¯netheclassFKcasKc=8K=mXj=1¹jKj:Kº0;¹j2R;trace(K)·c9;andtheclassFK+casK+c=8mXj=1¹jKj:Kº0;¹j¸0;trace(K)·c9:Theorem24Forevery°�0,withprobabilityatleast1¡±overthedata(xi;yi)chosenasabove,everyfunctionf2FKhaser(f)nomorethan1nnXi=1maxf1¡yif(xi);0g+1pnÃ4+p2log(1=±)+sC(K)n°2!;whereC(K)=EmaxK2K¾TK¾;withtheexpectationover¾chosenuniformlyfromf§1g2n.Furthermore,C(Kc)=cEmaxK2K¾TKtrace(K)¾;andthisisalwaysnomorethancn,andC(K+c)·cminµm;nmaxj¸jtrace(Kj)¶;where¸jisthelargesteigenvalueofKj.Noticethatthetesterrorisboundedbyasumoftheaverageoverthetrainingdataofamargincostfunctionplusacomplexitypenaltytermthatdependsontheratiobetweenthetraceofthekernelmatrixandthesquaredmarginparameter,°2.Thekernelmatrixhereisthefullmatrix,combiningbothtestandtrainingdata.TheproofofthetheoremisinAppendixB.Theprooftechniqueforthe¯rstpartofthetheoremwasintroducedbyKoltchinskiiandPanchenko(2002),whousedittogiveerrorboundsforboostingalgorithms.Althoughthetheoremrequiresthemarginparameter°tobespeci¯edinadvance,itisstraight-forwardtoextendtheresulttogiveanerrorboundthatholdswithhighprobabilityoverallvaluesof°.Inthiscase,thelog(1=±)intheboundwouldbereplacedbylog(1=±)+jlog(1=°)jandthe56 LearningtheKernelMatrixwithSemidefiniteProgrammingconstantswouldincreaseslightly.See,forexample,Proposition8anditsapplicationsintheworkofBartlett(1998).Theresultispresentedforthe1-normsoftmarginclassi¯er,buttheproofusesonlytwoprop-ertiesofthecostfunctiona7!maxf1¡a;0g:thatitisanupperboundontheindicatorfunctionfora·0,andthatitsatis¯esaLipschitzconstrainton[0;1).Theseconditionsarealsosatis¯edbythecostfunctionassociatedwiththe2-normsoftmarginclassi¯er,a7!(maxf1¡a;0g)2,forexample.TheboundonthecomplexityC(K+B)ofthekernelclassK+BiseasiertocheckthantheboundonC(KB).The¯rsttermintheminimumshowsthatthesetofpositivelinearcombinationsofasmallsetofkernelmatricesisnotverycomplex.Thesecondtermshowsthatif,foreachmatrixintheset,thelargesteigenvaluedoesnotdominatethesumoftheeigenvalues(thetrace),thenthesetofpositivelinearcombinationsisnottoocomplex,evenifthesetislarge.Ineithercase,theupperboundislinearinc,theupperboundonthetraceofthecombinedkernelmatrix.6.EmpiricalResultsWe¯rstpresentresultsonbenchmarkdatasets,usingkernelsKithatarederivedfromthesameinputvector.Thegoalhereistoexploredi®erentpossiblerepresentationsofthesamedatasource,andtochoosearepresentationorcombinationsofrepresentationsthatyieldthebestperformance.WecomparetothesoftmarginSVMwithanRBFkernel,inwhichthehyperparameteristunedviacross-validation.Notethatinourframeworkthereisnoneedforcross-validationtotunethecorrespondingkernelhyperparameters.Moreover,whenusingthe2-normsoftmarginSVM,themethodsaredirectlycomparable,becausethehyperparameterCispresentinbothcases.Inthesecondsectionweexploretheuseofourframeworktocombinekernelsthatarebuiltusingdatafromheterogeneoussources.Hereourmaininterestisincomparingthecombinedclassi¯ertothebestindividualclassi¯er.Totheextentthattheheterogeneousdatasourcesprovidecomplementaryinformation,wemightexpectthattheperformanceofthecombinedclassi¯ercandominatethatofthebestindividualclassi¯er.6.1BenchmarkDataSetsWepresentresultsforhardmarginandsoftmarginsupportvectormachines.WeuseakernelmatrixK=P3=1¹iKi,wheretheKi'sareinitial\guesses"ofthekernelmatrix.Weuseapolynomialkernelfunctionk1(x1;x2)=(1+xTx2)dforK1,aGaussiankernelfunctionk2(x1;x2)=exp(¡0:5(x1¡x2)T(x1¡x2)=¾)forK2andalinearkernelfunctionk3(x1;x2)=xT1x2forK3.Afterwards,allKiarenormalized.AfterevaluatingtheinitialkernelmatricesfKig3=1,theweightsf¹ig3=1areoptimizedinatransductionsettingaccordingtoahardmargin,a1-normsoftmarginanda2-normsoftmargincriterion,respectively;thesemide¯niteprograms(31),(36)and(41)aresolvedusingthegeneral-purposeoptimizationsoftwareSeDuMi(Sturm,1999),leadingtooptimalweightsf¹¤g3i=1.Next,theweightsf¹ig3i=1areconstrainedtobenon-negativeandoptimizedaccordingtothesamecriteria,againinatransductionsetting:thesecondorderconeprograms(33),(38)and(43)aresolvedusingthegeneral-purposeoptimizationsoftwareMosek(AndersenandAndersen,2000),leadingtooptimalweightsf¹¤+g3i=1.Forpositiveweights,wealsoreportresultsinwhichthe2-normsoftmarginhyperparameterCistunedaccordingto(47).EmpiricalresultsonstandardbenchmarkdatasetsaresummarizedinTables1,2and3.2TheWisconsinbreastcancerdatasetcontained16incompleteexampleswhichwerenotused.Thebreast2.Itisworthnotingthatthe¯rstthreecolumnsofthesecolumnsarebasedonaninductivealgorithmwhereasthelasttwocolumnsarebasedonatransductivealgorithm.Thismayfavorthekernelcombinationsinthelast57 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordancancer,ionosphereandsonardatawereobtainedfromtheUCIrepository.TheheartdatawereobtainedfromSTATLOGandnormalized.Dataforthe2-normproblemdataweregeneratedasspeci¯edbyBreiman(1998).Eachdatasetwasrandomlypartitionedinto80%trainingand20%testsets.Thereportedresultsaretheaveragesover30randompartitions.ThekernelparametersforK1andK2aregiveninTables1,2and3bydand¾respectively.Foreachofthekernelmatrices,anSVMistrainedusingthetrainingblockKtrandtestedusingthemixedblockKtr;tasde¯nedin(17).Themargin°(forahardmargincriterion)andtheoptimalsoftmargincostfunctions!¤S1and!¤S2(forsoftmargincriteria)arereportedfortheinitialkernelmatricesKi,aswellasfortheoptimalPi¹¤KiandPi¹¤i;+Ki.Furthermore,theaveragetestsetaccuracy(TSA),theaveragevalueforCandtheaverageweightsoverthe30partitionsarelisted.Forcomparison,theperformanceofthebestsoftmarginSVMwithanRBF(Gaussian)kernelisreported|thesoftmarginhyperparameterCandthekernelparameter¾fortheGaussiankernelweretunedusingcross-validationover30randompartitionsofthetrainingset.NotethatnoteveryKigivesrisetoalinearlyseparableembeddingofthetrainingdata,inwhichcasenohardmarginclassi¯ercanbefound(indicatedwithadash).ThematricesPi¹¤KiandPi¹¤+Kihowever,alwaysallowthetrainingofahardmarginSVManditsmarginisindeedlargerthanthemarginforeachofthedi®erentcomponentsKi|thisisconsistentwiththeSDP/QCQPoptimization.Forthesoftmargincriteria,theoptimalvalueofthecostfunctionforPi¹¤iKiandPi¹¤+KiissmallerthanitsvaluefortheindividualKi|againconsistentwiththeSDP/QCQPoptimizations.Noticethatconstrainingtheweights¹itobepositiveresultsinslightlysmallermarginsandlargercostfunctions,asexpected.Furthermore,thenumberoftestseterrorsforPi¹¤KiandPi¹¤i;+Kiisingeneralcomparableinmagnitudetothebestvalueachievedamongthedi®erentcomponentsKi.AlsonoticethatPi¹¤i;+KidoesoftenalmostaswellasPi¹¤iKi,andsometimesevenbetter:wecanthusachieveasubstantialreductionincomputationalcomplexitywithoutasigni¯cantlossofperformance.Moreover,theperformanceofPi¹¤iKiandPi¹¤i;+KiiscomparablewiththebestsoftmarginSVMwithanRBFkernel.InmakingthiscomparisonnotethattheRBFSVMrequirestuningofthekernelparameterusingcross-validation,whilethekernellearningapproachachievesasimilare®ectwithoutcross-validation.3Moreover,whenusingthe2-normsoftmarginSVMwithtunedhyperparameterC,wenolongerneedtodocross-validationforC.Thisleadstoasmallervalueoftheoptimalcostfunction!¤S2(comparedtothecaseSM2,withC=1)andperformswellonthetestset,whileo®eringtheadvantageofautomaticallyadjustingC.Onemightwonderwhythereisadi®erencebetweentheSDPandtheQCQPapproachforthe2-normdata,sincebothseemto¯ndpositiveweights¹i.However,itmustberecalledthattwocolumnsandthustheresultsshouldbeinterpretedwithcaution.However,itisalsoworthnotingthatthetransductionisaweakformoftransductionthatisbasedonlyonthenormofthetestdatapoint.3.Theexperimentswererunona2GHzWindowsXPmachine.WeusedtheprogramsSeDuMitosolvetheSDPforkernellearningandMosektosolvemultipleQP'sforcross-validatedSVMandtheQCQPforkernellearningwithpositiveweights.TheruntimefortheSDPisontheorderofminutes(approximately10minutesfor300datapointsand5kernels),whiletheruntimefortheQPandQCQPisontheorderofseconds(approximately1secondfor300datapointsand1kernel,andapproximately3secondsfor300datapointsand5kernels).Thus,weseethatkernellearningwithpositiveweights,whichrequiresonlyaQCQPsolution,achievesanaccuracywhichiscomparabletothefullSDPapproachatafractionofthecomputationalcost,andourtentativerecommendationisthattheQCQPapproachistobepreferred.Itisworthnoting,however,thatspecial-purposeimplementationsofSDPsthattakeadvantageofthestructureofthekernellearningproblemmaywellyieldsigni¯cantspeed-ups,andtherecommendationshouldbetakenwithcaution.Finally,theQCQPapproachalsocomparesfavorablyintermsofruntimetothemultiplerunsofaQPthatarerequiredforcross-validation,andshouldbeconsideredaviablealternativetocross-validation,particularlygiventhehighvarianceassociatedwithcross-validationinsmalldatasets.58 LearningtheKernelMatrixwithSemidefiniteProgrammingK1K2K3Pi¹¤KiPi¹¤i;+Kibestc/vRBFHeartd=2¾=0:5HM°0.03690.1221-0.15310.1528TSA72.9%59.5%-84.8%84.6%77.7%¹1=¹2=¹33/0/00/3/00/0/3-0.09/2.68/0.410.01/2.60/0.39SM1!¤S158.16933.53674.30221.36121.446TSA79.3%59.5%84.3%84.8%84.6%83.9%C11111¹1=¹2=¹33/0/00/3/00/0/3-0.09/2.68/0.410.01/2.60/0.39SM2!¤S232.72625.38645.89115.98816.034TSA78.1%59.0%84.3%84.8%84.6%83.2%C11111¹1=¹2=¹33/0/00/3/00/0/3-0.08/2.54/0.540.01/2.47/0.53SM2,C!¤S219.64325.15316.00415.985TSA81.3%59.6%84.7%84.6%83.2%C0.33781.18e+70.28800.4365¹1=¹2=¹31.04/0/00/3.99/00/0/0.530.01/0.80/0.53Sonard=2¾=0:1HM°0.02460.14600.00210.15170.1459TSA80.9%85.8%74.2%84.6%85.8%84.2%¹1=¹2=¹33/0/00/3/00/0/3-2.23/3.52/1.710/3/0SM1!¤S187.65723.288102.6821.63723.289TSA78.1%85.6%73.3%84.6%85.6%84.2%C11111¹1=¹2=¹33/0/00/3/00/0/3-2.20/3.52/1.690/3/0SM2!¤S245.04815.89353.29215.21915.893TSA79.1%85.2%76.7%84.5%85.2%84.2%C11111¹1=¹2=¹33/0/00/3/00/0/3-1.78/3.46/1.320/3/0SM2,C!¤S220.52015.64020.62015.640TSA60.9%84.6%51.0%84.6%84.2%C0.25910.60870.25100.6087¹1=¹2=¹30.14/0/00/2.36/00/0/0.020/2.34/0Table1:SVMstrainedandtestedwiththeinitialkernelmatricesK1;K2;K3andwiththeoptimalkernelmatricesPi¹¤KiandPi¹¤i;+Ki.ForhardmarginSVMs(HM),theresultingmargin°isgiven|adashmeaningthatnohardmarginclassi¯ercouldbefound;forsoftmarginSVMs(SM1=1-normsoftmarginwithC=1,SM2=2-normsoftmarginwithC=1andSM2,C=2-normsoftmarginwithautotuningofC)theoptimalvalueofthecostfunction!¤S1or!¤S2isgiven.Furthermore,thetest-setaccuracy(TSA),theaverageweightsandtheaverageC-valuesaregiven.Forcweusedc=Pitrace(Ki)forHM,SM1andSM2.Theinitialkernelmatricesareevaluatedafterbeingmultipliedby3.Thisassureswecancomparethedi®erent°forHM,!¤S1forSM1and!¤S2forSM2,sincetheresultingkernelmatrixhasaconstanttrace(thus,everythingisonthesamescale).ForSM2,Cweusec=Pitrace(Ki)+trace(In).Thisnotonlyallowscomparingthedi®erent!¤S2forSM2,Cbutalsoitallowscomparing!¤S2betweenSM2andSM2,C(sincewechooseC=1forSM2,wehavethattrace¡Pm=1¹iKi+1CIn¢isconstantinbothcases,soagain,weareonthesamescale).Finally,thecolumn'bestc/vRBF'reportstheperformanceofthebestsoftmarginSVMwithRBFkernel,tunedusingcross-validation.thevaluesinTable3areaveragesover30randomizations|forsomerandomizationstheSDPhasactuallyfoundnegativeweights,althoughtheaveragesarepositive.Asafurtherexampleillustratingthe°exibilityoftheSDPframework,considerthefollowingsetup.LetfKig5=1beGaussiankernelswith¾=0:01;0:1;1;10;100respectively.Combiningthoseoptimallywith¹i¸0fora2-normsoftmarginSVM,withtuningofC,yieldstheresults59 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanK1K2K3Pi¹¤KiPi¹¤i;+Kibestc/vRBFBreastcancerd=2¾=0:5HM°0.00360.1055-0.13690.1219TSA92.9%89.0%-95.5%94.4%96.1%¹1=¹2=¹33/0/00/3/00/0/31.90/2.35/-1.250.65/2.35/0SM1!¤S177.01244.913170.2626.69433.689TSA96.4%89.0%87.7%95.5%94.4%96.7%C11111¹1=¹2=¹33/0/00/3/00/0/31.90/2.35/-1.250.65/2.35/0SM2!¤S243.13835.245102.5120.69621.811TSA96.4%88.5%87.4%95.4%94.3%96.8%C11111¹1=¹2=¹33/0/00/3/00/0/32.32/2.13/-1.460.89/2.11/0SM2,C!¤S227.68233.68541.02325.267TSA94.5%89.0%87.3%94.4%96.8%C0.35041.48e+80.30516.77e+7¹1=¹2=¹31.15/0/00/3.99/00/0/0.720.87/3.13/0Ionosphered=2¾=0:5HM°0.06130.1452-0.16230.1616TSA91.2%92.0%-94.4%94.4%93.9%¹1=¹2=¹33/0/00/3/00/0/31.08/2.18/-0.260.79/2.21/0SM1!¤S130.78623.23352.31218.11718.303TSA94.5%92.1%83.1%94.8%94.5%94.0%C11111¹1=¹2=¹33/0/00/3/00/0/31.23/2.07/-0.300.90/2.10/0SM2!¤S218.53317.90731.66213.38213.542TSA94.7%92.0%91.6%94.5%94.4%94.2%C11111¹1=¹2=¹33/0/00/3/00/0/31.68/1.73/-0.411.23/1.78/0SM2,C!¤S214.55817.62318.97513.5015TSA93.5%92.1%90.0%94.6%94.2%C0.41445.82850.34420.8839¹1=¹2=¹31.59/0/00/3.83/00/0/1.091.24/1.61/0Table2:SeethecaptiontoTable1forexplanation.K1K2K3Pi¹¤KiPi¹¤i;+Kibestc/vRBF2-normd=2¾=0:1HM°0.14360.10720.05090.21700.2169TSA94.6%55.4%94.3%96.6%96.6%96.3%¹1=¹2=¹33/0/00/3/00/0/30.03/1.91/1.060.06/1.88/1.06SM1!¤S123.83543.50922.26210.63610.641TSA95.0%55.4%95.7%96.6%96.6%97.5%C11111¹1=¹2=¹33/0/00/3/00/0/30.03/1.91/1.060.06/1.88/1.06SM2!¤S216.13432.63111.9917.97807.9808TSA95.9%55.4%95.6%96.6%96.6%97.2%C11111¹1=¹2=¹33/0/00/3/00/0/30.05/1.54/1.410.08/1.51/1.41SM2,C!¤S216.05732.6337.98807.9808TSA96.2%55.4%96.6%96.6%97.2%C0.82130.50000.38690.8015¹1=¹2=¹32.78/0/00/2/00/0/1.420.08/1.25/1.41Table3:SeethecaptiontoTable1forexplanation.inTable4|averagesover30randomizationsin80%trainingand20%testsets.ThetestsetaccuraciesobtainedforPi¹¤+KiarecompetitivewiththoseforthebestsoftmarginSVMwithanRBFkernel,tunedusingcross-validation.Theaverageweightsshowthatsomekernelsareselectedandothersarenot.E®ectivelyweobtainadata-basedchoiceofsmoothingparameterwithoutrecoursetocross-validation.60 LearningtheKernelMatrixwithSemidefiniteProgramming¹1;+¹2;+¹3;+¹4;+¹5;+CTSASM2,CTSAbestc/vRBFBreastCancer003.240.940.823.6e+0897.1%96.8%Ionosphere0.850.852.630.6804.0e+0694.5%94.2%Heart03.890.061.0502.5e+0584.1%83.2%Sonar03.931.07003.2e+0784.8%84.2%2-norm0.490.4903.5102.038696.5%97.2%Table4:TheinitialkernelmatricesfKig5=1areGaussiankernelswith¾=0:01;0:1;1;10;100respectively.Forcweusedc=Pitrace(Ki)+trace(In).f¹i;+g5=1aretheaverageweightsoftheoptimalkernelmatrixPi¹¤+Kifora2-normsoftmarginSVMwith¹i¸0andtuningofC.TheaverageC-valueisgivenaswell.Thetestsetaccuracies(TSA)oftheoptimal2-normsoftmarginSVMwithtuningofC(SM2,C)andthebestcrossvalidationsoftmarginSVMwithRBFkernel(bestc/vRBF)arereported.InCristianinietal.(2002)empiricalresultsaregivenforoptimizationofthealignmentusingakernelmatrixK=PN=1¹ivivTi.TheresultsshowthatoptimizingthealignmentindeedimprovesthegeneralizationpowerofParzenwindowclassi¯ers.AsexplainedinSection4.7,itturnsoutthatinthisparticularcase,theSDPin(53)reducestoexactlythequadraticprogramthatisobtainedinCristianinietal.(2002)andthusthoseresultsalsoprovidesupportforthegeneralframeworkpresentedinthecurrentpaper.6.2CombiningHeterogeneousData6.2.1Reuters-21578DataSetToexplorethevalueofthisapproachforcombiningdatafromheterogeneoussources,werunexperimentsontheReuters-21578dataset,usingtwodi®erentkernels.The¯rstkernelK1isderivedasalinearkernelfromthe\bag-of-words"representationofthedi®erentdocuments,capturinginformationaboutthefrequencyoftermsinthedi®erentdocuments(SaltonandMcGill,1983).K1iscenteredandnormalized.ThesecondkernelK2isconstructedbyextracting500conceptsfromdocumentsviaprobabilisticlatentsemanticanalysis(CaiandHofmann,2003).Thiskernelcanbeviewedasarisingfromadocument-concept-termgraphicalmodel,withtheconceptsashiddennodes.Afterinferringtheconditionalprobabilitiesoftheconcepts,givenadocument,alinearkernelisappliedtothevectoroftheseprobabilistic\conceptmemberships,"representingeachdocument.AlsoK2isthencenteredandnormalized.Theconcept-baseddocumentinformationcontainedinK2islikelytobepartlyoverlappingandpartlycomplementarytotheterm-frequencyinformationinK1.Althoughthe\bag-of-words"andgraphicalmodelrepresentationareclearlyheterogeneous,theycanbothbecastintoahomogeneousframeworkofkernelmatrices,allowingtheinformationthattheyconveytobecombinedaccordingtoK=¹1K1+¹2K2.TheReuters-21578datasetconsistsofReutersnewswirestoriesfrom1987(www.davidlewis.com/resources/testcollections/reuters21578/).Afterapreprocessingstagethatincludestokenizationandstopwordremoval,37926wordtypesremained.Weusedthemodi¯edApte(\ModApte")splittosplitthecollectioninto12902usedand8676unuseddocuments.The12902useddocumentsconsistof9603trainingdocumentsand3299testdocuments.Fromthe9603trainingdocuments,werandomlyselecta1000-documentsubsetastrainingsetforasoftmarginsupportvectormachinewithC=1.WetraintheSVMforthebinaryclassi¯cationtasksof61 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordandistinguishingdocumentsaboutacertaintopicversusthosenotaboutthattopic.Werestrictourattentiontothetopicsthatappearinthemostdocuments(cf.CaiandHofmann(2003);Huang(2003);Eyheramendyetal.(2003));inparticular,wefocusedonthetop¯veReuters-21578topics.AftertrainingtheSVMontherandomlyselecteddocumentsusingeitherK1orK2,theaccuracyistestedonthe3299testdocumentsfromtheModAptesplit.Thisisdone20times,i.e.,for20randomlychosen1000-documenttrainingsets.TheaverageaccuraciesandstandarderrorsarereportedinFigure1.AfterevaluatingtheperformanceofK1andK2,theweights¹1and¹2areconstrainedtobenon-zeroandoptimized(usingonlythetrainingdata)accordingto(38).ThetestsetperformanceoftheoptimalcombinationisthenevaluatedandtheaverageaccuracyreportedinFigure1.Theoptimalweights,¹¤and¹¤2,donotvarygreatlyoverthedi®erenttopics,withaveragesof1.37for¹¤1and0.63for¹¤2.Weseethatinfourcasesoutof¯vetheoptimalcombinationofkernelsperformsbetterthaneitheroftheindividualkernels.Thissuggeststhatthesekernelsindeedprovidecomplementaryinformationfortheclassi¯cationdecision,andthattheSDPapproachisableto¯ndacombinationthatexploitsthiscomplementarity.6.2.2ProteinFunctionPredictionHereweillustratetheSDPapproachforfusingheterogeneousgenomicdatainordertopredictproteinfunctioninyeast;seeLanckrietetal.(2004)formoredetails.Thetaskistopredictfunctionalclassi¯cationsassociatedwithyeastproteins.WeuseasagoldstandardthefunctionalcatalogueprovidedbytheMIPSComprehensiveYeastGenomeDatabase(CYGD|mips.gsf.de/proj/yeast).Thetop-levelcategoriesinthefunctionalhierarchyproduce13classes,whichcontain3588proteins;theremainingyeastproteinshaveuncertainfunctionandarethereforenotusedinevaluatingtheclassi¯er.Becauseagivenproteincanbelongtoseveralfunctionalclasses,wecastthepredictionproblemas13binaryclassi¯cationtasks,oneforeachfunctionalclass.Usingthissetup,wefollowtheexperimentalparadigmofDengetal.(2003).Theprimaryinputtotheclassi¯cationalgorithmisacollectionofkernelmatricesrepresentingdi®erenttypesofdata:1.Aminoacidsequences:thiskernelincorporatesinformationaboutthedomainstructureofeachprotein,bylookingatthepresenceorabsenceintheproteinofPfamdomains(pfam.wustl.edu).ThecorrespondingkernelissimplytheinnerproductbetweenbinaryvectorsdescribingthepresenceorabsenceofonePfamdomain.Afterwards,wealsoconstructaricherkernelbyreplacingthebinaryscoringwithlogE-valuesusingtheHMMERsoftwaretoolkit(hmmer.wustl.edu).Moreover,anadditionalkernelmatrixisconstructedbyapplyingtheSmith-Waterman(SW)pairwisesequencecomparisonalgorithm(SmithandWaterman,1981)totheyeastproteinsequencesandapplyingtheempiricalkernelmap(Tsuda,1999).2.Protein-proteininteractions:thistypeofdatacanberepresentedasagraph,withproteinsasnodesandinteractionsasedges.Suchinteractiongraphallowstoestablishsimilaritiesamongproteinsthroughtheconstructionofacorrespondingdi®usionkernel(KondorandLa®erty,2002).3.Geneticinteractions:inasimilarway,theseinteractionsgiverisetoadi®usionkernel.4.Proteincomplexdata:co-participationinaproteincomplexcanbeseenasaweaksortofinteraction,givingrisetoathirddi®usionkernel.62 LearningtheKernelMatrixwithSemidefiniteProgrammingEARNACQMONEY-FXGRAINCRUDE9293949596979899CategoryTest set accuracyFigure1:Classi¯cationperformanceforthetop¯veReuters-21578topics.Theheightofeachbarisproportionaltotheaveragetestsetaccuracyfora1-normsoftmarginSVMwithC=1.BlackbarscorrespondtousingonlykernelmatrixK1;greybarscorrespondtousingonlykernelmatrixK2,andwhitebarscorrespondtotheoptimalcombination¹¤K1+¹¤2K2.ThekernelmatricesK1andK2arederivedfromdi®erenttypesofdata,i.e.,fromthe\bag-of-words"representationofdocumentsandtheconcept-basedgraphicalmodelrepresentation(with500concepts)ofdocumentsrespectively.Forcweusedc=trace(K1)+trace(K2)=4000.Thestandarderrorsacrossthe20experimentsareapproximately0.1orsmaller;indeed,allofthedepicteddi®erencesbetweentheoptimalcombinationandtheindividualkernelsarestatisticallysigni¯cantexceptforEARN.63 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordan5.Expressiondata:twogeneswithsimilarexpressionpro¯lesarelikelytohavesimilarfunctions;accordingly,Dengetal.(2003)converttheexpressionmatrixtoasquarebinaryinteractionmatrixinwhicha1indicatesthatthecorrespondingpairofexpressionpro¯lesexhibitsaPearsoncorrelationgreaterthan0.8.Thiscanbeusedtode¯neadi®usionkernel.Also,aricherGaussiankernelisde¯neddirectlyontheexpressionpro¯les.InordertocomparetheSDP/SVMapproachtotheMarkovrandom¯eld(MRF)methodofDengetal.(2003),Lanckrietetal.(2004)performtwovariantsoftheexperiment:oneinwhichthe¯vekernelsarerestrictedtocontainpreciselythesamebinaryinformationasusedbytheMRFmethod,andasecondexperimentinwhichthericherPfamandexpressionkernelsareusedandtheSWkernelisadded.TheyshowthatacombinedSVMclassi¯ertrainedwiththeSDPapproachperformsbetterthananSVMtrainedonanysingletypeofdata.MoreoveritoutperformstheMRFmethoddesignedforthisdataset.Toillustratethelatter,Figure2presentstheaverageROCscoresonthetestsetwhenperforming¯ve-foldcross-validationthreetimes.The¯gureshowsthat,foreachofthe13classi¯cations,theROCscoreoftheSDP/SVMmethodisbetterthanthatoftheMRFmethod.Overall,themeanROCimprovesfrom0.715to0.854.TheimprovementoftheSDP/SVMmethodovertheMRFmethodisconsistentandstatisticallysigni¯cantacrossall13classes.Anadditionalimprovement,thoughnotaslargeandonlystatisticallysigni¯cantfornineofthe13classes,isgainedbyusingricherkernelsandaddingtheSWkernel.7.DiscussionInthispaperwehavepresentedanewmethodforlearningakernelmatrixfromdata.Ourapproachmakesuseofsemide¯niteprogramming(SDP)ideas.Itismotivatedbythefactthateverysymmetric,positivesemide¯nitematrixcanbeviewedasakernelmatrix(correspondingtoacertainembeddingofa¯nitesetofdata),andthefactthatSDPdealswiththeoptimizationofconvexcostfunctionsovertheconvexconeofpositivesemide¯nitematrices(orconvexsubsetsofthiscone).ThusconvexoptimizationandmachinelearningconcernsmergetoprovideapowerfulmethodologyforlearningthekernelmatrixwithSDP.Wehavefocusedonthetransductivesetting,wherethelabeleddataareusedtolearnanembedding,whichisthenappliedtotheunlabeledpartofthedata.Basedonanewgeneralizationboundfortransduction,wehaveshownhowtoimposeconvexconstraintsthate®ectivelycontrolthecapacityofthesearchspaceofpossiblekernelsandyieldane±cientlearningprocedurethatcanbeimplementedbySDP.Furthermore,thisapproachleadstoaconvexmethodtolearnthe2-normsoftmarginparameterinsupportvectormachines,solvinganimportantopenproblem.Promisingempiricalresultsarereportedonstandardbenchmarkdatasets;theseresultsshowthatthenewapproachprovidesaprincipledwaytocombinemultiplekernelstoyieldaclassi¯erthatiscomparablewiththebestindividualclassi¯er,andcanperformbetterthananyindividualkernel.Performanceisalsocomparablewithaclassi¯erinwhichthekernelhyperparameteristunedwithcross-validation;ourapproachachievesthee®ectofthistuningwithoutcross-validation.Wehavealsoshownhowoptimizingalinearcombinationofkernelmatricesprovidesanovelmethodforfusingheterogeneousdatasources.Inthiscase,theempiricalresultsshowasigni¯-cantimprovementoftheclassi¯cationperformancefortheoptimalcombinationofkernelswhencomparedtoindividualkernels.ThereareseveralchallengesthatneedtobemetinfutureresearchonSDP-basedlearningalgo-rithms.First,itisclearlyofinteresttoexploreotherconvexqualitymeasuresforakernelmatrix,whichmaybeappropriateforotherlearningalgorithms.Forexample,inthesettingofGaussian64 LearningtheKernelMatrixwithSemidefiniteProgramming123456789101112130.50.550.60.650.70.750.80.850.90.951Function ClassROCFigure2:Classi¯cationperformanceforthe13functionalproteinclasses.TheheightofeachbarisproportionaltotheROCscore.Thestandarderroracrossthe13experimentsisusually0.01orsmaller,somostofthedepicteddi®erencesarestatisticallysigni¯-cant:betweenblackandgreybars,alldepicteddi®erencesarestatisticallysigni¯cant,whilenineofthe13di®erencesbetweengreyandwhitebarsarestatisticallysigni¯cant.BlackbarscorrespondtotheMRFmethodofDengetal.;greybarscorrespondtotheSDP/SVMmethodusing¯vekernelscomputedonbinarydata,andwhitebarscorre-spondtotheSDP/SVMusingtheenrichedPfamkernelandreplacingtheexpressionkernelwiththeSWkernel.SeeLanckrietetal.(2004)formoredetails.65 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanprocesses,therelativeentropybetweenthezero-meanGaussianprocesspriorPwithcovariancekernelKandthecorrespondingGaussianprocessapproximationQtothetrueintractableposteriorprocessdependsonKasD[PjjQ]=12logdetK+12trace¡yTKy¢+d;wheretheconstantdisindependentofK.OnecanverifythatD[PjjQ]isconvexwithrespecttoR=K¡1(see,e.g.,Vandenbergheetal.,1998).MinimizingthismeasurewithrespecttoR,andthusK,ismotivatedfromPAC-BayesiangeneralizationerrorboundsforGaussianprocesses(see,e.g.,Seeger,2002)andcanbeachievedbysolvingaso-calledmaximum-determinantproblem(Van-denbergheetal.,1998)|anevenmoregeneralframeworkthatcontainssemide¯niteprogrammingasaspecialcase.Second,theinvestigationofotherparameterizationsofthekernelmatrixisanimportanttopicforfurtherstudy.Whilethelinearcombinationofkernelsthatwehavestudiedhereislikelytobeusefulinmanypracticalproblems|capturinganotionofcombiningGrammatrix\experts"|itisalsoworthconsideringotherparameterizationsaswell.Anysuchparameterizationshavetorespecttheconstraintthatthequalitymeasureforthekernelmatrixisconvexwithrespecttotheparametersoftheproposedparameterization.Oneclassofexamplesarisesviathepositivede¯nitematrixcompletionproblem(Vandenbergheetal.,1998).HerewearegivenasymmetrickernelmatrixKthathassomeentrieswhichare¯xed.Theremainingentries|theparametersinthiscase|aretobechosensuchthattheresultingmatrixispositivede¯nite,whilesimultaneouslyacertaincostfunctionisoptimized,e.g.,trace(SK)+logdetK¡1,whereSisagivenmatrix.Thisspeci¯ccasereducestosolvingamaximum-determinantproblemwhichisconvexintheunknownentriesofK,theparametersoftheproposedparameterization.Athirdimportantareaforfutureresearchconsistsin¯ndingfasterimplementationsofsemidef-initeprogramming.Asinthecaseofquadraticprogramming(Platt,1999),itseemslikelythatspecialpurposemethodscanbedevelopedtoexploittheexchangeablenatureofthelearningprob-leminclassi¯cationandresultinmoree±cientalgorithms.Finally,byprovidingageneralapproachforcombiningheterogeneousdatasourcesinthesettingofkernel-basedstatisticallearningalgorithms,thislineofresearchsuggestsanimportantroleforkernelmatricesasgeneralbuildingblocksofstatisticalmodels.Muchasinthecaseof¯nite-dimensionalsu±cientstatistics,kernelmatricesgenerallyinvolveasigni¯cantreductionofthedataandrepresenttheonlyaspectsofthedatathatareusedbysubsequentalgorithms.Moreover,giventhepanoplyofmethodsthatareavailabletoaccommodatenotonlythevectorialandmatrixdatathatarefamiliarinclassicalstatisticalanalysis,butalsomoreexoticdatatypessuchasstrings,treesandgraphs,kernelmatriceshaveanappealinguniversality.Itisnaturaltoenvisionlibrariesofkernelmatricesin¯eldssuchasbioinformatics,computationalvision,andinformationretrieval,inwhichmultipledatasourcesabound.Suchlibrarieswouldsummarizethestatistically-relevantfeaturesofprimarydata,andencapsulatedomainspeci¯cknowledge.Toolssuchasthesemide¯niteprogrammingmethodsthatwehavepresentedherecanbeusedtobringthesemultipledatasourcestogetherinnovelwaystomakepredictionsanddecisions.AcknowledgementsWeacknowledgesupportfromONRMURIN00014-00-1-0637andNSFgrantIIS-9988642.SincerethankstoTijlDeBieforhelpfulconversationsandsuggestions,aswellastoLijuanCaiandThomasHofmannforprovidingthedatafortheReuters-21578experiments.66 LearningtheKernelMatrixwithSemidefiniteProgrammingAppendixA.ProofofResult(54)ForthecaseKi=vivTi,withviorthonormal,theoriginallearningproblem(52)becomesmaxK­Ktr;yyT®F(61)subjecttohK;KiF·1;Kº0;K=mXi=1¹ivivTi:Expandingthisfurthergives­Ktr;yyT®F=trace(K(1:ntr;1:ntr)yyT)=trace((mXi=1¹ivi(1:ntr)vi(1:ntr)T)yyT)=mXi=1¹itrace(¹vi¹vTiyyT)=mXi=1¹i(¹vTiy)2;(62)hK;KiF=trace(KTK)=trace(KK)=trace((mXi=1¹ivivTi)(mXj=1¹jvjvTj))=trace(mXi;j=1¹i¹jvivTivjvTj)=trace(mXi=1¹2vivTi)=mXi=1¹2itrace(vivTi)=mXi=1¹2itrace(vTivi)=mXi=1¹2(63)with¹vi=vi(1:ntr).Weusedthefactthattrace(ABC)=trace(BCA)(iftheproductsarewell-de¯ned)andthatthevectorsvi;i=1;:::;nareorthonormal:vTivj=±ij.Furthermore,becausetheviareorthogonal,the¹iinK=Pm=1¹ivivTiaretheeigenvaluesofK.ThisimpliesKº0,¹¸0,¹i¸0;i=1;:::;m:(64)67 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanUsing(62),(63)and(64)in(61),weobtainthefollowingoptimizationproblem:max¹imXi=1¹i(¹vTiy)2subjecttomXi=1¹2·1¹i¸0;i=1;:::;m;whichyieldstheresult(54).AppendixB.ProofofTheorem24Forafunctiong:X£Y!R,de¯ne^E1g(X;Y)=1nnXi=1g(xi;yi);^E2g(X;Y)=1nnXi=1g(xn+i;yn+i):De¯neamargincostfunctionÁ:R!R+asÁ(a)=81ifa·0,1¡a0a·1;0a�1:Noticethatinthe1-normsoftmargincostfunction,theslackvariable»iisaconvexupperboundonÁ(yif(xi))forthekernelclassi¯erf,thatis,maxf1¡a;0g¸Á(a)¸1[a·0];wherethelastexpressionistheindicatorfunctionofa·0.Theproofofthe¯rstpartisduetoKoltchinskiiandPanchenkoKoltchinskiiandPanchenko(2002),andinvolvesthefollowing¯vesteps:Step1.ForanyclassFofrealfunctionsde¯nedonX,supf2Fer(f)¡^E1Á(Yf(X))·supf2F^E2Á(Yf(X))¡^E1Á(Yf(X)):Toseethis,noticethater(f)istheaverageoverthetestsetoftheindicatorfunctionofYf(X)·0,andthatÁ(Yf(X))boundsthisfunction.Step2.ForanyclassGof[0;1]-valuedfunctions,PrÃsupg2G^E2g¡^E1g¸EÃsupg2G^E2g¡^E1g!+²!·expµ¡²2n4¶;wheretheexpectationisovertherandompermutation.ThisfollowsfromMcDiarmid'sinequality.Toseethis,weneedtode¯netherandompermutation¼usingasetof2nindependentrandomvariables.Tothisend,choose¼1;:::;¼2nuniformlyatrandomfromtheinterval[0;1].These68 LearningtheKernelMatrixwithSemidefiniteProgrammingarealmostsurelydistinct.Forj=1;:::;2n,de¯ne¼(j)=jfi:¼i·¼jgj,thatis,¼(j)isthepositionof¼jwhentherandomvariablesareorderedbysize.Itiseasytoseethat,foranyg,^E2g¡^E1gchangesbynomorethan2=nwhenoneofthe¼ichanges.McDiarmid'sboundeddi®erenceinequality(McDiarmid,1989)impliestheresult.Step3.ForanyclassGof[0;1]-valuedfunctions,EÃsupg2G^E2g¡^E1g!·^R2n(G)+4pn;where^R2n(G)=Esupg2G1nP2ni=1¾ig(Xi;Yi);andtheexpectationisovertheindependent,uniform,f§1g-valuedrandomvariables¾1;:::;¾2n.ThisresultisessentiallyLemma3of(BartlettandMendelson,2002);thatlemmacontainedasimilarboundfori.i.d.data,butthesameargumentholdsfor¯xeddata,randomlypermuted.Step4.IftheclassFofreal-valuedfunctionsde¯nedonXisclosedundernegations,^R2n(Á±F)·^R2n(F);whereeachf2Fde¯nesag2Á±Fbyg(x;y)=Á(yf(x)).ThisboundisthecontractionlemmaofLedouxandTalagrand(1991).Step5.FortheclassFKofkernelexpansions,notice(asintheproofofLemma26ofBartlettandMendelson(2002))that^R2n(FK)=1nEmaxf2FK2nXi=1¾if(Xi)=1nEmaxK2Kmaxkwk·1=°hw;2nXi=1¾i©(Xi)i=1n°EmaxK2K°2nXi=1¾i©(Xi)°·1n°qEmaxK2K¾TK¾=1n°pC(K);where¾=(¾1;:::;¾2n)isthevectorofRademacherrandomvariables.Combininggivesthe¯rstpartofthetheorem.Forthesecondpart,considerC(Kc)=EmaxK2Kc¾TK¾=Emax¹mXj=1¹j¾TKj¾;wherethemaxisover¹=(¹1;:::;¹m)forwhichthematrixK=Pm=1¹jKjsatis¯esthecondi-tionsKº0andtrace(K)·c.Now,trace(K)=mXj=1¹jtrace(Kj);andeachtraceinthesumispositive,sothesupremummustbeachievedfortrace(K)=c.SowecanwriteC(Kc)=cEmaxK2KcmXj=1¾TKtrace(K)¾:69 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanNoticethat¾TK¾isnomorethan¸k¾k2=n¸,where¸isthemaximumeigenvalueofK.Using¸·trace(K)=cshowsthatC(Kc)·cn.Finally,forK+cwehaveC(K+c)=EmaxK2K+c¾TK¾=Emax¹jmXj=1¹j¾TKj¾=Emaxjctrace(Kj)¾TKj¾:Sinceeachterminthemaximumisnon-negative,wecanreplaceitwithasumtoshowthatC(K+c)·cE¾T0XjKjtrace(Kj)1¾=cm:Alternatively,wecanwrite¾TKj¾·¸jk¾k=¸jn,where¸jisthemaximumeigenvalueofKj.ThisshowsthatC(K+c)·cnmaxj¸jtrace(Kj):ReferencesE.D.andAndersen,A.D.(2000).TheMOSEKinteriorpointoptimizerforlinearprogramming:Animplementationofthehomogeneousalgorithm.InFrenk,H.,Roos,C.,Terlaky,T.,andZhang,S.,editors,HighPerformanceOptimization,pages197{232.KluwerAcademicPublishers.Bartlett,P.L.(1998).Thesamplecomplexityofpatternclassi¯cationwithneuralnetworks:Thesizeoftheweightsismoreimportantthanthesizeofthenetwork.IEEETransactionsonInformationTheory,44(2):525{536.Bartlett,P.L.andMendelson,S.(2002).RademacherandGaussiancomplexities:Riskboundsandstructuralresults.JournalofMachineLearningResearch,3:463{482.Bennett,K.P.andBredensteiner,E.J.(2000).DualityandgeometryinSVMclassi¯ers.InProceedingsofthe17thInternationalConferenceonMachineLearning,pages57{64.MorganKaufmann.Boyd,S.andVandenberghe,L.(2003).Convexoptimization.CoursenotesforEE364,StanfordUniversity.Availableathttp://www.stanford.edu/class/ee364.Breiman,L.(1998).Arcingclassi¯ers.AnnalsofStatistics,26(3):801{849.Cai,L.andHofmann,T.(2003).Textcategorizationbyboostingautomaticallyextractedconcepts.InProceedingsofthe26thACM-SIGIRInternationalConferenceonResearchandDevelopmentinInformationRetrieval.ACMPress.Cristianini,N.,Kandola,J.,Elissee®,A.,andShawe-Taylor,J.(2001).Onkerneltargetalignment.TechnicalReportNeuroColt2001-099,RoyalHollowayUniversityLondon.70 LearningtheKernelMatrixwithSemidefiniteProgrammingCristianini,N.andShawe-Taylor,J.(2000).AnIntroductiontoSupportVectorMachines.Cam-bridgeUniversityPress.Cristianini,N.,Shawe-Taylor,J.,Elissee®,A.,andKandola,J.(2002).Onkernel-targetalignment.InDietterich,T.G.,Becker,S.,andGhahramani,Z.,editors,AdvancesinNeuralInformationProcessingSystems14.MITPress.DeBie,T.,Lanckriet,G.,andCristianini,N.(2003).Convextuningofthesoftmarginparameter.TechnicalReportCSD-03-1289,UniversityofCalifornia,Berkeley.Deng,M.,Chen,T.,andSun,F.(2003).Anintegratedprobabilisticmodelforfunctionalpredictionofproteins.InRECOMB,pages95{103.Eyheramendy,S.,Genkin,A.,Ju,W.,Lewis,D.D.,andMadigan,D.(2003).Sparsebayesianclassi¯ersfortextcategorization.Technicalreport,DepartmentofStatistics,RutgersUniversity.Huang,Y.(2003).Supportvectormachinesfortextcategorizationbasedonlatentsemanticin-dexing.Technicalreport,ElectricalandComputerEngineeringDepartment,TheJohnsHopkinsUniversity.Koltchinskii,V.andPanchenko,D.(2002).Empiricalmargindistributionsandboundingthegeneralizationerrorofcombinedclassi¯ers.AnnalsofStatistics,30.Kondor,R.I.andLa®erty,J.(2002).Di®usionkernelsongraphsandotherdiscreteinputspaces.InSammut,C.andHo®mann,A.,editors,ProceedingsoftheInternationalConferenceonMachineLearning.MorganKaufmann.Lanckriet,G.R.G.,Deng,M.,Cristianini,N.,Jordan,M.I.,andNoble,W.S.(2004).Kernel-baseddatafusionanditsapplicationtoproteinfunctionpredictioninyeast.InPaci¯cSymposiumonBiocomputing.Ledoux,M.andTalagrand,M.(1991).ProbabilityinBanachSpaces:IsoperimetryandProcesses.Springer-Verlag.McDiarmid,C.(1989).Onthemethodofboundeddi®erences.InSurveysinCombinatorics1989,pages148{188.CambridgeUniversityPress.Nesterov,Y.andNemirovsky,A.(1994).InteriorPointPolynomialMethodsinConvexProgram-ming:TheoryandApplications.SIAM.Platt,J.(1999).UsingsparsenessandanalyticQPtospeedtrainingofsupportvectormachines.InM.S.Kearns,S.A.Solla,D.A.C.,editor,AdvancesinNeuralInformationProcessingSystems11.MITPress.Salton,G.andMcGill,M.J.(1983).IntroductiontoModernInformationRetrieval.McGraw-Hill.SchÄolkopf,B.andSmola,A.(2002).LearningwithKernels.MITPress.Seeger,M.(2002).PAC-BayesiangeneralizationerrorboundsforGaussianprocessclassi¯cation.TechnicalReportEDI-INF-RR-0094,UniversityofEdinburgh,DivisionofInformatics.Shawe-Taylor,J.andCristianini,N.(1999).Softmarginandmargindistribution.InSmola,A.,SchÄolkopf,B.,Bartlett,P.,andSchuurmans,D.,editors,AdvancesinLargeMarginClassi¯ers.MITPress.71 Lanckriet,Cristianini,Bartlett,ElGhaouiandJordanShawe-Taylor,J.andCristianini,N.(2004).KernelMethodsforPatternAnalysis.CambridgeUniversityPress.Smith,T.F.andWaterman,M.S.(1981).Identi¯cationofcommonmolecularsubsequences.JournalofMolecularBiology,147(1):195{197.Sturm,J.F.(1999).UsingSeDuMi1.02,aMATLABtoolboxforoptimizationoversymmetriccones.OptimizationMethodsandSoftware,11{12:625{653.SpecialissueonInteriorPointMethods(CDsupplementwithsoftware).Tsuda,K.(1999).Supportvectorclassi¯cationwithasymmetrickernelfunction.InVerleysen,M.,editor,ProceedingsoftheEuropeanSymposiumonArti¯cialNeuralNetworks,pages183{188.Vandenberghe,L.andBoyd,S.(1996).Semide¯niteprogramming.SIAMReview,38(1):49{95.Vandenberghe,L.,Boyd,S.,andWu,S.-P.(1998).Determinantmaximizationwithlinearmatrixinequalityconstraints.SIAMJournalonMatrixAnalysisandApplications,19(2):499{533.72