/
JournalofMachineLearningResearch5(2004)819 JournalofMachineLearningResearch5(2004)819

JournalofMachineLearningResearch5(2004)819 - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
391 views
Uploaded On 2016-06-18

JournalofMachineLearningResearch5(2004)819 - PPT Presentation

JEBARAKONDORANDHOWARDkerneldesigningeneralItpermitskerneldesigntoleveragethelargebodyoftoolsinstatisticalandgenerativemodelingincludingBayesiannetworksPearl1997ForinstancemaximumlikelihoodorBay ID: 367593

JEBARA KONDORANDHOWARDkerneldesigningeneral.ItpermitskerneldesigntoleveragethelargebodyoftoolsinstatisticalandgenerativemodelingincludingBayesiannetworks(Pearl 1997).Forinstance maximumlikelihoodorBay

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch5(2004)8..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch5(2004)819–844Submitted1/04;Published7/04ProbabilityProductKernelsTonyJebaraJEBARA@CS.COLUMBIA.EDURisiKondorRISI@CS.COLUMBIA.EDUAndrewHowardAHOWARD@CS.COLUMBIA.EDUComputerScienceDepartmentColumbiaUniversityNewYork,NY10027,USAEditors:KristinBennettandNicoloCesa-BianchiAbstractTheadvantagesofdiscriminativelearningalgorithmsandkernelmachinesarecombinedwithgen-erativemodelingusinganovelkernelbetweendistributions.Intheprobabilityproductkernel,datapointsintheinputspacearemappedtodistributionsoverthesamplespaceandageneralinnerproductisthenevaluatedastheintegraloftheproductofpairsofdistributions.ThekernelisstraightforwardtoevaluateforallexponentialfamilymodelssuchasmultinomialsandGaussiansandyieldsinterestingnonlinearkernels.Furthermore,thekerneliscomputableinclosedformforlatentdistributionssuchasmixturemodels,hiddenMarkovmodelsandlineardynamicalsys-tems.Forintractablemodels,suchasswitchinglineardynamicalsystems,structuredmean-eldapproximationscanbebroughttobearonthekernelevaluation.Forgeneraldistributions,evenifananalyticexpressionforthekernelisnotfeasible,weshowastraightforwardsamplingmethodtoevaluateit.Thus,thekernelpermitsdiscriminativelearningmethods,includingsupportvectormachines,toexploittheproperties,metricsandinvariancesofthegenerativemodelsweinferfromeachdatum.Experimentsareshownusingmultinomialmodelsfortext,hiddenMarkovmodelsforbiologicaldatasetsandlineardynamicalsystemsfortimeseriesdata.Keywords:Kernels,supportvectormachines,generativemodels,Hellingerdivergence,Kullback-Leiblerdivergence,Bhattacharyyaafnity,expectedlikelihood,exponentialfamily,graphicalmod-els,latentmodels,hiddenMarkovmodels,dynamicalsystems,meaneld1.IntroductionRecentdevelopmentsinmachinelearning,includingtheemergenceofsupportvectormachines,haverekindledinterestinkernelmethods(Vapnik,1998;Hastieetal.,2001).Typically,kernelsaredesignedandusedbythepractitionertointroduceusefulnonlinearities,embedpriorknowledgeandhandleunusualinputspacesinalearningalgorithm(Sch¨olkopfandSmola,2001).Inthisarticle,weconsiderkernelsbetweendistributions.Typically,kernelscomputeageneralizedinnerproductbetweentwoinputobjectsxandx0whichisequivalenttoapplyamappingfunctionFtoeachobjectandthencomputingadotproductbetweenF(x)andF(x0)inaHilbertspace.WewillinsteadconsiderthecasewherethemappingF(x)isaprobabilitydistributionp(xjx),thusrestrictingtheHilbertspacesincethespaceofdistributionsistriviallyembeddedintheHilbertspaceoffunctions.However,thisrestrictiontopositivenormalizedmappingsisinfactnottoolimitingsincegeneralF(x)mappingsandthenonlineardecisionboundariestheygeneratecanoftenbemimickedbyaprobabilisticmapping.However,aprobabilisticmappingfacilitatesandelucidatesc\r2004TonyJebara,RisiKondorandAndrewHoward. JEBARA,KONDORANDHOWARDkerneldesigningeneral.ItpermitskerneldesigntoleveragethelargebodyoftoolsinstatisticalandgenerativemodelingincludingBayesiannetworks(Pearl,1997).Forinstance,maximumlikelihoodorBayesianestimatesmaybeusedtosetcertainaspectsofthemappingtoHilbertspace.Inthispaper,weconsiderthemappingfromdatumxtoaprobabilitydistributionp(xjx)whichhasastraightforwardinnerproductkernelk(x;x0)=Rpr(xjx)pr(xjx0)dx(forascalarrwhichwetypicallysetto1or1=2).Thiskernelismerelyaninnerproductbetweentwodistributionsand,ifthesedistributionsareknowndirectly,itcorrespondstoalinearclassierinthespaceofprobabilitydistributions.Ifthedistributionsthemselvesarenotavailableandonlyobserveddataisgiven,weproposeusingeitherBayesianorFrequentistestimatorstomapasingleinputdatum(ormultipleinputs)intoprobabilitydistributionsandsubsequentlycomputethekernel.Inotherwords,wemappointsasx!p(xjx)andx0!p(xjx0).Forthesettingofr=1=2thekernelisnoneotherthantheclassicalBhattacharyyasimilaritymeasure(KondorandJebara,2003),theafnitycorrespondingtoHellingerdivergence(JebaraandKondor,2003).1Onegoalofthisarticleistoexploreacontactpointbetweendiscriminativelearning(supportvectormachinesandkernels)andgenerativelearning(distributionsandgraphicalmodels).Dis-criminativelearningdirectlyoptimizesperformanceforagivenclassicationorregressiontask.Meanwhile,generativelearningprovidesarichpaletteoftoolsforexploringmodels,accommo-datingunusualinputspacesandhandlingpriors.Oneapproachtomarryingthetwoistoestimateparametersingenerativemodelswithdiscriminativelearningalgorithmsandoptimizeperformanceonaparticulartask.Examplesofsuchapproachesincludeconditionallearning(BengioandFras-coni,1996)orlargemargingenerativemodeling(Jaakkolaetal.,1999).Anotherapproachistousekernelstointegratethegenerativemodelswithinadiscriminativelearningparadigm.Thepro-posedprobabilityproductkernelfallsintothislattercategory.PreviouseffortstobuildkernelsthataccommodateprobabilitydistributionsincludetheFisherkernel(JaakkolaandHaussler,1998),theheatkernel(LaffertyandLebanon,2002)andkernelsarisingfromexponentiatingKullback-Leiblerdivergences(Morenoetal.,2004).Wediscuss,compareandcontrasttheseapproachestotheprobabilityproductkernelinSection7.Onecompellingfeatureofthenewkernelisthatitisstraightforwardandefcienttocomputeoverawiderangeofdistributionsandgenerativemod-elswhilestillproducinginterestingnonlinearbehaviorin,forexample,supportvectormachine(SVM)classiers.ThegenerativemodelsweconsiderincludeGaussians,multinomials,theex-ponentialfamily,mixturemodels,hiddenMarkovmodels,awiderangeofgraphicalmodelsand(viasamplingandmean-eldmethods)intractablegraphicalmodels.Theexibilityinchoosingadistributionfromthewiderangeofgenerativemodelsintheeldpermitsthekerneltoreadilyac-commodatemanyinterestinginputspacesincludingsequences,counts,graphs,eldsandsoforth.Italsoinheritstheproperties,priorsandinvariancesthatmaybestraightforwardtodesignwithinagenerativemodelingparadigmandpropagatesthemintothekernellearningmachine.Thisarticlethusgivesagenerativemodelingroutetokerneldesigntocomplementthefamilyofotherkernelengineeringmethodsincludingsuper-kernels(Ongetal.,2002),convolutionalkernels(Haussler,1999;CollinsandDuffy,2002),alignmentkernels(Watkins,2000),rationalkernels(Cortesetal.,2002)andstringkernels(Leslieetal.,2002;VishawanathanandSmola,2002).Thispaperisorganizedasfollows.InSection2,weintroducetheprobabilityproductkernelandpointoutconnectionstootherprobabilistickernelssuchasFisherkernelsandexponentiatedKullback-Leiblerdivergences.EstimationmethodsformappingpointsprobabilisticallytoHilbert1.ThishasinterestingconnectionstoKullback-Leiblerdivergence(Topsoe,1999).820 PROBABILITYPRODUCTKERNELSspaceareoutlined.InSection3weelaboratethekernelfortheexponentialfamilyandshowitsspecicformforclassicaldistributionssuchastheGaussianandmultinomial.Interestingly,thekernelcanbecomputedformixturemodelsandgraphicalmodelsingeneralandthisiselaboratedinSection4.Intractablegraphicalmodelsarethenhandledviastructuredmean-eldapproxima-tionsinSection6.Alternatively,wedescribesamplingmethodsforapproximatelycomputingthekernel.Section7discussesthedifferencesandsimilaritiesbetweenourprobabilityproductker-nelsandpreviouslypresentedprobabilistickernels.Wethenshowexperimentswithtextdatasetsandmultinomialkernels,withbiologicalsequencedatasetsusinghiddenMarkovmodelkernels,andwithcontinuoustimeseriesdatasetsusinglineardynamicalsystemkernels.Thearticlethenconcludeswithabriefdiscussion.2.AKernelBetweenDistributionsGivenapositive(semi-)denitekernelk:XX7!RontheinputspaceXandexamplesx1;x2;:::;xm2Xwithcorrespondinglabelsy1;y2;:::;ym,kernelbasedlearningalgorithmsreturnahypothesisoftheformh(x)=åm=1aik(xi;x)+b.TheroleofthekernelistocaptureourpriorassumptionsofsimilarityinX.Whenk(x;x0)islarge,weexpectthatthecorrespondinglabelsbesimilarorthesame.Classically,theinputsfxigm=1areoftenrepresentedasvectorsinRn,andkischosentobeoneofasmallnumberofpopularpositivedenitefunctionsonRn,suchastheGaussianRBFk(x;x0)=ekxx0k2=(2s2):(1)Oncewehavefoundanappropriatekernel,theproblembecomesaccessibletoawholearrayofnon-parametricdiscriminativekernelbasedlearningalgorithms,suchassupportvectormachines,Gaussianprocesses,etc.(Sch¨olkopfandSmola,2002).Incontrast,generativemodelstprobabilitydistributionsp(x)tox1;x2;:::;xmandbasetheirpredictionsonthelikelihoodunderdifferentmodels.Whenfacedwithadiscriminativetask,ap-proachingitfromthegenerativeangleisgenerallysuboptimal.Ontheotherhand,itisofteneasiertocapturethestructureofcomplexobjectswithgenerativemodelsthandirectlywithkernels.Spe-cicexamplesincludemultinomialmodelsfortextdocuments,hiddenMarkovmodelsforspeech,Markovrandomeldsforimagesandlineardynamicalsystemsformotion.Inthecurrentpaperweaimtocombinetheadvantagesofbothworldsby1.ttingseparateprobabilisticmodelsp1(x);p2(x);:::pm(x)tox1;x2;:::;xm;2.deninganovelkernelkprob(p;p0)betweenprobabilitydistributionsonX;3.nally,deningthekernelbetweenexamplestoequalkprobbetweenthecorrespondingdistri-butions:k(x;x0)=kprob(p;p0):Wethenplugthiskernelintoanyestablishedkernelbasedlearningalgorithmandproceedasusual.Wedenekprobinarathergeneralformandtheninvestigatespecialcases.Denition1Letpandp0beprobabilitydistributionsonaspaceXandrbeapositiveconstant.Assumethatpr;p0r2L2(X),i.e.thatRXp(x)2rdxandRXp0(x)2rdxarewelldened(notinnity).821 JEBARA,KONDORANDHOWARDTheprobabilityproductkernelbetweendistributionspandp0isdenedkprob(p;p0)=ZXp(x)rp0(x)rdx=\npr;p0r L2:(2)ItiswellknownthatL2(X)isaHilbertspace,henceforanysetPofprobabilitydistributionsoverXsuchthatRXp(x)2rdxisniteforanyp2P,thekerneldenedby(2)ispositivedenite.Theprobabilityproductkernelisalsoasimpleandintuitivelycompellingnotionofsimilaritybetweendistributions.2.1SpecialCasesandRelationshiptoStatisticalAfnitiesForr=1=2k(p;p0)=Zpp(x)pp0(x)dx;whichweshallcalltheBhattacharyyakernel,becauseinthestatisticsliteratureitisknownasBhattacharyya'safnitybetweendistributions(Bhattacharyya,1943),relatedtothebetter-knownHellinger'sdistanceH(p;p0)=12Zpp(x)pp0(x)2dxbyH(p;p0)=p22k(p;p0).Hellinger'sdistancecanbeseenasasymmetricapproximationtotheKullback-Leibler(KL)divergence,andinfactisaboundonKL,asshownin(Topsoe,1999),whererelationshipsbetweenseveralcommondivergencesarediscussed.TheBhattacharyyakernelwasrstintroducedin(KondorandJebara,2003)andhastheimportantspecialpropertyk(x;x)=1.Whenr=1,thekerneltakestheformoftheexpectationofonedistributionundertheother:k(x;x0)=Zp(x)p0(x)dx=Ep[p0(x)]=Ep0[p(x)]:(3)Wecallthistheexpectedlikelihoodkernel.ItisworthnotingthatwhendealingwithdistributionsoverdiscretespacesX=fx1;x2;:::g,probabilityproductkernelshaveasimplegeometricalinterpretation.Inthiscasepcanberepre-sentedasavectorp=(p1;p2;:::)wherepi=Pr(X=xi).Theprobabilityproductkernelisthenthedotproductbetween˜p=pr;pr2;:::and˜p0=p0r1;p0r2;:::.Inparticular,inthecaseoftheBhattacharyyakernel,˜pand˜p0arevectorsontheunitsphere.Thistypeofsimilaritymeasureisnotunknownintheliterature.Intextrecognition,forexam-ple,theso-called“cosine-similarity”(i.e.thedotproductmeasure)iswellentrenched,andithasbeenobservedthatpreprocessingwordcountsbytakingsquarerootsoftenimprovesperformance(GoldzmidtandSahami,1998;Cuttingetal.,1992).LaffertyandLebanonalsoarriveatasimilarkernel,butfromaverydifferentangle,lookingatthediffusionkernelonthestatisticalmanifoldofmultinomialdistributions(LaffertyandLebanon,2002).2.2FrequentistandBayesianMethodsofEstimationVariousstatisticalestimationmethodscanbeusedtotthedistributionsp1;p2;:::;pmtotheexam-plesx1;x2;:::;xm.Perhapsitismostimmediatetoconsideraparametricfamilyfpq(x)g,takethemaximumlikelihoodestimatorsˆqi=argmaxqpq(xi),andsetpi(x)=pˆqi(x).822 PROBABILITYPRODUCTKERNELST(x)A(x)K(q)Gaussian(q=µ;s2=1)x12xTxD2log(2p)12qTqGaussian(q=s2;µ=0)x12log(2p)12logqExponential(q=1=b)x0log(q)Gamma(q=a)logxlogxxlogG(q)Poisson(q=logp)xlog(x!)expqMultinomial(qi=logai)(x1;x2;:::)logåixi!=Õixi!1Table1:Somewell-knownexponentialfamilydistributionsThealternative,Bayesian,strategyistopostulateaprioronq,invokeBayes'rulep(qjx)=p(xjq)p(q)Rp(xjq)p(q)dq;andthenuseeitherthedistributionp(xjˆqMAP)basedonthemaximumaposterioriestimatorˆqMAP=argmaxqp(qjx),orthetrueposteriorp(xjx)=Zp(xjq)p(qjx)dq:(4)Atthispointthereadermightbewonderingtowhatextentitisjustiabletotdistributionstosingledatapoints.Itisimportanttobearinmindthatpi(x)isbutanintermediatestepinformingthekernel,andisnotnecessarilymeanttomodelanythinginandofitself.Themotivationistoexploitthewaythatprobabilitydistributionscapturesimilarity:assumingthatourmodelisappropriate,ifadistributionttoonedatapointgiveshighlikelihoodtoanotherdatapoint,thisindicatesthatthetwodatapointsareinsomesensesimilar.Thisisparticularlytrueofintricategraphicalmodelscommonlyusedtomodelstructureddata,suchastimesseries,images,etc..Suchmodelscancapturerelativelycomplicatedrelationshipsinrealworlddata.Probabilityproductkernelsallowkernelmethodstoharnessthepowerofsuchmodels.WhenxisjustapointinRnwithnofurtherstructure,probabilityproductkernelsarelesscom-pelling.Nevertheless,itisworthnotingthateventheGaussianRBFkernel(1)canberegardedasaprobabilityproductkernel(bysettingsettingr=1andchoosingpi(x)tobeas=2varianceGaussianttoxibymaximumlikelihood).Inparticularsituationswhenthepointxdoesnotyieldareliableestimateofp(typicallybecausetheclassofgenerativemodelsistooexiblerelativetothedatum),wemayconsiderregularizingthemaximumlikelihoodestimateofeachprobabilitybyttingittotheneighbourhoodofapointorbyemployingothermoreglobalestimatorsoftheden-sities.Otherrelatedmethodsthatusepointstoestimatelocaldistributionsincludingkerneldensityestimation(KDE)wheretheglobaldensityiscomposedoflocaldensitymodelscenteredoneachdatum(Silverman,1986).Wenextdiscussavarietyofdistributions(inparticular,theexponentialfamily)wheremaximumlikelihoodestimationiswellbehavedandtheprobabilityproductkernelisstraightforwardtocompute.823 JEBARA,KONDORANDHOWARD3.ExponentialFamiliesAfamilyofdistributionsparameterizedbyq2RDissaidtoformanexponentialfamilyifitsmem-bersareoftheformpq(x)=exp(A(x)+qTT(x)K(q)):HereAiscalledthemeasure,KisthethecumulantgeneratingfunctionandTarethesufcientstatistics.Someofthemostcommonstatisticaldistributions,includingtheGaussian,multinomial,Pois-sonandGammadistributions,areexponentialfamilies(Table1).Notethattoensurethatpq(x)isnormalized,A,KandqmustberelatedthroughtheLaplacetransformK(q)=logZexpA(x)+qTT(x)dx:TheBhattacharyyakernel(r=1=2)canbecomputedinclosedformforanyexponentialfamily:k(x;x0)=k(p;p0)=Zxpq(x)1=2pq0(x)1=2dx=ZxexpA(x)+12q+12q0TT(x)12K(q)12K(q0)dx=expK12q+12q012K(q)12K(q0):Forr6=1=2,k(x;x0)canonlybewritteninclosedformwhenAandTobeysomespecialproperties.Specically,if2rA(x)=A(hx)forsomehandTislinearinx,thenlogZexp2rA(x)+rq+q0TT(x)dx=log1hZexpA(hx)+rhq+q0TT(hx)d(hx)=Krhq+q0logh;hencek(p;p0)=exphKrhq+q0loghr2K(q)r2K(q0)i:Inthefollowingwelookatsomespecicexamples.3.1TheGaussianDistributionTheDdimensionalGaussiandistributionp(x)=N(µ;S)isoftheformp(x)=(2p)D=2jSj1=2exp12(xµ)TS1(xµ)whereSisapositivedenitematrixandjSjdenotesitsdeterminant.ForapairofGaussiansp=N(µ;S)andp0=N(µ0;S0),completingthesquareintheexponentgivesthegeneralprobabilityproductkernelKr(x;x0)=Kr(p;p0)=ZRDp(x)rp0(x)rdx=(2p)(12r)D=2rD=2 S† 1=2 S r=2 S0 r=2expr2µTS1µ+µ0TS01µ0µ†TS†µ†;(5)824 PROBABILITYPRODUCTKERNELSwhereS†=S1+S011andµ†=S1µ+S01µ0.Whenthecovarianceisisotropicandxed,S=s2I,thissimpliestoKr(p;p0)=(2r)D=2(2ps2)(12r)D=2ekµµ0k2=(4s2=r);which,forr=1(theexpectedlikelihoodkernel)simplygivesk(p;p0)=1(4ps2)D=2ekµ0µk2=(4s2);recovering,uptoaconstantfactor,theGaussianRBFkernel(1).3.2TheBernoulliDistributionTheBernoullidistributionp(x)=gx(1g)1xwithparameterg2(0;1),anditsDdimensionalvariant,sometimesreferredtoasNaiveBayes,p(x)=DÕd=1gxdd(1gd)1xdwithg2(0;1)D,areusedtomodelbinaryx2f0;1gormultidimensionalbinaryx2f0;1gDobserva-tions.TheprobabilityproductkernelKr(x;x0)=Kr(p;p0)=åx2f0;1gDDÕd=1(gdg0)rxd((1gd)(1g0d))r(1xd)factorizesasKr(p;p0)=DÕd=1(gdg0)r+(1gd)r(1g0d)r:3.3TheMultinomialDistributionThemultinomialmodelp(x)=s!x1!x2!:::xD!ax11ax22:::axDDwithparametervectora=(a1;a2;:::;aD)subjecttoåD=1ad=1iscommonlyusedtomodeldiscreteintegercountsx=(x1;x2;:::;xD)withåD=1xi=sxed,suchasthenumberofoccurrencesofwordsinadocument.Themaximumlikelihoodestimategivenobservationsx(1);x(2);:::;x(n)isˆad=ån=1x(i)dån=1åD=1x(i)d:Forxeds,theBhattacharyyakernel(r=1=2)canbecomputedexplicitlyusingthemultinomialtheorem:k(p;p0)=åx=(x1;x2;:::;xD)åixi=ss!x1!x2!:::xD!DÕd=1(ada0)xd=2="Dåd=1(ada0d)1=2#s(6)825 JEBARA,KONDORANDHOWARDwhichisequivalenttothehomogeneouspolynomialkernelofordersbetweenthevectors(pa1;pa2;:::;paD)and(pa0;pa0;:::;pa0).Whensisnotconstant,wecansumoverallitspossiblevaluesk(p;p0)=¥ås=0"Dåd=1(ada0)1=2#s= 1Dåd=1(ada0d)1=2!1orweighteachpowerdifferently,leadingtoapowerseriesexpansion.3.4TheGammaandExponentialDistributionsThegammadistributionG(a;b)parameterizedbya�0andb�0isoftheformp(x)=1G(a)baxa1ex=b:TheprobabilityproductkernelbetweenpG(a;b)andp0G(a0;b0)willthenbekr(p;p0)=G(a†)b†a†hG(a)baG(a0)b0a0irwherea†=r(a+a02)+1and1=b†=r(1=b+1=b0).Whenr=1=2thissimpliestoa†=(a+a0)=2and1=b†=(1=b+1=b0)=2.Theexponentialdistributionp(x)=1bex=bcanberegardedasaspecialcaseofthegammafamilywitha=1.Inthiscasekr(p;p0)=r(1=b+1=b0)(bb0)r:4.LatentandGraphicalModelsWenextupgradebeyondtheexponentialfamilyofdistributionsandderivetheprobabilityproductkernelformoreversatilegenerativedistributions.Thisisdonebyconsideringlatentvariablesandstructuredgraphicalmodels.Whileintroducingadditionalcomplexityinthegenerativemodelandhenceprovidingamoreelaborateprobabilityproductkernel,thesemodelsdohavethecaveatthatestimatorssuchasmaximumlikelihoodarenotaswell-behavedasinthesimpleexponentialfamilymodelsandmayinvolveexpectation-maximizationorotheronlylocallyoptimalestimates.Werstdiscussthesimplestlatentvariablemodels,mixturemodels,whichcorrespondtoasingleunob-serveddiscreteparentoftheemissionandthenupgradetomoregeneralgraphicalmodelssuchashiddenMarkovmodels.Latentvariablemodelsidentifyasplitbetweenthevariablesinxsuchthatsomeareobservedandaredenotedusingxo(wherethelower-case'o'isshortfor'observed')whileothersarehiddenanddenotedusingxh(wherethelower-case'h'isshortfor'hidden').Inthelatentcase,theprobabilityproductkerneliscomputedonlybytheinnerproductbetweentwodistributionsp(xo)andp0(xo)overtheobservedcomponents,inotherwordsk(p;p0)=Rp(xo)p0(xo)dxo.Theadditionalvariablesxhareincompleteobservationsthatarenotpartoftheoriginalsamplespace(wherethedatumordatapointstokernelizeexist)butanaugmentationofit.Thedesiredp(xo)therefore,involvesmarginalizingawaythehiddenvariables:p(xo)=åxhp(xo;xh)=åxhp(xh)p(xojxh):826 PROBABILITYPRODUCTKERNELSForinstance,wemayconsideramixtureofexponentialfamiliesmodel(adirectgeneralizationoftheexponentialfamilymodel)wherexhisasinglediscretevariableandwherep(xojxh)areexponentialfamilydistributionsthemselves.Theprobabilityproductkernelisthengivenasusualbetweentwosuchdistributionsp(xo)andp0(xo)whichexpandasfollows:k(p;p0)=åxo(p(x))rp0(x)r=åxo åxhp(xh)p(xojxh)!r0åx0hp0(x0h)p0(xojx0h)1r:Wenextnoteaslightmodicationtothekernelwhichmakeslatentvariablemodelstractablewhenr6=1.Thisalternativekernelisdenoted˜k(p;p0)andalsosatisesMercer'sconditionusingthesamelineofreasoningaswasemployedfork(p;p0)intheprevioussections.Essentially,for˜k(p;p0)wewillassumethatthepoweroperationinvolvingrisperformedoneachentryofthejointprobabilitydistributionp(xo;xh)insteadofonthemarginalizedp(xo)aloneasfollows:˜k(p;p0)=åxoåxh(p(xh)p(xojxh))råx0hp0(x0h)p0(xojx0h)r:Whiletheoriginalkernelk(p;p0)involvedthemappingtoHilbertspacegivenbyx!p(xo),theabove˜k(p;p0)correspondstomappingeachdatumtoanaugmenteddistributionoveramorecom-plexmarginalizedspacewhereprobabilityvaluesaresquashed(orraised)byapowerofr.Clearly,k(p;p0)and˜k(p;p0)areformallyequivalentwhenr=1.However,forothersettingsofr,wepreferhandlinglatentvariableswith˜k(p;p0)(andattimesomitthesymbolwhenitisself-evident)sinceitreadilyaccommodatesefcientmarginalizationandpropagationalgorithms(suchasthejunctiontreealgorithm(JordanandBishop,2004)).Oneopenissuewithlatentvariablesisthatthe˜kkernelsthatmarginalizeoverthemmayproducedifferentvaluesiftheunderlyinglatentdistributionshavedifferentjointdistributionsoverhiddenandobservedvariableseventhoughthemarginaldistributionoverobservedvariablesstaysthesame.Forinstance,wemayhaveatwo-componentmixturemodelforpwithbothcomponentshavingthesameidenticalemissiondistributionsyetthekernel˜k(p;p0)evaluatestoadifferentvalueifwecollapsetheidenticalcomponentsinpintooneemissiondistribution.ThisistobeexpectedsincethedifferentlatentcomponentsimplyaslightlydifferentmappingtoHilbertspace.Wenextdescribethekernelforvariousgraphicalmodelswhichessentiallyimposeafactor-izationonthejointprobabilitydistributionovertheNvariablesxoandxh.Thisarticlefocusesondirectedgraphs(althoughundirectedgraphscanalsobekernelized)wherethisfactorizationimpliesthatthejointdistributioncanbewrittenasaproductofconditionalsovereachnodeorvari-ablexi2fxo;xhg;i=1:::Ngivenitsparentnodesorsetofparentvariablesxpa(i)asp(xo;xh)=ÕN=1p(xijxpa(i)).Wenowdiscussparticularcasesofgraphicalmodelsincludingmixturemodels,hiddenMarkovmodelsandlineardynamicalsystems.4.1MixtureModelsInthesimplestscenario,thelatentgraphicalmodelcouldbeamixturemodel,suchasamixtureofGaussiansormixtureofexponentialfamilydistributions.Here,thehiddenvariablexhisasinglediscretevariablewhichisaparentofasinglexoemissionvariableintheexponentialfamily.Toderivethefullkernelforthemixturemodel,werstassumeitisstraightforwardtocomputean827 JEBARA,KONDORANDHOWARD \n \r(a)HMMforp.(b)HMMforp0.(c)Combinedgraphs.Figure1:TwohiddenMarkovmodelsandtheresultinggraphicalmodelasthekernelcouplescom-monparentsforeachnodecreatingundirectededgesbetweenthem.elementarykernelbetweenanypairofentries(indexedviaxhandx0h)fromeachmixturemodelasfollows:˜k(p(xojxh);p0(xojx0h))=åxop(xojxh)p0(xojx0h)r:Intheabove,theconditionalscouldbe,forinstance,exponentialfamily(emission)distributions.Then,wecaneasilycomputethekernelforamixturemodelofsuchdistributionsusingtheseele-mentarykernelevaluationsandaweightedsummationofallpossiblepairingsofcomponentsofthemixtures:˜k(p;p0)=åxoåxh(p(xh)p(xojxh))råx0hp0(x0h)p0(xojx0h)r=åxh;x0hp(xh)p0(x0h)r˜k(p(xojxh);p0(xojx0h)):Effectively,themixturemodelkernelinvolvesenumeratingallsettingsofthehiddenvariablesxhandx0h.Takingthenotationj:jtobethecardinalityofavariable,weeffectivelyneedtoperformandsumatotalofjxhjjx0hjelementarykernelcomputations˜kxh;x0hbetweentheindividualemissiondistributions.4.2HiddenMarkovModelsWenextupgradebeyondmixturestographicalmodelssuchashiddenMarkovmodels(HMMs).ConsideringhiddenMarkovmodelsasthegenerativemodelpallowsustobuildakerneloverse-quencesofvariablelengths.However,HMMsandmanypopularBayesiannetworkshavealargenumberof(hiddenandobserved)variablesandenumeratingallpossiblehiddenstatesfortwodis-tributionspandp0quicklybecomesinefcientsincexhandx0haremultivariateand/orhavelargecardinalities.However,unliketheplainmixturemodelingcase,HMMshaveanefcientgraphicalmodelstructureimplyingafactorizationoftheirdistributionwhichwecanleveragetocomputeourkernelefciently.AhiddenMarkovmodeloversequencesoflengthT+1hasobservedvariablesxo=fx0;:::;xTgandhiddenstatesxh=fq0;:::;qTg.ThegraphicalmodelforanHMMinFig-ure1(a)reectsitsMarkovchainassumptionandleadstothefollowingprobabilitydensityfunction(noteherewedeneq1=fgasanullvariableforbrevityor,equivalently,p(q0jq1)=p(q0)):p(xo)=åxhp(xo;xh)=åq0:::åqTTÕt=0p(xtjqt)p(qtjqt1):828 PROBABILITYPRODUCTKERNELS \n   \r\r     \n  Figure2:Theundirectedcliquegraphobtainedfromthekernelforefcientlysummingoverbothsetsofhiddenvariablesxhandx0h.Tocomputethekernel˜k(p;p0)inabruteforcemanner,weneedtosumoverallcongurationsofq0;:::;qTandq0;:::;q0whilecomputingforeacheachcongurationitscorrespondingelementarykernel˜k(p(x0;:::;xTjq0;:::;qT);p0(x0;:::;xTjq0;:::;q0)).ExploringeachsettingofthestatespaceoftheHMMs(showninFigure1(a)and(b))ishighlyinefcientrequiringjqtj(T+1)jq0tj(T+1)elementarykernelevaluations.However,wecantakeadvantageofthefactorizationofthehiddenMarkovmodelbyusinggraphicalmodelingalgorithmssuchasthejunctiontreealgorithm(JordanandBishop,2004)tocomputethekernelefciently.First,weusethefactorizationoftheHMMinthekernelasfollows:˜k(p;p0)=åx0;:::;xTåq0:::qTTÕt=0p(xtjqt)rp(qtjqt1)råq0:::q0TÕt=0p0(xtjq0t)rp(q0tjq0t1)r=åq0:::qTåq0:::q0TÕt=0p(qtjqt1)rp(q0tjq0t1)r åxtp(xtjqt)rp0(xtjq0t)r!=åq0:::qTåq0:::q0TÕt=0p(qtjqt1)rp(q0tjq0t1)ry(qt;q0t)=åqTåq0Ty(qT;q0)TÕt=1åqt1åq0t1p(qtjqt1)rp(q0tjq0t1)ry(qt1;q0t1)p(q0)rp0(q00)r:Intheaboveweseethatweonlyneedtocomputethekernelforeachsettingofthehiddenvariableforagiventimesteponitsown.Effectively,thekernelcouplesthetwoHMMsviatheircommonchildrenasinFigure1(c).Thekernelevaluations,oncesolved,formnon-negativecliquepotentialfunctionsy(qt;q0t)withadifferentsettingforeachvalueofqtandq0t.ThesecouplethehiddenvariableofeachHMMforeachtimestep.Theelementarykernelevaluationsareeasilycomputed(particularlyiftheemissiondistributionsp(xtjqt)areintheexponentialfamily)as:˜k(p(xtjqt);p0(xtjq0t))=y(qt;q0t)=åxtp(xtjqt)rp0(xtjq0t)r:Onlyatotalof(T+1)jqtjjq0tjsuchelementarykernelsneedtobeevaluatedandformalim-itednumberofsuchcliquepotentialfunctions.Similarly,eachconditionaldistributionp(qtjqt1)randp0(q0tjq0t1)rcanalsobeviewedasanon-negativecliquepotentialfunction,i.e.f(qt;qt1)=829 JEBARA,KONDORANDHOWARDp(qtjqt1)randf(q0t;q0t1)=p0(q0tjq0t1)r.Computingthekerneltheninvolvesmarginalizingoverthehiddenvariablesxh,whichisdonebyrunningaforward-backwardorjunction-treealgorithmontheresultingundirectedcliquetreegraphshowninFigure2.Thus,tocomputethekernelfortwosequencesxandx0oflengthsTx+1andTx0+1,wetrainanHMMfromeachsequenceusingmaximumlikelihoodandthencomputethekernelusingthelearnedHMMtransitionmatrixandemissionmodelsforauser-speciedsequencelengthT+1(whereTcanbechosenaccordingtosomeheuristic,forinstance,theaverageofallTxinthetrainingdataset).Thefollowingispseudo-codeillustratinghowtocomputethe˜k(p;p0)kernelgiventwohiddenMarkovmodels(p;p0)anduser-speciedparametersTandr.F(q0;q0)=p(q0)rp0(q00)rfort=1:::TF(qt;q0t)=åqt1åq0t1p(qtjqt1)rp0(q0tjq0t1)ry(qt1;q0t1)F(qt1;q0t1)end˜k(p;p0)=åqTåq0F(qT;q0)y(qT;q0T):NotethatthehiddenMarkovmodelspandp0neednothavethesamenumberofhiddenstatesforthekernelcomputation.4.3BayesianNetworksTheabovemethodcanbeappliedtoBayesiannetworksingeneralwherehiddenandobservedvari-ablesfactorizeaccordingtop(x)=ÕN=1p(xijxpa(i)).Thegeneralrecipemirrorstheabovederiva-tionforthehiddenMarkovmodelcase.Infact,thetwoBayesiannetworksneednothavethesamestructureaslongastheysharethesamesamplespaceovertheemissionvariablesxo.Forinstance,considercomputingthekernelbetweentheBayesiannetworkinFigure3(a)andthehiddenMarkovmodelinFigure1(b)overthesamesamplespace.ThegraphscanbeconnectedwiththeircommonchildrenasinFigure3(b).Theparentswithcommonchildrenaremarriedandformcliquesofhid-denvariables.Wethenonlyneedtoevaluatetheelementarykernelsoverthesecliquesgivingthefollowingnon-negativepotentialfunctionsforeachobservednode(forinstancexi2xo)undereachsettingofallitsparents'nodes(frombothpandp0):y(xh;pa(i);x0h0;pa(i))=˜k(p(xijxh;pa(i));p0(xijxh0;pa(i)0))=åxip(xijxh;pa(i))rp0(xijxh0;pa(i)0)r:Aftermarryingcommonparents,theresultingcliquegraphisthenbuiltandusedwithajunctiontreealgorithmtocomputetheoverallkernelfromtheelementarykernelevaluations.AlthoughtheresultingundirectedcliquegraphmaynotalwaysyieldadramaticimprovementinefciencyaswasseeninthehiddenMarkovmodelcase,wenotethatpropagationalgorithms(loopyorjunctiontreealgorithms)arestilllikelytoprovideamoreefcientevaluationofthekernelthanthebruteforceenumerationofalllatentcongurationsofbothBayesiannetworksinthekernel(aswasde-scribedinthegenericmixturemodelcase).Whiletheabovecomputationsemphasizeddiscretelatentvariables,thekernelisalsocomputableovercontinuouslatentcongurationnetworks,where830 PROBABILITYPRODUCTKERNELS \n\n\n\n (a)Bayesiannetworkforp.(b)Combinedgraphs.Figure3:TheresultinggraphicalmodelfromaBayesiannetworkandahiddenMarkovmodelasthekernelcouplescommonparentsforeachnodecreatingundirectededgesbetweenthemandanalcliquegraphforthejunctiontreealgorithm.xhcontainsscalarsandvectors,asisthecaseinlinearGaussianmodels,whichwedevelopinthenextsubsections.4.4LinearGaussianModelsAlinearGaussianmodelonadirectedacyclicgraphGwithvariablesx1;x2;:::;xNassociatedtoitsverticesisoftheformp(x1;x2;:::;xN)=ÕiNxi;bi;0+åj2pa(i)bi;jxj;SiwhereN(x;µ;S)isthemultivariateGaussiandistribution,andpa(i)istheindexsetofparentver-ticesassociatedwiththeithvertex.TheunconditionaljointdistributioncanberecoveredfromtheconditionalformandisitselfaGaussiandistributionoverthevariablesinthemodelp(x1;x2;:::;xN).Hence,theprobabilityproductkernelbetweenapairoflinearGaussianmodelscanbecomputedinclosedformbyusingtheunconditionaljointdistributionand(5).Latentvariablemodelsrequireanadditionalmarginalizationstep.Variablesaresplitintotwosets:theobservedvariablesxoandthehiddenvariablesxh.Afterthejointunconditionaldistributionp(xo;xh)hasbeencomputed,thehiddenvariablescanbetriviallyintegratedout:k(p;p0)=ZZp(xo;xh)dxhrZp(xo;x0h)dx0hrdxo=Zp(xo)rp0(xo)rdxo;sothatthekerneliscomputedbetweenGaussiansoveronlytheobservedvariables.4.5LinearDynamicalSystemsAcommonlyusedlinearGaussianmodelwithlatentvariablesisthelineardynamicalsystem(LDS)(ShumwayandStoffer,1982),alsoknownasthestatespacemodelorKalmanltermodel.TheLDSisthecontinuousstateanalogofthehiddenMarkovmodelandsharesthesameconditionalindependencegraphFigure1(a).Thus,itisappropriateformodelingcontinuoustimeseriesdataandcontinuousdynamicalsystems.TheLDS'sjointprobabilitymodelisoftheformp(x0;:::xT;s0;:::sT)=N(s0;µ;S)N(x0;Cs0;R)TÕt=1N(st;Ast1;Q)N(xt;Cst;R);831 JEBARA,KONDORANDHOWARDwherextistheobservedvariableattimet,andstisthelatentstatespacevariableattimet,obeyingtheMarkovproperty.TheprobabilityproductkernelbetweenLDSmodelsisk(p;p0)=ZN(x;µx;Sxx)rN(x;µ0;S0xx)rdx;whereµxandSxxaretheunconditionalmeanandcovariance,whichcanbecomputedfromtherecursionsµst=Aµst1µxt=CµstSstst=ASst1st1A0+QSxtxt=CSststC0+R:AswiththeHMM,weneedtosetthenumberoftimestepsbeforecomputingthekernel.NotethatthemostalgorithmicallyexpensivecalculationwhencomputingtheprobabilityproductkernelbetweenGaussiansistakingtheinverseofthecovariancematrices.Thedimensionalityofthecovariancematrixwillgrowlinearlywiththenumberoftimestepsandtherunningtimeofthematrixinversegrowscubicallywiththedimensionality.However,therequiredinversescanbeefcientlycomputedbecauseSxxisblockdiagonal.Eachextratimestepaddedtothekernelwillsimplyaddanotherblock,andtheinverseofablockdiagonalmatrixcanbecomputedbyinvertingtheblocksindividually.Therefore,therunningtimeofinvertingtheblockdiagonalcovariancematrixwillonlygrowlinearlywiththenumberoftimestepsTandcubicallywiththe(small)blocksize.SamplingApproximationTocapturethestructureofrealdata,generativemodelsoftenneedtobequiteintricate.Closedformsolutionsfortheprobabilityproductkernelarethenunlikelytoexist,forcingustorelyonapproximationmethodstocomputek(p;p0).Providedthatevaluatingthelikelihoodp(x)iscomputationallyfeasible,andthatwecanalsoefcientlysamplefromthemodel,inther=1case(expectedlikelihoodkernel)wemayemploytheMonteCarloestimatek(p;p0)bNNåi=1p0(xi)+(1b)N0Nåi=1p(x0i);wherex1;:::;xNandx01;:::;x0Narei.i.d.samplesfrompandp0respectively,andb2[0;1]isaparameterofourchoosing.Bythelawoflargenumbers,thisapproximationwillconvergetothetruekernelvalueinthelimitN!¥foranyb.OthersamplingtechniquessuchasimportancesamplingandavarietyofavorsofMarkovchainMonteCarlo(MCMC),includingGibbssampling,canbeusedtoimprovetherateofthesamplingapproximation'sconvergencetothetruekernel.Insomecasesitmaybepossibletousethesamplingapproximationforangeneralr.ThisispossibleifanormalizingfactorZcanbecomputedtorenormalizep(x)rintoavaliddistribution.Ziscomputableformostdiscretedistributionsandsomecontinuousdistribution,suchastheGaussian.832 PROBABILITYPRODUCTKERNELS1q2q3q0q1s2s3s0x0s1x2x3xFigure4:GraphfortheSwitchingLinearDynamicalSystem.Thesamplingapproximationthenbecomesk(p;p0)bNNåi=1Zp0(ˆxi)r+(1b)N0Nåi=1Z0p(ˆx0i)r;whereZandZ0arethenormalizersofpandp0aftertheyaretakentothepowerofr.Inthisformulationˆx1;:::;ˆxNandˆx01;:::;ˆx0Narei.i.d.samplesfromˆpandˆp0whereˆp=prZandˆp0=p0rZ0.Inpractice,foranitenumberofsamples,theapproximatekernelmaynotbeavalidMercerkernel.Thenon-symmetricaspectoftheapproximationcanbealleviatedbycomputingonlythelowertriangularentriesinthekernelmatrixandreplicatingthemintheupperhalf.Problemsas-sociatedwiththeapproximationnotbeingpositivedenitecanberectiedbyeitherusingmoresamples,orbyusinganimplementationforthekernelbasedclassieralgorithmthatcanconvergetoalocaloptimumsuchassequentialminimaloptimization(Platt,1999).5.1SwitchingLinearDynamicalSystemsInsomecases,partofthemodelcanbeevaluatedbysampling,whiletherestiscomputedinclosedform.Thisistruefortheswitchinglineardynamicalsystem(SLDS)(Pavlovicetal.,2000),whichcombinesthediscretehiddenstatesoftheHMMandthecontinuousstatespaceoftheLDS:p(x;s;q)=p(q0)p(s0jq0)p(x0js0)TÕt=1p(qtjqt1)p(stjst1;qt)p(xtjst);whereqtisthediscretehiddenstateattimet,stisthecontinuoushiddenstate,andxtareobservedemissionvariables(Figure4).Samplingq1:::;qNaccordingtop(q)andq01;:::;q0Naccordingtop0(q0),theremainingfactorsinthecorrespondingsamplingapproximation1NNåi=1Z Zp(s0jqi)p0(s00jq0i0)TÕt=1p(stjst1;qit)p0(s0tjs0t1;q0it)TÕt=0p(xtjst)p0(xtjs0t)dsds0!rdxformanon-stationarylineardynamicalsystem.Oncethejointdistributionisrecovered,itcanbeevaluatedinclosedformby(5)aswasthecasewithasimpleLDS.Forr6=1thisisaslightlydifferentformulationthanthesamplingapproximationintheprevioussection.ThishybridmethodhasthenicepropertythatitisalwaysavalidkernelforanynumberofsamplessinceitsimplybecomesaconvexcombinationofkernelsbetweenLDSs.AnSLDSmodelisusefulfordescribinginputsequencesthathaveacombinationofdiscreteandcontinuousdynamics,suchasatimeseriesofhumanmotioncontainingcontinuousphysicaldynamicsandkinematicsvariablesaswellasdiscreteactionvariables(Pavlovicetal.,2000).833 JEBARA,KONDORANDHOWARD6.MeanFieldApproximationAnotheroptionwhentheprobabilityproductkernelbetweengraphicalmodelscannotbecomputedinclosedformistousevariationalbounds,suchasthemeaneldapproximation(Jaakkola,2000).Inthismethod,weintroduceavariationaldistributionQ(x),andusingJensen'sinequality,lowerboundthekernel:k(p;p0)=Zp(x)rp0(x)rdx=ZY(x)rdx=explogZQ(x)Q(x)Y(x)rdxexpZQ(x)logY(x)rQ(x)dx=exp(rEQ[logY(x)]+H(Q))=B(Q);whereY(x)=p(x)p0(x)iscalledthepotential,EQ[]denotestheexpectationwithrespecttoQ(x),andH(Q)istheentropy.ThistransformstheproblemfromevaluatinganintractableintegralintothatofndingthesufcientstatisticsofQ(x)andcomputingthesubsequenttractableexpectations.Note,usingthemeaneldformulationgivesadifferent(andarguablycloserresult)tothedesiredintractablekernelthansimplycomputingtheprobabilityproductkernelsdirectlybetweentheap-proximatesimpledistributions.ItisinterestingtonotethefollowingformoftheboundwhenitisexpressedintermsoftheKullback-Leiblerdivergence,D(Qkp)=RQ(x)logQ(x)p(x)dx,andanentropyterm.TheboundontheprobabilityproductkernelisafunctionoftheKullback-Leiblerdivergencetopandp0:k(p;p0)B(Q)=exprD(Qkp)rD(Qkp0)+(12r)H(Q):ThegoalistodeneQ(x)inawaythatmakescomputingthesufcientstatisticseasy.WhenthegraphicalstructureofQ(x)hasnoedges,whichcorrespondstothecasethattheunderlyingvariablesareindependent,thisiscalledthemeaneldmethod.WhenQ(x)ismadeupofanumberofmoreelaborateindependentstructures,itiscalledthestructuredmeaneldmethod.Ineithercase,thevariationaldistributionfactorsintheformQ1(x1)Q2(x2):::QN(xN)withx1;x2;:::;xNdisjointsubsetsoftheoriginalvariablesetx.OncethestructureofQ(x)hasbeenset,its(locally)optimalparameterizationisfoundbycy-clingthroughthedistributionsQ1(x1);Q2(x2);:::;QN(xN)andmaximizingthelowerboundB(Q)withrespecttoeachone,holdingtheothersxed:¶logB(Q)¶Qn(xn)=¶¶Qn(xn)rZQn(xn)EQ[logY(x)jxn]dxn+H(Qn)+constant=0:ThisleadstotheupdateequationsQn(xn)=1Znexp(rEQ[logY(x)jxn]);wheretheconditionalexpectationEQ[jxn]iscomputedbytakingtheexpectationoverallvariablesexceptxnandZn=Zexp(rEQ[logY(x)jxn])dxisthenormalizingfactorguaranteeingthatQ(x)remainsavaliddistribution.834 PROBABILITYPRODUCTKERNELS11q12q13q0x10q1x2x3x21q22q23q20q11q12q13q0x10q1x2x3x21q22q23q20q'11q'12q'13q'10q'21q'22q'23q'20q11q12q13q0x10q1x2x3x21q22q23q20q'11q'12q'13q'10q'21q'22q'23q'20q(a)FHMMP(x;q).(b)PotentialY(x;q;q0).(c)VariationalQ(x;q;q0).Figure5:GraphsassociatedwiththefactorialhiddenMarkovmodel,itsprobabilityproductkernelpotential,anditsstructuredmeanelddistribution.Inpracticethisapproximationmaynotgivepositivedenite(PD)kernels.Muchcurrentre-searchhasbeenfocusedonkernelalgorithmsusingnon-positivedenitekernels.Algorithmssuchassequentialminimaloptimization(Platt,1999)canconvergetolocallyoptimalsolutions.Thelearningtheoryfornon-PDkernelsiscurrentlyofgreatinterest.SeeOngetal.(2004)foraninter-estingdiscussionandadditionalreferences.6.1FactorialHiddenMarkovModelsOnemodelthattraditionallyusesastructuredmeaneldapproximationforinferenceisthefactorialhiddenMarkovmodel(FHMM)(GhahramaniandJordan,1997).Suchamodelisappropriateformodelingoutputsequencesthatareemittedbymultipleindependenthiddenprocesses.TheFHMMisgivenbyp(x0;x1;x2;:::;xT)=CÕc=1"p(qc)TÕt=1p(qctjqct1)#TÕt=0p(xtjq1t;:::;qCt);whereqctisthefactoreddiscretehiddenstateattimetforchaincandxtistheobservedGaussianemissionvariableattimet.ThegraphsofthejointFHMMdistribution,thepotentialY(x;q;q0),andthevariationaldistributionQ(x;q;q0)aredepictedinFigure5.Thestructuredmeaneldap-proximationisparameterizedasQ(qc0=i)µexprlogfc(i)r2logjSijr2EQ(x0µi)TS1i(x0µi)Q(qct=ijqct1=j)µexprlogFc(i;j)r2logjSijr2EQ(xtµi)TS1i(xtµi)Q(xt)=N(xt;ˆµt;ˆSt)ˆSt=1råi;cEQ[sct(i)]S1i+åi;cEQq0ct(i)S01i1ˆµt=rˆStåi;cEQ[qct(i)]S1iµi+åi;cEQq0ct(i)S01iµ0:835 JEBARA,KONDORANDHOWARDqqqqqqqq’qqqq*Figure6:Astatisticalmanifoldwithageodesicpathbetweentwogenerativemodelsqandq0aswellasalocaltangentspaceapproximationat,forinstance,themaximumlikelihoodmodelqfortheaggregateddataset.Itisnecessarytocomputetheexpectedsufcientstatisticstobothupdatethemeaneldequa-tionsandtoevaluatethebound.TheexpectationsEQ[xt]andEQxtxTtcaneasilybecomputedfromQ(xt),whereasQ(sct=i)andQ(sct=i;sct1=j)canbecomputedefcientlybymeansofajunctiontreealgorithmoneachindependentchainintheapproximatingdistribution.7.RelationshiptoOtherProbabilisticKernelsWhiletheprobabilityproductkernelisnottheonlykerneltoleveragegenerativemodeling,itdoeshaveadvantagesintermsofcomputationalfeasibilityaswellasinnonlinearexibility.Forin-stance,heatkernelsordiffusionkernelsonstatisticalmanifolds(LaffertyandLebanon,2002)areamoreelegantwaytogenerateakernelinaprobabilisticsetting,butcomputationsinvolvendinggeodesicsoncomplicatedmanifoldsandndingclosedformexpressionsfortheheatkernel,whichareonlyknownforsimplestructuressuchasspheresandhyperbolicsurfaces.Figure6depictssuchastatisticalmanifold.Theheatkernelcansometimesbeapproximatedviathegeodesicdistancefrompwhichisparameterizedbyqtop0parameterizedbyq0.But,werarelyhavecompactclosed-formexpressionsfortheheatkernelorforthesegeodesicsforgeneraldistributions.OnlyGaussianswithsphericalcovarianceandmultinomialsarereadilyaccommodated.Curiously,theprobabilisticproductkernelexpressionforboththesetwodistributionsseemstocoincidecloselywiththemoreelaborateheatkernelform.Forinstance,inthemultinomialcase,bothinvolveaninnerproductbetweenthesquare-rootedcountfrequencies.However,theprobabilityproductkernelisstraight-forwardtocomputeforamuchwiderrangeofdistributionsprovidingapracticalalternativetoheatkernels.AnotherprobabilisticapproachistheFisherkernel(JaakkolaandHaussler,1998)whichapprox-imatesdistancesandafnitiesonthestatisticalmanifoldbyusingthelocalmetricatthemaximumlikelihoodestimateqofthewholedatasetasshowninFigure6.Onecaveat,however,isthatthisapproximationcanproduceakernelthatisquitedifferentfromtheexactheatkernelaboveandmaynotpreservetheinterestingnonlinearitiesimpliedbythegenerativemodel.Forinstance,intheexponentialfamilycase,alldistributionsgenerateFisherkernelsthatarelinearinthesufcientstatistics.ConsidertheFisherkernelk(x;x0)=UxI1qUx0836 PROBABILITYPRODUCTKERNELSwhereI1qistheinverseFisherinformationmatrixatthemaximumlikelihoodestimateandwherewedeneUx=Ñqlogp(xjq)jq:Inexponentialfamilies,pq(x)=exp(A(x)+qTT(x)K(q))producingthefollowingUx=T(x)G(q)wherewedeneG(q)=ÑqK(q))jq.TheresultingFisherkernelisthenk(x;x0)=(T(x)G(q))TI1qT(x0)TG(q);whichisanonlyslightlymodiedformofthelinearkernelinT(x).Therefore,viatheFisherkernelmethod,aGaussiandistributionovermeansgeneratesonlylinearkernels.AGaussiandistributionovermeansandcovariancesgeneratesquadratickernels.Furthermore,multinomialsgeneratelog-countsunlikethesquarerootofthecountsintheprobabilityproductkernelandtheheatkernel.Inpracticaltextclassicationapplications,itisknownthatsquarerootsquashingfunctionsonfre-quenciesandcounttypicallyoutperformlogarithmicsquashingfunctions(GoldzmidtandSahami,1998;Cuttingetal.,1992).In(Tsudaetal.,2002;Kashimaetal.,2003),anotherrelatedprobabilis-tickernelisputforward,theso-calledmarginalizedkernel.ThiskernelisagainsimilartotheFisherkernelaswellastheprobabilityproductkernel.Itinvolvesmarginalizingakernelweightedbytheposterioroverhiddenstatesgiventheinputdatafortwodifferentpoints,inotherwordstheout-putkernelk(x;x0)=åhåh0p(hjx)p(h0jx0)k((x;h);(x0;h0)).Theprobabilityproductkernelissimilar,however,itinvolvesaninnerproductoverbothhiddenvariablesandtheinputspacexwithajointdistribution.Amorerecentlyintroducedkernel,theexponentiatedsymmetrizedKullback-Leibler(KL)di-vergence(Morenoetal.,2004)isalsorelatedtotheprobabilityproductkernel.ThiskernelinvolvesexponentiatingthenegatedsymmetrizedKL-divergencebetweentwodistributionsk(p;p0)=exp(aD(pkp0)aD(p0kp)+b)whereD(pkp0)=Zxp(x)logp(x)p0(x)dxwhereaandbareuser-denedscalarparameters.However,thiskernelmaynotalwayssatisfyMercer'sconditionandaformalproofisnotavailable.Anotherinterestingconnectionisthatthisformisverysimilartoourmean-eldboundontheprobabilityproductkernel(7).However,unliketheprobabilityproductkernel(anditsbounds),computingtheKL-divergencebetweencomplicateddistributions(mixtures,hiddenMarkovmodels,Bayesiannetworks,lineardynamicalsystemsandintractablegraphicalmodels)isintractableandmustbeapproximatedfromtheoutsetusingnumer-icalorsamplingprocedureswhichfurtherdecreasesthepossibilityofhavingavalidMercerkernel(Morenoetal.,2004).8.ExperimentsInthissectionwediscussthreedifferentlearningproblems:textclassication,biologicalsequenceclassicationandtimeseriesclassication.Foreachproblem,weselectagenerativemodelthatiscompatiblewiththedomainwhichthenleadstoaspecicformfortheprobabilityproductkernel.Fortext,multinomialmodelsareused,forsequenceshiddenMarkovmodelsarenaturalandfortimeseriesdataweuselineardynamicalsystems.Thesubsequentkernelcomputationsarefedtoadiscriminativeclassier(asupportvectormachine)fortrainingandtesting.837 JEBARA,KONDORANDHOWARD8.1TextClassicationFortextclassication,weusedthestandardWebKBdatasetwhichconsistsofHTMLdocumentsinmultiplecategories.OnlythetextcomponentofeachwebpagewaspreservedandHTMLmarkupinformationandhyper-linkswerestrippedaway.Nofurtherstemmingorelaboratetextpre-processingwasperformedonthetext.Subsequently,abag-of-wordsmodelwasusedwhereeachdocumentisonlyrepresentedbythefrequencyofwordsthatappearwithinit.Thiscorrespondstoamultinomialdistributionovercounts.ThefrequenciesofwordsarethemaximumlikelihoodestimateforthemultinomialgenerativemodeltoaparticulardocumentwhichproducesthevectorofparametersˆaasinSection3.01234567800.050.10.150.20.25Regularization log(C)Error RateRBF s=1/4RBF s=1RBF s=4Linear01234567800.050.10.150.20.25Regularization log(C)Error RateRBF s=1/4RBF s=1RBF s=4Linear(a)TrainingSetSize=77(b)TrainingSetSize=622Figure7:SVMerrorrates(andstandarddeviation)fortheprobabilityproductkernel(setatr=1=2asaBhattacharyyakernel)formultinomialmodelsaswellaserrorratesfortraditionalkernelsontheWebKBdataset.PerformanceformultiplesettingsoftheregularizationparameterCinthesupportvectormachineareshown.Twodifferentsizesoftrainingsetsareshown(77and622).Allresultsareshownfor20-foldcross-validation.Asupportvectormachine(whichisknowntohavestrongempiricalperformanceonsuchtextclassicationdata)wasthenusedtobuildaclassier.TheSVMwasfedourkernel'sevaluationsviaagrammatrixandwastrainedtodiscriminatebetweenfacultyandstudentwebpagesintheWebKBdataset(othercategorieswereomitted).Theprobabilityproductkernelwascomputedformultinomialsundertheparametersettingsr=1=2(Bhattacharyyakernel)ands=1asinSection3.Comparisonsweremadewiththe(linear)dotproductkernelaswellasGaussianRBFkernels.Thedatasetcontainsatotalof1641studentwebpagesand1124facultywebpages.Thedataforeachclassisfurthersplitinto4universitiesand1miscellaneouscategoryandweperformedtheusualtrainingandtestingsplitasdescribedby(LaffertyandLebanon,2002;Joachimsetal.,2001)wheretestingisperformedonaheldoutuniversity.Theaverageerrorwascomputedfrom20-foldcross-validationforthedifferentkernelsasafunctionofthesupportvectormachineregularizationparameterCinFigure7.Thegureshowserrorratesfordifferentsizesofthetrainingset(77and622trainingpoints).Inaddition,weshowthestandarddeviationoftheerrorratefortheBhat-tacharyyakernel.Eventhoughweonlyexploredasinglesettingofs=1;r=1=2fortheprobability838 PROBABILITYPRODUCTKERNELSproductkernel,itoutperformsthelinearkernelaswellastheRBFkernelatmultiplesettingsoftheRBFsparameter(weattempteds=f1=4;1;4g).Thisresultconrmspreviousintuitionsinthetextclassicationcommunitywhichsuggestusingsquashingfunctionsonwordfrequenciessuchasthelogarithmorthesquareroot(GoldzmidtandSahami,1998;Cuttingetal.,1992).Thisalsorelatedtothepower-transformationusedinstatisticsknownastheBox-CoxtransformationwhichmayhelpmakedataseemmoreGaussian(DavidsonandMacKinnon,1993).8.2BiologicalSequenceClassicationFordiscretesequences,naturalgenerativemodelstoconsiderareMarkovmodelsandhiddenMarkovmodels.WeobtainedlabeledgenesequencesfromtheHS3Ddataset2Thesequencesareofvariablelengthsandhavediscretesymbolentriesfromthe4-letteralphabet(G,A,T,C).Webuiltclassierstodistinguishbetweengenesequencesoftwodifferenttypes:intronsandexonsusingrawsequencesofvaryinglengths(fromdozensofcharacterstotensofthousandsofcharacters).Fromtheorig-inal(unwindowed)4450intronsand3752exonsextracteddirectlyfromGenBank,weselectedarandomsubsetof500intronsand500exons.These1000sequenceswerethenrandomlysplitintoa50%traininganda50%testingsubset.IntherstexperimentweusedthetrainingdatatolearnstationaryhiddenMarkovmodelswhosenumberofstatesMwasrelatedtothelengthTnofagivensequencenasfollows:M=oor(12pO2+4(Tz+O+1)12O)+1.Here,Oisthenumberofpossibleemissionsymbols(forgenesequences,O=4)andzistheratioofparameterstothenumberofsymbolsinthesequenceT(weusedz=1=10althoughhighervaluesmayhelpavoidover-parameterizingthemodels).EachhiddenMarkovmodelwasbuiltfromeachsequenceusingthestandardExpectation-Maximizationalgorithm(whichisiteratedfor400stepstoensureconver-gence,thisisparticularlycrucialforlongersequences).Grammatricesofsize10001000werethenformedbycomputingtheprobabilityproductkernelbetweenallthehiddenMarkovmodels.ItwasnotedthatallGrammatriceswerepositivedenitebyasingularvaluedecomposition.Wealsousedthefollowingstandardnormalizationofthekernel(atypicalpre-processingstepintheliterature):˜k(p;p0) ˜k(p;p0)q˜k(p;p)q˜k(p0;p0):Subsequently,wetrainedasupportvectormachineclassieron500examplesequencesandtestedontheremaining500.Figure8(a)depictstheresultingerrorrateaswevaryC,theregulariza-tionparameterontheSVMaswellasTthenumberoftimestepsusedbythekernel.Throughout,theseexperiments,weonlyusedthesettingr=1.Notethatthesemodelswere1storderhiddenMarkovmodelswhereeachstateonlydependsonthepreviousstate.VaryingT,thelengthofthehiddenMarkovmodelsusedinthekernelcomputationhasaslighteffectonperformanceanditseemsT=9atanappropriateCvalueperformedbestwithanerrorrateaslowas10-11%.Forcomparison,wealsocomputedmorestandardstringkernelssuchasthek-mercountsorspectrumkernelsonthesametrainingandtestinggenesequencesasshowninFigure8(b).Thesebasicallycorrespondtoafullyobserved(non-hidden)stationaryMarkovmodelasinFigure9.Thezerothorder(K=1)Markovmodeldoespoorlywithitslowesterrorratearound34%.Therst2.Thisisacollectionofunprocessedintronandexonclassgenesequencesreferredtoasthehomosapienssplicesitesdatasetandwasdownloadedfromwww.sci.unisannio.it/docenti/rampone.839 JEBARA,KONDORANDHOWARD1234560.050.10.150.20.250.30.350.4Regularization log(C)Error RateT=71234560.050.10.150.20.250.30.350.4Regularization log(C)Error RateK=1(a)HiddenMarkovModels(1stOrder)(b)MarkovModels(0thto4thOrder)Figure8:SVMerrorratesfortheprobabilityproductkernelatr=1forhiddenMarkovmodelsandMarkovmodels(equivalenttostringorspectrumkernels).In(a)testerrorrateundervariouslevelsofregularizationisshownfor5differentsettingsofTintheprobabilityproductkernel.In(b)testerrorrateundervariouslevelsofregularizationisshownfor5differentsettingsoftheorderKoftheMarkovmodel.Figure9:Ahigherorder(2ndorderorK=3)Markovmodel.840 PROBABILITYPRODUCTKERNELSorderMarkovmodel(K=2)performsbetterwithitslowesterrorrateat18%yetisslightlyworsethanrstorderhiddenMarkovmodels.ThishelpsmotivatethepossibilitythatdependenceonthepastinrstorderMarkovmodelscouldbebettermodeledbyalatentstateasopposedtojustthepreviousemission.ThisisdespitethemaximumlikelihoodestimationissuesandlocalminimathatwemustcontendwithwhentraininghiddenMarkovmodelsusingexpectation-maximization.ThesecondorderMarkovmodelswithK=3farebestwithanerrorrateofabout7-8%.However,higherorders(K=4andK=5)seemtoreduceperformance.Therefore,anaturalnextexperimentwouldbetoconsiderhigherorderhiddenMarkovmodels,mostnotablysecondorderoneswheretheemissiondistributionsareconditionedonthepasttwostatesasopposedtothepasttwoemittedsymbolsinthesequence.8.3TimeSeriesExperimentsInthisexperimentwecomparetheperformanceoftheprobabilityproductkernelbetweenLDSsandthesamplingapproximationoftheSLDSkernelwithmoretraditionalkernelsandmaximumlikeli-hoodtechniquesonsyntheticdata.Wesampledfrom20exemplardistributionstogeneratesynthetictimeseriesdata.Thesampleddistributionswere5state,2dimensionalSLDSmodelsgeneratedatrandom.Eachmodelwassampled5timesfor25timestepswith10modelsbeingassignedtoeachclass.Thisgave50timeseriesoflength25perclass,creatingachallengingclassicationproblem.Forcomparison,maximumlikelihoodmodelsforLDSsandSLDSswerettothedataandusedforclassicationaswellasSVMstrainedusingGaussianRBFkernels.Alltestingwasperformedusingleaveoneoutcrossvalidation.TheSLDSkernelwasapproximatedusingsampling(50samples)andtheprobabilityproductkernelswerenormalizedtoimproveperformance.ThemaximumlikelihoodLDShadanerrorrateof.35,theSLDS.30,andtheGaussianRBF.34.ExploringthesettingoftheparametersrandTfortheLDSandSLDSkernels,wewereabletooutperformthemaximumlikelihoodmethodsandtheGaussianRBFkernelwithanoptimalerrorfortheLDSandSLDSkernelof.28and.25respectively.Figure10(a)and(b)showtheerrorrateversusTandrrespectively.ItcanbeseenthatincreasingTgenerallyimprovesperformancewhichistobeexpected.AlthoughitdoesappearthatTcanbechosentoolarge.Meanwhile,r,generallyappearstoperformbetterwhenitissmaller(acounterexampleistheLDSkernelatthesettingT=5).Overall,thekernelsperformedcomparablytoorbetterthanstandardmethodsformostsettingsofrandTwiththeexceptionofextremesettings.9.ConclusionsDiscriminativelearning,statisticallearningtheoryandkernel-basedalgorithmshavebroughtmath-ematicalrigorandpracticalsuccesstotheeldmakingitiseasytooverlooktheadvantagesofgenerativeprobabilisticmodeling.Yetgenerativemodelsareintuitiveandofferexibilityforinsert-ingstructure,handlingunusualinputspaces,andaccommodatingpriorknowledge.Byproposingakernelbetweenthemodelsthemselves,thispaperprovidesabridgebetweenthetwoschoolsofthought.Theformofthekernelisalmostassimpleaspossible,yetitstillgivesrisetointerestingnonlin-earitiesandpropertieswhenappliedtodifferentgenerativemodels.Itcanbecomputedexplicitlyforsomeofthemostcommonlyusedparametricdistributionsinstatisticssuchastheexponentialfamily.Formoreelaboratemodels(graphicalmodels,hiddenMarkovmodels,lineardynamicalsystems,etc.),computingthekernelreducestousingstandardalgorithmsintheeld.Experiments841 JEBARA,KONDORANDHOWARD5101520250.250.30.350.40.45Time Steps TErrorLDS rho=1/100.10.250.5120.250.30.350.40.450.5RhoErrorLDS T=5(a)(b)Figure10:Acomparisonofthechoiceofparameters(a)Tand(b)r.Thex-axisistheparameterandthey-axisistheerrorrate.showthatprobabilityproductkernelsholdpromiseinpracticaldomains.Ultimately,engineeringkernelsbetweenstructured,discreteorunusualobjectsisanareaofactiveresearchand,viagenera-tivemodeling,probabilityproductkernelscanbringadditionalexibility,formalismandpotential.AcknowledgmentsTheauthorsthanktheanonymousreviewersfortheirhelpfulcomments.ThisworkwasfundedinpartbytheNationalScienceFoundation(grantsIIS-0347499andCCR-0312690).ReferencesY.BengioandP.Frasconi.Input-outputHMM'sforsequenceprocessing.IEEETransactionsonNeuralNetworks,7(5):1231–1249,September1996.A.Bhattacharyya.Onameasureofdivergencebetweentwostatisticalpopulationsdenedbytheirprobabilitydistributions.Bull.CalcuttaMathSoc.,1943.M.CollinsandN.Duffy.Convolutionkernelsfornaturallanguage.InNeuralInformationProcess-ingSystems14,2002.C.Cortes,P.Haffner,andM.Mohri.Rationalkernels.InNeuralInformationProcessingSystems15,2002.D.R.Cutting,D.R.Karger,J.O.Pederson,andKJ.W.Tukey.Scatter/gather:acluster-basedapproachtobrowsinglargedocumentcollections.InProceedingsACM/SIGIR,1992.842 PROBABILITYPRODUCTKERNELSR.DavidsonandJ.MacKinnon.EstimationandInferenceinEconometrics.OxfordUniversityPress,1993.Z.GhahramaniandM.Jordan.FactorialhiddenMarkovmodels.MachineLearning,29:245–273,1997.M.GoldzmidtandM.Sahami.Aprobabilisticapproachtofull-textdocumentclustering.Technicalreport,StanfordUniversity,1998.DatabaseGroupPublication60.T.Hastie,R.Tibshirani,andFriedmanJ.H.TheElementsofStatisticalLearning.SpringerVerlag,2001.D.Haussler.Convolutionkernelsondiscretestructures.TechnicalReportUCSC-CRL-99-10,UniversityofCaliforniaatSantaCruz,1999.T.Jaakkola.Tutorialonvariationalapproximationmethods.InAdvancedMeanFieldMethods:TheoryandPractice.MITPress,2000.T.JaakkolaandD.Haussler.Exploitinggenerativemodelsindiscriminativeclassiers.InNeuralInformationProcessingSystems11,1998.T.Jaakkola,M.Meila,andT.Jebara.Maximumentropydiscrimination.InNeuralInformationProcessingSystems12,1999.T.JebaraandR.Kondor.Bhattacharyyaandexpectedlikelihoodkernels.InConferenceonLearningTheory,2003.T.Joachims,N.Cristianini,andJ.Shawe-Taylor.Compositekernelsforhypertextcategorisation.InInternationalConferenceonMachineLearning,2001.M.JordanandC.Bishop.IntroductiontoGraphicalModels.Inprogress,2004.H.Kashima,K.Tsuda,andA.Inokuchi.Marginalizedkernelsbetweenlabeledgraphs.InMachineLearning:TenthInternationalConference,ICML,2003.R.KondorandT.Jebara.Akernelbetweensetsofvectors.InMachineLearning:TenthInterna-tionalConference,2003.J.LaffertyandG.Lebanon.Informationdiffusionkernels.InNeuralInformationProcessingSystems,2002.C.Leslie,E.Eskin,J.Weston,andW.S.Noble.MismatchstringkernelsforSVMproteinclassi-cation.InNeuralInformationProcessingSystems,2002.P.J.Moreno,P.P.Ho,andN.Vasconcelos.AKullback-LeiblerdivergencebasedkernelforSVMclassicationinmultimediaapplications.InNeuralInformationProcessingSystems,2004.C.Ong,A.Smola,andR.Williamson.Superkernels.InNeuralInformationProcessingSystems,2002.C.S.Ong,M.Xavier,S.Canu,andA.J.Smola.Learningwithnon-positivekernels.InICML,2004.843 JEBARA,KONDORANDHOWARDV.Pavlovic,J.M.Rehg,andJ.MacCormick.Learningswitchinglinearmodelsofhumanmotion.InNeuralInformationProcessingSystems13,pages981–987,2000.J.Pearl.ProbabilisticReasoninginIntelligentSystems:NetworksofPlausibleInference.MorganKaufmann,1997.J.Platt.Fasttrainingofsupportvectormachinesusingsequentialminimaloptimization.InB.Sch¨olkopf,C.Burges,andA.Smola,editors,AdvancesinKernelMethods-SupportVec-torLearning.MITPress,1999.B.Sch¨olkopfandA.J.Smola.LearningwithKernels:SupportVectorMachines,Regularization,Optimization,andBeyond.MITPress,2001.B.Sch¨olkopfandA.J.Smola.LearningwithKernels.MITPress,2002.R.H.ShumwayandD.S.Stoffer.AnapproachtotimeseriessmoothingandforecastingusingtheEMalgorithm.J.ofTimeSeriesAnalysis,3(4):253–264,1982.B.W.Silverman.DensityEstimationforStatisticsandDataAnalysis.ChapmanandHall:London.,1986.F.Topsoe.Someinequalitiesforinformationdivergenceandrelatedmeasuresofdiscrimination.J.ofInequalitiesinPureandAppliedMathematics,2(1),1999.K.Tsuda,T.Kin,andK.Asai.Marginalizedkernelsforbiologicalsequences.Bioinformatics,18(90001):S268–S275,2002.V.Vapnik.StatisticalLearningTheory.JohnWiley&Sons,1998.S.V.N.VishawanathanandA.J.Smola.Fastkernelsforstringandtreematching.InNeuralInformationProcessingSystems15,2002.C.Watkins.Advancesinkernelmethods,chapterDynamicAlignmentKernels.MITPress,2000.844