/
JournalofMachineLearningResearch1(2000)1-48Submitted4/00;Published10/0 JournalofMachineLearningResearch1(2000)1-48Submitted4/00;Published10/0

JournalofMachineLearningResearch1(2000)1-48Submitted4/00;Published10/0 - PDF document

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
369 views
Uploaded On 2015-09-07

JournalofMachineLearningResearch1(2000)1-48Submitted4/00;Published10/0 - PPT Presentation

aJordanforclassi cationpredictionanddensityestimationBishop1999FriedmanGeigerGoldszmidt1997HeckermanGeigerChickering1995HintonDayanFreyNeal1995FriedmanGetoorKollerPfe er1996 ID: 123877

a&Jordanforclassi cation predictionanddensityestimation(Bishop 1999;Friedman Geiger &Goldszmidt 1997;Heckerman Geiger &Chickering 1995;Hinton Dayan Frey &Neal 1995;Friedman Getoor Koller &Pfe er 1996;

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch1(2000)1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch1(2000)1-48Submitted4/00;Published10/00LearningwithMixturesofTreesMarinaMeilmmp@stat.washington.eduDepartmentofStatisticsUniversityofWashingtonSeattle,WA98195-4322,USAMichaelI.Jordanjordan@cs.berkeley.eduDivisionofComputerScienceandDepartmentofStatisticsUniversityofCaliforniaBerkeley,CA94720-1776,USAEditor:LesliePackKaelblingAbstractThispaperdescribesthemixtures-of-treesmodel,aprobabilisticmodelfordiscretemultidimensionaldomains.Mixtures-of-treesgeneralizetheprobabilistictreesofChowandLiu(1968)inadi erentandcomplementarydirectiontothatofBayesiannetworks.Wepresentecientalgorithmsforlearningmixtures-of-treesmodelsinmaximumlikelihoodandBayesianframeworks.Wealsodiscussadditionalecienciesthatcanbeobtainedwhendataare\sparse,"andwepresentdatastructuresandalgorithmsthatexploitsuchsparseness.Experimentalresultsdemonstratetheperformanceofthemodelforbothden-sityestimationandclassi cation.Wealsodiscussthesenseinwhichtree-basedclassi ersperformanimplicitformoffeatureselection,anddemonstratearesultinginsensitivitytoirrelevantattributes.1.IntroductionProbabilisticinferencehasbecomeacoretechnologyinAI,largelyduetodevelopmentsingraph-theoreticmethodsfortherepresentationandmanipulationofcomplexprobabilitydistributions(Pearl,1988).Whetherintheirguiseasdirectedgraphs(Bayesiannetworks)orasundirectedgraphs(Markovrandom elds),probabilisticgraphicalmodelshaveanum-berofvirtuesasrepresentationsofuncertaintyandasinferenceengines.Graphicalmodelsallowaseparationbetweenqualitative,structuralaspectsofuncertainknowledgeandthequantitative,parametricaspectsofuncertainty|theformerrepresentedviapatternsofedgesinthegraphandthelatterrepresentedasnumericalvaluesassociatedwithsubsetsofnodesinthegraph.Thisseparationisoftenfoundtobenaturalbydomainexperts,tamingsomeoftheproblemsassociatedwithstructuring,interpreting,andtroubleshoot-ingthemodel.Evenmoreimportantly,thegraph-theoreticframeworkhasallowedforthedevelopmentofgeneralinferencealgorithms,whichinmanycasesprovideordersofmagni-tudespeedupsoverbrute-forcemethods(Cowell,Dawid,Lauritzen,&Spiegelhalter,1999;Shafer&Shenoy,1990).Thesevirtueshavenotgoneunnoticedbyresearchersinterestedinmachinelearning,andgraphicalmodelsarebeingwidelyexploredastheunderlyingarchitecturesinsystems2000MarinaMeilaandMichaelI.Jordan. a&Jordanforclassi cation,predictionanddensityestimation(Bishop,1999;Friedman,Geiger,&Goldszmidt,1997;Heckerman,Geiger,&Chickering,1995;Hinton,Dayan,Frey,&Neal,1995;Friedman,Getoor,Koller,&Pfe er,1996;Monti&Cooper,1998;Saul&Jordan,1999).Indeed,itispossibletoviewawidevarietyofclassicalmachinelearningarchitecturesasinstancesofgraphicalmodels,andthegraphicalmodelframeworkprovidesanaturaldesignprocedureforexploringarchitecturalvariationsonclassicalthemes(Buntine,1996;Smyth,Heckerman,&Jordan,1997).Asinmanymachinelearningproblems,theproblemoflearningagraphicalmodelfromdatacanbedividedintotheproblemofparameterlearningandtheproblemofstructurelearning.Muchprogresshasbeenmadeontheformerproblem,muchofitcastwithintheframeworkoftheexpectation-maximization(EM)algorithm(Lauritzen,1995).TheEMalgorithmessentiallyrunsaprobabilisticinferencealgorithmasasubroutinetocomputethe\expectedsucientstatistics"forthedata,reducingtheparameterlearningproblemtoadecoupledsetoflocalstatisticalestimationproblemsateachnodeofthegraph.Thislinkbetweenprobabilisticinferenceandparameterlearningisanimportantone,allow-ingdevelopmentsinecientinferencetohaveimmediateimpactonresearchonlearningalgorithms.Theproblemoflearningthestructureofagraphfromdataissigni cantlyharder.Inpractice,moststructurelearningmethodsareheuristicmethodsthatperformlocalsearchbystartingwithagivengraphandimprovingitbyaddingordeletingoneedgeatatime(Heck-ermanetal.,1995;Cooper&Herskovits,1992).Thereisanimportantspecialcaseinwhichbothparameterlearningandstructurelearningaretractable,namelythecaseofgraphicalmodelsintheformofatreedistributionAsshownbyChowandLiu(1968),thetreedistributionthatmaximizesthelikelihoodofasetofobservationsonnodes|aswellastheparametersofthetree|canbefoundintimequadraticinthenumberofvariablesinthedomain.ThisalgorithmisknownastheChow-Liualgorithm.Treesalsohavethevirtuethatprobabilisticinferenceisguaranteedtobeecient,andindeedhistoricallytheearliestresearchinAIonecientinferencefocusedontrees(Pearl,1988).Laterresearchextendedthisearlyworkby rstconsideringgeneralsingly-connectedgraphs(Pearl,1988),andthenconsideringgraphswitharbitrary(acylic)patternsofcon-nectivity(Cowelletal.,1999).Thislineofresearchhasprovidedoneuseful\upgradepath"fromtreedistributionstothecomplexBayesianandMarkovnetworkscurrentlybeingstud-ied.Inthispaperweconsideranalternativeupgradepath.Inspiredbythesuccessofmixturemodelsinprovidingsimple,e ectivegeneralizationsofclassicalmethodsinmanysimplerdensityestimationsettings(MacLachlan&Bashford,1988),weconsiderageneralizationoftreedistributionsknownasthemixtures-of-trees(MT)model.AssuggestedinFigure1,theMTmodelinvolvestheprobabilisticmixtureofasetofgraphicalcomponents,eachofwhichisatree.Inthispaperwedescribelikelihood-basedalgorithmsforlearningtheparametersandstructureofsuchmodels.Onecanalsoconsiderprobabilisticmixturesofmoregeneralgraphicalmodels;indeed,thegeneralcaseistheBayesianmultinetintroducedbyGeigerandHeckerman(1996).TheBayesianmultinetisamixturemodelinwhicheachmixturecomponentisanarbitrarygraphicalmodel.TheadvantageofBayesianmultinetsovermoretraditionalgraphicalmod- LearningwithMixturesofTrees e d c a d b c a d b c a z=1z=2 z=3 Figure1:Amixtureoftreesoveradomainconsistingofrandomvariablesa;b;c;d;ewhereisahiddenchoicevariable.Conditionalonthevalueof,thedepen-dencystructureisatree.Adetailedpresentationofthemixture-of-treesmodelisprovidedinSection3.elsistheabilitytorepresentcontext-speci cindependencies|situationsinwhichsubsetsofvariablesexhibitcertainconditionalindependenciesforsome,butnotall,valuesofaconditioningvariable.(Furtherworkoncontext-speci cindependencehasbeenpresentedbyBoutilier,Friedman,Goldszmidt,&Koller,1996).Bymakingcontext-speci cinde-pendenciesexplicitasmultiplecollectionsofedges,onecanobtain(a)moreparsimoniousrepresentationsofjointprobabilitiesand(b)moreecientinferencealgorithms.Inthemachinelearningsetting,however,theadvantagesofthegeneralBayesianmulti-netformalismarelessapparent.Allowingeachmixturecomponenttobeageneralgraph-icalmodelforcesustofacethedicultiesoflearninggeneralgraphicalstructure.More-over,greedyedge-additionandedge-deletionalgorithmsseemparticularlyill-suitedtotheBayesianmultinet,giventhatitisthefocusoncollectionsofedgesratherthansingleedgesthatunderliesmuchoftheintuitiveappealofthisarchitecture.Weviewthemixtureoftreesasprovidingareasonablecompromisebetweenthesimplic-ityoftreedistributionsandtheexpressivepoweroftheBayesianmultinet,whiledoingsowithinarestrictedsettingthatleadstoecientmachinelearningalgorithms.Inparticular,asweshowinthispaper,thereisasimplegeneralizationoftheChow-Liualgorithmthatmakesitpossibleto nd(local)maximaoflikelihoods(orpenalizedlikelihoods)ecientlyingeneralMTmodels.ThisalgorithmisaniterativeExpectation-Maximization(EM)al-gorithm,inwhichtheinnerloop(theMstep)involvesinvokingtheChow-Liualgorithmtodeterminethestructureandparametersoftheindividualmixturecomponents.Thus,inaveryconcretesense,thisalgorithmsearchesinthespaceofcollectionsofedges.Insummary,theMTmodelisamultiplenetworkrepresentationthatsharesmanyofthebasicfeaturesofBayesianandMarkovnetworkrepresentations,butbringsnewfeaturestothefore.Webelievethatthesefeaturesexpandthescopeofgraph-theoreticprobabilistic a&Jordanrepresentationsinusefulwaysandmaybeparticularlyappropriateformachinelearningproblems.1.1RelatedworkTheMTmodelcanbeusedbothintheclassi cationsettingandthedensityestimationsetting,anditmakescontactwithdi erentstrandsofpreviousliteratureinthesetwoguises.Intheclassi cationsetting,theMTmodelbuildsontheseminalworkontree-basedclassi ersbyChowandLiu(1968),andonrecentextensionsduetoFriedmanetal.(1997)andFriedman,Goldszmidt,andLee(1998).ChowandLiuproposedtosolve-wayclassi cationproblemsby ttingaseparatetreetotheobservedvariablesineachoftheclasses,andclassifyinganewdatapointbychoosingtheclasshavingmaximumclass-conditionalprobabilityunderthecorrespondingtreemodel.Friedmanetal.tookastheirpointofdeparturetheNaiveBayesmodel,whichcanbeviewedasagraphicalmodelinwhichanexplicitclassnodehasdirectededgestoanotherwisedisconnectedsetofnodesrepresentingtheinputvariables(i.e.,attributes).IntroducingadditionaledgesbetweentheinputvariablesyieldstheTreeAugmentedNaiveBayes(TANB)classi er(Friedmanetal.,1997;Geiger,1992).Theseauthorsalsoconsideredalessconstrainedmodelinwhichdi erentpatternsofedgeswereallowedforeachvalueoftheclassnode|thisisformallyidenticaltotheChowandLiuproposal.IfthechoicevariableoftheMTmodelisidenti edwiththeclasslabelthentheMTmodelisidenticaltotheChowandLiuapproach(intheclassi cationsetting).However,wedonotnecessarilywishtoidentifythechoicevariablewiththeclasslabel,and,indeed,inourexperimentsonclassi cationwetreattheclasslabelassimplyanotherinputvariable.Thisyieldsamorediscriminativeapproachtoclassi cationinwhichallofthetrainingdataarepooledforthepurposesoftrainingthemodel(Section5,Meila&Jordan,1998).Thechoicevariableremainshidden,yieldingamixturemodelforeachclass.Thisissimilarinspirittothe\mixturediscriminantanalysis"modelofHastieandTibshirani(1996),whereamixtureofGaussiansisusedforeachclassinamultiwayclassi cationproblem.Inthesettingofdensityestimation,clusteringandcompressionproblems,theMTmodelmakescontactwiththelargeandactiveliteratureonmixturemodeling.Letusbrie yreviewsomeofthemostsalientconnections.TheAuto-Classmodel(Cheeseman&Stutz,1995)isamixtureoffactorialdistributions(MF),anditsexcellentcost/performanceratiomotivatestheMTmodelinmuchthesamewayastheNaiveBayesmodelmotivatestheTANBmodelintheclassi cationsetting.(Afactorialdistributionisaproductoffactorseachofwhichdependsonexactlyonevariable).Kontkanen,Myllymaki,andTirri(1996)studyaMFinwhichahiddenvariableisusedforclassi cation;thisapproachwasextendedbyMontiandCooper(1998).TheideaoflearningtractablebutsimplebeliefnetworksandsuperimposingamixturetoaccountfortheremainingdependencieswasdevelopedindependentlyofourworkbyThiesson,Meek,Chickering,andHeckerman(1997),whostudiedmixturesofGaussianbeliefnetworks.TheirworkinterleavesEMparametersearchwithBayesianmodelsearchinaheuristicbutgeneralalgorithm. LearningwithMixturesofTrees2.TreedistributionsInthissectionweintroducethetreemodelandthenotationthatwillbeusedthroughoutthepaper.Letdenoteasetofdiscreterandomvariablesofinterest.Foreachrandomvariablelet ()representitsrange,)aparticularvalue,andthe( nite)cardinalityof ().Foreachsubset,let ()andletdenoteanassignmenttothevariablesin.Tosimplifynotationwillbedenotedbyand (willbedenotedsimplyby .Sometimesweneedtorefertothemaximumofover;wedenotethisvaluebyWebeginwithundirected(Markovrandom eld)representationsoftreedistributions.Identifyingthevertexsetofagraphwiththesetofrandomvariables,consideragraphV;E),whereisasetofundirectededges.Weallowatreetohavemultipleconnectedcomponents(thusour\trees"aregenerallycalledforests).Giventhisde nition,thenumberofedgesandthenumberofconnectedcomponentsarerelatedasfollows:implyingthataddinganedgetoatreereducesthenumberofconnectedcomponentsby1.Thus,atreecanhaveatmost1edges.InthislattercasewerefertothetreeasaspanningtreeWeparameterizeatreeinthefollowingway.Foru;vand(u;v,letdenoteajointprobabilitydistributiononand.Werequirethesedistributionstobeconsistentwithrespecttomarginalization,denotingby)themarginalof)withrespecttoforany.Wenowassignadistributiontothegraph(V;E)asfollows: wheredegisthedegreeofvertex;i.e.,thenumberofedgesincidentto.Itcanbeveri edthatisinfactaprobabilitydistribution;moreover,thepairwiseprobabilitiesarethemarginalsoftreedistributionisde nedtobeanydistributionthatadmitsafactorizationoftheform(1).Treedistributionscanalsoberepresentedusingdirected(Bayesiannetwork)graphicalmodels.LetV;E)denoteadirectedtree(possiblyaforest),whereisasetofdirectededgesandwhereeachnodehas(atmost)oneparent,denotedpa().Weparameterizethisgraphasfollows:)(2)where)isanarbitraryconditionaldistribution.Itcanbeveri edthatindeedde nesaprobabilitydistribution;moreover,themarginalconditionalsofaregivenbytheconditionalsWeshallcalltherepresentations(1)and(2)theundirectedanddirectedtreerepresen-tationsofthedistributionrespectively.Wecanreadilyconvertbetweentheserepresenta-tions;forexample,toconvert(1)toadirectedrepresentationwechooseanarbitraryroot a&Jordan "!"!"!# 3"!# 4"!@@@I���  -T(xx1x3)Tx2x3)Tx3x4)Tx4x5) T23(x3)T4(x4)T(x4(x4)Tx3x4) T4(x4)Tx5x4) T4(x4)Tx1x3) T3(x3)Tx2x3) (a)(b)Figure2:Atreeinitsundirected(a)anddirected(b)representations.ineachconnectedcomponentanddirecteachedgeawayfromtheroot.For(u;vwithclosertotherootthan,letpa(.Nowcomputetheconditionalprobabilitiescorrespondingtoeachdirectededgebyrecursivelysubstitutingpa(pa(pa(startingfromtheroot.Figure2illustratesthisprocessonatreewith5vertices.Thedirectedtreerepresentationhastheadvantageofhavingindependentparameters.Thetotalnumberoffreeparametersineitherrepresentationis:(deg1)+Theright-handsideof(3)showsthateachedge(u;v)increasesthenumberofparametersby(Thesetofconditionalindependenciesassociatedwithatreedistributionarereadilycharacterized(Lauritzen,Dawid,Larsen,&Leimer,1990).Inparticular,twosubsetsA;Bareindependentgivenintersectseverypath(ignoringthedirectionofedgesinthedirectedcase)betweenandforalland2.1Marginalization,inferenceandsamplingintreedistributionsThebasicoperationsofcomputinglikelihoods,conditioning,marginalization,samplingandinferencecanbeperformedecientlyintreedistributions;inparticular,eachoftheseopera-tionshastimecomplexity).Thisisadirectconsequenceofthefactorizedrepresentationoftreedistributionsinequations(1)and(2).2.2RepresentationalcapabilitiesIfgraphicalrepresentationsarenaturalforhumanintuition,thenthesubclassoftreemodelsareparticularlyintuitive.Treesaresparsegraphs,having1orfeweredges.Thereisatmostonepathbetweeneverypairofvariables;thus,independencerelationshipsbetweensubsetsofvariables,whicharenoteasytoreadoutingeneralBayesiannetworktopologies,areobviousinatree.Inatree,anedgecorrespondstothesimple,common-sensenotionofdirectdependencyandisthenaturalrepresentationforit.However,theverysimplicity LearningwithMixturesofTreesthatmakestreemodelsappealingalsolimitstheirmodelingpower.Notethatthenumberoffreeparametersinatreegrowslinearlywithwhilethesizeofthestatespace (isanexponentialfunctionof.Thustheclassofdependencystructuresrepresentablebytreesisarelativelysmallone.2.3LearningoftreedistributionsThelearningproblemisformulatedasfollows:wearegivenasetofobservations;:::;xandwearerequiredto ndthetreethatmaximizestheloglikelihoodofthedata:argmaxwhereisanassignmentofvaluestoallvariables.Notethatthemaximumistakenbothwithrespecttothetreestructure(thechoiceofwhichedgestoinclude)andwithrespecttothenumericalvaluesoftheparameters.Hereandintherestofthepaperwewillassumeforsimplicitythattherearenomissingvaluesforthevariablesin,or,inotherwords,thattheobservationsarecompleteLetting)denotetheproportionofobservationsinthetrainingsetthatareequalto,wecanalternativelyexpressthemaximumlikelihoodproblembysummingovercon gurationsargmax)logInthisformweseethattheloglikelihoodcriterionfunctionisa(negative)cross-entropy.Wewillinfactsolvetheproblemingeneral,letting)beanarbitraryprobabilitydistribution.Thisgeneralitywillproveusefulinthefollowingsectionwhereweconsidermixturesoftrees.Thesolutiontothelearningproblemisanalgorithm,duetoChowandLiu(1968),thathasquadraticcomplexityin(seeFigure3).Therearethreestepstothealgorithm.First,wecomputethepairwisemarginals).Ifisanempiricaldistribution,asinthepresentcase,computingthesemarginalsrequiresoperations.Second,fromthesemarginalswecomputethemutualinformationbetweeneachpairofvariablesinunderthedistribution)log ;u;vV;u=v;anoperationthatrequiresMAX)operations.Third,werunamaximum-weightspan-ningtree(MWST)algorithm(Cormen,Leiserson,&Rivest,1990),usingastheweightforedge(u;v;forallu;v.Suchalgorithms,whichrunintime),returnaspanningtreethatmaximizesthetotalmutualinformationforedgesincludedinthetree.ChowandLiushowedthatthemaximum-weightspanningtreealsomaximizesthelike-lihoodovertreedistributions,andmoreovertheoptimizingparameters),foru;v,areequaltothecorrespondingmarginalsofthedistributionThealgorithmthusattainsaglobaloptimumoverbothstructureandparameters. a&Jordan AlgorithmChowLiuDistributionoverdomainProcedure(weights)thatoutputsamaximumweightspanningtreeoverV1.Computemarginaldistributionsforu;v2.Computemutualinformationvaluesforu;v4.SetforOutput: Figure3:TheChowandLiualgorithmformaximumlikelihoodestimationoftreestructureandparameters.3.MixturesoftreesWede neamixture-of-trees(MT)modeltobeadistributionoftheform:with;:::;mThetreedistributionsarethemixturecomponentsandarecalledmixturecoecientsAmixtureoftreescanbeviewedascontaininganunobservedchoicevariable,whichtakesvalue;:::mwithprobability.Conditionedonthevalueof,thedistributionoftheobservedvariablesisatree.Thetreesmayhavedi erentstructuresanddi erentparameters.InFigure1,forexample,wehaveamixtureoftreeswith=3and=5.Notethatbecauseofthevaryingstructureofthecomponenttrees,amixtureoftreesisneitheraBayesiannetworknoraMarkovrandom eld.Letusadoptthenotationfor\isindependentofgivenunderdistribution".Ifforsome(all);:::mwehavewithA;B;CthiswillnotimplythatOntheotherhand,amixtureoftreesiscapableofrepresentingdependencystructuresthatareconditionedonthevalueofavariable(thechoicevariable),somethingthatausual LearningwithMixturesofTrees 12345 2345 (a)(b)Figure4:Amixtureoftreeswithsharedstructure(MTSS)representedasaBayesnet(a)andasaMarkovrandom eld(b).BayesiannetworkorMarkovrandom eldcannotdo.Situationswheresuchamodelispotentiallyusefulaboundinreallife:Considerforexamplebitmapsofhandwrittendigits.Suchimagesobviouslycontainmanydependenciesbetweenpixels;however,thepatternofthesedependencieswillvaryacrossdigits.Imagineamedicaldatabaserecordingthebodyweightandotherdataforeachpatient.Thebodyweightcouldbeafunctionofageandheightforahealthyperson,butitwoulddependonotherconditionsifthepatientsu eredfromadiseaseorwereanathlete.If,inasituationliketheonesmentionedabove,condi-tioningononevariableproducesadependencystructurecharacterizedbysparse,acyclicpairwisedependencies,thenamixtureoftreesmayprovideagoodmodelofthedomain.Ifweconstrainallofthetreesinthemixturetohavethesamestructureweobtainamixtureoftreeswithsharedstructure(MTSS;seeFigure4).InthecaseoftheMTSS,ifforsome(all);:::mthenenzg:Inaddition,aMTSScanberepresentedasaBayesiannetwork(Figure4,a),asaMarkovrandom eld(Figure4,b)andasachaingraph(Figure5).ChaingraphswereintroducedbyLauritzen(1996);theyrepresentasuperclassofbothBayesiannetworksandMarkovrandom elds.Achaingraphcontainsbothdirectedandundirectededges.Whilewegenerallyconsiderproblemsinwhichthechoicevariableishidden(i.e.,unob-served),itisalsopossibletoutilizeboththeMTandtheMTSSframeworksinwhichthechoicevariableisobserved.Suchmodels,which|aswediscussinSection1.1|havebeenstudiedpreviouslybyFriedmanetal.(1997)andFriedmanetal.(1998),willbereferredtogenericallyasmixtureswithobservedchoicevariable.Unlessstatedotherwise,itwillbeassumedthatthechoicevariableishidden. a&Jordan 345Figure5:AMTSSrepresentedasachaingraph.Thedoubleboxesenclosetheundirectedblocksofthechaingraph.3.1Marginalization,inferenceandsamplinginMTmodelsLet)beanMTmodel.Weconsiderthebasicoperationsofmarginal-ization,inferenceandsampling,recallingthattheseoperationshavetimecomplexityforeachofthecomponenttreedistributionsMarginalizationThemarginaldistributionofasubsetisgivenasfollows:Hence,themarginalofisamixtureofthemarginalsofthecomponenttrees.InferenceLetbetheevidence.ThentheprobabilityofthehiddenvariablegivenevidenceisobtainedbyapplyingBayes'ruleasfollows: Inparticular,whenweobserveallbutthechoicevariable,i.e.,and,weobtaintheposteriorprobabilitydistribution Theprobabilitydistributionofagivensubsetofgiventheevidenceis PmkkTkV0(xV0)=mXkz=kjV0=xV0]TkAjV0(xAjxV0): LearningwithMixturesofTreesThustheresultisagainamixtureoftheresultsofinferenceproceduresrunonthecompo-nenttrees.Theprocedureforsamplingfromamixtureoftreesisatwostageprocess: rstonesamplesavalueforthechoicevariablefromitsdistribution(;:::thenavalueissampledfromusingtheprocedureforsamplingfromatreedistribution.Insummary,thebasicoperationsonmixturesoftrees,marginalization,conditioningandsampling,areachievedbyperformingthecorrespondingoperationoneachcomponentofthemixtureandthencombiningtheresults.Therefore,thecomplexityoftheseoperationsscaleslinearlywiththenumberoftreesinthemixture.3.2LearningofMTmodelsTheexpectation-maximization(EM)algorithmprovidesane ectiveapproachtosolvingmanylearningproblems(Dempster,Laird,&Rubin,1977;MacLachlan&Bashford,1988),andhasbeenemployedwithparticularsuccessinthesettingofmixturemodelsandmoregenerallatentvariablemodels(Jordan&Jacobs,1994;Jelinek,1997;Rubin&Thayer,1983).InthissectionweshowthatEMalsoprovidesanaturalapproachtothelearningproblemfortheMTmodel.Animportantfeatureofthesolutionthatwepresentisthatitprovidesestimatesforbothparametricandstructuralaspectsofthemodel.Inparticular,althoughweassume(inthecurrentsection)thatthenumberoftreesis xed,thealgorithmthatwederiveprovidesestimatesforboththepatternofedgesoftheindividualtreesandtheirparameters.Theseestimatesaremaximumlikelihoodestimates,andalthoughtheyaresubjecttoconcernsaboutover tting,theconstrainednatureoftreedistributionshelpstoameliorateover ttingproblems.Itisalsopossibletocontrolthenumberofedgesindirectlyviapriors;wediscussmethodsfordoingthisinSection4.Wearegivenasetofobservations;:::;xandarerequiredto ndthemixtureoftreesthatsatis esargmaxWithintheframeworkoftheEMalgorithm,thislikelihoodfunctionisreferredtoastheincompletelog-likelihoodanditiscontrastedwiththefollowingcompletelog-likelihoodfunc-tion:;:::Nk;zk;z+logwherek;zisequaltooneifisequaltothethvalueofthechoicevariableandzeroother-wise.Thecompletelog-likelihoodwouldbethelog-likelihoodofthedataiftheunobserved;:::;zcouldbeobserved. a&JordanTheideaoftheEMalgorithmistoutilizethecompletelog-likelihood,whichisgener-allyeasytomaximize,asasurrogatefortheincompletelog-likelihood,whichisgenerallysomewhatlesseasytomaximizedirectly.Inparticular,thealgorithmgoesuphillintheexpectedvalueofthecompletelog-likelihood,wheretheexpectationistakenwithrespecttotheunobserveddata.Thealgorithmthushastheformofaninteractingpairofsteps:theEstep,inwhichtheexpectationiscomputedgiventhecurrentvalueoftheparameters,andtheMstep,inwhichtheparametersareadjustedsoastomaximizetheexpectedcompletelog-likelihood.Thesetwostepsiterateandareprovedtoconvergetoalocalmaximumofthe(incomplete)log-likelihood(Dempsteretal.,1977).Takingtheexpectationof(5),weseethattheEstepfortheMTmodelreducestotakingtheexpectationofthedeltafunctionk;z,conditionedonthedataatak;zzzi=kjD]=Pr[zi=kjV=xi];andthislatterquantityisrecognizableastheposteriorprobabilityofthehiddenvariablegiventhethobservation(cf.equation(4)).Letusde ne: asthisposteriorprobability.Substituting(6)intotheexpectedvalueofthecompletelog-likelihoodin(5),weobtain:in:lc(x1;:::N)]=+logLetusde nethefollowingquantities:;:::m wherethesums��;N]canbeinterpretedasthetotalnumberofdatapointsthataregeneratedbycomponent.Usingthesede nitionsweobtain:n:lc(x1;:::N;:::N)]=)logItisthisquantitythatwemustmaximizewithrespecttotheparameters.From(7)weseethatatlc]separatesintotermsthatdependondisjointsubsetsofthemodelparametersandthustheMstepdecouplesintoseparatemaximizationsforeachofthevariousparameters.Maximizingthe rsttermof(7)withrespecttotheparameterssubjecttotheconstraint=1,weobtainthefollowingupdateequation: for;:::m: LearningwithMixturesofTrees AlgorithmMixTreeDataset;:::xInitialmodelm;T;:::mProcedureChowLiuuntilconvergence:Estep:Compute)for;:::m;i;:::NMstep:for;:::mChowLiuOutput:Modelm;T;:::m Figure6:TheMixTreealgorithmforlearningMTmodels.Inordertoupdate,weseethatwemustmaximizethenegativecross-entropybetweenand)logThisproblemissolvedbytheChowLiualgorithmfromSection2.3.ThusweseethattheMstepforlearningthemixturecomponentsofMTmodelsreducestoseparaterunsoftheChowLiualgorithm,wherethe\target"distribution)isthenormalizedposteriorprobabilityobtainedfromtheEstep.WesummarizetheresultsofthederivationoftheEMalgorithmformixturesoftreesinFigure6.3.2.1RunningtimeComputingthelikelihoodofadatapointunderatreedistribution(inthedirectedtreerepresentation)takes)multiplications.Hence,theEsteprequires oatingpointoperations.AsfortheMstep,themostcomputationallyexpensivephaseisthecomputationofthemarginalsforthethtree.Thisstephastimecomplexity).AnotherMAX)and)arerequiredateachiterationforthemutualinformationsandfortherunsoftheMWSTalgorithm.FinallyweneedMAX)forcomputingthetreeparametersinthedirectedrepresentation.ThetotalrunningtimeperEMiterationisthusMAXThealgorithmispolynomial(periteration)inthedimensionofthedomain,thenumberofcomponentsandthesizeofthedataset. a&JordanThespacecomplexityisalsopolynomial,andisdominatedbyMAX),thespaceneededtostorethepairwisemarginaltables(thetablescanbeoverwrittenbysuccessivevaluesof3.3LearningmixturesoftreeswithsharedstructureItispossibletomodifytheMixTreealgorithmsoastoconstrainthetreestosharethesamestructure,andtherebyestimateMTSSmodels.TheEstepremainsunchanged.Theonlynoveltyisthereestimationofthetreedistri-butionsintheMstep,sincetheyarenowconstrainedtohavethesamestructure.Thus,themaximizationcannotbedecoupledintoseparatetreeestimationsbut,remarkablyenough,itcanstillbeperformedeciently.Itcanbereadilyveri edthatforanygivenstructuretheoptimalparametersofeachtreeedgeareequaltotheparametersofthecorrespondingmarginaldistributionItremainsonlyto ndtheoptimalstructure.Theexpressiontobeoptimizedisthesecondsumintheright-handsideofequation(7).Byreplacingwithanddenotingthemutualinformationbetweenandunderthissumcanbereexpressedasfollows:)loggX(u;v)2EIkuv�Xv2VH(Pkv)]=NX(u;v)2EIuvjz�NXv2VH(vjz):(8)Thenewquantityappearingaboverepresentsthemutualinformationofandconditionedonthehiddenvariable.Itsgeneralde nitionforthreediscretevariablesu;v;zdistributedaccordingtouvz)log Thesecondtermin(8),),representsthesumoftheconditionalentropiesofthevariablesgivenandisindependentofthetreestructure.Hence,theoptimizationofthestructurecanbeachievedbyrunningaMWSTalgorithmwiththeedgeweightsrepresentedby.WesummarizethealgorithminFigure7.4.DecomposablepriorsandMAPestimationformixturesoftreesTheBayesianlearningframeworkcombinesinformationobtainedfromdirectobservationswithpriorknowledgeaboutthemodel,whenthelatterisrepresentedasaprobabilitydistribution.TheobjectofinterestofBayesiananalysisistheposteriordistributionoverthemodelsgiventheobserveddata,ata,QjD],aquantitywhichcanrarelybecalculatedexplicitly.Practicalmethodsforapproximatingtheposteriorincludechoosingasinglemaximumaposteriori(MAP)estimate,replacingthecontinuousspaceofmodelsbya nitesetofhighposteriorprobability(Heckermanetal.,1995),andexpandingtheposteriorarounditsmode(s)(Cheeseman&Stutz,1995). LearningwithMixturesofTrees AlgorithmMixTreeSSDataset;:::xInitialmodelm;T;:::mProcedure(weights)thatoutputsamaximumweightspanningtreeuntilconvergence:Estep:Compute)for;:::m;i;:::NMstep:Computemarginalsu;vComputemutualinformationvaluesu;vSetSetforandfor;:::mOutput:Modelm;T;:::m Figure7:TheMixTreeSSalgorithmforlearningMTSSmodels.Findingthelocalmaxima(modes)ofthedistributionnQjD]isanecessarystepinalltheabovemethodsandisourprimaryconcerninthissection.Wedemonstratethatmaxi-mumaposteriorimodescanbefoundasecientlyasmaximumlikelihoodmodes,givenaparticularchoiceofprior.Thishastwoconsequences:First,itmakesapproximateBayesianaveragingpossible.Second,ifoneusesanon-informativeprior,thenMAPestimationisequivalenttoBayesiansmoothing,andrepresentsaformofregularization.Regularizationisparticularlyusefulinthecaseofsmalldatasetsinordertopreventover tting.4.1MAPestimationbytheEMalgorithmForamodelanddatasetthelogarithmoftheposterioriorQjD]equals::Q]+Xx2DlogQ(x)plusanadditiveconstant.TheEMalgorithmcanbeadaptedtomaximizethelogposteriorforevery xed&Hinton,1999).Indeed,bycomparingwithequation(5)weseethatthequantitytobemaximizedisnow:w:Pr[Qjx1;:::;N;:::;N]]=logogQ]+E[lc(x1;:::;N;:::;NThepriortermdoesnotin uencetheEstepoftheEMalgorithm,whichproceedsexactlyasbefore(cf.equation(6)).Tobeabletosuccessfullymaximizetheright-handsideof(9)intheMstepwerequirethatlogogQ]decomposesintoasumofindependenttermsmatchingthedecompositionofflc]in(7).Apriorovermixturesoftreesthatisamenable a&Jordantothisdecompositioniscalledadecomposableprior.ItwillhavethefollowingproductformmQ]=Pr[1;:::;m;mEk]Pr[parameters Tk]=Pr[1;:::;m;mTkvjpa(The rstparenthesizedfactorinthisequationrepresentsthepriorofthetreestructurewhilethesecondfactoristhepriorforparameters.Requiringthatthepriorbedecomposableisequivalenttomakingseveralindependenceassumptions:inparticular,itmeansthattheparametersofeachtreeinthemixtureareindependentoftheparametersofalltheothertreesaswellasoftheprobabilityofthemixturevariable.Inthefollowingsection,weshowthattheseassumptionsarenotoverlyrestrictive,byconstructingdecomposablepriorsfortreestructuresandparametersandshowingthatthisclassisrichenoughtocontainmembersthatareofpracticalimportance.4.2DecomposablepriorsfortreestructuresThegeneralformofadecomposablepriorforthetreestructureisonewhereeachedgecontributesaconstantfactorindependentofthepresenceorabsenceofotheredgesinnE]/Y(u;v)2Eexp(Withthisprior,theexpressiontobemaximizedintheMstepoftheEMalgorithmbecomes)log Consequently,eachedgeweightintreeisadjustedbythecorrespondingvaluedividedbythetotalnumberofpointsthattreeisresponsiblefor: Anegativeincreasestheprobabilityof(u;v)beingpresentinthe nalsolution,whereasapositivevalueofactslikeapenaltyonthepresenceofedge(u;v)inthetree.Iftheprocedureismodi edsoastonotaddnegative-weightedges,onecanobtain(disconnected)treeshavingfewerthan1edges.Notethatthestrengthofthepriorisinverselyproportionalto�,thetotalnumberofdatapointsassignedtomixturecomponent.Thus,withequalpriorsforalltrees,treesaccountingforfewerdatapointswillbepenalizedmorestronglyandthereforewillbelikelytohavefeweredges.Ifonechoosestheedgepenaltiestobeproportionaltotheincreaseinthenumberofparameterscausedbytheadditionofedgetothetree, 1)log LearningwithMixturesofTreesthenaMinimumDescriptionLength(MDL)(Rissanen,1989)typeofpriorisimplemented.InthecontextoflearningBayesiannetworks,Heckermanetal.(1995)suggestedthefollowingprior::E]/(E;Ewhere()isadistancemetricbetweenBayesnetstructuresandisthepriornetworkstructure.Thus,thispriorpenalizesdeviationsfromthepriornetwork.Thispriorisdecomposable,entailingu;vu;vDecomposablepriorsonstructurecanalsobeusedwhenthestructureiscommonforalltrees(MTSS).Inthiscasethee ectoftheprioristopenalizetheweightin(8)byAdecomposablepriorhastheremarkablepropertythatitsnormalizationconstantcanbecomputedexactlyinclosedform.Thismakesitpossiblenotonlytocompletelyde netheprior,butalsotocomputeaveragesunderthisprior(e.g.,tocomputeamodel'sevi-dence).Giventhatthenumberofallundirectedtreestructuresovervariablesisthisresult(Meila&Jaakkola,2000)isquitesurprising.4.3DecomposablepriorsfortreeparametersThedecomposablepriorforparametersthatweintroduceisaDirichletprior(Heckermanetal.,1995).TheDirichletdistributionisde nedoverthedomainof;:::;randhastheform;:::;r;:::;rThenumbers;:::;r0thatparametrizecanbeinterpretedasthesucientstatisticsofa\ ctitiousdataset"ofsize.Thereforearecalled ctitiouscountsrepresentsthestrengthoftheprior.Tospecifyapriorfortreeparameters,onemustspecifyaDirichletdistributionforeachoftheprobabilitytables ,foreachpossibletreestructure.Thisisachievedbymeansofasetofparametersvuxsatisfyingvuxforallu;vWiththesesettings,thepriorfortheparameters;:::;rinanytreethatcontainsthedirectededge isde nedbyuvx;:::;r.Thisrepresentationofthepriorisnotonlycompact(orderMAXparameters)butitisalsoconsistent:twodi erentdirectedparametrizationsofthesametreedistributionreceivethesameprior.Theassumptionsallowingustode nethispriorareexplicatedbyMeilaandJaakkola(2000)andparallelthereasoningofHeckermanetal.(1995)forgeneralBayesnets.Denotebytheempiricaldistributionobtainedfromadatasetofsizeandbyuvxthedistributionde nedbythe ctitiouscounts.Then,byapropertyoftheDirichletdistribution(Heckermanetal.,1995)itfollowsthatlearninga a&JordanMAPtreeisequivalenttolearninganMLtreefortheweightedcombinationofthetwo\datasets" Consequently,theparametersoftheoptimaltreewillbeForamixtureoftrees,maximizingtheposteriortranslatesintoreplacingandby�inequation(10)above.ThisimpliesthattheMstepoftheEMalgorithm,aswellastheEstep,isexactandtractableinthecaseofMAPestimationwithdecomposablepriors.Finally,notethattheposteriorsrsQjD]formodelswithdi erentarede neduptoaconstantthatdependson.Hence,onecannotcompareposteriorsofMTswithdi erentnumbersofmixturecomponents.Intheexperimentsthatwepresent,wechoseotherperformancecriteria:validationsetlikelihoodinthedensityestimationexperimentsandvalidationsetclassi cationaccuracyintheclassi cationtasks.5.ExperimentsThissectiondescribestheexperimentsthatwereruninordertoassessthepromiseoftheMTmodel.The rstexperimentsarestructureidenti cationexperiments;theyexaminetheabilityoftheMixTreealgorithmtorecovertheoriginaldistributionwhenthedataaregeneratedbyamixtureoftrees.ThenextgroupofexperimentsstudiestheperformanceoftheMTmodelasadensityestimator;thedatausedintheseexperimentsarenotgeneratedbymixturesoftrees.Finally,weperformclassi cationexperiments,studyingboththeMTmodelandasingletreemodel.Comparisonsaremadewithclassi erstrainedinbothsupervisedandunsupervisedmode.Thesectionendswithadiscussionofthesingletreeclassi eranditsfeatureselectionproperties.Inalloftheexperimentsthetrainingalgorithmisinitializedatrandom,independentlyofthedata.Unlessstatedotherwise,thelearningalgorithmisrununtilconvergence.Log-likelihoodsareexpressedinbits/exampleandthereforearesometimescalledcompressionrates.Thelowerthevalueofthecompressionrate,thebetterthe ttothedata.IntheexperimentsthatinvolvesmalldatasetsweusetheBayesianmethodsthatwediscussedinSection4toimposeapenaltyoncomplexmodels.Inordertoregularizemodelstructureweuseadecomposablepriorovertreeedgeswith0.ToregularizemodelparametersweuseaDirichletpriorderivedfromthepairwisemarginaldistributionsforthedataset.Thisapproachisknownassmoothingwiththemarginal(Friedmanetal.,1997;Ney,Essen,&Kneser,1994).Inparticular,wesettheparametercharacterizingtheDirichletpriorfortreebyapportioninga xedsmoothingcoecientequallybetweenthevariablesandinanamountthatisinverselyproportionalto�betweenthecomponents.Intuitively,thee ectofthisoperationistomakethetreesmoresimilartoeachother,therebyreducingthee ectivemodelcomplexity. LearningwithMixturesofTrees Figure8:Eighttrainingexamplesforthebarslearningtask.5.1Structureidenti cation5.1.1Randomtrees,largedatasetForthe rststructureidenti cationexperiment,wegeneratedamixtureof=5treesover30variableswitheachvertexhaving=4values.Thedistributionofthechoicevariableaswellaseachtree'sstructureandparametersweresampledatrandom.Themixturewasusedtogenerate30,000datapointsthatwereusedasatrainingsetforaMixTreealgorithm.Theinitialmodelhad=5componentsbutotherwisewasrandom.Wecomparedthestructureofthelearnedmodelwiththegenerativemodelandcomputedthelikelihoodsofboththelearnedandtheoriginalmodelonatestdatasetconsistingof1000points.Thealgorithmwasquitesuccessfulatidentifyingtheoriginaltrees:outof10trials,thealgorithmfailedtoidentifycorrectlyonly1treein1trial.Moreover,thisresultcanbeaccountedforbysamplingnoise;thetreethatwasn'tidenti edhadamixturecoecientofonly002.Thedi erencebetweentheloglikelihoodofthesamplesofthegeneratingmodelandtheapproximatingmodelwas0.41bitsperexample.5.1.2Randombars,smalldatasetThe\bars"problemisabenchmarkstructurelearningproblemforunsupervisedlearningalgorithmsintheneuralnetworkliterature(Dayan&Zemel,1995).ThedomainisthesquareofbinaryvariablesdepictedinFigure8.Thedataaregeneratedinthefollowingmanner: rst,one ipsafaircointodecidewhethertogeneratehorizontalorverticalbars;thisrepresentsthehiddenvariableinourmodel.Then,eachofthebarsisturnedonindependently(blackinFigure8)withprobability.Finally,noiseisaddedby ippingeachbitoftheimageindependentlywithprobability.Alearnerisshowndatageneratedbythisprocess;thetaskofthelearneristodiscoverthedatageneratingmechanism.AmixtureoftreesmodelthatapproximatesthetruestructureforlownoiselevelsisshowninFigure10.Notethatanytreeoverthevariablesformingabarisanequallygoodapproximation.Thus,wewillconsiderthatthestructurehasbeendiscoveredwhenthemodellearnsamixturewith=2,eachhavingconnectedcomponents,oneforeachbar.Additionally,weshalltesttheclassi cationaccuracyofthelearnedmodelby a&Jordan Figure9:Thetruestructureoftheprobabilisticgenerativemodelforthebarsdata. Figure10:Amixtureoftreesthatapproximatesthegenerativemodelforthebarsproblem.Theinterconnectionbetweenthevariablesineach\bar"arearbitrary. LearningwithMixturesofTrees .01 1 100 10000 8 10 12 14 16 18 20 Smoothing valueValidation set likelihood [bits/case]m=2m=3 Figure11:Testsetlog-likelihoodonthebarslearningtaskfordi erentvaluesofthesmooth-ingparameteranddi erent.The gurepresentsaveragesandstandarddeviationsover20trials.comparingthetruevalueofthehiddenvariable(i.e.\horizontal"or\vertical")withthevalueestimatedbythemodelforeachdatapointinatestset.Asseeninthe rstrow,thirdcolumnofFigure8,sometrainingsetexamplesareambiguous.Weretainedtheseambiguousexamplesinthetrainingset.Thetotaltrainingsetsizewastrain=400.Wetrainedmodelswith;:::;andevaluatedthemodelsonavalidationsetofsize100tochoosethe nalvaluesofandofthesmoothingparameter.Typicalvaluesforintheliteratureare5;wechoose=5followingDayanandZemel(1995).Theotherparametervalueswere02andtest=200.Toobtaintreeswithseveralconnectedcomponentsweusedasmalledgepenalty=5.Thevalidation-setlog-likelihoods(inbits)for3aregiveninFigure11.Clearly,=2isthebestmodel.For=2weexaminedtheresultingstructures:in19outof20trials,structurerecoverywasperfect.Interestingly,thisresultheldforthewholerangeofthesmoothingparameter,notsimplyatthecross-validatedvalue.Bywayofcomparison,DayanandZemel(1995)examinedtwotrainingmethodsandthestructurewasrecoveredin27andrespectively69casesoutof100.Theabilityofthelearnedrepresentationtocategorizenewexamplesascomingfromonegrouportheotherisreferredtoasclassi cationperformanceandisshowninTable1.Theresultreportedisobtainedonaseparatetestsetforthe nal,cross-validatedvalueof.Notethat,duetothepresenceofambiguousexamples,nomodelcanachieveperfectclassi cation.Theprobabilityofanambiguousexampleisambig+(1whichyieldsanerrorrateof0ambig125.Comparingthislowerboundwiththevalue a&JordanTable1:Resultsonthebarslearningtask. Testsetambiguousunambiguous test[bits/datapt]9.820.9513.67Classaccuracy0.8520.0760.951 Figure12:Anexampleofadigitpair.inthecorrespondingcolumnofTable1showsthatthemodelperformsquitewell,evenwhentrainedonambiguousexamples.Tofurthersupportthisconclusion,asecondtestsetofsize200wasgenerated,thistimeincludingonlynon-ambiguousexamples.Theclassi cationperformance,showninthecorrespondingsectionofTable1,roseto0.95.Thetablealsoshowsthelikelihoodofthe(test)dataevaluatedonthelearnedmodel.Forthe rst,\ambiguous"testset,thisis9.82,1.67bitsawayfromthetruemodellikelihoodof8.15bits/datapoint.Forthe\non-ambiguous"testset,thecompressionrateissigni cantlyworse,whichisnotsurprisinggiventhatthedistributionofthetestsetisnowdi erentfromthedistributionthemodelwastrainedon.5.2DensityestimationexperimentsInthissectionwepresenttheresultsofthreeexperimentsthatstudythemixtureoftreesinthedensityestimationsetting.5.2.1DigitsanddigitpairsimagesOur rstdensityestimationexperimentinvolvedasubsetofbinaryvectorrepresentationsofhandwrittendigits.Thedatasetsconsistofnormalizedandquantized88binaryimagesofhandwrittendigitsmadeavailablebytheUSPostalServiceOceforAdvancedTechnology. LearningwithMixturesofTreesTable2:Averagelog-likelihood(bitsperdigit)forthesingledigit(Digit)anddoubledigit(Pairs)datasets.Resultsareaveragedover3runs. DigitsPairs 1634.7279.253234.4878.996434.8479.7012834.8881.26 onthevalidationsetandusingitwecalculatedtheaveragelog-likelihoodoverthetestset(inbitsperexample).Theaverages(over3runs)areshowninTable2.InFigure13wecompareourresults(for=32)withtheresultspublishedbyFreyetal.(1996).Thealgorithmsplottedinthe gurearethe(1)completelyfactoredor\Baserate"(BR)model,whichassumesthateveryvariableisindependentofalltheothers,(2)mixtureoffactorialdistributions(MF),(3)theUNIX\gzip"compressionprogram,(4)theHelmholtzMachine,trainedbythewake-sleepalgorithm(Freyetal.,1996)(HWS),(5)thesameHelmholtzMachinewhereamean eldapproximationwasusedfortraining(HMF),(5)afullyobservedandfullyconnectedsigmoidbeliefnetwork(FV),and(6)themixtureoftrees(MT)model.AsshowninFigure13,theMTmodelyieldsthebestdensitymodelforthesimpledigitsandthesecond-bestmodelforpairsofdigits.AcomparisonofparticularinterestisbetweentheMTmodelandthemixtureoffactorial(MF)model.Inspiteofthestructuralsimilaritiesinthesemodels,theMTmodelperformsbetterthantheMFmodel,indicatingthatthereisstructureinthedatathatisexploitedbythemixtureofspanningtreesbutisnotcapturedbyamixtureofindependentvariablemodels.ComparingthevaluesoftheaveragelikelihoodintheMTmodelfordigitsandpairsweseethatthesecondismorethantwicethe rst.Thissuggeststhatourmodel(andtheMFmodel)isabletoperformgoodcompressionofthedigitdatabutisunabletodiscovertheindependenceinthedoubledigitset.5.2.2TheALARMnetworkOursecondsetofdensityestimationexperimentsfeaturestheALARMnetworkasthedatageneratingmechanism(Heckermanetal.,1995;Cheng,Bell,&Liu,1997).ThisBayesiannetworkwasconstructedfromexpertknowledgeasamedicaldiagnosticalarmmessagesystemforpatientmonitoring.Thedomainhas=37discretevariablestakingbetween2and4values,connectedby46directedarcs.Notethatthisnetworkisatreeoramixtureoftrees,butthetopologyofthegraphissparse,suggestingthepossibilityofapproximatingthedependencystructurebyamixtureoftreeswithasmallnumberofcomponentsWegeneratedatrainingsethavingtrain=10000datapointsandaseparatetestsetoftest000datapoints.Onthesesetswetrainedandcomparedthefollowingmethods:mixturesoftrees(MT),mixturesoffactorial(MF)distributions,thetruemodel, a&Jordan MF gzip HWS HMF FV MT 0 10 20 30 40 50 60 Likelihood [bits/digit] MF gzip HWS HMF FV MT 0 20 40 60 80 100 120 Likelihood [bits/digit] (a)(b)Figure13:Averagelog-likelihoods(bitsperdigit)forthesingledigit(a)anddoubledigit(b)datasets.Noticethedi erenceinscalebetweenthetwo gures.Table3:DensityestimationresultsforthemixturesoftreesandothermodelsontheALARMdataset.Trainingsetsizetrain=10000.Averageandstandarddeviationover20trials. ModelTrainlikelihoodTestlikelihood[bits/datapoint][bits/datapoint] ALARMnet13.14813.264Mixtureoftrees=1813.510.0414.55Mixtureoffactorials=2817.110.1217.64Baserate30.9931.17gzip40.34541.260 and\gzip."ForMTandMFthemodelorderandthedegreeofsmoothingwereselectedbycrossvalidationonrandomlyselectedsubsetsofthetrainingset.TheresultsarepresentedinTable3,whereweseethattheMTmodeloutperformstheMFmodelaswellasgzipandthebaseratemodel.Toexaminethesensitivityofthealgorithmstothesizeofthedatasetweranthesameexperimentwithatrainingsetofsize1,000.TheresultsarepresentedinTable4.Again,theMTmodelistheclosesttothetruemodel.Noticethatthedegradationinperformanceforthemixtureoftreesisrelativelymild(about1bit),whereasthemodelcomplexityisreducedsigni cantly.Thisindicatestheimportantroleplayedbythetreestructuresin ttingthedataandmotivatestheadvantageofthemixtureoftreesoverthemixtureoffactorialsforthisdataset. LearningwithMixturesofTreesTable4:Densityestimationresultsforthemixturesoftreesandothermodelsonadatasetofsize1000generatedfromtheALARMnetwork.Averageandstandarddeviationover20trials.Recallthatisasmoothingcoecient. ModelTrainlikelihoodTestlikelihood[bits/datapoint][bits/datapoint] ALARMnet13.16713.264Mixtureoftrees=5014.560.1615.51Mixtureoffactorials=12=10018.200.3719.99Baserate31.2331.18gzip45.96046.072 Table5:DensityestimationresultsforthemixturesoftreesandothermodelsonatheFACESdataset.Averageandstandarddeviationover10trials. ModelTrainlikelihoodTestlikelihood[bits/datapoint][bits/datapoint] Mixtureoftrees=1052.770.3356.29Mixtureoffactorials=24=10056.340.4864.41Baserate75.8474.27gzip{103.51 5.2.3TheFACESdatasetForthethirddensityestimationexperiment,weusedasubsetof576imagesfromanor-malizedfaceimagesdataset(Philips,Moon,Rauss,&Rizvi,1997).Theseimagesweredownsampledto48variables(pixels)and5graylevels.Wedividedthedatarandomlyintotrain=500andtest=76examples;ofthe500trainingexamples,50wereleftoutasavalidationsetandusedtoselectandfortheMTandMFmodels.Theresultsintable5showthemixtureoftreesastheclearwinner.Moreover,theMTachievesthisperformancewithalmost5timesfewerparametersthanthesecondbestmodel,themixtureof24factorialdistributions.NotethatanessentialingredientofthesuccessoftheMTbothhereandinthedigitsexperimentsisthatthedataare\normalized",i.e.,apixel/variablecorrespondsapprox-imatelytothesamelocationontheunderlyingdigitorface.WedonotexpectMTstoperformwellonrandomlychosenimagepatches. a&Jordan5.3Classi cationwithmixturesoftrees5.3.1UsingamixtureoftreesasaclassifierAdensityestimatorcanbeturnedintoaclassi erintwoways,bothofthembeingessentiallylikelihoodratiomethods.Denotetheclassvariablebyandthesetofinputvariablesby.Inthe rstmethod,adoptedinourclassi cationexperimentsunderthenameof,anMTmodelistrainedonthedomain,treatingtheclassvariablelikeanyothervariableandpoolingallthetrainingdatatogether.Inthetestingphase,anewinstance)isclassi edbypickingthemostlikelyvalueoftheclassvariablegiventhesettingsoftheothervariables:argmaxSimilarly,fortheMFclassi er(termed\D-SIDE"byKontkanenetal.,1996),aboveisanMFtrainedonThesecondmethodcallsforpartitioningthetrainingsetaccordingtothevaluesoftheclassvariableandfortrainingatreedensityestimatoroneachpartition.Thisisequivalenttotrainingamixtureoftreeswithobservedchoicevariable,thechoicevariablebeingtheclass(Chow&Liu,1968;Friedmanetal.,1997).Inparticular,ifthetreesareforcedtohavethesamestructureweobtaintheTreeAugmentedNaiveBayes(TANB)classi erofFriedmanetal.(1997).IneithercaseoneturnstoBayesformula:argmaxaxc=k]Tk(x)toclassifyanewinstance.TheanalogoftheMFclassi erinthissettingisthenaiveBayesclassi er.5.3.2TheAUSTRALIANdatasetThisdatasethas690exampleseachconsistingof14attributesandabinaryclassvariable(Blake&Merz,1998).InthefollowingwereplicatedtheexperimentalprocedureofKon-tkanenetal.(1996)andMichieetal.(1994)ascloselyaspossible.Thetestandtrainingsetsizeswere70and620respectively.Foreachvalueofweranouralgorithmfora xednumberofepochsonthetrainingsetandthenrecordedtheperformanceonthetestset.Thiswasrepeated20timesforeach,eachtimewitharandomstartandwitharandomsplitbetweenthetestandthetrainingset.Becauseofthesmalldatasetsizeweusededgepruningwith.ThebestperformanceofthemixturesoftreesiscomparedtothepublishedresultsofKontkanenetal.(1996)andMichieetal.(1994)forthesamedatasetinTable6.5.3.3TheAGARICUS-LEPIOTAdatasetTheAGARICUS-LEPIOTAdata(Blake&Merz,1998)comprises8124examples,eachspecifyingthe22discreteattributesofaspeciesofmushroomintheAgaricusandLepiotafamiliesandclassifyingitasediblepoisonous.Thearitiesofthevariablesrangefrom2to12.Wecreatedatestsetoftest=1124examplesandatrainingsetoftrain=7000examples.Ofthelatter,800exampleswerekeptasidetoselectandtherestwereused LearningwithMixturesofTreesTable6:PerformancecomparisonbetweentheMTmodelandotherclassi cationmethodsontheAUSTRALIANdataset(Michieetal.,1994).TheresultsformixturesoffactorialdistributionarethosereportedbyKontkanenetal.(1996). Method%correctMethod%correct Mixtureoftrees=20Backprop84.6Mixtureoffactorialdistributions87.2C4.584.6Cal5(decisiontree)86.9SMART84.2ITrule86.3BayesTrees82.9Logisticdiscrimination85.9K-nearestneighbor=181.9Lineardiscrimination85.9AC281.9DIPOL9285.9NewID81.9Radialbasisfunctions85.5LVQ80.3CART85.5ALLOC8079.9CASTLE85.2CN279.6NaiveBayes84.9Quadraticdiscrimination79.3IndCART84.8FlexibleBayes78.3 MF TANB NB 0 0.002 0.004 0.006 0.008 0.01 Error rate MF TANB NB 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Error rate (a)(b)Figure14:Classi cationresultsforthemixturesoftreesandothermodels:(a)OntheAGARICUS-LEPIOTAdataset;theMThas=12,andtheMFhas=30.(b)OntheNURSERYdataset;theMThas=30,theMFhas=70.TANBandNBarethetreeaugmentednaiveBayesandthenaiveBayesclassi ersrespectively.Theplotsshowtheaverageandstandarddeviationtestseterrorrateover5trials. a&Jordan 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Error rate :0001101000110100KBNNNNTANBNB Tree Figure15:Comparisonofclassi cationperformanceoftheMTandothermodelsontheSPLICEdatasetwhentrain=2000test=1175.Treerepresentsamixtureoftreeswith=1,MTisamixtureoftreeswith=3.KBNNistheKnowledgebasedneuralnet,NNisaneuralnet.fortraining.Nosmoothingwasused.Theclassi cationresultsonthetestsetarepresentedin gure14(a).Asthe guresuggests,thisisarelativelyeasyclassi cationproblem,whereseeingenoughexamplesguaranteesperfectperformance(achievedbytheTANB).TheMT(with=12)achievesnearlyoptimalperformance,makingonemistakeinoneofthe5trials.TheMFandnaiveBayesmodelsfollowabout0.5%behind.5.3.4TheNURSERYdatasetThisdatasetcontains12,958entries(Blake&Merz,1998),consistingof8discreteattributesandoneclassvariabletaking4values.Thedatawererandomlyseparatedintoatrainingsetofsizetrain=11000andatestsetofsizetest=1958.InthecaseofMTsandMFstheformerdatasetwasfurtherpartitionedinto9000examplesusedfortrainingthecandidatemodelsand2000examplesusedtoselecttheoptimal.TheTANBandnaiveBayesmodelsweretrainedonallthe11,000examples.Nosmoothingwasusedsincethetrainingsetwaslarge.Theclassi cationresultsareshownin gure14(b).5.3.5TheSPLICEdataset:ClassificationWealsostudiedtheclassi cationperformanceoftheMTmodelinthedomainofDNASPLICE-junctions.Thedomainconsistsof60variables,representingasequenceofDNA 1.Theoriginaldatasetcontainsanothertwodatapointswhichcorrespondtoa fthclassvalue;weeliminatedthosefromthedataweused. LearningwithMixturesofTrees 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Error rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Error rate 0:0: 0:0:| DELVE Tree TANB DELVE Tree TANB train=100test=1575train=200test=1575Figure16:Comparisonofclassi cationperformanceofthemixtureoftreesandothermodelstrainedonsmallsubsetsoftheSPLICEdataset.ThemodelstestedbyDELVEare,fromlefttoright:1-nearestneighbor,CART,HME(hierarchicalmixtureofexperts)-ensemblelearning,HME-earlystopping,HME-grown,K-nearestneighbors,LLS(linearleastsquares),LLS-ensemblelearning,ME(mixtureofexperts)-ensemblelearning,ME-earlystopping.TANBistheTreeAugmentedNaiveBayesclassi er,NBistheNaiveBayesclassi er,andTreeisthesingletreeclassi er. a&Jordanbases,andanadditionalclassvariable(Rasmussenetal.,1996).Thetaskistodetermineifthemiddleofthesequenceisasplicejunctionandwhatisitstype.Splicejunctionsareoftwotypes:exon-intron(EI)representstheendofanexonandthebeginningofanintronwhereasintron-exon(IE)istheplacewheretheintronendsandthenextexon,orcodingsection,begins.Hence,theclassvariablecantake3values(EI,IEornojunction)andtheothervariablestake4valuescorrespondingtothe4possibleDNAbases(C,A,G,T).Thedatasetconsistsof3,175labeledexamples.WerantwoseriesofexperimentscomparingtheMTmodelwithcompetingmodels.Inthe rstseriesofexperiments,wecomparedtotheresultsofNoordewieretal.(1991),whousedmultilayerneuralnetworksandknowledge-basedneuralnetworksforthesametask.Wereplicatedtheseauthors'choiceoftrainingsetsize(2000)andtestsetsize(1175)andsamplednewtraining/testsetsforeachtrial.Weconstructedtrees(=1)andmixturesoftrees(=3).In ttingthemixture,weusedanearly-stoppingprocedureinwhichvalid=300exampleswereseparatedoutofthetrainingsetandtrainingwasstoppedwhenthelikelihoodontheseexamplesstoppedincreasing.Theresults,averagedover20trials,arepresentedinFigure15foravarietyofvaluesof.ItcanbeseenthatthesingletreeandtheMTmodelperformsimilarly,withthesingletreeshowinganinsigni cantlybetterclassi cationaccuracy.Notethatinthissituationsmoothingdoesnotimproveperformance;thisisnotunexpectedsincethedatasetisrelativelylarge.Withtheexceptionofthe\oversmoothed"MTmodel(=100),allthesingletreeorMTmodelsoutperformtheothermodelstestedonthisproblem.Notethatwhereasthetreemodelscontainnopriorknowledgeaboutthedomain,theothertwomodelsdo:theneuralnetworkmodelistrainedinsupervisedmode,optimizingforclassaccuracy,andtheKBNNincludesdetaileddomainknowledge.BasedonthestrongshowingofthesingletreemodelontheSPLICEtask,wepursuedasecondseriesofexperimentsinwhichwecomparethetreemodelwithalargercollectionofmethodsfromtheDELVErepository(Rasmussenetal.,1996).TheDELVEbenchmarkusessubsetsoftheSPLICEdatabasewith100and200examplesfortraining.Testingisdoneon1500examplesinallcases.Figure16presentstheresultsforthealgorithmstestedbyDELVEaswellasthesingletreeswithdi erentdegreesofsmoothing.WealsoshowresultsfornaiveBayes(NB)andTreeAugmentedNaiveBayes(TANB)models(Friedmanetal.,1997).TheresultsfromDELVErepresentaveragesover20runswithdi erentrandominitializationsonthesametrainingandtestingsets;fortrees,NBandTANB,whoseoutputsarenotinitialization-dependent,weaveragedtheperformanceofthemodelslearnedfor20di erentsplitsoftheunionofthetrainingandtestingset.Noearlystoppingorcross-validationwasusedinthiscase.Theresultsshowthatthesingletreeisquitesuccessfulinthisdomain,yieldinganerrorratethatislessthanhalfoftheerrorrateofthebestmodeltestedinDELVE.Moreover,theaverageerrorofasingletreetrainedon200examplesis6.9%,whichisonly2.3%greaterthantheaverageerrorofthetreetrainedon2000examples.WeattempttoexplainthisstrikingpreservationofaccuracyforsmalltrainingsetsinourdiscussionoffeatureselectioninSection5.3.7. 2.Weeliminated15examplesfromtheoriginaldatasetthathadambiguousinputs(Noordewier,Towell,&Shavlik,1991;Rasmussenetal.,1996). LearningwithMixturesofTrees 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 Figure17:Cumulativeadjacencymatrixof20trees tto2000examplesoftheSPLICEdatasetwithnosmoothing.Thesizeofthesquareatcoordinatesrepresentsthenumberoftrees(outof20)thathaveanedgebetweenvariablesand.Nosquaremeansthatthisnumberis0.Onlythelowerhalfofthematrixisshown.Theclassisvariable0.Thegroupofsquaresatthebottomofthe gureshowsthevariablesthatareconnecteddirectlytotheclass.Onlythesevariablearerelevantforclassi cation.Notsurprisingly,theyarealllocatedinthevicinityofthesplicejunction(whichisbetween30and31).Thesubdiagonal\chain"showsthattherestofthevariablesareconnectedtotheirimmediateneighbors.Itslower-leftendisedge2{1anditsupper-rightisedge60-59. a&JordanEIjunctionExon Intron 282930 313233343536 Tree AGAG-TrueCAAG AGAGT IEjunctionIntron Exon 1516...252627282930 31 Tree{CTCTCT{{CT TrueCTCTCTCT{{CT Figure18:TheencodingoftheIEandEIsplicejunctionsasdiscoveredbythetreelearningalgorithm,comparedtotheonesgivenbyWatsonetal.,\MolecularBiologyoftheGene"(Watsonetal.,1987).Positionsinthesequenceareconsistentwithourvariablenumbering:thusthesplicejunctionissituatedbetweenpositions30and31.Symbolsinboldfaceindicatebasesthatarepresentwithprobabilityalmost1,otherA,C,G,Tsymbolsindicatebasesorgroupsofbasesthathavehighprobability(0.8),anda{indicatesthatthepositioncanbeoccupiedbyanybasewithanon-negligibleprobability.TheNaiveBayesmodelexhibitsbehaviorthatisverysimilartothetreemodelandonlyslightlylessaccurate.However,augmentingtheNaiveBayesmodeltoaTANBsigni cantlyhurtstheclassi cationperformance.5.3.6TheSPLICEdataset:StructureidentificationFigure17presentsasummaryofthetreestructureslearnedfromthe=2000datasetintheformofacumulatedadjacencymatrix.Theadjacencymatricesofthe20graphstructuresobtainedintheexperimenthavebeensummed.Thesizeoftheblacksquareatcoordinatesi;jinthe gureisproportionaltothevalueofthei;j-thelementofthecumulatedadjacencymatrix.Nosquaremeansthattherespectiveelementis0.Sincetheadjacencymatrixissymmetric,onlyhalfofthematrixisshown.FromFigure17weseethatthetreestructureisverystableoverthe20trials.Variable0representstheclassvariable;thehypotheticalsplicejunctionissituatedbetweenvariables30and31.The gureshowsthatthesplicejunction(variable0)dependsonlyonDNAsitesthatareinitsvicinity.Thesitesthatareremotefromthesplicejunctionaredependentontheirimmediateneighbors.Moreover,examiningthetreeparameters,fortheedgesadjacenttotheclassvariable,weobservethatthesevariablesbuildcertainpatternswhenthesplicejunctionispresent,butarerandomandalmostuniformlydistributedintheabsenceofasplicejunction.ThepatternsextractedfromthelearnedtreesareshowninFigure18.Thesame guredisplaysthe\true"encodingsoftheIEandEIjunctionsasgivenbyWatsonetal.(1987).Thematchbetweenthetwoencodingsisalmostperfect.Thus,wecanconcludethatforthis LearningwithMixturesofTrees 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 Figure19:Thecumulatedadjacencymatrixfor20treesovertheoriginalsetofvariables(0-60)augmentedwith60\noisy"variables(61-120)thatareindependentoftheoriginalones.Thematrixshowsthatthetreestructureovertheoriginalvariablesispreserved. a&Jordandomain,thetreemodelnotonlyprovidesagoodclassi erbutalsodiscoversamodelofthephysicalrealityunderlyingthedata.Notethatthealgorithmarrivesatthisresultintheabsenceofpriorknowledge:(1)itdoesnotknowwhichvariableistheclassvariable,and(2)itdoesnotknowthatthevariablesareinasequence(i.e.,thesameresultwouldbeobtainediftheindicesofthevariableswerescrambled).5.3.7TheSPLICEdataset:FeatureselectionLetusexaminethesingletreeclassi erthatwasusedfortheSPLICEdatasetmoreclosely.AccordingtotheMarkovpropertiesofthetreedistribution,theprobabilityoftheclassvariablesdependsonlyonitsneighbors,thatis,onthevariablestowhichtheclassvariableisconnectedbytreeedges.Hence,atreeactsasanimplicitvariableselectorforclassi cation:onlythevariablesadjacenttothequeriedvariable(thissetofvariablesiscalledtheMarkovblanket;Pearl,1988)arerelevantfordeterminingitsprobabilitydistribution.Thispropertyalsoexplainstheobservedpreservationoftheaccuracyofthetreeclassi erwhenthesizeofthetrainingsetdecreases:outofthe60variables,only18arerelevanttotheclass;moreover,thedependenceisparametrizedas18independentpairwiseprobabilitytablesclass;v.Suchparameterscanbe taccuratelyfromrelativelyfewexamples.Hence,aslongasthetrainingsetcontainsenoughdatatoestablishthecorrectdependencystructure,theclassi cationaccuracywilldegradeslowlywiththedecreaseinthesizeofthedataset.Thisexplanationhelpstounderstandthesuperiorityofthetreeclassi eroverthemodelsinDELVE:onlyasmallsubsetofvariablesarerelevantforclassi cation.Thetree ndsthemcorrectly.Aclassi erthatisnotabletoperformfeatureselectionreasonablywellwillbehinderedbytheremainingirrelevantvariables,especiallyifthetrainingsetissmall.ForagivenMarkovblanket,thetreeclassi esinthesamewayasanaiveBayesmodelwiththeMarkovblanketvariablesasinputs.NotealsothatthenaiveBayesmodelitselfhasabuilt-infeatureselector:ifoneoftheinputvariablesisnotrelevanttotheclass,thedistributionswillberoughlythesameforallvaluesof.Consequently,intheposteriorthatservesforclassi cation,thefactorscorrespondingtowillsimplifyandthuswillhavelittlein uenceontheclassi cation.ThismayexplainwhythenaiveBayesmodelalsoperformswellontheSPLICEclassi cationtask.Noticehoweverthatthevariableselectionmechanismsimplementedbythetreeclassi erandthenaiveBayesclassi erarenotthesame.Toverifythatindeedthesingletreeclassi eractslikeafeatureselector,weperformedthefollowingexperiment,againusingtheSPLICEdata.Weaugmentedthevariablesetwithanother60variables,eachtaking4valueswithrandomlyandindependentlyassignedprobabilities.Therestoftheexperimentalconditions(trainingset,testsetandnumberofrandomrestarts)wereidenticaltothe rstSPLICEexperiment.We tasetofmodelswith=1,asmall1andnosmoothing.Thestructureofthenewmodels,intheformofacumulativeadjacencymatrix,isshowninFigure19.Weseethatthestructureovertheoriginal61variablesisunchangedandstable;the60noisevariablesconnectinarandomuniformpatternstotheoriginalvariablesandamongeachother.Asexpectedafterexaminingthestructure,theclassi cationperformanceofthenewtreesisnota ectedbythenewlyintroducedvariables:infacttheaverageaccuracyofthetreesover121variablesis95.8%,0.1%higherthantheaccuracyoftheoriginaltrees. LearningwithMixturesofTrees6.TheacceleratedtreelearningalgorithmWehavearguedthatthemixtureoftreesapproachhassigni cantadvantagesovergeneralBayesiannetworksintermsofitsalgorithmiccomplexity.Inparticular,theMstepoftheEMalgorithmformixturesoftreesistheChowLiualgorithm,whichscalesquadraticallyinthenumberofvariablesandlinearlyinthesizeofthedataset.GiventhattheEstepislinearinandformixturesoftrees,wehaveasituationinwhicheachpassoftheEMalgorithmisquadraticinandlinearinAlthoughthistimecomplexityrecommendstheMTapproachforlarge-scaleproblems,thequadraticscalinginbecomesproblematicforparticularlylargeproblems.InthissectionweproposeamethodforreducingthetimecomplexityoftheMTlearningalgorithmanddemonstrateempiricallythelargeperformancegainsthatweareabletoobtainwiththismethod.Asaconcreteexampleofthekindofproblemsthatwehaveinmind,considertheproblemofclusteringorclassi cationofdocumentsininformationretrieval.Herethevariablesarewordsfromavocabulary,andthedatapointsaredocuments.Adocumentisrepresentedasabinaryvectorwithacomponentequalto1foreachwordthatispresentinthedocumentandequalto0foreachwordthatisnotpresent.Inatypicalapplicationthenumberofdocumentsisoftheorderof10{10,asisthevocabularysize.Givensuchnumbers, ttingasingletreetothedatarequirescountingoperations.Note,however,thatthisdomainischaracterizedbyacertainsparseness:inparticular,eachdocumentcontainsonlyarelativelysmallnumberofwordsandthusmostofthecomponentsofitsbinaryvectorare0.Inthissection,weshowhowtotakeadvantageofdatasparsenesstoacceleratetheChowLiualgorithm.Weshowthatinthesparseregimewecanoftenrankordermutualinformationvalueswithoutactuallycomputingthesevalues.Wealsoshowhowtospeedupthecomputationofthesucientstatisticsbyexploitingsparseness.Combiningthesetwoideasyieldsanalgorithm|theacCL(acceleratedChowandLiu)algorithm|thatprovidessigni cantperformancegainsinbothrunningtimeandmemory.6.1TheacCLalgorithmWe rstpresenttheacCLalgorithmforthecaseofbinaryvariables,presentingtheex-tensiontogeneraldiscretevariablesinSection6.2.Forbinaryvariableswewillsaythatavariableis\on"whenittakesvalue1;otherwisewesaythatitis\o ".Withoutlossofgeneralityweassumethatavariableiso moretimesthanitisoninthegivendataset.ThetargetdistributionisassumedtobederivedfromasetofobservationsofsizeLetusdenotebythenumberoftimesvariableisoninthedatasetandbythenumberoftimesvariablesandaresimultaneouslyon.Wecalleachofthelattereventsco-occurrenceand.Themarginalandisgivenby:1)=0)=1)=0)= a&JordanAlltheinformationaboutthatisnecessaryfor ttingthetreeissummarizedinthecountsand;u;v;:::;n,andfromnowonwewillconsidertoberepresentedbythesecounts.(Itisaneasyextensiontohandlenon-integerdata,suchaswhenthedatapointsare\weighted"byrealnumbers).Wenowde nethenotionofsparsenessthatmotivatestheacCLalgorithm.Letusdenotebythenumberofvariablesthatareoninobservation,where0De ne,thedatasparsenessmaxIf,forexample,thedataaredocumentsandthevariablesrepresentwordsfromavocabulary,thenrepresentsthemaximumnumberofdistinctwordsinadocument.Thetimeandmemoryrequirementsofthealgorithmthatwedescribedependonthesparseness;thelowerthesparseness,themoreecientthealgorithm.Ouralgorithmwillrealizeitslargestperformancegainswhenn;NRecallthattheChowLiualgorithmgreedilyaddsedgestoagraphbychoosingtheedgethatcurrentlyhasthemaximalvalueofmutualinformation.Thealgorithmthatwedescribeinvolvesanecientwaytorankordermutualinformation.Therearetwokeyaspectstothealgorithm:(1)howtocomparemutualinformationbetweennon-co-occurringvariables,and(2)howtocomputeco-occurrencesinalistrepresentation.6.1.1Comparingmutualinformationbetweennon-co-occurringvariablesLetusfocusonthepairsu;vthatdonotco-occur,i.e.,forwhich=0.Forsuchapair,themutualinformationisafunctionofand.Letusanalyzethevariationofthemutualinformationwithrespecttobytakingthecorrespondingpartialderivative: =log 0(11)Thisresultimpliesthatforagivenvariableandanytwovariablesv;vforwhich=0wehave:impliesthatThisobservationallowsustopartiallysortthemutualinformationvaluesfornon-co-occurringpairsu;v,withoutcomputingthem.First,wehavetosortallthevariablesbytheirnumberofoccurrences.Westoretheresultinalist.Thisgivesatotalordering"forthevariablesinpreceedsinlistForeach,wede nethelistofvariablesfollowingintheordering\"andnotco-occurringwithit:V;vu;NThislistissortedbydecreasingandtherefore,implicitly,bydecreasing.Sincethedataaresparse,mostpairsofvariablesdonotco-occur.Therefore,bycreatingthelists),alargenumberofvaluesofthemutualinformationarepartiallysorted.Beforeshowinghowtousethisconstruction,letusexamineanecientwayofcomputingthecountswhenthedataaresparse. LearningwithMixturesofTrees list of , , sorted by list of , , sorted by next edge� 0Iuv= 0Nv(virtual)uvNv Figure20:Thedatastructurethatsuppliesthenextcandidateedge.Verticallyontheleftarethevariables,sortedbydecreasing.Foragiven,therearetwolists:),sortedbydecreasingand(thevirtuallist)),sortedbydecreasing.Themaximumofthetwo rstelementsoftheselistsisinsertedintoanFibonacciheap.TheoverallmaximumofcanthenbeextractedasthemaximumoftheFibonacciheap.6.1.2Computingco-occurrencesinalistrepresentationLet;:::xbeasetofobservationsoverbinaryvariables.Ifitisecienttorepresenteachobservationinasalistofthevariablesthatareonintherespectiveobservation.Thus,datapoint;:::Nwillberepresentedbythelistxlist=list.ThespacerequiredbythelistsisnomorethanandismuchsmallerthanthespacerequiredbythebinaryvectorrepresentationofthesamedataNote,moreover,thatthetotalnumberofco-occurrencesinthedataset Forthevariablesthatco-occurwith,asetofco-occurrencelists)iscreated.Each)containsrecords(v;Iu;N0andissortedbydecreasingTorepresentthelists)(thatcontainmanyelements)weusetheir\complements" V;vu;N.Itcanbeshown(Meila-Predoviciu,1999)thatthe a&Jordan AlgorithmacCLVariablesetofsizeDatasetxlist;:::NProcedureKruskal1.computefor;create,listofvariablesinsortedbydecreasing2.computeco-occurrences;createlists 3.createvheapforvlistargmaxinsert(;u;v;I)invheapKruskalvheap);storefortheedgesaddedto5.for(u;vcomputetheprobabilitytableusingandOutput: Figure21:TheacCLalgorithm.computationoftheco-occurrencecountsandtheconstructionofthelists)and forall,takesanamountoftimeproportionaltothenumberofco-occurrencesuptoalogarithmicfactor:log(N=nComparingthisvaluewith),whichisthetimetocomputeintheChowLiualgorithm,weseethatthepresentmethodreplacesthedimensionofthedomainThememoryrequirementsforthelistsarealsoatmostproportionalto,hence6.1.3Thealgorithmanditsdatastructures.Wehavedescribedecientmethodsforcomputingtheco-occurrencesandforpartiallysortingthemutualinformationvalues.Whatweaimtocreateisamechanismthatwilloutputtheedges(u;v)indecreasingorderoftheirmutualinformation.WeshallsetupthismechanismintheformofaFibonacciheap(Fredman&Tarjan,1987)calledvheapthatcontainsanelementforeach,representedbytheedgewiththehighestmutualinformationamongtheedges(u;v),with,thatarenotyeteliminated.Therecordinvheapisoftheform(;u;v;I),with,andwithbeingthekeyusedforsorting.Oncethemaximumisextracted,theusededgehastobereplacedbythenextlargest(intermsof)edgein'slists.ToperformthislattertaskweusethedatastructuresshowninFigure20.Kruskal'salgorithmisnowusedtoconstructthedesiredspanningtree.Figure21summarizestheresultingalgorithm. LearningwithMixturesofTrees 5 500 1000 2000 3000 0 0.5 1 1.5 2 2.5x 104 n steps Figure22:Themean(fullline),standarddeviationandmaximum(dottedline)ofthenumberofstepsintheKruskalalgorithmover1000runsplottedagainstrangesfrom5to3000.Theedgeweightsweresampledfromauniformdistribution.6.1.4RunningtimeThealgorithmrequires:)forcomputing)forsortingthevariables,log(N=n))forstep2,)forstep3,)fortheKruskalalgorithmisthenumberofedgesexaminedbythealgorithm,logisthetimeforanextractionfromvheapisanupperboundforthenumberofelementsinthelists whoseelementsneedtobeskippedoccasionallywhenweextractvariablesfromthevirtuallists),and)forcreatingthethe1probabilitytablesinstep5.Summingtheseterms,weobtainthefollowingupperboundfortherunningtimeoftheacCLalgorithm: Ifweignorethelogarithmicfactorsthissimpli estoForconstant,thisboundisapolynomialofdegree1inthethreevariablesn;Nand.Becausehastherange2,theworstcasecomplexityoftheAcCLalgorithmisquadraticin.Empirically,however,asweshowbelow,we ndthatthedependenceofisgenerallysubquadratic.Moreover,randomgraphtheoryimpliesthatifthedistributionoftheweightvaluesisthesameforalledges,thenKruskal'salgorithmshouldtaketimeproportionalto(West,1996).Toverifythislatterresult,weconductedasetofMonteCarloexperiments,inwhichwerantheKruskalalgorithmonsetsofrandomweightsoverdomainsofdimensionupto=3000.Foreach,1000runswereperformed.Figure22plotstheaverageandmaximumversusfortheseexperiments.Thecurvefortheaveragedisplaysanessentiallylineardependence. a&Jordan6.1.5MemoryrequirementsBeyondthestoragefordataandresults,weneed)spacetostoretheco-occurrencesandthelists )and)forL;vheapandtheKruskalalgorithm.Hence,theadditionalspaceusedbytheacCLalgorithmis6.2DiscretevariablesofarbitraryarityWebrie ydescribetheextensionoftheacCLalgorithmtothecaseofdiscretedomainsinwhichthevariablescantakemorethantwovalues.Firstweextendthede nitionofdatasparseness:weassumethatforeachvariablethereexistsaspecialvaluethatappearswithhigherfrequencythanalltheothervalues.Thisvaluewillbedenotedby0,withoutlossofgenerality.Forexample,inamedicaldomain,thevalue0foravariablewouldrepresentthe\normal"value,whereastheabnormalvaluesofeachvariablewouldbedesignatedbynon-zerovalues.An\occurrence"forvariablewillbetheevent=0anda\co-occurrence"ofandmeansthatandarebothnon-zeroforthesamedatapoint.Wede neasthenumberofnon-zerovaluesinobservationThesparsenessis,asbefore,themaximumofoverthedataset.Toexploitthehighfrequencyofthezerovalueswerepresentonlytheoccurrencesexplicitly,creatingtherebyacompactandecientdatastructure.Weobtainperformancegainsbypresortingmutualinformationvaluesfornon-co-occurringvariables.6.2.1Computingco-occurrencesAsbefore,weavoidrepresentingzerovaluesexplicitlybyreplacingeachdatapointthelistxlist,wherexlist=listv;xV;xAco-occurrenceisrepresentedbythequadruple(u;x;v;x=0.Insteadofoneco-occurrencecount,wenowhaveatwo-waycontingencytable.Eachrepresentsthenumberofdatapointswherei;vj;i;j=0.Countingandstoringco-occurrencescanbedoneinthesametimeasbeforeandwithaMAX)largeramountofmemory,necessitatedbytheadditionalneedtostorethe(non-zero)variablevalues.6.2.2PresortingmutualinformationvaluesOurgoalistopresortthemutualinformationvaluesforallthatdonotco-occurwith.Thefollowingtheoremshowsthatthiscanbedoneexactlyasbefore.TheoremLetu;v;wbediscretevariablessuchthatv;wdonotco-occurwith)inagivendataset.Letbethenumberofdatapointsforwhichrespectively,andletbetherespectiveempiricalmutualinformationvaluesbasedonthesample.Thenwithequalityonlyifisidentically0. TheproofofthetheoremisgivenintheAppendix.TheimplicationofthistheoremisthattheacCLalgorithmcanbeextendedtovariablestakingmorethantwovaluesbymakingonlyone(minor)modi cation:thereplacementofthescalarcountsandthevectors=0and,respectively,thecontingencytables;i;j=0. LearningwithMixturesofTrees 100 200 500 1000 1s 2s 5s 10s 1m 2m 10m 30m 1h 4h nTime s=5 s=10 s=15 s=100 Figure23:RunningtimefortheacCL(fullline)and(dottedline)ChowLiualgorithmsversusnumberofverticesfordi erentvaluesofthesparseness 100 200 500 1000 102 103 104 105 106 nn Kruskal s=5 s=10 s=15 s=100 Figure24:NumberofstepsoftheKruskalalgorithmversusdomainsizemeasuredfortheacCLalgorithmfordi erentvaluesof6.3ExperimentsInthissectionwereportontheresultsofexperimentsthatcomparethespeedoftheacCLalgorithmwiththestandardChowLiumethodonarti cialdata.Inourexperimentsthebinarydomaindimensionvariesfrom50to1000.Eachdatapointhasa xednumberofvariablestakingvalue1.Thesparsenesstakesthevalues5,10,15and100.Thedataweregeneratedfromanarti ciallyconstructednon-uniform,non-factoreddistribution.Foreachpair(n;s)asetof10,000pointswascreated.ForeachdatasetboththeChowLiualgorithmandtheacCLalgorithmwereusedto tasingletreedistribution.TherunningtimesareplottedinFigure23.Theimprovements a&JordanacCLoverthestandardversionarespectacular:learningatreeon1000variablesfrom10,000datapointstakes4hourswiththestandardalgorithmandonly4secondswiththeacceleratedversionwhenthedataaresparse(15).Forthelesssparseregimeof=100theacCLalgorithmtakes2minutestocomplete,improvingonthetraditionalalgorithmbyafactorof120.Notealsothattherunningtimeoftheacceleratedalgorithmseemstobenearlyinde-pendentofthedimensionofthedomain.Recallontheotherhandthatthenumberofsteps(Figure24)growswith.ThisimpliesthatthebulkofthecomputationliesinthestepsprecedingtheKruskalalgorithmproper.Namely,itisinthecomputingofco-occurrencesandorganizingthedatathatmostofthetimeisspent.Figure23alsocon rmsthattherunningtimeofthetraditionalChowLiualgorithmgrowsquadraticallywithandisindependentofThisconcludesthepresentationoftheacCLalgorithm.Themethodachievesitsperfor-mancegainsbyexploitingcharacteristicsofthedata(sparseness)andoftheproblem(theweightsrepresentmutualinformation)thatareexternaltothemaximumweightspanningtreealgorithmproper.Thealgorithmobtainedisworstcase)buttypically),whichrepresentsasigni cantasymptoticimprovementoverthe)ofthetraditionalChowandLiualgorithm.Moreover,ifislarge,thentheacCLalgorithm(gracefully)degradestothestandardChowLiualgorithm.Thealgorithmextendstonon-integercounts,hencebeingdirectlyapplicabletomixturesoftrees.Aswehaveseenempirically,averysigni cantpartoftherunningtimeisspentincomputingco-occurrences.Thispromptsfutureworkonlearningstatisticalmodelsoverlargedomainsthatfocusontheecientcomputationandusageoftherelevantsucientstatistics.WorkinthisdirectionincludesthestructuralEMalgorithm(Friedman,1998;Friedman&Getoor,1999)aswellasA-Dtrees(Moore&Lee,1998).Thelatterarecloselyrelatedtoourrepresentationofthepairwisemarginalsbycounts.Infact,ourrepresentationcanbeviewedasa\reduced"A-Dtreethatstoresonlypairwisestatistics.Consequently,whenanA-Dtreerepresentationisalreadycomputed,itcanbeexploitedinsteps1and2oftheacCLalgorithm.OtherversionsoftheacCLalgorithmarediscussedbyMeila-Predoviciu(1999).7.ConclusionsWehavepresentedthemixtureoftrees(MT),aprobabilisticmodelinwhichjointprobabil-itydistributionsarerepresentedas nitemixturesoftreedistributions.Treedistributionshaveanumberofvirtues|representational,computationalandstatistical|buthavelimitedexpressivepower.BayesianandMarkovnetworksachievesigni cantlygreaterexpressivepowerwhileretainingmanyoftherepresentationalvirtuesoftrees,butincursigni cantlyhighercostsonthecomputationalandstatisticalfronts.Themixtureapproachprovidesanalternativeupgradepath.WhileBayesianandMarkovnetworkshavenodistinguishedrelationshipsbetweenedges,andstatisticalmodelselectionproceduresforthesenetworksgenerallyinvolveadditionsanddeletionsofsingleedges,theMTmodelgroupsoverlappingsetsofedgesintomixturecomponents,andedgesareaddedandremovedviaamaximumlikelihoodalgorithmthatisconstrainedto ttreemodelsineachmixturecomponent.WehavealsoseenthatitisstraightforwardtodevelopBayesianmethodsthatallow nercon- LearningwithMixturesofTreestroloverthechoiceofedgesandsmooththenumericalparameterizationofeachofthecomponentmodels.ChowandLiu(1968)presentedthebasicmaximumlikelihoodalgorithmfor ttingtreedistributionsthatprovidestheMstepofourEMalgorithm,andalsoshowedhowtouseen-semblesoftreestosolveclassi cationproblems,whereeachtreemodelstheclass-conditionaldensityofoneoftheclasses.ThisapproachwaspursuedbyFriedmanetal.(1997,1998),whoemphasizedtheconnectionwiththenaiveBayesmodel,andpresentedempiricalre-sultsthatdemonstratedtheperformancegainsthatcouldbeobtainedbyenhancingnaiveBayestoallowconnectivitybetweentheattributes.Ourworkisafurthercontributiontothisgenerallineofresearch|wetreatanensembleoftreesasamixturedistribution.Themixtureapproachprovidesadditional exibilityintheclassi cationdomain,wherethe\choicevariable"neednotbetheclasslabel,andalsoallowsthearchitecturetobeappliedtounsupervisedlearningproblems.Thealgorithmsforlearningandinferencethatwehavepresentedhaverelativelybe-nignscaling:inferenceislinearinthedimensionalityandeachstepoftheEMlearningalgorithmisquadraticin.Suchfavorabletimecomplexityisanimportantvirtueofourtree-basedapproach.Inparticularlylargeproblems,however,suchasthosethatariseininformationretrievalapplications,quadraticcomplexitycanbecomeonerous.ToallowtheuseoftheMTmodeltosuchcases,wehavedevelopedtheacCLalgorithm,wherebyex-ploitingdatasparsenessandpayingattentiontodatastructureissueswehavesigni cantlyreducedruntime:Wepresentedexamplesinwhichthespeed-upobtainedfromtheacCLalgorithmwasthreeordersofmagnitude.Arethereotherclassesofgraphicalmodelswhosestructurecanbelearnedecientlyfromdata?ConsidertheclassofBayesiannetworksforwhichthetopologicalorderingofthevariablesis xedandthenumberofparentsofeachnodeisboundedbya xedconstant.Forthisclasstheoptimalmodelstructureforagiventargetdistributioncanbefound)timebyagreedyalgorithm.Thesemodelssharewithtreesthepropertyofbeingmatroids(West,1996).Thematroidistheuniquealgebraicstructureforwhichthe\maximumweight"problem,inparticularthemaximumweightspanningtreeproblem,issolvedoptimallybyagreedyalgorithm.Graphicalmodelsthatarematroidshaveecientstructurelearningalgorithms;itisaninterestingopenproblemto ndadditionalexamplesofsuchmodels.AcknowledgmentsWewouldliketoacknowledgesupportforthisprojectfromtheNationalScienceFoundation(NSFgrantIIS-9988642)andtheMultidisciplinaryResearchProgramoftheDepartmentofDefense(MURIN00014-00-1-0637). a&JordanAppendixA.InthisappendixweprovethefollowingtheoremfromSection6.2:TheoremLetu;v;wbediscretevariablessuchthatv;wdonotco-occurwithinagivendataset).Letbethenumberofdatapointsforwhichrespectively,andletbetherespectiveempiricalmutualinformationvaluesbasedonthesample.Thenwithequalityonlyifisidentically0. Proof.Weusethenotation: =0;(0)=1Thesevaluesrepresentthe(empirical)probabilitiesoftakingvalue=0and0respectively.Entropieswillbedenotedby.Weaimtoshowthat We rstnotea\chainrule"expressionfortheentropyofadiscretevariable.Inpar-ticular,theentropyofanymultivalueddiscretevariablecanbedecomposedinthefollowingway:)log(1 Hv0�Pv0)Xi6v(i) Pv0)v(i) Pv0)| �H +(1 Notemoreoverthatthemutualinformationoftwonon-co-occurringvariablesis.Thesecondterm,theconditionalentropyofgiven Wenowexpandusingthedecompositionin(12):+(1Becauseandarenevernon-zeroatthesametime,allnon-zerovaluesofarepairedwithzerovaluesof.Consequentlytlyu=iju6=0;v=0]==u=iju6=0]and .Thetermdenotedistheentropyofabinaryvariablewhoseprobabilityissu=0jv=0].Thisprobabilityequalsalsu=0jv=0]=1 1�Pv0: LearningwithMixturesofTreesNotethatinordertoobtainanon-negativeprobabilityintheaboveequationoneneeds,aconditionthatisalwayssatis edifanddonotco-occur.Replacingthepreviousthreeequationsintheformulaofthemutualinformation,weget1)log(Thisexpression,remarkably,dependsonlyonand.Takingitspartialderivativewithrespectto =log avaluethatisalwaysnegative,independentlyof.Thisshowsthemutualinformationincreasesmonotonicallywiththe\occurrencefrequency"ofgivenby1.Notealsothattheaboveexpressionforthederivativeisthesameastheresultobtainedforbinaryvariablesin(11).ReferencesBishop,C.M.(1999).Latentvariablemodels.InM.I.Jordan(Ed.),LearninginGraphicalModels.Cambridge,MA:MITPress.Blake,C.,&Merz,C.(1998).UCIRepositoryofMachineLearningDatabases.http://www.ics.uci.edu/mlearn/MLRepository.html.Boutilier,C.,Friedman,N.,Goldszmidt,M.,&Koller,D.(1996).Context-speci cinde-pendenceinBayesiannetworks.InProceedingsofthe12thConferenceonUncertaintyinAI(pp.64{72).MorganKaufmann.Buntine,W.(1996).Aguidetotheliteratureonlearninggraphicalmodels.IEEETrans-actionsonKnowledgeandDataEngineering,195{210.Cheeseman,P.,&Stutz,J.(1995).Bayesianclassi cation(AutoClass):Theoryandresults.InU.Fayyad,G.Piatesky-Shapiro,P.Smyth,&Uthurusamy(Eds.),AdvancesinKnowledgeDiscoveryandDataMining(pp.153{180).AAAIPress.Cheng,J.,Bell,D.A.,&Liu,W.(1997).Learningbeliefnetworksfromdata:aninformationtheorybasedapproach.InProceedingsoftheSixthACMInternationalConferenceonInformationandKnowledgeManagement.Chow,C.K.,&Liu,C.N.(1968).Approximatingdiscreteprobabilitydistributionswithdependencetrees.IEEETransactionsonInformationTheoryIT-14(3),462{467.Cooper,G.F.,&Herskovits,E.(1992).ABayesianmethodfortheinductionofprobabilisticnetworksfromdata.MachineLearning,309{347.Cormen,T.H.,Leiserson,C.E.,&Rivest,R.R.(1990).IntroductiontoAlgorithms.Cambridge,MA:MITPress. a&JordanCowell,R.G.,Dawid,A.P.,Lauritzen,S.L.,&Spiegelhalter,D.J.(1999).ProbabilisticNetworksandExpertSystems.NewYork,NY:Springer.Dayan,P.,&Zemel,R.S.(1995).Competitionandmultiplecausemodels.NeuralCom-(3),565{579.Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).MaximumlikelihoodfromincompletedataviatheEMalgorithm.JournaloftheRoyalStatisticalSociety,B,1{38.Fredman,M.L.,&Tarjan,R.E.(1987).Fibonacciheapsandtheirusesinimprovednetworkoptimizationalgorithms.JournaloftheAssociationforComputingMachineryFrey,B.J.,Hinton,G.E.,&Dayan,P.(1996).Doesthewake-sleepalgorithmproducegooddensityestimators?InD.Touretzky,M.Mozer,&M.Hasselmo(Eds.),NeuralInformationProcessingSystems(pp.661{667).Cambridge,MA:MITPress.Friedman,N.(1998).TheBayesianstructuralEMalgorithm.InProceedingsofthe14thConferenceonUncertaintyinAI(pp.129{138).SanFrancisco,CA:MorganKauf-mann.Friedman,N.,Geiger,D.,&Goldszmidt,M.(1997).Bayesiannetworkclassi ers.MachineLearning,131{163.Friedman,N.,&Getoor,L.(1999).Ecientlearningusingconstrainedsucientstatis-tics.InProceedingsofthe7thInternationalWorkshoponArti cialIntelligenceandStatistics(AISTATS-99).Friedman,N.,Getoor,L.,Koller,D.,&Pfe er,A.(1996).Learningprobabilisticrela-tionalmodels.InProceedingsofthe16thInternationalJointConferenceonArti cialIntelligence(IJCAI)(pp.1300{1307).Friedman,N.,Goldszmidt,M.,&Lee,T.(1998).Bayesiannetworkclassi cationwithcontinousattributes:Gettingthebestofbothdiscretizationandparametric tting.ProceedingsoftheInternationalConferenceonMachineLearning(ICML).Geiger,D.(1992).Anentropy-basedlearningalgorithmofBayesianconditionaltrees.Proceedingsofthe8thConferenceonUncertaintyinAI(pp.92{97).MorganKaufmannPublishers.Geiger,D.,&Heckerman,D.(1996).KnowledgerepresentationandinferenceinsimilaritynetworksandBayesianmultinets.Arti cialIntelligence,45{74.Hastie,T.,&Tibshirani,R.(1996).Discriminantanalysisbymixturemodeling.JournaloftheRoyalStatisticalSocietyB,155{176.Heckerman,D.,Geiger,D.,&Chickering,D.M.(1995).LearningBayesiannetworks:thecombinationofknowledgeandstatisticaldata.MachineLearning(3),197{243.Hinton,G.E.,Dayan,P.,Frey,B.,&Neal,R.M.(1995).Thewake-sleepalgorithmforunsupervisedneuralnetworks.Science,1158{1161. LearningwithMixturesofTreesJelinek,F.(1997).StatisticalMethodsforSpeechRecognition.Cambridge,MA:MITPress.Jordan,M.I.,&Jacobs,R.A.(1994).HierarchicalmixturesofexpertsandtheEMalgorithm.NeuralComputation,181{214.Kontkanen,P.,Myllymaki,P.,&Tirri,H.(1996).ConstructingBayesian nitemixturemodelsbytheEMalgorithm(Tech.Rep.No.C-1996-9).UniversityofHelsinki,De-partmentofComputerScience.Lauritzen,S.L.(1995).TheEMalgorithmforgraphicalassociationmodelswithmissingdata.ComputationalStatisticsandDataAnalysis,191{201.Lauritzen,S.L.(1996).GraphicalModels.Oxford:ClarendonPress.Lauritzen,S.L.,Dawid,A.P.,Larsen,B.N.,&Leimer,H.-G.(1990).IndependencepropertiesofdirectedMarkov elds.Networks,579{605.MacLachlan,G.J.,&Bashford,K.E.(1988).MixtureModels:InferenceandApplicationstoClustering.NY:MarcelDekker.a,M.,&Jaakkola,T.(2000).TractableBayesianlearningoftreedistributions.InC.Boutilier&M.Goldszmidt(Eds.),Proceedingsofthe16thConferenceonUncer-taintyinAI(pp.380{388).SanFrancisco,CA:MorganKaufmann.a,M.,&Jordan,M.I.(1998).Estimatingdependencystructureasahiddenvariable.InM.I.Jordan,M.J.Kearns,&S.A.Solla(Eds.),NeuralInformationProcessing(pp.584{590).MITPress.a-Predoviciu,M.(1999).Learningwithmixturesoftrees.Unpublisheddoctoraldisser-tation,MassachusettsInstituteofTechnology.Michie,D.,Spiegelhalter,D.J.,&Taylor,C.C.(1994).MachineLearning,NeuralandStatisticalClassi cation.NewYork:EllisHorwood.Monti,S.,&Cooper,G.F.(1998).ABayesiannetworkclass erthatcombinesa nitemixturemodelandanaiveBayesmodel(Tech.Rep.No.ISSP-98-01).UniversityofPittsburgh.Moore,A.W.,&Lee,M.S.(1998).Cachedsucientstatisticsforecientmachinelearningwithlargedatasets.JournalforArti cialIntelligenceResearch,67{91.Neal,R.M.,&Hinton,G.E.(1999).AviewoftheEMalgorithmthatjusti esincremental,sparse,andothervariants.InM.I.Jordan(Ed.),LearninginGraphicalModels(pp.355{368).Cambridge,MA:MITPress.Ney,H.,Essen,U.,&Kneser,R.(1994).Onstructuringprobabilisticdependencesinstochasticlanguagemodelling.ComputerSpeechandLanguage,1{38.Noordewier,M.O.,Towell,G.G.,&Shavlik,J.W.(1991).Trainingknowledge-basedneuralnetworkstorecognizegenesinDNAsequences.InR.P.Lippmann,J.E.Moody,&D.S.Touretzky(Eds.),AdvancesinNeuralInformationProcessingSystems(pp.530{538).MorganKaufmannPublishers. a&JordanPearl,J.(1988).ProbabilisticReasoninginIntelligentSystems:NetworksofPlausibleInference.SanMateo,CA:MorganKaufmanPublishers.Philips,P.,Moon,H.,Rauss,P.,&Rizvi,S.(1997).TheFERETevaluationmethodologyforface-recognitionalgorithms.InProceedingsofthe1997ConferenceonComputerVisionandPatternRecognition.SanJuan,PuertoRico.Rasmussen,C.E.,Neal,R.M.,Hinton,G.E.,Camp,D.van,Revow,M.,Ghahramani,Z.,Kustra,R.,&Tibshrani,R.(1996).TheDELVEManual.http://www.cs.utoronto.ca/delve.Rissanen,J.(1989).StochasticComplexityinStatisticalInquiry.NewJersey:WorldScienti cPublishingCompany.Rubin,D.B.,&Thayer,D.T.(1983).EMalgorithmsforMLfactoranalysis.Psychome-,69{76.Saul,L.K.,&Jordan,M.I.(1999).Amean eldlearningalgorithmforunsupervisedneuralnetworks.InM.I.Jordan(Ed.),LearninginGraphicalModels(pp.541{554).Cambridge,MA:MITPress.Shafer,G.,&Shenoy,P.(1990).Probabilitypropagation.AnnalsofMathematicsandArti cialIntelligence,327{352.Smyth,P.,Heckerman,D.,&Jordan,M.I.(1997).ProbabilisticindependencenetworksforhiddenMarkovprobabilitymodels.NeuralComputation,227{270.Thiesson,B.,Meek,C.,Chickering,D.M.,&Heckerman,D.(1997).LearningmixturesofBayesnetworks(Tech.Rep.Nos.MSR{POR{97{30).MicrosoftResearch.Watson,J.D.,Hopkins,N.H.,Roberts,J.W.,Steitz,J.A.,&Weiner,A.M.(1987).Molec-ularBiologyoftheGene(Vol.I,4ed.).MenloPark,CA:TheBenjamin/CummingsPublishingCompany.West,D.B.(1996).IntroductiontoGraphTheory.UpperSaddleRiver,NJ:PrenticeHall.