aJordanforclassicationpredictionanddensityestimationBishop1999FriedmanGeigerGoldszmidt1997HeckermanGeigerChickering1995HintonDayanFreyNeal1995FriedmanGetoorKollerPfeer1996 ID: 123877
Download Pdf The PPT/PDF document "JournalofMachineLearningResearch1(2000)1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
JournalofMachineLearningResearch1(2000)1-48Submitted4/00;Published10/00LearningwithMixturesofTreesMarinaMeilmmp@stat.washington.eduDepartmentofStatisticsUniversityofWashingtonSeattle,WA98195-4322,USAMichaelI.Jordanjordan@cs.berkeley.eduDivisionofComputerScienceandDepartmentofStatisticsUniversityofCaliforniaBerkeley,CA94720-1776,USAEditor:LesliePackKaelblingAbstractThispaperdescribesthemixtures-of-treesmodel,aprobabilisticmodelfordiscretemultidimensionaldomains.Mixtures-of-treesgeneralizetheprobabilistictreesofChowandLiu(1968)inadierentandcomplementarydirectiontothatofBayesiannetworks.Wepresentecientalgorithmsforlearningmixtures-of-treesmodelsinmaximumlikelihoodandBayesianframeworks.Wealsodiscussadditionalecienciesthatcanbeobtainedwhendataare\sparse,"andwepresentdatastructuresandalgorithmsthatexploitsuchsparseness.Experimentalresultsdemonstratetheperformanceofthemodelforbothden-sityestimationandclassication.Wealsodiscussthesenseinwhichtree-basedclassiersperformanimplicitformoffeatureselection,anddemonstratearesultinginsensitivitytoirrelevantattributes.1.IntroductionProbabilisticinferencehasbecomeacoretechnologyinAI,largelyduetodevelopmentsingraph-theoreticmethodsfortherepresentationandmanipulationofcomplexprobabilitydistributions(Pearl,1988).Whetherintheirguiseasdirectedgraphs(Bayesiannetworks)orasundirectedgraphs(Markovrandomelds),probabilisticgraphicalmodelshaveanum-berofvirtuesasrepresentationsofuncertaintyandasinferenceengines.Graphicalmodelsallowaseparationbetweenqualitative,structuralaspectsofuncertainknowledgeandthequantitative,parametricaspectsofuncertainty|theformerrepresentedviapatternsofedgesinthegraphandthelatterrepresentedasnumericalvaluesassociatedwithsubsetsofnodesinthegraph.Thisseparationisoftenfoundtobenaturalbydomainexperts,tamingsomeoftheproblemsassociatedwithstructuring,interpreting,andtroubleshoot-ingthemodel.Evenmoreimportantly,thegraph-theoreticframeworkhasallowedforthedevelopmentofgeneralinferencealgorithms,whichinmanycasesprovideordersofmagni-tudespeedupsoverbrute-forcemethods(Cowell,Dawid,Lauritzen,&Spiegelhalter,1999;Shafer&Shenoy,1990).Thesevirtueshavenotgoneunnoticedbyresearchersinterestedinmachinelearning,andgraphicalmodelsarebeingwidelyexploredastheunderlyingarchitecturesinsystems2000MarinaMeilaandMichaelI.Jordan. a&Jordanforclassication,predictionanddensityestimation(Bishop,1999;Friedman,Geiger,&Goldszmidt,1997;Heckerman,Geiger,&Chickering,1995;Hinton,Dayan,Frey,&Neal,1995;Friedman,Getoor,Koller,&Pfeer,1996;Monti&Cooper,1998;Saul&Jordan,1999).Indeed,itispossibletoviewawidevarietyofclassicalmachinelearningarchitecturesasinstancesofgraphicalmodels,andthegraphicalmodelframeworkprovidesanaturaldesignprocedureforexploringarchitecturalvariationsonclassicalthemes(Buntine,1996;Smyth,Heckerman,&Jordan,1997).Asinmanymachinelearningproblems,theproblemoflearningagraphicalmodelfromdatacanbedividedintotheproblemofparameterlearningandtheproblemofstructurelearning.Muchprogresshasbeenmadeontheformerproblem,muchofitcastwithintheframeworkoftheexpectation-maximization(EM)algorithm(Lauritzen,1995).TheEMalgorithmessentiallyrunsaprobabilisticinferencealgorithmasasubroutinetocomputethe\expectedsucientstatistics"forthedata,reducingtheparameterlearningproblemtoadecoupledsetoflocalstatisticalestimationproblemsateachnodeofthegraph.Thislinkbetweenprobabilisticinferenceandparameterlearningisanimportantone,allow-ingdevelopmentsinecientinferencetohaveimmediateimpactonresearchonlearningalgorithms.Theproblemoflearningthestructureofagraphfromdataissignicantlyharder.Inpractice,moststructurelearningmethodsareheuristicmethodsthatperformlocalsearchbystartingwithagivengraphandimprovingitbyaddingordeletingoneedgeatatime(Heck-ermanetal.,1995;Cooper&Herskovits,1992).Thereisanimportantspecialcaseinwhichbothparameterlearningandstructurelearningaretractable,namelythecaseofgraphicalmodelsintheformofatreedistributionAsshownbyChowandLiu(1968),thetreedistributionthatmaximizesthelikelihoodofasetofobservationsonnodes|aswellastheparametersofthetree|canbefoundintimequadraticinthenumberofvariablesinthedomain.ThisalgorithmisknownastheChow-Liualgorithm.Treesalsohavethevirtuethatprobabilisticinferenceisguaranteedtobeecient,andindeedhistoricallytheearliestresearchinAIonecientinferencefocusedontrees(Pearl,1988).Laterresearchextendedthisearlyworkbyrstconsideringgeneralsingly-connectedgraphs(Pearl,1988),andthenconsideringgraphswitharbitrary(acylic)patternsofcon-nectivity(Cowelletal.,1999).Thislineofresearchhasprovidedoneuseful\upgradepath"fromtreedistributionstothecomplexBayesianandMarkovnetworkscurrentlybeingstud-ied.Inthispaperweconsideranalternativeupgradepath.Inspiredbythesuccessofmixturemodelsinprovidingsimple,eectivegeneralizationsofclassicalmethodsinmanysimplerdensityestimationsettings(MacLachlan&Bashford,1988),weconsiderageneralizationoftreedistributionsknownasthemixtures-of-trees(MT)model.AssuggestedinFigure1,theMTmodelinvolvestheprobabilisticmixtureofasetofgraphicalcomponents,eachofwhichisatree.Inthispaperwedescribelikelihood-basedalgorithmsforlearningtheparametersandstructureofsuchmodels.Onecanalsoconsiderprobabilisticmixturesofmoregeneralgraphicalmodels;indeed,thegeneralcaseistheBayesianmultinetintroducedbyGeigerandHeckerman(1996).TheBayesianmultinetisamixturemodelinwhicheachmixturecomponentisanarbitrarygraphicalmodel.TheadvantageofBayesianmultinetsovermoretraditionalgraphicalmod- LearningwithMixturesofTrees e d c a d b c a d b c a z=1z=2 z=3 Figure1:Amixtureoftreesoveradomainconsistingofrandomvariablesa;b;c;d;ewhereisahiddenchoicevariable.Conditionalonthevalueof,thedepen-dencystructureisatree.Adetailedpresentationofthemixture-of-treesmodelisprovidedinSection3.elsistheabilitytorepresentcontext-specicindependencies|situationsinwhichsubsetsofvariablesexhibitcertainconditionalindependenciesforsome,butnotall,valuesofaconditioningvariable.(Furtherworkoncontext-specicindependencehasbeenpresentedbyBoutilier,Friedman,Goldszmidt,&Koller,1996).Bymakingcontext-specicinde-pendenciesexplicitasmultiplecollectionsofedges,onecanobtain(a)moreparsimoniousrepresentationsofjointprobabilitiesand(b)moreecientinferencealgorithms.Inthemachinelearningsetting,however,theadvantagesofthegeneralBayesianmulti-netformalismarelessapparent.Allowingeachmixturecomponenttobeageneralgraph-icalmodelforcesustofacethedicultiesoflearninggeneralgraphicalstructure.More-over,greedyedge-additionandedge-deletionalgorithmsseemparticularlyill-suitedtotheBayesianmultinet,giventhatitisthefocusoncollectionsofedgesratherthansingleedgesthatunderliesmuchoftheintuitiveappealofthisarchitecture.Weviewthemixtureoftreesasprovidingareasonablecompromisebetweenthesimplic-ityoftreedistributionsandtheexpressivepoweroftheBayesianmultinet,whiledoingsowithinarestrictedsettingthatleadstoecientmachinelearningalgorithms.Inparticular,asweshowinthispaper,thereisasimplegeneralizationoftheChow-Liualgorithmthatmakesitpossibletond(local)maximaoflikelihoods(orpenalizedlikelihoods)ecientlyingeneralMTmodels.ThisalgorithmisaniterativeExpectation-Maximization(EM)al-gorithm,inwhichtheinnerloop(theMstep)involvesinvokingtheChow-Liualgorithmtodeterminethestructureandparametersoftheindividualmixturecomponents.Thus,inaveryconcretesense,thisalgorithmsearchesinthespaceofcollectionsofedges.Insummary,theMTmodelisamultiplenetworkrepresentationthatsharesmanyofthebasicfeaturesofBayesianandMarkovnetworkrepresentations,butbringsnewfeaturestothefore.Webelievethatthesefeaturesexpandthescopeofgraph-theoreticprobabilistic a&Jordanrepresentationsinusefulwaysandmaybeparticularlyappropriateformachinelearningproblems.1.1RelatedworkTheMTmodelcanbeusedbothintheclassicationsettingandthedensityestimationsetting,anditmakescontactwithdierentstrandsofpreviousliteratureinthesetwoguises.Intheclassicationsetting,theMTmodelbuildsontheseminalworkontree-basedclassiersbyChowandLiu(1968),andonrecentextensionsduetoFriedmanetal.(1997)andFriedman,Goldszmidt,andLee(1998).ChowandLiuproposedtosolve-wayclassicationproblemsbyttingaseparatetreetotheobservedvariablesineachoftheclasses,andclassifyinganewdatapointbychoosingtheclasshavingmaximumclass-conditionalprobabilityunderthecorrespondingtreemodel.Friedmanetal.tookastheirpointofdeparturetheNaiveBayesmodel,whichcanbeviewedasagraphicalmodelinwhichanexplicitclassnodehasdirectededgestoanotherwisedisconnectedsetofnodesrepresentingtheinputvariables(i.e.,attributes).IntroducingadditionaledgesbetweentheinputvariablesyieldstheTreeAugmentedNaiveBayes(TANB)classier(Friedmanetal.,1997;Geiger,1992).Theseauthorsalsoconsideredalessconstrainedmodelinwhichdierentpatternsofedgeswereallowedforeachvalueoftheclassnode|thisisformallyidenticaltotheChowandLiuproposal.IfthechoicevariableoftheMTmodelisidentiedwiththeclasslabelthentheMTmodelisidenticaltotheChowandLiuapproach(intheclassicationsetting).However,wedonotnecessarilywishtoidentifythechoicevariablewiththeclasslabel,and,indeed,inourexperimentsonclassicationwetreattheclasslabelassimplyanotherinputvariable.Thisyieldsamorediscriminativeapproachtoclassicationinwhichallofthetrainingdataarepooledforthepurposesoftrainingthemodel(Section5,Meila&Jordan,1998).Thechoicevariableremainshidden,yieldingamixturemodelforeachclass.Thisissimilarinspirittothe\mixturediscriminantanalysis"modelofHastieandTibshirani(1996),whereamixtureofGaussiansisusedforeachclassinamultiwayclassicationproblem.Inthesettingofdensityestimation,clusteringandcompressionproblems,theMTmodelmakescontactwiththelargeandactiveliteratureonmixturemodeling.Letusbrie yreviewsomeofthemostsalientconnections.TheAuto-Classmodel(Cheeseman&Stutz,1995)isamixtureoffactorialdistributions(MF),anditsexcellentcost/performanceratiomotivatestheMTmodelinmuchthesamewayastheNaiveBayesmodelmotivatestheTANBmodelintheclassicationsetting.(Afactorialdistributionisaproductoffactorseachofwhichdependsonexactlyonevariable).Kontkanen,Myllymaki,andTirri(1996)studyaMFinwhichahiddenvariableisusedforclassication;thisapproachwasextendedbyMontiandCooper(1998).TheideaoflearningtractablebutsimplebeliefnetworksandsuperimposingamixturetoaccountfortheremainingdependencieswasdevelopedindependentlyofourworkbyThiesson,Meek,Chickering,andHeckerman(1997),whostudiedmixturesofGaussianbeliefnetworks.TheirworkinterleavesEMparametersearchwithBayesianmodelsearchinaheuristicbutgeneralalgorithm. LearningwithMixturesofTrees2.TreedistributionsInthissectionweintroducethetreemodelandthenotationthatwillbeusedthroughoutthepaper.Letdenoteasetofdiscreterandomvariablesofinterest.Foreachrandomvariablelet ()representitsrange,)aparticularvalue,andthe(nite)cardinalityof ().Foreachsubset,let ()andletdenoteanassignmenttothevariablesin.Tosimplifynotationwillbedenotedbyand (willbedenotedsimplyby .Sometimesweneedtorefertothemaximumofover;wedenotethisvaluebyWebeginwithundirected(Markovrandomeld)representationsoftreedistributions.Identifyingthevertexsetofagraphwiththesetofrandomvariables,consideragraphV;E),whereisasetofundirectededges.Weallowatreetohavemultipleconnectedcomponents(thusour\trees"aregenerallycalledforests).Giventhisdenition,thenumberofedgesandthenumberofconnectedcomponentsarerelatedasfollows:implyingthataddinganedgetoatreereducesthenumberofconnectedcomponentsby1.Thus,atreecanhaveatmost1edges.InthislattercasewerefertothetreeasaspanningtreeWeparameterizeatreeinthefollowingway.Foru;vand(u;v,letdenoteajointprobabilitydistributiononand.Werequirethesedistributionstobeconsistentwithrespecttomarginalization,denotingby)themarginalof)withrespecttoforany.Wenowassignadistributiontothegraph(V;E)asfollows: wheredegisthedegreeofvertex;i.e.,thenumberofedgesincidentto.Itcanbeveriedthatisinfactaprobabilitydistribution;moreover,thepairwiseprobabilitiesarethemarginalsoftreedistributionisdenedtobeanydistributionthatadmitsafactorizationoftheform(1).Treedistributionscanalsoberepresentedusingdirected(Bayesiannetwork)graphicalmodels.LetV;E)denoteadirectedtree(possiblyaforest),whereisasetofdirectededgesandwhereeachnodehas(atmost)oneparent,denotedpa().Weparameterizethisgraphasfollows:)(2)where)isanarbitraryconditionaldistribution.Itcanbeveriedthatindeeddenesaprobabilitydistribution;moreover,themarginalconditionalsofaregivenbytheconditionalsWeshallcalltherepresentations(1)and(2)theundirectedanddirectedtreerepresen-tationsofthedistributionrespectively.Wecanreadilyconvertbetweentheserepresenta-tions;forexample,toconvert(1)toadirectedrepresentationwechooseanarbitraryroot a&Jordan "!"!"!# 3"!# 4"!@@@I -T(xx1x3)Tx2x3)Tx3x4)Tx4x5) T23(x3)T4(x4)T(x4(x4)Tx3x4) T4(x4)Tx5x4) T4(x4)Tx1x3) T3(x3)Tx2x3) (a)(b)Figure2:Atreeinitsundirected(a)anddirected(b)representations.ineachconnectedcomponentanddirecteachedgeawayfromtheroot.For(u;vwithclosertotherootthan,letpa(.Nowcomputetheconditionalprobabilitiescorrespondingtoeachdirectededgebyrecursivelysubstitutingpa(pa(pa(startingfromtheroot.Figure2illustratesthisprocessonatreewith5vertices.Thedirectedtreerepresentationhastheadvantageofhavingindependentparameters.Thetotalnumberoffreeparametersineitherrepresentationis:(deg1)+Theright-handsideof(3)showsthateachedge(u;v)increasesthenumberofparametersby(Thesetofconditionalindependenciesassociatedwithatreedistributionarereadilycharacterized(Lauritzen,Dawid,Larsen,&Leimer,1990).Inparticular,twosubsetsA;Bareindependentgivenintersectseverypath(ignoringthedirectionofedgesinthedirectedcase)betweenandforalland2.1Marginalization,inferenceandsamplingintreedistributionsThebasicoperationsofcomputinglikelihoods,conditioning,marginalization,samplingandinferencecanbeperformedecientlyintreedistributions;inparticular,eachoftheseopera-tionshastimecomplexity).Thisisadirectconsequenceofthefactorizedrepresentationoftreedistributionsinequations(1)and(2).2.2RepresentationalcapabilitiesIfgraphicalrepresentationsarenaturalforhumanintuition,thenthesubclassoftreemodelsareparticularlyintuitive.Treesaresparsegraphs,having1orfeweredges.Thereisatmostonepathbetweeneverypairofvariables;thus,independencerelationshipsbetweensubsetsofvariables,whicharenoteasytoreadoutingeneralBayesiannetworktopologies,areobviousinatree.Inatree,anedgecorrespondstothesimple,common-sensenotionofdirectdependencyandisthenaturalrepresentationforit.However,theverysimplicity LearningwithMixturesofTreesthatmakestreemodelsappealingalsolimitstheirmodelingpower.Notethatthenumberoffreeparametersinatreegrowslinearlywithwhilethesizeofthestatespace (isanexponentialfunctionof.Thustheclassofdependencystructuresrepresentablebytreesisarelativelysmallone.2.3LearningoftreedistributionsThelearningproblemisformulatedasfollows:wearegivenasetofobservations;:::;xandwearerequiredtondthetreethatmaximizestheloglikelihoodofthedata:argmaxwhereisanassignmentofvaluestoallvariables.Notethatthemaximumistakenbothwithrespecttothetreestructure(thechoiceofwhichedgestoinclude)andwithrespecttothenumericalvaluesoftheparameters.Hereandintherestofthepaperwewillassumeforsimplicitythattherearenomissingvaluesforthevariablesin,or,inotherwords,thattheobservationsarecompleteLetting)denotetheproportionofobservationsinthetrainingsetthatareequalto,wecanalternativelyexpressthemaximumlikelihoodproblembysummingovercongurationsargmax)logInthisformweseethattheloglikelihoodcriterionfunctionisa(negative)cross-entropy.Wewillinfactsolvetheproblemingeneral,letting)beanarbitraryprobabilitydistribution.Thisgeneralitywillproveusefulinthefollowingsectionwhereweconsidermixturesoftrees.Thesolutiontothelearningproblemisanalgorithm,duetoChowandLiu(1968),thathasquadraticcomplexityin(seeFigure3).Therearethreestepstothealgorithm.First,wecomputethepairwisemarginals).Ifisanempiricaldistribution,asinthepresentcase,computingthesemarginalsrequiresoperations.Second,fromthesemarginalswecomputethemutualinformationbetweeneachpairofvariablesinunderthedistribution)log ;u;vV;u=v;anoperationthatrequiresMAX)operations.Third,werunamaximum-weightspan-ningtree(MWST)algorithm(Cormen,Leiserson,&Rivest,1990),usingastheweightforedge(u;v;forallu;v.Suchalgorithms,whichrunintime),returnaspanningtreethatmaximizesthetotalmutualinformationforedgesincludedinthetree.ChowandLiushowedthatthemaximum-weightspanningtreealsomaximizesthelike-lihoodovertreedistributions,andmoreovertheoptimizingparameters),foru;v,areequaltothecorrespondingmarginalsofthedistributionThealgorithmthusattainsaglobaloptimumoverbothstructureandparameters. a&Jordan AlgorithmChowLiuDistributionoverdomainProcedure(weights)thatoutputsamaximumweightspanningtreeoverV1.Computemarginaldistributionsforu;v2.Computemutualinformationvaluesforu;v4.SetforOutput: Figure3:TheChowandLiualgorithmformaximumlikelihoodestimationoftreestructureandparameters.3.MixturesoftreesWedeneamixture-of-trees(MT)modeltobeadistributionoftheform:with;:::;mThetreedistributionsarethemixturecomponentsandarecalledmixturecoecientsAmixtureoftreescanbeviewedascontaininganunobservedchoicevariable,whichtakesvalue;:::mwithprobability.Conditionedonthevalueof,thedistributionoftheobservedvariablesisatree.Thetreesmayhavedierentstructuresanddierentparameters.InFigure1,forexample,wehaveamixtureoftreeswith=3and=5.Notethatbecauseofthevaryingstructureofthecomponenttrees,amixtureoftreesisneitheraBayesiannetworknoraMarkovrandomeld.Letusadoptthenotationfor\isindependentofgivenunderdistribution".Ifforsome(all);:::mwehavewithA;B;CthiswillnotimplythatOntheotherhand,amixtureoftreesiscapableofrepresentingdependencystructuresthatareconditionedonthevalueofavariable(thechoicevariable),somethingthatausual LearningwithMixturesofTrees 12345 2345 (a)(b)Figure4:Amixtureoftreeswithsharedstructure(MTSS)representedasaBayesnet(a)andasaMarkovrandomeld(b).BayesiannetworkorMarkovrandomeldcannotdo.Situationswheresuchamodelispotentiallyusefulaboundinreallife:Considerforexamplebitmapsofhandwrittendigits.Suchimagesobviouslycontainmanydependenciesbetweenpixels;however,thepatternofthesedependencieswillvaryacrossdigits.Imagineamedicaldatabaserecordingthebodyweightandotherdataforeachpatient.Thebodyweightcouldbeafunctionofageandheightforahealthyperson,butitwoulddependonotherconditionsifthepatientsueredfromadiseaseorwereanathlete.If,inasituationliketheonesmentionedabove,condi-tioningononevariableproducesadependencystructurecharacterizedbysparse,acyclicpairwisedependencies,thenamixtureoftreesmayprovideagoodmodelofthedomain.Ifweconstrainallofthetreesinthemixturetohavethesamestructureweobtainamixtureoftreeswithsharedstructure(MTSS;seeFigure4).InthecaseoftheMTSS,ifforsome(all);:::mthenenzg:Inaddition,aMTSScanberepresentedasaBayesiannetwork(Figure4,a),asaMarkovrandomeld(Figure4,b)andasachaingraph(Figure5).ChaingraphswereintroducedbyLauritzen(1996);theyrepresentasuperclassofbothBayesiannetworksandMarkovrandomelds.Achaingraphcontainsbothdirectedandundirectededges.Whilewegenerallyconsiderproblemsinwhichthechoicevariableishidden(i.e.,unob-served),itisalsopossibletoutilizeboththeMTandtheMTSSframeworksinwhichthechoicevariableisobserved.Suchmodels,which|aswediscussinSection1.1|havebeenstudiedpreviouslybyFriedmanetal.(1997)andFriedmanetal.(1998),willbereferredtogenericallyasmixtureswithobservedchoicevariable.Unlessstatedotherwise,itwillbeassumedthatthechoicevariableishidden. a&Jordan 345Figure5:AMTSSrepresentedasachaingraph.Thedoubleboxesenclosetheundirectedblocksofthechaingraph.3.1Marginalization,inferenceandsamplinginMTmodelsLet)beanMTmodel.Weconsiderthebasicoperationsofmarginal-ization,inferenceandsampling,recallingthattheseoperationshavetimecomplexityforeachofthecomponenttreedistributionsMarginalizationThemarginaldistributionofasubsetisgivenasfollows:Hence,themarginalofisamixtureofthemarginalsofthecomponenttrees.InferenceLetbetheevidence.ThentheprobabilityofthehiddenvariablegivenevidenceisobtainedbyapplyingBayes'ruleasfollows: Inparticular,whenweobserveallbutthechoicevariable,i.e.,and,weobtaintheposteriorprobabilitydistribution Theprobabilitydistributionofagivensubsetofgiventheevidenceis PmkkTkV0(xV0)=mXkz=kjV0=xV0]TkAjV0(xAjxV0): LearningwithMixturesofTreesThustheresultisagainamixtureoftheresultsofinferenceproceduresrunonthecompo-nenttrees.Theprocedureforsamplingfromamixtureoftreesisatwostageprocess:rstonesamplesavalueforthechoicevariablefromitsdistribution(;:::thenavalueissampledfromusingtheprocedureforsamplingfromatreedistribution.Insummary,thebasicoperationsonmixturesoftrees,marginalization,conditioningandsampling,areachievedbyperformingthecorrespondingoperationoneachcomponentofthemixtureandthencombiningtheresults.Therefore,thecomplexityoftheseoperationsscaleslinearlywiththenumberoftreesinthemixture.3.2LearningofMTmodelsTheexpectation-maximization(EM)algorithmprovidesaneectiveapproachtosolvingmanylearningproblems(Dempster,Laird,&Rubin,1977;MacLachlan&Bashford,1988),andhasbeenemployedwithparticularsuccessinthesettingofmixturemodelsandmoregenerallatentvariablemodels(Jordan&Jacobs,1994;Jelinek,1997;Rubin&Thayer,1983).InthissectionweshowthatEMalsoprovidesanaturalapproachtothelearningproblemfortheMTmodel.Animportantfeatureofthesolutionthatwepresentisthatitprovidesestimatesforbothparametricandstructuralaspectsofthemodel.Inparticular,althoughweassume(inthecurrentsection)thatthenumberoftreesisxed,thealgorithmthatwederiveprovidesestimatesforboththepatternofedgesoftheindividualtreesandtheirparameters.Theseestimatesaremaximumlikelihoodestimates,andalthoughtheyaresubjecttoconcernsaboutovertting,theconstrainednatureoftreedistributionshelpstoameliorateoverttingproblems.Itisalsopossibletocontrolthenumberofedgesindirectlyviapriors;wediscussmethodsfordoingthisinSection4.Wearegivenasetofobservations;:::;xandarerequiredtondthemixtureoftreesthatsatisesargmaxWithintheframeworkoftheEMalgorithm,thislikelihoodfunctionisreferredtoastheincompletelog-likelihoodanditiscontrastedwiththefollowingcompletelog-likelihoodfunc-tion:;:::Nk;zk;z+logwherek;zisequaltooneifisequaltothethvalueofthechoicevariableandzeroother-wise.Thecompletelog-likelihoodwouldbethelog-likelihoodofthedataiftheunobserved;:::;zcouldbeobserved. a&JordanTheideaoftheEMalgorithmistoutilizethecompletelog-likelihood,whichisgener-allyeasytomaximize,asasurrogatefortheincompletelog-likelihood,whichisgenerallysomewhatlesseasytomaximizedirectly.Inparticular,thealgorithmgoesuphillintheexpectedvalueofthecompletelog-likelihood,wheretheexpectationistakenwithrespecttotheunobserveddata.Thealgorithmthushastheformofaninteractingpairofsteps:theEstep,inwhichtheexpectationiscomputedgiventhecurrentvalueoftheparameters,andtheMstep,inwhichtheparametersareadjustedsoastomaximizetheexpectedcompletelog-likelihood.Thesetwostepsiterateandareprovedtoconvergetoalocalmaximumofthe(incomplete)log-likelihood(Dempsteretal.,1977).Takingtheexpectationof(5),weseethattheEstepfortheMTmodelreducestotakingtheexpectationofthedeltafunctionk;z,conditionedonthedataatak;zzzi=kjD]=Pr[zi=kjV=xi];andthislatterquantityisrecognizableastheposteriorprobabilityofthehiddenvariablegiventhethobservation(cf.equation(4)).Letusdene: asthisposteriorprobability.Substituting(6)intotheexpectedvalueofthecompletelog-likelihoodin(5),weobtain:in:lc(x1;:::N)]=+logLetusdenethefollowingquantities:;:::m wherethesums;N]canbeinterpretedasthetotalnumberofdatapointsthataregeneratedbycomponent.Usingthesedenitionsweobtain:n:lc(x1;:::N;:::N)]=)logItisthisquantitythatwemustmaximizewithrespecttotheparameters.From(7)weseethatatlc]separatesintotermsthatdependondisjointsubsetsofthemodelparametersandthustheMstepdecouplesintoseparatemaximizationsforeachofthevariousparameters.Maximizingthersttermof(7)withrespecttotheparameterssubjecttotheconstraint=1,weobtainthefollowingupdateequation: for;:::m: LearningwithMixturesofTrees AlgorithmMixTreeDataset;:::xInitialmodelm;T;:::mProcedureChowLiuuntilconvergence:Estep:Compute)for;:::m;i;:::NMstep:for;:::mChowLiuOutput:Modelm;T;:::m Figure6:TheMixTreealgorithmforlearningMTmodels.Inordertoupdate,weseethatwemustmaximizethenegativecross-entropybetweenand)logThisproblemissolvedbytheChowLiualgorithmfromSection2.3.ThusweseethattheMstepforlearningthemixturecomponentsofMTmodelsreducestoseparaterunsoftheChowLiualgorithm,wherethe\target"distribution)isthenormalizedposteriorprobabilityobtainedfromtheEstep.WesummarizetheresultsofthederivationoftheEMalgorithmformixturesoftreesinFigure6.3.2.1RunningtimeComputingthelikelihoodofadatapointunderatreedistribution(inthedirectedtreerepresentation)takes)multiplications.Hence,theEsteprequires oatingpointoperations.AsfortheMstep,themostcomputationallyexpensivephaseisthecomputationofthemarginalsforthethtree.Thisstephastimecomplexity).AnotherMAX)and)arerequiredateachiterationforthemutualinformationsandfortherunsoftheMWSTalgorithm.FinallyweneedMAX)forcomputingthetreeparametersinthedirectedrepresentation.ThetotalrunningtimeperEMiterationisthusMAXThealgorithmispolynomial(periteration)inthedimensionofthedomain,thenumberofcomponentsandthesizeofthedataset. a&JordanThespacecomplexityisalsopolynomial,andisdominatedbyMAX),thespaceneededtostorethepairwisemarginaltables(thetablescanbeoverwrittenbysuccessivevaluesof3.3LearningmixturesoftreeswithsharedstructureItispossibletomodifytheMixTreealgorithmsoastoconstrainthetreestosharethesamestructure,andtherebyestimateMTSSmodels.TheEstepremainsunchanged.Theonlynoveltyisthereestimationofthetreedistri-butionsintheMstep,sincetheyarenowconstrainedtohavethesamestructure.Thus,themaximizationcannotbedecoupledintoseparatetreeestimationsbut,remarkablyenough,itcanstillbeperformedeciently.ItcanbereadilyveriedthatforanygivenstructuretheoptimalparametersofeachtreeedgeareequaltotheparametersofthecorrespondingmarginaldistributionItremainsonlytondtheoptimalstructure.Theexpressiontobeoptimizedisthesecondsumintheright-handsideofequation(7).Byreplacingwithanddenotingthemutualinformationbetweenandunderthissumcanbereexpressedasfollows:)loggX(u;v)2EIkuvXv2VH(Pkv)]=NX(u;v)2EIuvjzNXv2VH(vjz):(8)Thenewquantityappearingaboverepresentsthemutualinformationofandconditionedonthehiddenvariable.Itsgeneraldenitionforthreediscretevariablesu;v;zdistributedaccordingtouvz)log Thesecondtermin(8),),representsthesumoftheconditionalentropiesofthevariablesgivenandisindependentofthetreestructure.Hence,theoptimizationofthestructurecanbeachievedbyrunningaMWSTalgorithmwiththeedgeweightsrepresentedby.WesummarizethealgorithminFigure7.4.DecomposablepriorsandMAPestimationformixturesoftreesTheBayesianlearningframeworkcombinesinformationobtainedfromdirectobservationswithpriorknowledgeaboutthemodel,whenthelatterisrepresentedasaprobabilitydistribution.TheobjectofinterestofBayesiananalysisistheposteriordistributionoverthemodelsgiventheobserveddata,ata,QjD],aquantitywhichcanrarelybecalculatedexplicitly.Practicalmethodsforapproximatingtheposteriorincludechoosingasinglemaximumaposteriori(MAP)estimate,replacingthecontinuousspaceofmodelsbyanitesetofhighposteriorprobability(Heckermanetal.,1995),andexpandingtheposteriorarounditsmode(s)(Cheeseman&Stutz,1995). LearningwithMixturesofTrees AlgorithmMixTreeSSDataset;:::xInitialmodelm;T;:::mProcedure(weights)thatoutputsamaximumweightspanningtreeuntilconvergence:Estep:Compute)for;:::m;i;:::NMstep:Computemarginalsu;vComputemutualinformationvaluesu;vSetSetforandfor;:::mOutput:Modelm;T;:::m Figure7:TheMixTreeSSalgorithmforlearningMTSSmodels.Findingthelocalmaxima(modes)ofthedistributionnQjD]isanecessarystepinalltheabovemethodsandisourprimaryconcerninthissection.Wedemonstratethatmaxi-mumaposteriorimodescanbefoundasecientlyasmaximumlikelihoodmodes,givenaparticularchoiceofprior.Thishastwoconsequences:First,itmakesapproximateBayesianaveragingpossible.Second,ifoneusesanon-informativeprior,thenMAPestimationisequivalenttoBayesiansmoothing,andrepresentsaformofregularization.Regularizationisparticularlyusefulinthecaseofsmalldatasetsinordertopreventovertting.4.1MAPestimationbytheEMalgorithmForamodelanddatasetthelogarithmoftheposterioriorQjD]equals::Q]+Xx2DlogQ(x)plusanadditiveconstant.TheEMalgorithmcanbeadaptedtomaximizethelogposteriorforeveryxed&Hinton,1999).Indeed,bycomparingwithequation(5)weseethatthequantitytobemaximizedisnow:w:Pr[Qjx1;:::;N;:::;N]]=logogQ]+E[lc(x1;:::;N;:::;NThepriortermdoesnotin uencetheEstepoftheEMalgorithm,whichproceedsexactlyasbefore(cf.equation(6)).Tobeabletosuccessfullymaximizetheright-handsideof(9)intheMstepwerequirethatlogogQ]decomposesintoasumofindependenttermsmatchingthedecompositionofflc]in(7).Apriorovermixturesoftreesthatisamenable a&Jordantothisdecompositioniscalledadecomposableprior.ItwillhavethefollowingproductformmQ]=Pr[1;:::;m;mEk]Pr[parameters Tk]=Pr[1;:::;m;mTkvjpa(Therstparenthesizedfactorinthisequationrepresentsthepriorofthetreestructurewhilethesecondfactoristhepriorforparameters.Requiringthatthepriorbedecomposableisequivalenttomakingseveralindependenceassumptions:inparticular,itmeansthattheparametersofeachtreeinthemixtureareindependentoftheparametersofalltheothertreesaswellasoftheprobabilityofthemixturevariable.Inthefollowingsection,weshowthattheseassumptionsarenotoverlyrestrictive,byconstructingdecomposablepriorsfortreestructuresandparametersandshowingthatthisclassisrichenoughtocontainmembersthatareofpracticalimportance.4.2DecomposablepriorsfortreestructuresThegeneralformofadecomposablepriorforthetreestructureisonewhereeachedgecontributesaconstantfactorindependentofthepresenceorabsenceofotheredgesinnE]/Y(u;v)2Eexp(Withthisprior,theexpressiontobemaximizedintheMstepoftheEMalgorithmbecomes)log Consequently,eachedgeweightintreeisadjustedbythecorrespondingvaluedividedbythetotalnumberofpointsthattreeisresponsiblefor: Anegativeincreasestheprobabilityof(u;v)beingpresentinthenalsolution,whereasapositivevalueofactslikeapenaltyonthepresenceofedge(u;v)inthetree.Iftheprocedureismodiedsoastonotaddnegative-weightedges,onecanobtain(disconnected)treeshavingfewerthan1edges.Notethatthestrengthofthepriorisinverselyproportionalto,thetotalnumberofdatapointsassignedtomixturecomponent.Thus,withequalpriorsforalltrees,treesaccountingforfewerdatapointswillbepenalizedmorestronglyandthereforewillbelikelytohavefeweredges.Ifonechoosestheedgepenaltiestobeproportionaltotheincreaseinthenumberofparameterscausedbytheadditionofedgetothetree, 1)log LearningwithMixturesofTreesthenaMinimumDescriptionLength(MDL)(Rissanen,1989)typeofpriorisimplemented.InthecontextoflearningBayesiannetworks,Heckermanetal.(1995)suggestedthefollowingprior::E]/(E;Ewhere()isadistancemetricbetweenBayesnetstructuresandisthepriornetworkstructure.Thus,thispriorpenalizesdeviationsfromthepriornetwork.Thispriorisdecomposable,entailingu;vu;vDecomposablepriorsonstructurecanalsobeusedwhenthestructureiscommonforalltrees(MTSS).Inthiscasetheeectoftheprioristopenalizetheweightin(8)byAdecomposablepriorhastheremarkablepropertythatitsnormalizationconstantcanbecomputedexactlyinclosedform.Thismakesitpossiblenotonlytocompletelydenetheprior,butalsotocomputeaveragesunderthisprior(e.g.,tocomputeamodel'sevi-dence).Giventhatthenumberofallundirectedtreestructuresovervariablesisthisresult(Meila&Jaakkola,2000)isquitesurprising.4.3DecomposablepriorsfortreeparametersThedecomposablepriorforparametersthatweintroduceisaDirichletprior(Heckermanetal.,1995).TheDirichletdistributionisdenedoverthedomainof;:::;randhastheform;:::;r;:::;rThenumbers;:::;r0thatparametrizecanbeinterpretedasthesucientstatisticsofa\ctitiousdataset"ofsize.Thereforearecalledctitiouscountsrepresentsthestrengthoftheprior.Tospecifyapriorfortreeparameters,onemustspecifyaDirichletdistributionforeachoftheprobabilitytables ,foreachpossibletreestructure.Thisisachievedbymeansofasetofparametersvuxsatisfyingvuxforallu;vWiththesesettings,thepriorfortheparameters;:::;rinanytreethatcontainsthedirectededge isdenedbyuvx;:::;r.Thisrepresentationofthepriorisnotonlycompact(orderMAXparameters)butitisalsoconsistent:twodierentdirectedparametrizationsofthesametreedistributionreceivethesameprior.TheassumptionsallowingustodenethispriorareexplicatedbyMeilaandJaakkola(2000)andparallelthereasoningofHeckermanetal.(1995)forgeneralBayesnets.Denotebytheempiricaldistributionobtainedfromadatasetofsizeandbyuvxthedistributiondenedbythectitiouscounts.Then,byapropertyoftheDirichletdistribution(Heckermanetal.,1995)itfollowsthatlearninga a&JordanMAPtreeisequivalenttolearninganMLtreefortheweightedcombinationofthetwo\datasets" Consequently,theparametersoftheoptimaltreewillbeForamixtureoftrees,maximizingtheposteriortranslatesintoreplacingandbyinequation(10)above.ThisimpliesthattheMstepoftheEMalgorithm,aswellastheEstep,isexactandtractableinthecaseofMAPestimationwithdecomposablepriors.Finally,notethattheposteriorsrsQjD]formodelswithdierentaredeneduptoaconstantthatdependson.Hence,onecannotcompareposteriorsofMTswithdierentnumbersofmixturecomponents.Intheexperimentsthatwepresent,wechoseotherperformancecriteria:validationsetlikelihoodinthedensityestimationexperimentsandvalidationsetclassicationaccuracyintheclassicationtasks.5.ExperimentsThissectiondescribestheexperimentsthatwereruninordertoassessthepromiseoftheMTmodel.Therstexperimentsarestructureidenticationexperiments;theyexaminetheabilityoftheMixTreealgorithmtorecovertheoriginaldistributionwhenthedataaregeneratedbyamixtureoftrees.ThenextgroupofexperimentsstudiestheperformanceoftheMTmodelasadensityestimator;thedatausedintheseexperimentsarenotgeneratedbymixturesoftrees.Finally,weperformclassicationexperiments,studyingboththeMTmodelandasingletreemodel.Comparisonsaremadewithclassierstrainedinbothsupervisedandunsupervisedmode.Thesectionendswithadiscussionofthesingletreeclassieranditsfeatureselectionproperties.Inalloftheexperimentsthetrainingalgorithmisinitializedatrandom,independentlyofthedata.Unlessstatedotherwise,thelearningalgorithmisrununtilconvergence.Log-likelihoodsareexpressedinbits/exampleandthereforearesometimescalledcompressionrates.Thelowerthevalueofthecompressionrate,thebetterthettothedata.IntheexperimentsthatinvolvesmalldatasetsweusetheBayesianmethodsthatwediscussedinSection4toimposeapenaltyoncomplexmodels.Inordertoregularizemodelstructureweuseadecomposablepriorovertreeedgeswith0.ToregularizemodelparametersweuseaDirichletpriorderivedfromthepairwisemarginaldistributionsforthedataset.Thisapproachisknownassmoothingwiththemarginal(Friedmanetal.,1997;Ney,Essen,&Kneser,1994).Inparticular,wesettheparametercharacterizingtheDirichletpriorfortreebyapportioningaxedsmoothingcoecientequallybetweenthevariablesandinanamountthatisinverselyproportionaltobetweenthecomponents.Intuitively,theeectofthisoperationistomakethetreesmoresimilartoeachother,therebyreducingtheeectivemodelcomplexity. LearningwithMixturesofTrees Figure8:Eighttrainingexamplesforthebarslearningtask.5.1Structureidentication5.1.1Randomtrees,largedatasetFortherststructureidenticationexperiment,wegeneratedamixtureof=5treesover30variableswitheachvertexhaving=4values.Thedistributionofthechoicevariableaswellaseachtree'sstructureandparametersweresampledatrandom.Themixturewasusedtogenerate30,000datapointsthatwereusedasatrainingsetforaMixTreealgorithm.Theinitialmodelhad=5componentsbutotherwisewasrandom.Wecomparedthestructureofthelearnedmodelwiththegenerativemodelandcomputedthelikelihoodsofboththelearnedandtheoriginalmodelonatestdatasetconsistingof1000points.Thealgorithmwasquitesuccessfulatidentifyingtheoriginaltrees:outof10trials,thealgorithmfailedtoidentifycorrectlyonly1treein1trial.Moreover,thisresultcanbeaccountedforbysamplingnoise;thetreethatwasn'tidentiedhadamixturecoecientofonly002.Thedierencebetweentheloglikelihoodofthesamplesofthegeneratingmodelandtheapproximatingmodelwas0.41bitsperexample.5.1.2Randombars,smalldatasetThe\bars"problemisabenchmarkstructurelearningproblemforunsupervisedlearningalgorithmsintheneuralnetworkliterature(Dayan&Zemel,1995).ThedomainisthesquareofbinaryvariablesdepictedinFigure8.Thedataaregeneratedinthefollowingmanner:rst,one ipsafaircointodecidewhethertogeneratehorizontalorverticalbars;thisrepresentsthehiddenvariableinourmodel.Then,eachofthebarsisturnedonindependently(blackinFigure8)withprobability.Finally,noiseisaddedby ippingeachbitoftheimageindependentlywithprobability.Alearnerisshowndatageneratedbythisprocess;thetaskofthelearneristodiscoverthedatageneratingmechanism.AmixtureoftreesmodelthatapproximatesthetruestructureforlownoiselevelsisshowninFigure10.Notethatanytreeoverthevariablesformingabarisanequallygoodapproximation.Thus,wewillconsiderthatthestructurehasbeendiscoveredwhenthemodellearnsamixturewith=2,eachhavingconnectedcomponents,oneforeachbar.Additionally,weshalltesttheclassicationaccuracyofthelearnedmodelby a&Jordan Figure9:Thetruestructureoftheprobabilisticgenerativemodelforthebarsdata. Figure10:Amixtureoftreesthatapproximatesthegenerativemodelforthebarsproblem.Theinterconnectionbetweenthevariablesineach\bar"arearbitrary. LearningwithMixturesofTrees .01 1 100 10000 8 10 12 14 16 18 20 Smoothing valueValidation set likelihood [bits/case]m=2m=3 Figure11:Testsetlog-likelihoodonthebarslearningtaskfordierentvaluesofthesmooth-ingparameteranddierent.Thegurepresentsaveragesandstandarddeviationsover20trials.comparingthetruevalueofthehiddenvariable(i.e.\horizontal"or\vertical")withthevalueestimatedbythemodelforeachdatapointinatestset.Asseenintherstrow,thirdcolumnofFigure8,sometrainingsetexamplesareambiguous.Weretainedtheseambiguousexamplesinthetrainingset.Thetotaltrainingsetsizewastrain=400.Wetrainedmodelswith;:::;andevaluatedthemodelsonavalidationsetofsize100tochoosethenalvaluesofandofthesmoothingparameter.Typicalvaluesforintheliteratureare5;wechoose=5followingDayanandZemel(1995).Theotherparametervalueswere02andtest=200.Toobtaintreeswithseveralconnectedcomponentsweusedasmalledgepenalty=5.Thevalidation-setlog-likelihoods(inbits)for3aregiveninFigure11.Clearly,=2isthebestmodel.For=2weexaminedtheresultingstructures:in19outof20trials,structurerecoverywasperfect.Interestingly,thisresultheldforthewholerangeofthesmoothingparameter,notsimplyatthecross-validatedvalue.Bywayofcomparison,DayanandZemel(1995)examinedtwotrainingmethodsandthestructurewasrecoveredin27andrespectively69casesoutof100.TheabilityofthelearnedrepresentationtocategorizenewexamplesascomingfromonegrouportheotherisreferredtoasclassicationperformanceandisshowninTable1.Theresultreportedisobtainedonaseparatetestsetforthenal,cross-validatedvalueof.Notethat,duetothepresenceofambiguousexamples,nomodelcanachieveperfectclassication.Theprobabilityofanambiguousexampleisambig+(1whichyieldsanerrorrateof0ambig125.Comparingthislowerboundwiththevalue a&JordanTable1:Resultsonthebarslearningtask. Testsetambiguousunambiguous test[bits/datapt]9.820.9513.67Classaccuracy0.8520.0760.951 Figure12:Anexampleofadigitpair.inthecorrespondingcolumnofTable1showsthatthemodelperformsquitewell,evenwhentrainedonambiguousexamples.Tofurthersupportthisconclusion,asecondtestsetofsize200wasgenerated,thistimeincludingonlynon-ambiguousexamples.Theclassicationperformance,showninthecorrespondingsectionofTable1,roseto0.95.Thetablealsoshowsthelikelihoodofthe(test)dataevaluatedonthelearnedmodel.Fortherst,\ambiguous"testset,thisis9.82,1.67bitsawayfromthetruemodellikelihoodof8.15bits/datapoint.Forthe\non-ambiguous"testset,thecompressionrateissignicantlyworse,whichisnotsurprisinggiventhatthedistributionofthetestsetisnowdierentfromthedistributionthemodelwastrainedon.5.2DensityestimationexperimentsInthissectionwepresenttheresultsofthreeexperimentsthatstudythemixtureoftreesinthedensityestimationsetting.5.2.1DigitsanddigitpairsimagesOurrstdensityestimationexperimentinvolvedasubsetofbinaryvectorrepresentationsofhandwrittendigits.Thedatasetsconsistofnormalizedandquantized88binaryimagesofhandwrittendigitsmadeavailablebytheUSPostalServiceOceforAdvancedTechnology. LearningwithMixturesofTreesTable2:Averagelog-likelihood(bitsperdigit)forthesingledigit(Digit)anddoubledigit(Pairs)datasets.Resultsareaveragedover3runs. DigitsPairs 1634.7279.253234.4878.996434.8479.7012834.8881.26 onthevalidationsetandusingitwecalculatedtheaveragelog-likelihoodoverthetestset(inbitsperexample).Theaverages(over3runs)areshowninTable2.InFigure13wecompareourresults(for=32)withtheresultspublishedbyFreyetal.(1996).Thealgorithmsplottedinthegurearethe(1)completelyfactoredor\Baserate"(BR)model,whichassumesthateveryvariableisindependentofalltheothers,(2)mixtureoffactorialdistributions(MF),(3)theUNIX\gzip"compressionprogram,(4)theHelmholtzMachine,trainedbythewake-sleepalgorithm(Freyetal.,1996)(HWS),(5)thesameHelmholtzMachinewhereameaneldapproximationwasusedfortraining(HMF),(5)afullyobservedandfullyconnectedsigmoidbeliefnetwork(FV),and(6)themixtureoftrees(MT)model.AsshowninFigure13,theMTmodelyieldsthebestdensitymodelforthesimpledigitsandthesecond-bestmodelforpairsofdigits.AcomparisonofparticularinterestisbetweentheMTmodelandthemixtureoffactorial(MF)model.Inspiteofthestructuralsimilaritiesinthesemodels,theMTmodelperformsbetterthantheMFmodel,indicatingthatthereisstructureinthedatathatisexploitedbythemixtureofspanningtreesbutisnotcapturedbyamixtureofindependentvariablemodels.ComparingthevaluesoftheaveragelikelihoodintheMTmodelfordigitsandpairsweseethatthesecondismorethantwicetherst.Thissuggeststhatourmodel(andtheMFmodel)isabletoperformgoodcompressionofthedigitdatabutisunabletodiscovertheindependenceinthedoubledigitset.5.2.2TheALARMnetworkOursecondsetofdensityestimationexperimentsfeaturestheALARMnetworkasthedatageneratingmechanism(Heckermanetal.,1995;Cheng,Bell,&Liu,1997).ThisBayesiannetworkwasconstructedfromexpertknowledgeasamedicaldiagnosticalarmmessagesystemforpatientmonitoring.Thedomainhas=37discretevariablestakingbetween2and4values,connectedby46directedarcs.Notethatthisnetworkisatreeoramixtureoftrees,butthetopologyofthegraphissparse,suggestingthepossibilityofapproximatingthedependencystructurebyamixtureoftreeswithasmallnumberofcomponentsWegeneratedatrainingsethavingtrain=10000datapointsandaseparatetestsetoftest000datapoints.Onthesesetswetrainedandcomparedthefollowingmethods:mixturesoftrees(MT),mixturesoffactorial(MF)distributions,thetruemodel, a&Jordan MF gzip HWS HMF FV MT 0 10 20 30 40 50 60 Likelihood [bits/digit] MF gzip HWS HMF FV MT 0 20 40 60 80 100 120 Likelihood [bits/digit] (a)(b)Figure13:Averagelog-likelihoods(bitsperdigit)forthesingledigit(a)anddoubledigit(b)datasets.Noticethedierenceinscalebetweenthetwogures.Table3:DensityestimationresultsforthemixturesoftreesandothermodelsontheALARMdataset.Trainingsetsizetrain=10000.Averageandstandarddeviationover20trials. ModelTrainlikelihoodTestlikelihood[bits/datapoint][bits/datapoint] ALARMnet13.14813.264Mixtureoftrees=1813.510.0414.55Mixtureoffactorials=2817.110.1217.64Baserate30.9931.17gzip40.34541.260 and\gzip."ForMTandMFthemodelorderandthedegreeofsmoothingwereselectedbycrossvalidationonrandomlyselectedsubsetsofthetrainingset.TheresultsarepresentedinTable3,whereweseethattheMTmodeloutperformstheMFmodelaswellasgzipandthebaseratemodel.Toexaminethesensitivityofthealgorithmstothesizeofthedatasetweranthesameexperimentwithatrainingsetofsize1,000.TheresultsarepresentedinTable4.Again,theMTmodelistheclosesttothetruemodel.Noticethatthedegradationinperformanceforthemixtureoftreesisrelativelymild(about1bit),whereasthemodelcomplexityisreducedsignicantly.Thisindicatestheimportantroleplayedbythetreestructuresinttingthedataandmotivatestheadvantageofthemixtureoftreesoverthemixtureoffactorialsforthisdataset. LearningwithMixturesofTreesTable4:Densityestimationresultsforthemixturesoftreesandothermodelsonadatasetofsize1000generatedfromtheALARMnetwork.Averageandstandarddeviationover20trials.Recallthatisasmoothingcoecient. ModelTrainlikelihoodTestlikelihood[bits/datapoint][bits/datapoint] ALARMnet13.16713.264Mixtureoftrees=5014.560.1615.51Mixtureoffactorials=12=10018.200.3719.99Baserate31.2331.18gzip45.96046.072 Table5:DensityestimationresultsforthemixturesoftreesandothermodelsonatheFACESdataset.Averageandstandarddeviationover10trials. ModelTrainlikelihoodTestlikelihood[bits/datapoint][bits/datapoint] Mixtureoftrees=1052.770.3356.29Mixtureoffactorials=24=10056.340.4864.41Baserate75.8474.27gzip{103.51 5.2.3TheFACESdatasetForthethirddensityestimationexperiment,weusedasubsetof576imagesfromanor-malizedfaceimagesdataset(Philips,Moon,Rauss,&Rizvi,1997).Theseimagesweredownsampledto48variables(pixels)and5graylevels.Wedividedthedatarandomlyintotrain=500andtest=76examples;ofthe500trainingexamples,50wereleftoutasavalidationsetandusedtoselectandfortheMTandMFmodels.Theresultsintable5showthemixtureoftreesastheclearwinner.Moreover,theMTachievesthisperformancewithalmost5timesfewerparametersthanthesecondbestmodel,themixtureof24factorialdistributions.NotethatanessentialingredientofthesuccessoftheMTbothhereandinthedigitsexperimentsisthatthedataare\normalized",i.e.,apixel/variablecorrespondsapprox-imatelytothesamelocationontheunderlyingdigitorface.WedonotexpectMTstoperformwellonrandomlychosenimagepatches. a&Jordan5.3Classicationwithmixturesoftrees5.3.1UsingamixtureoftreesasaclassifierAdensityestimatorcanbeturnedintoaclassierintwoways,bothofthembeingessentiallylikelihoodratiomethods.Denotetheclassvariablebyandthesetofinputvariablesby.Intherstmethod,adoptedinourclassicationexperimentsunderthenameof,anMTmodelistrainedonthedomain,treatingtheclassvariablelikeanyothervariableandpoolingallthetrainingdatatogether.Inthetestingphase,anewinstance)isclassiedbypickingthemostlikelyvalueoftheclassvariablegiventhesettingsoftheothervariables:argmaxSimilarly,fortheMFclassier(termed\D-SIDE"byKontkanenetal.,1996),aboveisanMFtrainedonThesecondmethodcallsforpartitioningthetrainingsetaccordingtothevaluesoftheclassvariableandfortrainingatreedensityestimatoroneachpartition.Thisisequivalenttotrainingamixtureoftreeswithobservedchoicevariable,thechoicevariablebeingtheclass(Chow&Liu,1968;Friedmanetal.,1997).Inparticular,ifthetreesareforcedtohavethesamestructureweobtaintheTreeAugmentedNaiveBayes(TANB)classierofFriedmanetal.(1997).IneithercaseoneturnstoBayesformula:argmaxaxc=k]Tk(x)toclassifyanewinstance.TheanalogoftheMFclassierinthissettingisthenaiveBayesclassier.5.3.2TheAUSTRALIANdatasetThisdatasethas690exampleseachconsistingof14attributesandabinaryclassvariable(Blake&Merz,1998).InthefollowingwereplicatedtheexperimentalprocedureofKon-tkanenetal.(1996)andMichieetal.(1994)ascloselyaspossible.Thetestandtrainingsetsizeswere70and620respectively.Foreachvalueofweranouralgorithmforaxednumberofepochsonthetrainingsetandthenrecordedtheperformanceonthetestset.Thiswasrepeated20timesforeach,eachtimewitharandomstartandwitharandomsplitbetweenthetestandthetrainingset.Becauseofthesmalldatasetsizeweusededgepruningwith.ThebestperformanceofthemixturesoftreesiscomparedtothepublishedresultsofKontkanenetal.(1996)andMichieetal.(1994)forthesamedatasetinTable6.5.3.3TheAGARICUS-LEPIOTAdatasetTheAGARICUS-LEPIOTAdata(Blake&Merz,1998)comprises8124examples,eachspecifyingthe22discreteattributesofaspeciesofmushroomintheAgaricusandLepiotafamiliesandclassifyingitasediblepoisonous.Thearitiesofthevariablesrangefrom2to12.Wecreatedatestsetoftest=1124examplesandatrainingsetoftrain=7000examples.Ofthelatter,800exampleswerekeptasidetoselectandtherestwereused LearningwithMixturesofTreesTable6:PerformancecomparisonbetweentheMTmodelandotherclassicationmethodsontheAUSTRALIANdataset(Michieetal.,1994).TheresultsformixturesoffactorialdistributionarethosereportedbyKontkanenetal.(1996). Method%correctMethod%correct Mixtureoftrees=20Backprop84.6Mixtureoffactorialdistributions87.2C4.584.6Cal5(decisiontree)86.9SMART84.2ITrule86.3BayesTrees82.9Logisticdiscrimination85.9K-nearestneighbor=181.9Lineardiscrimination85.9AC281.9DIPOL9285.9NewID81.9Radialbasisfunctions85.5LVQ80.3CART85.5ALLOC8079.9CASTLE85.2CN279.6NaiveBayes84.9Quadraticdiscrimination79.3IndCART84.8FlexibleBayes78.3 MF TANB NB 0 0.002 0.004 0.006 0.008 0.01 Error rate MF TANB NB 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Error rate (a)(b)Figure14:Classicationresultsforthemixturesoftreesandothermodels:(a)OntheAGARICUS-LEPIOTAdataset;theMThas=12,andtheMFhas=30.(b)OntheNURSERYdataset;theMThas=30,theMFhas=70.TANBandNBarethetreeaugmentednaiveBayesandthenaiveBayesclassiersrespectively.Theplotsshowtheaverageandstandarddeviationtestseterrorrateover5trials. a&Jordan 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Error rate :0001101000110100KBNNNNTANBNB Tree Figure15:ComparisonofclassicationperformanceoftheMTandothermodelsontheSPLICEdatasetwhentrain=2000test=1175.Treerepresentsamixtureoftreeswith=1,MTisamixtureoftreeswith=3.KBNNistheKnowledgebasedneuralnet,NNisaneuralnet.fortraining.Nosmoothingwasused.Theclassicationresultsonthetestsetarepresentedingure14(a).Astheguresuggests,thisisarelativelyeasyclassicationproblem,whereseeingenoughexamplesguaranteesperfectperformance(achievedbytheTANB).TheMT(with=12)achievesnearlyoptimalperformance,makingonemistakeinoneofthe5trials.TheMFandnaiveBayesmodelsfollowabout0.5%behind.5.3.4TheNURSERYdatasetThisdatasetcontains12,958entries(Blake&Merz,1998),consistingof8discreteattributesandoneclassvariabletaking4values.Thedatawererandomlyseparatedintoatrainingsetofsizetrain=11000andatestsetofsizetest=1958.InthecaseofMTsandMFstheformerdatasetwasfurtherpartitionedinto9000examplesusedfortrainingthecandidatemodelsand2000examplesusedtoselecttheoptimal.TheTANBandnaiveBayesmodelsweretrainedonallthe11,000examples.Nosmoothingwasusedsincethetrainingsetwaslarge.Theclassicationresultsareshowningure14(b).5.3.5TheSPLICEdataset:ClassificationWealsostudiedtheclassicationperformanceoftheMTmodelinthedomainofDNASPLICE-junctions.Thedomainconsistsof60variables,representingasequenceofDNA 1.Theoriginaldatasetcontainsanothertwodatapointswhichcorrespondtoafthclassvalue;weeliminatedthosefromthedataweused. LearningwithMixturesofTrees 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Error rate 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Error rate 0:0:0:0:| DELVE Tree TANB DELVE Tree TANB train=100test=1575train=200test=1575Figure16:ComparisonofclassicationperformanceofthemixtureoftreesandothermodelstrainedonsmallsubsetsoftheSPLICEdataset.ThemodelstestedbyDELVEare,fromlefttoright:1-nearestneighbor,CART,HME(hierarchicalmixtureofexperts)-ensemblelearning,HME-earlystopping,HME-grown,K-nearestneighbors,LLS(linearleastsquares),LLS-ensemblelearning,ME(mixtureofexperts)-ensemblelearning,ME-earlystopping.TANBistheTreeAugmentedNaiveBayesclassier,NBistheNaiveBayesclassier,andTreeisthesingletreeclassier. a&Jordanbases,andanadditionalclassvariable(Rasmussenetal.,1996).Thetaskistodetermineifthemiddleofthesequenceisasplicejunctionandwhatisitstype.Splicejunctionsareoftwotypes:exon-intron(EI)representstheendofanexonandthebeginningofanintronwhereasintron-exon(IE)istheplacewheretheintronendsandthenextexon,orcodingsection,begins.Hence,theclassvariablecantake3values(EI,IEornojunction)andtheothervariablestake4valuescorrespondingtothe4possibleDNAbases(C,A,G,T).Thedatasetconsistsof3,175labeledexamples.WerantwoseriesofexperimentscomparingtheMTmodelwithcompetingmodels.Intherstseriesofexperiments,wecomparedtotheresultsofNoordewieretal.(1991),whousedmultilayerneuralnetworksandknowledge-basedneuralnetworksforthesametask.Wereplicatedtheseauthors'choiceoftrainingsetsize(2000)andtestsetsize(1175)andsamplednewtraining/testsetsforeachtrial.Weconstructedtrees(=1)andmixturesoftrees(=3).Inttingthemixture,weusedanearly-stoppingprocedureinwhichvalid=300exampleswereseparatedoutofthetrainingsetandtrainingwasstoppedwhenthelikelihoodontheseexamplesstoppedincreasing.Theresults,averagedover20trials,arepresentedinFigure15foravarietyofvaluesof.ItcanbeseenthatthesingletreeandtheMTmodelperformsimilarly,withthesingletreeshowinganinsignicantlybetterclassicationaccuracy.Notethatinthissituationsmoothingdoesnotimproveperformance;thisisnotunexpectedsincethedatasetisrelativelylarge.Withtheexceptionofthe\oversmoothed"MTmodel(=100),allthesingletreeorMTmodelsoutperformtheothermodelstestedonthisproblem.Notethatwhereasthetreemodelscontainnopriorknowledgeaboutthedomain,theothertwomodelsdo:theneuralnetworkmodelistrainedinsupervisedmode,optimizingforclassaccuracy,andtheKBNNincludesdetaileddomainknowledge.BasedonthestrongshowingofthesingletreemodelontheSPLICEtask,wepursuedasecondseriesofexperimentsinwhichwecomparethetreemodelwithalargercollectionofmethodsfromtheDELVErepository(Rasmussenetal.,1996).TheDELVEbenchmarkusessubsetsoftheSPLICEdatabasewith100and200examplesfortraining.Testingisdoneon1500examplesinallcases.Figure16presentstheresultsforthealgorithmstestedbyDELVEaswellasthesingletreeswithdierentdegreesofsmoothing.WealsoshowresultsfornaiveBayes(NB)andTreeAugmentedNaiveBayes(TANB)models(Friedmanetal.,1997).TheresultsfromDELVErepresentaveragesover20runswithdierentrandominitializationsonthesametrainingandtestingsets;fortrees,NBandTANB,whoseoutputsarenotinitialization-dependent,weaveragedtheperformanceofthemodelslearnedfor20dierentsplitsoftheunionofthetrainingandtestingset.Noearlystoppingorcross-validationwasusedinthiscase.Theresultsshowthatthesingletreeisquitesuccessfulinthisdomain,yieldinganerrorratethatislessthanhalfoftheerrorrateofthebestmodeltestedinDELVE.Moreover,theaverageerrorofasingletreetrainedon200examplesis6.9%,whichisonly2.3%greaterthantheaverageerrorofthetreetrainedon2000examples.WeattempttoexplainthisstrikingpreservationofaccuracyforsmalltrainingsetsinourdiscussionoffeatureselectioninSection5.3.7. 2.Weeliminated15examplesfromtheoriginaldatasetthathadambiguousinputs(Noordewier,Towell,&Shavlik,1991;Rasmussenetal.,1996). LearningwithMixturesofTrees 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 Figure17:Cumulativeadjacencymatrixof20treestto2000examplesoftheSPLICEdatasetwithnosmoothing.Thesizeofthesquareatcoordinatesrepresentsthenumberoftrees(outof20)thathaveanedgebetweenvariablesand.Nosquaremeansthatthisnumberis0.Onlythelowerhalfofthematrixisshown.Theclassisvariable0.Thegroupofsquaresatthebottomofthegureshowsthevariablesthatareconnecteddirectlytotheclass.Onlythesevariablearerelevantforclassication.Notsurprisingly,theyarealllocatedinthevicinityofthesplicejunction(whichisbetween30and31).Thesubdiagonal\chain"showsthattherestofthevariablesareconnectedtotheirimmediateneighbors.Itslower-leftendisedge2{1anditsupper-rightisedge60-59. a&JordanEIjunctionExon Intron 282930 313233343536 Tree AGAG-TrueCAAG AGAGT IEjunctionIntron Exon 1516...252627282930 31 Tree{CTCTCT{{CT TrueCTCTCTCT{{CT Figure18:TheencodingoftheIEandEIsplicejunctionsasdiscoveredbythetreelearningalgorithm,comparedtotheonesgivenbyWatsonetal.,\MolecularBiologyoftheGene"(Watsonetal.,1987).Positionsinthesequenceareconsistentwithourvariablenumbering:thusthesplicejunctionissituatedbetweenpositions30and31.Symbolsinboldfaceindicatebasesthatarepresentwithprobabilityalmost1,otherA,C,G,Tsymbolsindicatebasesorgroupsofbasesthathavehighprobability(0.8),anda{indicatesthatthepositioncanbeoccupiedbyanybasewithanon-negligibleprobability.TheNaiveBayesmodelexhibitsbehaviorthatisverysimilartothetreemodelandonlyslightlylessaccurate.However,augmentingtheNaiveBayesmodeltoaTANBsignicantlyhurtstheclassicationperformance.5.3.6TheSPLICEdataset:StructureidentificationFigure17presentsasummaryofthetreestructureslearnedfromthe=2000datasetintheformofacumulatedadjacencymatrix.Theadjacencymatricesofthe20graphstructuresobtainedintheexperimenthavebeensummed.Thesizeoftheblacksquareatcoordinatesi;jinthegureisproportionaltothevalueofthei;j-thelementofthecumulatedadjacencymatrix.Nosquaremeansthattherespectiveelementis0.Sincetheadjacencymatrixissymmetric,onlyhalfofthematrixisshown.FromFigure17weseethatthetreestructureisverystableoverthe20trials.Variable0representstheclassvariable;thehypotheticalsplicejunctionissituatedbetweenvariables30and31.Thegureshowsthatthesplicejunction(variable0)dependsonlyonDNAsitesthatareinitsvicinity.Thesitesthatareremotefromthesplicejunctionaredependentontheirimmediateneighbors.Moreover,examiningthetreeparameters,fortheedgesadjacenttotheclassvariable,weobservethatthesevariablesbuildcertainpatternswhenthesplicejunctionispresent,butarerandomandalmostuniformlydistributedintheabsenceofasplicejunction.ThepatternsextractedfromthelearnedtreesareshowninFigure18.Thesameguredisplaysthe\true"encodingsoftheIEandEIjunctionsasgivenbyWatsonetal.(1987).Thematchbetweenthetwoencodingsisalmostperfect.Thus,wecanconcludethatforthis LearningwithMixturesofTrees 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 Figure19:Thecumulatedadjacencymatrixfor20treesovertheoriginalsetofvariables(0-60)augmentedwith60\noisy"variables(61-120)thatareindependentoftheoriginalones.Thematrixshowsthatthetreestructureovertheoriginalvariablesispreserved. a&Jordandomain,thetreemodelnotonlyprovidesagoodclassierbutalsodiscoversamodelofthephysicalrealityunderlyingthedata.Notethatthealgorithmarrivesatthisresultintheabsenceofpriorknowledge:(1)itdoesnotknowwhichvariableistheclassvariable,and(2)itdoesnotknowthatthevariablesareinasequence(i.e.,thesameresultwouldbeobtainediftheindicesofthevariableswerescrambled).5.3.7TheSPLICEdataset:FeatureselectionLetusexaminethesingletreeclassierthatwasusedfortheSPLICEdatasetmoreclosely.AccordingtotheMarkovpropertiesofthetreedistribution,theprobabilityoftheclassvariablesdependsonlyonitsneighbors,thatis,onthevariablestowhichtheclassvariableisconnectedbytreeedges.Hence,atreeactsasanimplicitvariableselectorforclassication:onlythevariablesadjacenttothequeriedvariable(thissetofvariablesiscalledtheMarkovblanket;Pearl,1988)arerelevantfordeterminingitsprobabilitydistribution.Thispropertyalsoexplainstheobservedpreservationoftheaccuracyofthetreeclassierwhenthesizeofthetrainingsetdecreases:outofthe60variables,only18arerelevanttotheclass;moreover,thedependenceisparametrizedas18independentpairwiseprobabilitytablesclass;v.Suchparameterscanbetaccuratelyfromrelativelyfewexamples.Hence,aslongasthetrainingsetcontainsenoughdatatoestablishthecorrectdependencystructure,theclassicationaccuracywilldegradeslowlywiththedecreaseinthesizeofthedataset.ThisexplanationhelpstounderstandthesuperiorityofthetreeclassieroverthemodelsinDELVE:onlyasmallsubsetofvariablesarerelevantforclassication.Thetreendsthemcorrectly.Aclassierthatisnotabletoperformfeatureselectionreasonablywellwillbehinderedbytheremainingirrelevantvariables,especiallyifthetrainingsetissmall.ForagivenMarkovblanket,thetreeclassiesinthesamewayasanaiveBayesmodelwiththeMarkovblanketvariablesasinputs.NotealsothatthenaiveBayesmodelitselfhasabuilt-infeatureselector:ifoneoftheinputvariablesisnotrelevanttotheclass,thedistributionswillberoughlythesameforallvaluesof.Consequently,intheposteriorthatservesforclassication,thefactorscorrespondingtowillsimplifyandthuswillhavelittlein uenceontheclassication.ThismayexplainwhythenaiveBayesmodelalsoperformswellontheSPLICEclassicationtask.NoticehoweverthatthevariableselectionmechanismsimplementedbythetreeclassierandthenaiveBayesclassierarenotthesame.Toverifythatindeedthesingletreeclassieractslikeafeatureselector,weperformedthefollowingexperiment,againusingtheSPLICEdata.Weaugmentedthevariablesetwithanother60variables,eachtaking4valueswithrandomlyandindependentlyassignedprobabilities.Therestoftheexperimentalconditions(trainingset,testsetandnumberofrandomrestarts)wereidenticaltotherstSPLICEexperiment.Wetasetofmodelswith=1,asmall1andnosmoothing.Thestructureofthenewmodels,intheformofacumulativeadjacencymatrix,isshowninFigure19.Weseethatthestructureovertheoriginal61variablesisunchangedandstable;the60noisevariablesconnectinarandomuniformpatternstotheoriginalvariablesandamongeachother.Asexpectedafterexaminingthestructure,theclassicationperformanceofthenewtreesisnotaectedbythenewlyintroducedvariables:infacttheaverageaccuracyofthetreesover121variablesis95.8%,0.1%higherthantheaccuracyoftheoriginaltrees. LearningwithMixturesofTrees6.TheacceleratedtreelearningalgorithmWehavearguedthatthemixtureoftreesapproachhassignicantadvantagesovergeneralBayesiannetworksintermsofitsalgorithmiccomplexity.Inparticular,theMstepoftheEMalgorithmformixturesoftreesistheChowLiualgorithm,whichscalesquadraticallyinthenumberofvariablesandlinearlyinthesizeofthedataset.GiventhattheEstepislinearinandformixturesoftrees,wehaveasituationinwhicheachpassoftheEMalgorithmisquadraticinandlinearinAlthoughthistimecomplexityrecommendstheMTapproachforlarge-scaleproblems,thequadraticscalinginbecomesproblematicforparticularlylargeproblems.InthissectionweproposeamethodforreducingthetimecomplexityoftheMTlearningalgorithmanddemonstrateempiricallythelargeperformancegainsthatweareabletoobtainwiththismethod.Asaconcreteexampleofthekindofproblemsthatwehaveinmind,considertheproblemofclusteringorclassicationofdocumentsininformationretrieval.Herethevariablesarewordsfromavocabulary,andthedatapointsaredocuments.Adocumentisrepresentedasabinaryvectorwithacomponentequalto1foreachwordthatispresentinthedocumentandequalto0foreachwordthatisnotpresent.Inatypicalapplicationthenumberofdocumentsisoftheorderof10{10,asisthevocabularysize.Givensuchnumbers,ttingasingletreetothedatarequirescountingoperations.Note,however,thatthisdomainischaracterizedbyacertainsparseness:inparticular,eachdocumentcontainsonlyarelativelysmallnumberofwordsandthusmostofthecomponentsofitsbinaryvectorare0.Inthissection,weshowhowtotakeadvantageofdatasparsenesstoacceleratetheChowLiualgorithm.Weshowthatinthesparseregimewecanoftenrankordermutualinformationvalueswithoutactuallycomputingthesevalues.Wealsoshowhowtospeedupthecomputationofthesucientstatisticsbyexploitingsparseness.Combiningthesetwoideasyieldsanalgorithm|theacCL(acceleratedChowandLiu)algorithm|thatprovidessignicantperformancegainsinbothrunningtimeandmemory.6.1TheacCLalgorithmWerstpresenttheacCLalgorithmforthecaseofbinaryvariables,presentingtheex-tensiontogeneraldiscretevariablesinSection6.2.Forbinaryvariableswewillsaythatavariableis\on"whenittakesvalue1;otherwisewesaythatitis\o".Withoutlossofgeneralityweassumethatavariableisomoretimesthanitisoninthegivendataset.ThetargetdistributionisassumedtobederivedfromasetofobservationsofsizeLetusdenotebythenumberoftimesvariableisoninthedatasetandbythenumberoftimesvariablesandaresimultaneouslyon.Wecalleachofthelattereventsco-occurrenceand.Themarginalandisgivenby:1)=0)=1)=0)= a&JordanAlltheinformationaboutthatisnecessaryforttingthetreeissummarizedinthecountsand;u;v;:::;n,andfromnowonwewillconsidertoberepresentedbythesecounts.(Itisaneasyextensiontohandlenon-integerdata,suchaswhenthedatapointsare\weighted"byrealnumbers).WenowdenethenotionofsparsenessthatmotivatestheacCLalgorithm.Letusdenotebythenumberofvariablesthatareoninobservation,where0Dene,thedatasparsenessmaxIf,forexample,thedataaredocumentsandthevariablesrepresentwordsfromavocabulary,thenrepresentsthemaximumnumberofdistinctwordsinadocument.Thetimeandmemoryrequirementsofthealgorithmthatwedescribedependonthesparseness;thelowerthesparseness,themoreecientthealgorithm.Ouralgorithmwillrealizeitslargestperformancegainswhenn;NRecallthattheChowLiualgorithmgreedilyaddsedgestoagraphbychoosingtheedgethatcurrentlyhasthemaximalvalueofmutualinformation.Thealgorithmthatwedescribeinvolvesanecientwaytorankordermutualinformation.Therearetwokeyaspectstothealgorithm:(1)howtocomparemutualinformationbetweennon-co-occurringvariables,and(2)howtocomputeco-occurrencesinalistrepresentation.6.1.1Comparingmutualinformationbetweennon-co-occurringvariablesLetusfocusonthepairsu;vthatdonotco-occur,i.e.,forwhich=0.Forsuchapair,themutualinformationisafunctionofand.Letusanalyzethevariationofthemutualinformationwithrespecttobytakingthecorrespondingpartialderivative: =log 0(11)Thisresultimpliesthatforagivenvariableandanytwovariablesv;vforwhich=0wehave:impliesthatThisobservationallowsustopartiallysortthemutualinformationvaluesfornon-co-occurringpairsu;v,withoutcomputingthem.First,wehavetosortallthevariablesbytheirnumberofoccurrences.Westoretheresultinalist.Thisgivesatotalordering"forthevariablesinpreceedsinlistForeach,wedenethelistofvariablesfollowingintheordering\"andnotco-occurringwithit:V;vu;NThislistissortedbydecreasingandtherefore,implicitly,bydecreasing.Sincethedataaresparse,mostpairsofvariablesdonotco-occur.Therefore,bycreatingthelists),alargenumberofvaluesofthemutualinformationarepartiallysorted.Beforeshowinghowtousethisconstruction,letusexamineanecientwayofcomputingthecountswhenthedataaresparse. LearningwithMixturesofTrees list of , , sorted by list of , , sorted by next edge 0Iuv= 0Nv(virtual)uvNv Figure20:Thedatastructurethatsuppliesthenextcandidateedge.Verticallyontheleftarethevariables,sortedbydecreasing.Foragiven,therearetwolists:),sortedbydecreasingand(thevirtuallist)),sortedbydecreasing.ThemaximumofthetworstelementsoftheselistsisinsertedintoanFibonacciheap.TheoverallmaximumofcanthenbeextractedasthemaximumoftheFibonacciheap.6.1.2Computingco-occurrencesinalistrepresentationLet;:::xbeasetofobservationsoverbinaryvariables.Ifitisecienttorepresenteachobservationinasalistofthevariablesthatareonintherespectiveobservation.Thus,datapoint;:::Nwillberepresentedbythelistxlist=list.ThespacerequiredbythelistsisnomorethanandismuchsmallerthanthespacerequiredbythebinaryvectorrepresentationofthesamedataNote,moreover,thatthetotalnumberofco-occurrencesinthedataset Forthevariablesthatco-occurwith,asetofco-occurrencelists)iscreated.Each)containsrecords(v;Iu;N0andissortedbydecreasingTorepresentthelists)(thatcontainmanyelements)weusetheir\complements" V;vu;N.Itcanbeshown(Meila-Predoviciu,1999)thatthe a&Jordan AlgorithmacCLVariablesetofsizeDatasetxlist;:::NProcedureKruskal1.computefor;create,listofvariablesinsortedbydecreasing2.computeco-occurrences;createlists 3.createvheapforvlistargmaxinsert(;u;v;I)invheapKruskalvheap);storefortheedgesaddedto5.for(u;vcomputetheprobabilitytableusingandOutput: Figure21:TheacCLalgorithm.computationoftheco-occurrencecountsandtheconstructionofthelists)and forall,takesanamountoftimeproportionaltothenumberofco-occurrencesuptoalogarithmicfactor:log(N=nComparingthisvaluewith),whichisthetimetocomputeintheChowLiualgorithm,weseethatthepresentmethodreplacesthedimensionofthedomainThememoryrequirementsforthelistsarealsoatmostproportionalto,hence6.1.3Thealgorithmanditsdatastructures.Wehavedescribedecientmethodsforcomputingtheco-occurrencesandforpartiallysortingthemutualinformationvalues.Whatweaimtocreateisamechanismthatwilloutputtheedges(u;v)indecreasingorderoftheirmutualinformation.WeshallsetupthismechanismintheformofaFibonacciheap(Fredman&Tarjan,1987)calledvheapthatcontainsanelementforeach,representedbytheedgewiththehighestmutualinformationamongtheedges(u;v),with,thatarenotyeteliminated.Therecordinvheapisoftheform(;u;v;I),with,andwithbeingthekeyusedforsorting.Oncethemaximumisextracted,theusededgehastobereplacedbythenextlargest(intermsof)edgein'slists.ToperformthislattertaskweusethedatastructuresshowninFigure20.Kruskal'salgorithmisnowusedtoconstructthedesiredspanningtree.Figure21summarizestheresultingalgorithm. LearningwithMixturesofTrees 5 500 1000 2000 3000 0 0.5 1 1.5 2 2.5x 104 n steps Figure22:Themean(fullline),standarddeviationandmaximum(dottedline)ofthenumberofstepsintheKruskalalgorithmover1000runsplottedagainstrangesfrom5to3000.Theedgeweightsweresampledfromauniformdistribution.6.1.4RunningtimeThealgorithmrequires:)forcomputing)forsortingthevariables,log(N=n))forstep2,)forstep3,)fortheKruskalalgorithmisthenumberofedgesexaminedbythealgorithm,logisthetimeforanextractionfromvheapisanupperboundforthenumberofelementsinthelists whoseelementsneedtobeskippedoccasionallywhenweextractvariablesfromthevirtuallists),and)forcreatingthethe1probabilitytablesinstep5.Summingtheseterms,weobtainthefollowingupperboundfortherunningtimeoftheacCLalgorithm: IfweignorethelogarithmicfactorsthissimpliestoForconstant,thisboundisapolynomialofdegree1inthethreevariablesn;Nand.Becausehastherange2,theworstcasecomplexityoftheAcCLalgorithmisquadraticin.Empirically,however,asweshowbelow,wendthatthedependenceofisgenerallysubquadratic.Moreover,randomgraphtheoryimpliesthatifthedistributionoftheweightvaluesisthesameforalledges,thenKruskal'salgorithmshouldtaketimeproportionalto(West,1996).Toverifythislatterresult,weconductedasetofMonteCarloexperiments,inwhichwerantheKruskalalgorithmonsetsofrandomweightsoverdomainsofdimensionupto=3000.Foreach,1000runswereperformed.Figure22plotstheaverageandmaximumversusfortheseexperiments.Thecurvefortheaveragedisplaysanessentiallylineardependence. a&Jordan6.1.5MemoryrequirementsBeyondthestoragefordataandresults,weneed)spacetostoretheco-occurrencesandthelists )and)forL;vheapandtheKruskalalgorithm.Hence,theadditionalspaceusedbytheacCLalgorithmis6.2DiscretevariablesofarbitraryarityWebrie ydescribetheextensionoftheacCLalgorithmtothecaseofdiscretedomainsinwhichthevariablescantakemorethantwovalues.Firstweextendthedenitionofdatasparseness:weassumethatforeachvariablethereexistsaspecialvaluethatappearswithhigherfrequencythanalltheothervalues.Thisvaluewillbedenotedby0,withoutlossofgenerality.Forexample,inamedicaldomain,thevalue0foravariablewouldrepresentthe\normal"value,whereastheabnormalvaluesofeachvariablewouldbedesignatedbynon-zerovalues.An\occurrence"forvariablewillbetheevent=0anda\co-occurrence"ofandmeansthatandarebothnon-zeroforthesamedatapoint.Wedeneasthenumberofnon-zerovaluesinobservationThesparsenessis,asbefore,themaximumofoverthedataset.Toexploitthehighfrequencyofthezerovalueswerepresentonlytheoccurrencesexplicitly,creatingtherebyacompactandecientdatastructure.Weobtainperformancegainsbypresortingmutualinformationvaluesfornon-co-occurringvariables.6.2.1Computingco-occurrencesAsbefore,weavoidrepresentingzerovaluesexplicitlybyreplacingeachdatapointthelistxlist,wherexlist=listv;xV;xAco-occurrenceisrepresentedbythequadruple(u;x;v;x=0.Insteadofoneco-occurrencecount,wenowhaveatwo-waycontingencytable.Eachrepresentsthenumberofdatapointswherei;vj;i;j=0.Countingandstoringco-occurrencescanbedoneinthesametimeasbeforeandwithaMAX)largeramountofmemory,necessitatedbytheadditionalneedtostorethe(non-zero)variablevalues.6.2.2PresortingmutualinformationvaluesOurgoalistopresortthemutualinformationvaluesforallthatdonotco-occurwith.Thefollowingtheoremshowsthatthiscanbedoneexactlyasbefore.TheoremLetu;v;wbediscretevariablessuchthatv;wdonotco-occurwith)inagivendataset.Letbethenumberofdatapointsforwhichrespectively,andletbetherespectiveempiricalmutualinformationvaluesbasedonthesample.Thenwithequalityonlyifisidentically0. TheproofofthetheoremisgivenintheAppendix.TheimplicationofthistheoremisthattheacCLalgorithmcanbeextendedtovariablestakingmorethantwovaluesbymakingonlyone(minor)modication:thereplacementofthescalarcountsandthevectors=0and,respectively,thecontingencytables;i;j=0. LearningwithMixturesofTrees 100 200 500 1000 1s 2s 5s 10s 1m 2m 10m 30m 1h 4h nTime s=5 s=10 s=15 s=100 Figure23:RunningtimefortheacCL(fullline)and(dottedline)ChowLiualgorithmsversusnumberofverticesfordierentvaluesofthesparseness 100 200 500 1000 102 103 104 105 106 nn Kruskal s=5 s=10 s=15 s=100 Figure24:NumberofstepsoftheKruskalalgorithmversusdomainsizemeasuredfortheacCLalgorithmfordierentvaluesof6.3ExperimentsInthissectionwereportontheresultsofexperimentsthatcomparethespeedoftheacCLalgorithmwiththestandardChowLiumethodonarticialdata.Inourexperimentsthebinarydomaindimensionvariesfrom50to1000.Eachdatapointhasaxednumberofvariablestakingvalue1.Thesparsenesstakesthevalues5,10,15and100.Thedataweregeneratedfromanarticiallyconstructednon-uniform,non-factoreddistribution.Foreachpair(n;s)asetof10,000pointswascreated.ForeachdatasetboththeChowLiualgorithmandtheacCLalgorithmwereusedtotasingletreedistribution.TherunningtimesareplottedinFigure23.Theimprovements a&JordanacCLoverthestandardversionarespectacular:learningatreeon1000variablesfrom10,000datapointstakes4hourswiththestandardalgorithmandonly4secondswiththeacceleratedversionwhenthedataaresparse(15).Forthelesssparseregimeof=100theacCLalgorithmtakes2minutestocomplete,improvingonthetraditionalalgorithmbyafactorof120.Notealsothattherunningtimeoftheacceleratedalgorithmseemstobenearlyinde-pendentofthedimensionofthedomain.Recallontheotherhandthatthenumberofsteps(Figure24)growswith.ThisimpliesthatthebulkofthecomputationliesinthestepsprecedingtheKruskalalgorithmproper.Namely,itisinthecomputingofco-occurrencesandorganizingthedatathatmostofthetimeisspent.Figure23alsoconrmsthattherunningtimeofthetraditionalChowLiualgorithmgrowsquadraticallywithandisindependentofThisconcludesthepresentationoftheacCLalgorithm.Themethodachievesitsperfor-mancegainsbyexploitingcharacteristicsofthedata(sparseness)andoftheproblem(theweightsrepresentmutualinformation)thatareexternaltothemaximumweightspanningtreealgorithmproper.Thealgorithmobtainedisworstcase)buttypically),whichrepresentsasignicantasymptoticimprovementoverthe)ofthetraditionalChowandLiualgorithm.Moreover,ifislarge,thentheacCLalgorithm(gracefully)degradestothestandardChowLiualgorithm.Thealgorithmextendstonon-integercounts,hencebeingdirectlyapplicabletomixturesoftrees.Aswehaveseenempirically,averysignicantpartoftherunningtimeisspentincomputingco-occurrences.Thispromptsfutureworkonlearningstatisticalmodelsoverlargedomainsthatfocusontheecientcomputationandusageoftherelevantsucientstatistics.WorkinthisdirectionincludesthestructuralEMalgorithm(Friedman,1998;Friedman&Getoor,1999)aswellasA-Dtrees(Moore&Lee,1998).Thelatterarecloselyrelatedtoourrepresentationofthepairwisemarginalsbycounts.Infact,ourrepresentationcanbeviewedasa\reduced"A-Dtreethatstoresonlypairwisestatistics.Consequently,whenanA-Dtreerepresentationisalreadycomputed,itcanbeexploitedinsteps1and2oftheacCLalgorithm.OtherversionsoftheacCLalgorithmarediscussedbyMeila-Predoviciu(1999).7.ConclusionsWehavepresentedthemixtureoftrees(MT),aprobabilisticmodelinwhichjointprobabil-itydistributionsarerepresentedasnitemixturesoftreedistributions.Treedistributionshaveanumberofvirtues|representational,computationalandstatistical|buthavelimitedexpressivepower.BayesianandMarkovnetworksachievesignicantlygreaterexpressivepowerwhileretainingmanyoftherepresentationalvirtuesoftrees,butincursignicantlyhighercostsonthecomputationalandstatisticalfronts.Themixtureapproachprovidesanalternativeupgradepath.WhileBayesianandMarkovnetworkshavenodistinguishedrelationshipsbetweenedges,andstatisticalmodelselectionproceduresforthesenetworksgenerallyinvolveadditionsanddeletionsofsingleedges,theMTmodelgroupsoverlappingsetsofedgesintomixturecomponents,andedgesareaddedandremovedviaamaximumlikelihoodalgorithmthatisconstrainedtottreemodelsineachmixturecomponent.WehavealsoseenthatitisstraightforwardtodevelopBayesianmethodsthatallownercon- LearningwithMixturesofTreestroloverthechoiceofedgesandsmooththenumericalparameterizationofeachofthecomponentmodels.ChowandLiu(1968)presentedthebasicmaximumlikelihoodalgorithmforttingtreedistributionsthatprovidestheMstepofourEMalgorithm,andalsoshowedhowtouseen-semblesoftreestosolveclassicationproblems,whereeachtreemodelstheclass-conditionaldensityofoneoftheclasses.ThisapproachwaspursuedbyFriedmanetal.(1997,1998),whoemphasizedtheconnectionwiththenaiveBayesmodel,andpresentedempiricalre-sultsthatdemonstratedtheperformancegainsthatcouldbeobtainedbyenhancingnaiveBayestoallowconnectivitybetweentheattributes.Ourworkisafurthercontributiontothisgenerallineofresearch|wetreatanensembleoftreesasamixturedistribution.Themixtureapproachprovidesadditional exibilityintheclassicationdomain,wherethe\choicevariable"neednotbetheclasslabel,andalsoallowsthearchitecturetobeappliedtounsupervisedlearningproblems.Thealgorithmsforlearningandinferencethatwehavepresentedhaverelativelybe-nignscaling:inferenceislinearinthedimensionalityandeachstepoftheEMlearningalgorithmisquadraticin.Suchfavorabletimecomplexityisanimportantvirtueofourtree-basedapproach.Inparticularlylargeproblems,however,suchasthosethatariseininformationretrievalapplications,quadraticcomplexitycanbecomeonerous.ToallowtheuseoftheMTmodeltosuchcases,wehavedevelopedtheacCLalgorithm,wherebyex-ploitingdatasparsenessandpayingattentiontodatastructureissueswehavesignicantlyreducedruntime:Wepresentedexamplesinwhichthespeed-upobtainedfromtheacCLalgorithmwasthreeordersofmagnitude.Arethereotherclassesofgraphicalmodelswhosestructurecanbelearnedecientlyfromdata?ConsidertheclassofBayesiannetworksforwhichthetopologicalorderingofthevariablesisxedandthenumberofparentsofeachnodeisboundedbyaxedconstant.Forthisclasstheoptimalmodelstructureforagiventargetdistributioncanbefound)timebyagreedyalgorithm.Thesemodelssharewithtreesthepropertyofbeingmatroids(West,1996).Thematroidistheuniquealgebraicstructureforwhichthe\maximumweight"problem,inparticularthemaximumweightspanningtreeproblem,issolvedoptimallybyagreedyalgorithm.Graphicalmodelsthatarematroidshaveecientstructurelearningalgorithms;itisaninterestingopenproblemtondadditionalexamplesofsuchmodels.AcknowledgmentsWewouldliketoacknowledgesupportforthisprojectfromtheNationalScienceFoundation(NSFgrantIIS-9988642)andtheMultidisciplinaryResearchProgramoftheDepartmentofDefense(MURIN00014-00-1-0637). a&JordanAppendixA.InthisappendixweprovethefollowingtheoremfromSection6.2:TheoremLetu;v;wbediscretevariablessuchthatv;wdonotco-occurwithinagivendataset).Letbethenumberofdatapointsforwhichrespectively,andletbetherespectiveempiricalmutualinformationvaluesbasedonthesample.Thenwithequalityonlyifisidentically0. Proof.Weusethenotation: =0;(0)=1Thesevaluesrepresentthe(empirical)probabilitiesoftakingvalue=0and0respectively.Entropieswillbedenotedby.Weaimtoshowthat Werstnotea\chainrule"expressionfortheentropyofadiscretevariable.Inpar-ticular,theentropyofanymultivalueddiscretevariablecanbedecomposedinthefollowingway:)log(1 Hv0Pv0)Xi6v(i) Pv0)v(i) Pv0)| H +(1 Notemoreoverthatthemutualinformationoftwonon-co-occurringvariablesis.Thesecondterm,theconditionalentropyofgiven Wenowexpandusingthedecompositionin(12):+(1Becauseandarenevernon-zeroatthesametime,allnon-zerovaluesofarepairedwithzerovaluesof.Consequentlytlyu=iju6=0;v=0]==u=iju6=0]and .Thetermdenotedistheentropyofabinaryvariablewhoseprobabilityissu=0jv=0].Thisprobabilityequalsalsu=0jv=0]=1 1Pv0: LearningwithMixturesofTreesNotethatinordertoobtainanon-negativeprobabilityintheaboveequationoneneeds,aconditionthatisalwayssatisedifanddonotco-occur.Replacingthepreviousthreeequationsintheformulaofthemutualinformation,weget1)log(Thisexpression,remarkably,dependsonlyonand.Takingitspartialderivativewithrespectto =log avaluethatisalwaysnegative,independentlyof.Thisshowsthemutualinformationincreasesmonotonicallywiththe\occurrencefrequency"ofgivenby1.Notealsothattheaboveexpressionforthederivativeisthesameastheresultobtainedforbinaryvariablesin(11).ReferencesBishop,C.M.(1999).Latentvariablemodels.InM.I.Jordan(Ed.),LearninginGraphicalModels.Cambridge,MA:MITPress.Blake,C.,&Merz,C.(1998).UCIRepositoryofMachineLearningDatabases.http://www.ics.uci.edu/mlearn/MLRepository.html.Boutilier,C.,Friedman,N.,Goldszmidt,M.,&Koller,D.(1996).Context-specicinde-pendenceinBayesiannetworks.InProceedingsofthe12thConferenceonUncertaintyinAI(pp.64{72).MorganKaufmann.Buntine,W.(1996).Aguidetotheliteratureonlearninggraphicalmodels.IEEETrans-actionsonKnowledgeandDataEngineering,195{210.Cheeseman,P.,&Stutz,J.(1995).Bayesianclassication(AutoClass):Theoryandresults.InU.Fayyad,G.Piatesky-Shapiro,P.Smyth,&Uthurusamy(Eds.),AdvancesinKnowledgeDiscoveryandDataMining(pp.153{180).AAAIPress.Cheng,J.,Bell,D.A.,&Liu,W.(1997).Learningbeliefnetworksfromdata:aninformationtheorybasedapproach.InProceedingsoftheSixthACMInternationalConferenceonInformationandKnowledgeManagement.Chow,C.K.,&Liu,C.N.(1968).Approximatingdiscreteprobabilitydistributionswithdependencetrees.IEEETransactionsonInformationTheoryIT-14(3),462{467.Cooper,G.F.,&Herskovits,E.(1992).ABayesianmethodfortheinductionofprobabilisticnetworksfromdata.MachineLearning,309{347.Cormen,T.H.,Leiserson,C.E.,&Rivest,R.R.(1990).IntroductiontoAlgorithms.Cambridge,MA:MITPress. a&JordanCowell,R.G.,Dawid,A.P.,Lauritzen,S.L.,&Spiegelhalter,D.J.(1999).ProbabilisticNetworksandExpertSystems.NewYork,NY:Springer.Dayan,P.,&Zemel,R.S.(1995).Competitionandmultiplecausemodels.NeuralCom-(3),565{579.Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).MaximumlikelihoodfromincompletedataviatheEMalgorithm.JournaloftheRoyalStatisticalSociety,B,1{38.Fredman,M.L.,&Tarjan,R.E.(1987).Fibonacciheapsandtheirusesinimprovednetworkoptimizationalgorithms.JournaloftheAssociationforComputingMachineryFrey,B.J.,Hinton,G.E.,&Dayan,P.(1996).Doesthewake-sleepalgorithmproducegooddensityestimators?InD.Touretzky,M.Mozer,&M.Hasselmo(Eds.),NeuralInformationProcessingSystems(pp.661{667).Cambridge,MA:MITPress.Friedman,N.(1998).TheBayesianstructuralEMalgorithm.InProceedingsofthe14thConferenceonUncertaintyinAI(pp.129{138).SanFrancisco,CA:MorganKauf-mann.Friedman,N.,Geiger,D.,&Goldszmidt,M.(1997).Bayesiannetworkclassiers.MachineLearning,131{163.Friedman,N.,&Getoor,L.(1999).Ecientlearningusingconstrainedsucientstatis-tics.InProceedingsofthe7thInternationalWorkshoponArticialIntelligenceandStatistics(AISTATS-99).Friedman,N.,Getoor,L.,Koller,D.,&Pfeer,A.(1996).Learningprobabilisticrela-tionalmodels.InProceedingsofthe16thInternationalJointConferenceonArticialIntelligence(IJCAI)(pp.1300{1307).Friedman,N.,Goldszmidt,M.,&Lee,T.(1998).Bayesiannetworkclassicationwithcontinousattributes:Gettingthebestofbothdiscretizationandparametrictting.ProceedingsoftheInternationalConferenceonMachineLearning(ICML).Geiger,D.(1992).Anentropy-basedlearningalgorithmofBayesianconditionaltrees.Proceedingsofthe8thConferenceonUncertaintyinAI(pp.92{97).MorganKaufmannPublishers.Geiger,D.,&Heckerman,D.(1996).KnowledgerepresentationandinferenceinsimilaritynetworksandBayesianmultinets.ArticialIntelligence,45{74.Hastie,T.,&Tibshirani,R.(1996).Discriminantanalysisbymixturemodeling.JournaloftheRoyalStatisticalSocietyB,155{176.Heckerman,D.,Geiger,D.,&Chickering,D.M.(1995).LearningBayesiannetworks:thecombinationofknowledgeandstatisticaldata.MachineLearning(3),197{243.Hinton,G.E.,Dayan,P.,Frey,B.,&Neal,R.M.(1995).Thewake-sleepalgorithmforunsupervisedneuralnetworks.Science,1158{1161. LearningwithMixturesofTreesJelinek,F.(1997).StatisticalMethodsforSpeechRecognition.Cambridge,MA:MITPress.Jordan,M.I.,&Jacobs,R.A.(1994).HierarchicalmixturesofexpertsandtheEMalgorithm.NeuralComputation,181{214.Kontkanen,P.,Myllymaki,P.,&Tirri,H.(1996).ConstructingBayesiannitemixturemodelsbytheEMalgorithm(Tech.Rep.No.C-1996-9).UniversityofHelsinki,De-partmentofComputerScience.Lauritzen,S.L.(1995).TheEMalgorithmforgraphicalassociationmodelswithmissingdata.ComputationalStatisticsandDataAnalysis,191{201.Lauritzen,S.L.(1996).GraphicalModels.Oxford:ClarendonPress.Lauritzen,S.L.,Dawid,A.P.,Larsen,B.N.,&Leimer,H.-G.(1990).IndependencepropertiesofdirectedMarkovelds.Networks,579{605.MacLachlan,G.J.,&Bashford,K.E.(1988).MixtureModels:InferenceandApplicationstoClustering.NY:MarcelDekker.a,M.,&Jaakkola,T.(2000).TractableBayesianlearningoftreedistributions.InC.Boutilier&M.Goldszmidt(Eds.),Proceedingsofthe16thConferenceonUncer-taintyinAI(pp.380{388).SanFrancisco,CA:MorganKaufmann.a,M.,&Jordan,M.I.(1998).Estimatingdependencystructureasahiddenvariable.InM.I.Jordan,M.J.Kearns,&S.A.Solla(Eds.),NeuralInformationProcessing(pp.584{590).MITPress.a-Predoviciu,M.(1999).Learningwithmixturesoftrees.Unpublisheddoctoraldisser-tation,MassachusettsInstituteofTechnology.Michie,D.,Spiegelhalter,D.J.,&Taylor,C.C.(1994).MachineLearning,NeuralandStatisticalClassication.NewYork:EllisHorwood.Monti,S.,&Cooper,G.F.(1998).ABayesiannetworkclasserthatcombinesanitemixturemodelandanaiveBayesmodel(Tech.Rep.No.ISSP-98-01).UniversityofPittsburgh.Moore,A.W.,&Lee,M.S.(1998).Cachedsucientstatisticsforecientmachinelearningwithlargedatasets.JournalforArticialIntelligenceResearch,67{91.Neal,R.M.,&Hinton,G.E.(1999).AviewoftheEMalgorithmthatjustiesincremental,sparse,andothervariants.InM.I.Jordan(Ed.),LearninginGraphicalModels(pp.355{368).Cambridge,MA:MITPress.Ney,H.,Essen,U.,&Kneser,R.(1994).Onstructuringprobabilisticdependencesinstochasticlanguagemodelling.ComputerSpeechandLanguage,1{38.Noordewier,M.O.,Towell,G.G.,&Shavlik,J.W.(1991).Trainingknowledge-basedneuralnetworkstorecognizegenesinDNAsequences.InR.P.Lippmann,J.E.Moody,&D.S.Touretzky(Eds.),AdvancesinNeuralInformationProcessingSystems(pp.530{538).MorganKaufmannPublishers. a&JordanPearl,J.(1988).ProbabilisticReasoninginIntelligentSystems:NetworksofPlausibleInference.SanMateo,CA:MorganKaufmanPublishers.Philips,P.,Moon,H.,Rauss,P.,&Rizvi,S.(1997).TheFERETevaluationmethodologyforface-recognitionalgorithms.InProceedingsofthe1997ConferenceonComputerVisionandPatternRecognition.SanJuan,PuertoRico.Rasmussen,C.E.,Neal,R.M.,Hinton,G.E.,Camp,D.van,Revow,M.,Ghahramani,Z.,Kustra,R.,&Tibshrani,R.(1996).TheDELVEManual.http://www.cs.utoronto.ca/delve.Rissanen,J.(1989).StochasticComplexityinStatisticalInquiry.NewJersey:WorldScienticPublishingCompany.Rubin,D.B.,&Thayer,D.T.(1983).EMalgorithmsforMLfactoranalysis.Psychome-,69{76.Saul,L.K.,&Jordan,M.I.(1999).Ameaneldlearningalgorithmforunsupervisedneuralnetworks.InM.I.Jordan(Ed.),LearninginGraphicalModels(pp.541{554).Cambridge,MA:MITPress.Shafer,G.,&Shenoy,P.(1990).Probabilitypropagation.AnnalsofMathematicsandArticialIntelligence,327{352.Smyth,P.,Heckerman,D.,&Jordan,M.I.(1997).ProbabilisticindependencenetworksforhiddenMarkovprobabilitymodels.NeuralComputation,227{270.Thiesson,B.,Meek,C.,Chickering,D.M.,&Heckerman,D.(1997).LearningmixturesofBayesnetworks(Tech.Rep.Nos.MSR{POR{97{30).MicrosoftResearch.Watson,J.D.,Hopkins,N.H.,Roberts,J.W.,Steitz,J.A.,&Weiner,A.M.(1987).Molec-ularBiologyoftheGene(Vol.I,4ed.).MenloPark,CA:TheBenjamin/CummingsPublishingCompany.West,D.B.(1996).IntroductiontoGraphTheory.UpperSaddleRiver,NJ:PrenticeHall.