/
JournalofMachineLearningResearch12(2011)1771-1812Submitted9/10;Revised JournalofMachineLearningResearch12(2011)1771-1812Submitted9/10;Revised

JournalofMachineLearningResearch12(2011)1771-1812Submitted9/10;Revised - PDF document

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
388 views
Uploaded On 2015-07-31

JournalofMachineLearningResearch12(2011)1771-1812Submitted9/10;Revised - PPT Presentation

CHOITANANANDKUMARANDILLSKY1IntroductionTheinclusionoflatentvariablesinmodelingcomplexphenomenaanddataisawellrecognizedandavaluableconstructinavarietyofapplicationsincludingbioinformaticsandcompu ID: 97496

CHOI TAN ANANDKUMARANDILLSKY1.IntroductionTheinclusionoflatentvariablesinmodelingcomplexphenomenaanddataisawell-recognizedandavaluableconstructinavarietyofapplications includingbio-informaticsandcompu

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch12(2011)..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch12(2011)1771-1812Submitted9/10;Revised2/11;Published5/11LearningLatentTreeGraphicalModelsMyungJinChoiMYUNGJIN@MIT.EDUStochasticSystemsGroupLaboratoryforInformationandDecisionSystemsMassachusettsInstituteofTechnologyCambridge,MA02139VincentY.F.TanVTAN@WISC.EDUDepartmentofElectricalandComputerEngineeringUniversityofWisconsin-MadisonMadison,WI53706AnimashreeAnandkumarA.ANANDKUMAR@UCI.EDUCenterforPervasiveCommunicationsandComputingElectricalEngineeringandComputerScienceUniversityofCalifornia,IrvineIrvine,CA92697AlanS.WillskyWILLSKY@MIT.EDUStochasticSystemsGroupLaboratoryforInformationandDecisionSystemsMassachusettsInstituteofTechnologyCambridge,MA02139Editor:MarinaMeil aAbstractWestudytheproblemoflearningalatenttreegraphicalmodelwheresamplesareavailableonlyfromasubsetofvariables.Weproposetwoconsistentandcomputationallyefcientalgorithmsforlearningminimallatenttrees,thatis,treeswithoutanyredundanthiddennodes.Unlikemanyexistingmethods,theobservednodes(orvariables)arenotconstrainedtobeleafnodes.Oural-gorithmscanbeappliedtobothdiscreteandGaussianrandomvariablesandourlearnedmodelsaresuchthatalltheobservedandlatentvariableshavethesamedomain(statespace).Ourrstalgorithm,recursivegrouping,buildsthelatenttreerecursivelybyidentifyingsiblinggroupsusingso-calledinformationdistances.Oneofthemaincontributionsofthisworkisoursecondalgo-rithm,whichwerefertoasCLGrouping.CLGroupingstartswithapre-processingprocedureinwhichatreeovertheobservedvariablesisconstructed.Thisglobalstepgroupstheobservednodesthatarelikelytobeclosetoeachotherinthetruelatenttree,therebyguidingsubsequentrecursivegrouping(orequivalentproceduressuchasneighbor-joining)onmuchsmallersubsetsofvariables.Thisresultsinmoreaccurateandefcientlearningoflatenttrees.Wealsopresentregularizedver-sionsofouralgorithmsthatlearnlatenttreeapproximationsofarbitrarydistributions.Wecomparetheproposedalgorithmstoothermethodsbyperformingextensivenumericalexperimentsonvar-iouslatenttreegraphicalmodelssuchashiddenMarkovmodelsandstargraphs.Inaddition,wedemonstratetheapplicabilityofourmethodsonreal-worlddatasetsbymodelingthedependencystructureofmonthlystockreturnsintheS&Pindexandofthewordsinthe20newsgroupsdataset.Keywords:graphicalmodels,Markovrandomelds,hiddenvariables,latenttreemodels,struc-turelearningc\r2011MyungJinChoi,VincentY.F.Tan,AnimashreeAnandkumarandAlanS.Willsky. CHOI,TAN,ANANDKUMARANDILLSKY1.IntroductionTheinclusionoflatentvariablesinmodelingcomplexphenomenaanddataisawell-recognizedandavaluableconstructinavarietyofapplications,includingbio-informaticsandcomputervision,andtheinvestigationofmachine-learningmethodsformodelswithlatentvariablesisasubstantialandcontinuingdirectionofresearch.Therearethreechallengingproblemsinlearningamodelwithlatentvariables:learningthenumberoflatentvariables;inferringthestructureofhowtheselatentvariablesrelatetoeachotherandtotheobservedvariables;andestimatingtheparameterscharacterizingthoserelationships.Is-suesthatonemustconsiderindevelopinganewlearningalgorithmincludedevelopingtractablemethods;incorporatingthetradeoffbetweenthedelitytothegivendataandgeneralizability;de-rivingtheoreticalresultsontheperformanceofsuchalgorithms;andstudyingapplicationsthatprovideclearmotivationandcontextsforthemodelssolearned.Oneclassofmodelsthathasreceivedconsiderableattentionintheliteratureistheclassoflatenttreemodels,thatis,graphicalmodelsMarkovontrees,inwhichvariablesatsomenodesrepresenttheoriginal(observed)variablesofinterestwhileothersrepresentthelatentvariables.Theappealofsuchmodelsforcomputationaltractabilityisclear:withatree-structuredmodeldescribingthestatisticalrelationships,inference—processingnoisyobservationsofsomeoralloftheoriginalvariablestocomputetheestimatesofallvariables—isstraightforwardandscalable.Althoughtheclassoftree-structuredmodels,withorwithoutlatentvariables,isaconstrainedone,thereareinterestingapplicationsthatprovidestrongmotivationfortheworkpresentedhere.Inparticular,averyactiveavenueofresearchincomputervisionistheuseofcontext—forexample,thenatureofascenetoaidthereliablerecognitionofobjects(andatthesametimetoallowtherecognitionofparticularobjectstoassistinrecognizingthescene).Forexample,ifoneknowsthatanimageisthatofanofce,thenonemightexpecttondadesk,amonitoronthatdesk,andperhapsacomputermouse.Henceifonebuildsamodelwithalatentvariablerepresentingthatcontext(“ofce”)andusessimple,noisydetectorsfordifferentobjecttypes,onewouldexpectthatthedetectionofadeskwouldsupportthelikelihoodthatoneislookingatanofceandthroughthatenhancethereliabilityofdetectingsmallerobjects(monitors,keyboards,mice,etc.).Workalongtheselines,includingbysomeoftheauthorsofthispaper(ParikhandChen,2007;Choietal.,2010),showthepromiseofusingtree-basedmodelsofcontext.Thispaperconsiderstheproblemoflearningtree-structuredlatentmodels.Ifallvariablesareobservedinthetreeunderconsideration,thenthewell-knownalgorithmofChowandLiu(1968)providesatractablealgorithmforperformingmaximumlikelihood(ML)estimationofthetreestructure.However,ifnotallvariablesareobserved,thatis,forlatenttreemodels,thenMLesti-mationisNP-hard(Roch,2006).Thishasmotivatedanumberofinvestigationsofothertractablemethodsforlearningsuchtreesaswellastheoreticalguaranteesonperformance.Ourworkrepre-sentsacontributiontothisareaofinvestigation.Therearethreemaincontributionsinourpaper.Firstly,byadoptingastatisticaldistance-basedframework,wedeveloptwonewalgorithmsforthelearningoflatenttrees—recursivegroupingandCLGrouping,whichapplyequallywelltodiscreteandGaussianmodels.Secondly,weprovideconsistencyguarantees(bothstructuralandparametric)aswellasveryfavorablecomputationalandsamplecomplexitycharacterizationsforbothofouralgorithms.Thirdly,throughextensivenumericalexperimentsonbothsyntheticandreal-worlddata,wedemonstratethesuperiorityofour1772 LEARNINGLATENTTREEGRAPHICALMODELSapproachforawidevarietyofmodelsrangingfromoneswithverylargetreediameters(e.g.,hiddenMarkovmodels(HMMs))tostarmodelsandcompletetrees.1Ourrstalgorithm,whichwerefertoasrecursivegrouping,constructsalatenttreeinabottom-upfashion,groupingnodesintosiblinggroupsthatsharethesameparentnode,recursivelyateachleveloftheresultinghierarchy(andallowingforsomeoftheobservedvariablestoplayrolesatarbitrarylevelsintheresultinghierarchy).Oursecondalgorithm,CLGroupingrstimplementsaglobalconstructionstep,namelyproducingtheChow-Liutreefortheobservedvariableswithoutanyhiddennodes.Thisglobalstepthenprovidesguidanceforgroupsofobservednodesthatarelikelytobetopologicallyclosetoeachotherinthelatenttree,therebyguidingsubsequentrecursivegroupingorneighbor-joining(SaitouandNei,1987)computations.Eachofthesealgorithmsisconsistentandhasexcellentsampleandcomputationalcomplexity.2AsPearl(1988)pointsout,theidenticationoflatenttreemodelshassomebuilt-inambigu-ity,asthereisanentireequivalenceclassofmodelsinthesensethatwhenalllatentvariablesaremarginalizedout,eachmodelinthisclassyieldsthesamejointdistributionovertheobservedvari-ables.Forexample,wecantakeanysuchlatentmodelandaddanotherhiddenvariableasaleafnodeconnectedtoonlyoneother(hiddenorobserved)node.Hence,muchasonendsineldssuchasstatespacedynamicsystems(e.g.,Luenberger,1979,Section8),thereisanotionofminimalitythatisrequiredhere,andourresultsarestatedintermsofconsistentlearningofsuchminimallatentmodels.1.1RelatedWorkTherelevantliteratureonlearninglatentmodelsisvastandinthissection,wesummarizethemainlinesofresearchinthisarea.Theclassicallatentclustermodels(LCM)considermultivariatedistributionsinwhichthereexistsonlyonelatentvariableandeachstateofthatvariablecorrespondstoaclusterinthedata(LazarsfeldandHenry,1968).Hierarchicallatentclass(HLC)models(ZhangandKocka,2004;Zhang,2004;Chenetal.,2008)generalizethesemodelsbyallowingmultiplelatentvariables.HLCallowslatentvariablestohavedifferentnumberofstates,butassumethatallobservednodesareattheleavesofthetree.Theirlearningalgorithmisbasedonagreedyapproachofmakingonelocalmoveatatime(e.g.,introducingonehiddennode,orreplacinganedge),whichiscomputationallyexpensiveanddoesnothaveconsistencyguarantees.AgreedylearningalgorithmforHLCcalledBINisproposedinHarmelingandWilliams(2010),whichiscomputationallymoreefcient.Inaddition,Silvaetal.(2006)consideredthelearningofdirectedlatentmodelsusingso-calledtetradconstraints,andtherehavealsobeenattemptstotailorthelearningoflatenttreemodelsinordertoperformapproximateinferenceaccuratelyandefcientlydownstream(Wangetal.,2008).Inalltheseworks,thelatentvariablescanhavedifferentstatespaces,buttheobservednodesarerequiredtobeleavesofthetree.Incontrast,wexthestatespaceofeachhiddennode,butallowthepossibilitythatsomeobservednodesareinternalnodes(non-leaves).Thisassumptionleadstoanidentiablemodel,andweprovidealgorithmswithconsistencyguaranteeswhichcanrecoverthecorrectstructureundermildconditions.Incontrast,theworksinZhangandKocka(2004); 1.Atreeiscalledacompletek-arytree(ork-completetree),ifallitsinternalnodeshavedegreekandthereexistsonenode(commonlyreferredastherootnode)thathastheexactlysamedistancetoallleafnodes.2.Aswewillsee,dependingonthetruelatenttreemodel,oneortheotherofthesemaybemoreefcient.Roughlyspeaking,forsmallerdiametergraphs(suchasthestar),recursivegroupingisfaster,andforlargerdiametergraphs(suchasanHMM),CLgroupingismoreefcient.1773 CHOI,TAN,ANANDKUMARANDILLSKYZhang(2004);Chenetal.(2008);HarmelingandWilliams(2010)donotprovidesuchconsistencyguarantees.Manyauthorsalsoproposereconstructinglatenttreesusingtheexpectationmaximization(EM)algorithm(ElidanandFriedman,2005;KempandTenenbaum,2008).However,aswithallotherEM-basedmethods,theseapproachesdependontheinitializationandsufferfromthepossibilityofbeingtrappedinlocaloptimaandthusnoconsistencyguaranteescanbeprovided.Ateachiteration,alargenumberofcandidatestructuresneedtobeevaluated,sothesemethodsassumethatallobservednodesaretheleavesofthetreetoreducethenumberofcandidatestructures.Algorithmshavebeenproposed(Hsuetal.,2009)withsamplecomplexityguaranteesforlearningHMMsundertheconditionthatthejointdistributionoftheobservedvariablesgeneratedbydistincthiddenstatesaredistinct.Anotherrelatedlineofresearchisthatof(hierarchical)clustering.SeeJainetal.(1999),Bal-canandGupta(2010)andthereferencesthereinforextensivediscussions.Theprimaryobjectiveofhierarchicalclusteringistobuildatreeconsistingofnestedpartitionsoftheobserveddata,wheretheleaves(typically)consistofsingledatapointswhiletheinternalnodesrepresentcoarserparti-tions.Thedifferencefromourworkisthathierarchicalclusteringdoesnotassumeaprobabilisticgraphicalmodel(Markovrandomeld)onthedata,butimposesconstraintsonthedatapointsviaasimilaritymatrix.Weareinterestedinlearningtree-structuredgraphicalmodelswithhiddenvariables.Thereconstructionoflatenttreeshasbeenstudiedextensivelybythephylogeneticcommunitywheresequencesofextantspeciesareavailableandtheunknownphylogenetictreeistobeinferredfromthesesequences.SeeDurbinetal.(1999)forathoroughoverview.Efcientalgorithmswithprovableperformanceguaranteesareavailable(Erdosetal.,1999;Daskalakisetal.,2006).However,theworksinthisareamostlyassumethatonlytheleavesareobservedandeachinternalnode(whichishidden)hasthesamedegreeexceptfortheroot.Themostpopularalgorithmforconstructingphylogenetictreesistheneighbor-joining(NJ)methodbySaitouandNei(1987).Likeourrecursivegroupingalgorithm,theinputtothealgorithmisasetofstatisticaldistancesbetweenobservedvariables.Thealgorithmproceedsbyrecursivelypairingtwonodesthataretheclosestneighborsinthetruelatenttreeandintroducingahiddennodeastheparentofthetwonodes.FormoredetailsonNJ,thereaderisreferredtoDurbinetal.(1999,Section7.3).Anotherpopularclassofreconstructionmethodsusedinthephylogeneticcommunityisthefamilyofquartet-baseddistancemethods(BandelthandDress,1986;Erdosetal.,1999;Jiangetal.,2001).3Quartet-basedmethodsrstconstructasetofquartetsforallsubsetsoffourobservednodes.Subsequently,thesequartetsarethencombinedtoformalatenttree.However,whenweonlyhaveaccesstothesamplesattheobservednodes,thenitisnotstraightforwardtoconstructalatenttreefromasetofquartetssincethequartetsmaybenotbeconsisten4Infact,itisknownthattheproblemofdeterminingalatenttreethatagreeswiththemaximumnumberofquartetsisNP-hard(Steel,1992),butmanyheuristicshavebeenproposed(Farris,1972;SattathandTversky,1977).Also,inpractice,quartet-basedmethodsareusuallymuchlessaccuratethanNJ(St.Johnetal.,2003),andhence,weonlycompareourproposedalgorithmstoNJinourexperiments.Forfurthercomparisons(thesamplecomplexityandotheraspectsof)betweenthequartetmethodsandNJ,thereaderisreferredtoCsur¨os(2000)andSt.Johnetal.(2003). 3.Aquartetissimplyanunrootedbinarytreeonasetoffourobservednodes.4.Thetermconsistenthereisnotthesameastheestimation-theoreticone.Here,wesaythatasetofquartetsisconsistentifthereexistsalatenttreesuchthatallquartetsagreewiththetree.1774 LEARNINGLATENTTREEGRAPHICALMODELSAnotherdistance-basedalgorithmwasproposedinPearl(1988,Section8.3.3).Thisalgorithmisverysimilarinspirittoquartet-basedmethodsbutinsteadofndingquartetsforallsubsetsoffourobservednodes,itndsjustenoughquartetstodeterminethelocationofeachobservednodeinthetree.Althoughthealgorithmisconsistent,itperformspoorlywhenonlythesamplesofobservednodesareavailable(Pearl,1988,Section8.3.5).Thelearningofphylogenetictreesisrelatedtotheemergingeldofnetworktomography(Cas-troetal.,2004)inwhichoneseekstolearncharacteristics(suchasstructure)fromdatawhichareonlyavailableattheendpoints(e.g.,sourcesandsinks)ofthenetwork.However,againobservationsareonlyavailableattheleafnodesandusuallytheobjectiveistoestimatethedelaydistributionscorrespondingtonodeslinkedbyanedge(Tsangetal.,2003;Bhamidietal.,2009).Themodelingofthedelaydistributionsisdifferentfromthelearningoflatenttreegraphicalmodelsdiscussedinthispaper.1.2PaperOrganizationTherestofthepaperisorganizedasfollows.InSection2,weintroducethenotationsandtermi-nologiesusedinthepaper.InSection3,weintroducethenotionofinformationdistanceswhichareusedtoreconstructtreemodels.Inthesubsequenttwosections,wemaketwoassumptions:Firstly,thetruedistributionisalatenttreeandsecondly,perfectknowledgeofinformationdistanceofobservedvariablesisavailable.WeintroducerecursivegroupinginSection4.ThisisfollowedbyoursecondalgorithmCLGroupinginSection5.InSection6,werelaxtheassumptionthattheinformationdistancesareknownanddevelopsamplebasedalgorithmsandatthesametimeprovidesamplecomplexityguaranteesforrecursivegroupingandCLGrouping.Wealsodiscussextensionsofouralgorithmsforthecasewhentheunderlyingmodelisnotatreeandourgoalistolearnanapproximationtoitusingalatenttreemodel.WedemonstratetheempiricalperformanceofouralgorithmsinSection7andconcludethepaperinSection8.Theappendixincludesproofsforthetheoremspresentedinthepaper.2.LatentTreeGraphicalModelsInthissection,weprovidesomebackgroundandintroducethenotionofminimal-treeextensionsandconsistency.2.1UndirectedGraphsLetG=(W;E)beanundirectedgraphwithvertex(ornode)setW=f1;:::;MgandedgesetEW2.Letnbd(G)andnbd[G]bethesetofneighborsofnodeandtheclosedneighborhoodofrespectively,thatis,nbd[G]=nbd(G)[fg.Ifanundirectedgraphdoesnotincludeanyloops,itiscalledatree.Acollectionofdisconnectedtreesiscalledaforest5ForatreeT=(W;E)thesetofleafnodes(nodeswithdegree1),themaximumdegree,andthediameteraredenotedbyLeaf(T)D(T),anddiam(T)respectively.ThepathbetweentwonodesandinatreeT=(W;E)whichisunique,isthesetofedgesconnectingandandisdenotedasPath((;)E).ThedistancebetweenanytwonodesandisthenumberofedgesinPath((;)E).Inanundirectedtree,wecanchoosearootnodearbitrarily,anddenetheparent-childrelationshipswithrespecttotheroot: 5.Strictlyspeaking,agraphwithnoloopsiscalledaforest,anditiscalledatreeonlyifeverynodeisconnectedtoeachother.1775 CHOI,TAN,ANANDKUMARANDILLSKYforapairneighboringnodesand,ifisclosertotherootthanis,theniscalledtheparentof,andiscalledthechildof.Notethattherootnodedoesnothaveanyparent,andforallothernodesinthetree,thereexistsexactlyoneparent.WeuseC()todenotethesetofchildnodes.Asetofnodesthatsharethesameparentiscalledasiblinggroup.Afamilyistheunionofthesiblingsandtheassociatedparent.AlatenttreeisatreewithnodesetW=V[H,theunionofasetofobservednodesV(withm=jVj),andasetoflatent(orhidden)nodesH.Theeffectivedepthd(TV)(withrespecttoV)isthemaximumdistanceofahiddennodetoitsclosestobservednode,thatis,d(TV)=maxi2Hminj2VjPath((;)T)j:(1)2.2GraphicalModelsAnundirectedgraphicalmodel(Lauritzen,1996)isafamilyofmultivariateprobabilitydistributionsthatfactorizeaccordingtoagraphG=(W;E).Moreprecisely,letX=(X1;:::;XM)bearandomvector,whereeachrandomvariableXi,whichtakesonvaluesinanalphabet,correspondstovari-ableatnode2V.ThesetofedgesEencodesthesetofconditionalindependenciesinthemodel.TherandomvectorXissaidtobeMarkovonGifforevery,therandomvariableXiisconditionallyindependentofallothervariablesgivenitsneighbors,thatis,ifpisthejointdistribution6ofX,thenp(xijxnbd(i;G))=p(xijxni);(2)wherexnidenotesthesetofallvariables7excludingxi.Equation(2)isknownasthelocalMarkovpropertyInthispaper,weconsiderbothdiscreteandGaussiangraphicalmodels.Fordiscretemodels,thealphabet=f1;:::;Kgisaniteset.ForGaussiangraphicalmodels,=Randfurthermore,withoutlossofgenerality,weassumethatthemeanisknowntobethezerovectorandhence,thejointdistributionp(x)=1 det(2p)1=2exp1 2xT1xdependsonlyonthecovariancematrixAnimportantandtractableclassofgraphicalmodelsisthesetoftree-structuredgraphicalmod-els,thatis,multivariateprobabilitydistributionsthatareMarkovonanundirectedtreeT=(W;E)Itisknownfromjunctiontreetheory(Cowelletal.,1999)thatthejointdistributionpforsuchamodelfactorizesasp(x1;:::;xM)=Õi2Wp(xi)Õ(i;j)2Ep(xi;xj) p(xi)p(xj):(3)Thatis,thesetsofmarginalfp(xi)2Wgandpairwisejointsontheedgesfp(xi;xj)(;)2Egfullycharacterizethejointdistributionofatree-structuredgraphicalmodeAspecialclassofadiscretetree-structuredgraphicalmodelsisthesetofsymmetricdiscretedistributions.Thisclassofmodelsischaracterizedbythefactthatthepairsofvariables(Xi;Xj)on 6.Weabusethetermdistributiontomeanaprobabilitymassfunctioninthediscretecase(densitywithrespecttothecountingmeasure)andaprobabilitydensityfunction(densitywithrespecttotheLebesguemeasure)inthecontinuouscase.7.Wewillusethetermsnode,vertexandvariableinterchangeablyinthesequel.1776 LEARNINGLATENTTREEGRAPHICALMODELSalltheedges(;)2Efollowtheconditionalprobabilitylaw:p(xijxj)=1(K1)qij;ifxi=xj;qij;otherwise;(4)andthemarginaldistributionofeveryvariableinthetreeisuniform,thatis,p(xi)=1=Kforallxi2andforall2V[H.Theparameterqij2(0;1=K)in(4),whichdoesnotdependonthestatevaluesxi;xj2(butcanbedifferentfordifferentpairs(;)2E),isknownasthecrossoverprobabilityLetxn=fx(1);:::;x(n)gbeasetofni.i.d.samplesdrawnfromagraphicalmodel(distribution)p,MarkovonalatenttreeTp=(W;Ep),whereW=V[H.Eachsamplex(l)2Misalength-Mvector.Inoursetup,thelearneronlyhasaccesstosamplesdrawnfromtheobservednodesetV,andwedenotethissetofsub-vectorscontainingonlytheelementsinV,asxnV=fx(1)V;:::;x(n)Vg,whereeachobservedsamplex(l)V2misalength-mvector.Ouralgorithmslearnlatenttreestructuresusingtheinformationdistances(denedinSection3)betweenpairsofobservedvariables,whichcanbeestimatedfromsamples.Wenowcommentontheabovemodelassumptions.Notethatweassumethatthethehiddenvariableshavethesamedomainastheobservedones(allofwhichalsohaveacommondomain).Wedonotviewthisasaseriousmodelingrestrictionsincewedevelopefcientalgorithmswithstrongtheoreticalguarantees,andthesealgorithmshaveverygoodperformanceonreal-worlddata(seeSection7).Nonetheless,itmaybepossibletodevelopauniedframeworktoincorporatevariableswithdifferentstatespaces(i.e.,bothcontinuousanddiscrete)underareproducingkernelHilbertspace(RKHS)frameworkalongthelinesofSongetal.(2010).Wedeferthistofuturework.2.3MinimalTreeExtensionsOurultimategoalistorecoverthegraphicalmodelp,thatis,thelatenttreestructureanditsparam-eters,givenni.i.d.samplesoftheobservedvariablesxnV.However,ingeneral,therecanbemultiplelatenttreemodelswhichresultinthesameobservedstatistics,thatis,thesamejointdistributionpVoftheobservedvariables.Weconsidertheclassoftreemodelswhereitispossibletorecoverthelatenttreemodeluniquelyandprovidenecessaryconditionsforstructureidentiability,thatis,theidentiabilityoftheedgesetEFirstly,welimitourselvestothescenariowherealltherandomvariables(bothobservedandlatent)takevaluesonacommonalphabet.Thus,intheGaussiancase,eachhiddenandobservedvariableisaunivariateGaussian.Inthediscretecase,eachvariabletakesonvaluesinthesamenitealphabet.Notethatthemodelmaynotbeidentiableifsomeofthehiddenvariablesareallowedtohavearbitraryalphabets.Asanexample,consideradiscretelatenttreemodelwithbinaryobservedvariables(K=2).Alatenttreewiththesimpleststructure(fewestnumberofnodes)isatreeinwhichallmobservedbinaryvariablesareconnectedtoonehiddenvariable.Ifweallowthehiddenvariabletotakeon2mstates,thenthetreecandescribeallpossiblestatisticsamongthemobservedvariables,thatis,thejointdistributionpVcanbearbitrary.8AprobabilitydistributionpV(xV)issaidtobetree-decomposableifitisthemarginal(ofvari-ablesinV)ofatree-structuredgraphicalmodelp(xV;xH).Inthiscase,p(overvariablesinW)issaidtobeatreeextensionofpV(Pearl,1988).Adistributionpissaidtohavearedundanthid-dennodeh2HifwecanremovehandthemarginalonthesetofvisiblenodesVremainsaspV 8.Thisfollowsfromaelementaryparametercountingargument.1777 CHOI,TAN,ANANDKUMARANDILLSKY (a) h1 1 2 3 4 5 6 (b) h1 1 2 3 4 5 6 h3h2 h5 h4 Figure1:Examplesofminimallatenttrees.Shadednodesareobservedandunshadednodesarehidden.(a)Anidentiabletree.(b)Anon-identiabletreebecauseh4andh5havedegreeslessthan3.Thefollowingconditionsensurethatalatenttreedoesnotincludearedundanthiddennode(Pearl,1988):(C1)Eachhiddenvariablehasatleastthreeneighbors(whichcanbeeitherhiddenorobserved).Notethatthisensuresthatallleafnodesareobserved(althoughnotallobservednodesneedtobeleaves).(C2)Anytwovariablesconnectedbyanedgeinthetreemodelareneitherperfectlydependentnorindependent.Figure1(a)showsanexampleofatreesatisfying(C1).If(C2),whichisaconditiononparam-eters,isalsosatised,thenthetreeinFigure1(a)isidentiable.ThetreeshowninFigure1(b)doesnotsatisfy(C1)becauseh4andh5havedegreeslessthan3.Infact,ifwemarginalizeoutthehiddenvariablesh4andh5,thentheresultingmodelhasthesametreestructureasinFigure1(a).Weassumethroughoutthepaperthat(C2)issatisedforallprobabilitydistributions.LetT3bethesetof(latent)treessatisfying(C1).WerefertoT3asthesetofminimal(oridentiable)latenttrees.Minimallatenttreesdonotcontainredundanthiddennodes.Thedistributionp(overWandMarkovonsometreeinT3)issaidtobeaminimaltreeextensionofpV.AsillustratedinFigure1,usingmarginalizationoperations,anynon-minimallatenttreedistributioncanbereducedtoaminimallatenttreemodel.Proposition1(MinimalTreeExtensions)(Pearl,1988,Section8.3)(i)Foreverytree-decomposabledistributionpV,thereexistsaminimaltreeextensionpMarkovonatreeT2T3,whichisuniqueuptotherenamingofthevariablesortheirvalues.(ii)ForGaussianandbinarydistributions,ifpVisknownexactly,thentheminimaltreeextensionpcanberecovered.(iii)ThestructureofTisuniquelydeterminedbythepairwisedistributionsofobservedvariablesp(xi;xj)foralli;2V.1778 LEARNINGLATENTTREEGRAPHICALMODELS2.4ConsistencyWenowdenethenotionofconsistency.InSection6,weshowthatourlatenttreelearningalgo-rithmsareconsistent.Denition2(Consistency)AlatenttreereconstructionalgorithmAisamapfromtheobservedsamplesxnVtoanestimatedtreebTnandanestimatedtree-structuredgraphicalmodelbpn.WesaythatalatenttreereconstructionalgorithmAisstructurallyconsistentifthereexistsagraphhomo-morphism9hsuchthatn!¥Pr(h(bTn)=Tp)=0:(5)Furthermore,wesaythatAisriskconsistentiftoeverye�0n!¥Pr(D(pjjbpn)�e)=0;(6)whereD(pjjbpn)istheKL-divergence(CoverandThomas,2006)betweenthetruedistributionpandtheestimateddistributionbpnInthefollowingsections,wedesignstructurallyandriskconsistentalgorithmsfor(minimal)Gaussianandsymmetricdiscretelatenttreemodels,denedin(4).Ouralgorithmsusepairwisedistributionsbetweentheobservednodes.However,forgeneraldiscretemodels,pairwisedistribu-tionsbetweenobservednodesare,ingeneral,notsufcienttorecovertheparameters(ChangandHartigan,1991).Therefore,weonlyprovestructuralconsistency,asdenedin(5),forgeneraldis-cretelatenttreemodels.Forsuchdistributions,weconsideratwo-stepprocedureforstructureandparameterestimation:Firstly,weestimatethestructureofthelatenttreeusingthealgorithmssug-gestedinthispaper.Subsequently,weusetheExpectationMaximization(EM)algorithm(Dempsteretal.,1977)toinfertheparameters.Notethat,asmentionedpreviously,riskconsistencywillnotbeguaranteedinthiscase.3.InformationDistancesTheproposedalgorithmsinthispaperreceiveasinputsthesetofso-called(exactorestimated)informationdistances,whicharefunctionsofthepairwisedistributions.ThesequantitiesaredenedinSection3.1forthetwoclassesoftree-structuredgraphicalmodelsdiscussedinthispaper,namelytheGaussiananddiscretegraphicalmodels.Wealsoshowthattheinformationdistanceshaveaparticularlysimpleformforsymmetricdiscretedistributions.InSection3.2,weusetheinformationdistancestoinfertherelationshipsbetweentheobservedvariablessuchasisachildoforandaresiblings.3.1DenitionsofInformationDistancesWedeneinformationdistancesforGaussiananddiscretedistributionsandshowthatthesedis-tancesareadditivefortree-structuredgraphicalmodels.RecallthatfortworandomvariablesXiandXj,thecorrelationcoefcientisdenedasrij=Cov(Xi;Xj) p Var(Xi)Var(Xj):(7) 9.Agraphhomomorphismisamappingbetweengraphsthatrespectstheirstructure.Moreprecisely,agraphhomo-morphismhfromagraphG=(W;E)toagraphG0=(V0;E0),writtenhG!G0isamappinghV!V0suchthat(;)2Eimpliesthat(h();h())2E0.1779 CHOI,TAN,ANANDKUMARANDILLSKYForGaussiangraphicalmodels,theinformationdistanceassociatedwiththepairofvariablesXiandXjisdenedas:dij=logjrijj:(8)Intuitively,iftheinformationdistancedijislarge,thenXiandXjareweaklycorrelatedandvice-versa.Fordiscreterandomvariables,letJijdenotethejointprobabilitymatrixbetweenXiandXj(i.e.,Jijab=p(xi=a;xj=b);a;b2).AlsoletMibethediagonalmarginalprobabilitymatrixofXi(i.e.,Miaa=p(xi=a)).Fordiscretegraphicalmodels,theinformationdistanceassociatedwiththepairofvariablesXiandXjisdenedasLake(1994):dij=logjdetJijj p detMidetMj:(9)Notethatforbinaryvariables,thatis,K=2,thevalueofdijin(9)reducestotheexpressionin(8),thatis,theinformationdistanceisafunctionofthecorrelationcoefcient,denedin(7),justasintheGaussiancase.Forsymmetricdiscretedistributionsdenedin(4),theinformationdistancedenedfordiscretegraphicalmodelsin(9)reducestodij=(K1)log(1Kqij):(10)Notethatthereisone-to-onecorrespondencebetweentheinformationdistancesdijandthemodelparametersforGaussiandistributions(parametrizedbythecorrelationcoefcientrij)in(8)andthesymmetricdiscretedistributions(parametrizedbythecrossoverprobabilityqij)in(10).Thus,thesetwodistributionsarecompletelycharacterizedbytheinformationdistancesdij.Ontheotherhand,thisdoesnotholdforgeneraldiscretedistributions.Moreover,iftheunderlyingdistributionisasymmetricdiscretemodeloraGaussianmodel,theinformationdistancedijandthemutualinformationI(XiXj)(CoverandThomas,2006)aremonotonic,andwewillexploitthisresultinSection5.Forgeneraldistributions,thisisnotvalid.SeeSection5.5forfurtherdiscussions.Equippedwiththesedenitionsofinformationdistances,assumption(C2)inSection2.3canberewrittenasthefollowing:Thereexistsconstants0;u¥,suchthat(C2)diju;8(;)2Ep:(11)Proposition3(AdditivityofInformationDistances)Theinformationdistancesdijdenedin(8),(9),and(10)areadditivetreemetrics(Erdosetal.,1999).Inotherwords,ifthejointprobabilitydistributionp(x)isatree-structuredgraphicalmodelMarkovonthetreeTp=(W;Ep),thentheinformationdistancesareadditiveonTp:dkl=å(i;j)2Path((k;l);Ep)dij;8k;2W:(12)Thepropertyin(12)impliesthatifeachpairofvertices;2Wisassignedtheweightdij,thenTpisaminimumspanningtreeonW,denotedasMST(WD),whereDistheinformationdistancematrixwithelementsdijforall;2V1780 LEARNINGLATENTTREEGRAPHICALMODELS (a)(b)(c)(d)(e) i j e1e2e3e4e5e6e7e8 j e1e2e3e4e5e6e7e8 i j e1e2e3e4e5e6e7e8 ik j e1e2e3e4e5e6e7e8 ikk` j e1e2e3e4e5e6e7e8 ikk` Figure2:ExamplesforeachcaseinTestNodeRelationships.Foreachedge,eirepresentstheinformationdistanceassociatedwiththeedge.(a)Case1:Fijk=e8=dijforallk2Vnf;g.(b)Case2:Fijk=e6e7=dij=e6+e7forallk2Vnf;g(c)Case3a:Fijk=e4+e2+e3e7=Fijk0=e4e2e3e7.(d)Case3b:Fijk=e4+e5=Fijk0=e4e5.(e)Case3c:Fijk=e5=Fijk0=e5ItisstraightforwardtoshowthattheinformationdistancesareadditivefortheGaussianandsymmetricdiscretecasesusingthelocalMarkovpropertyofgraphicalmodels.Forgeneraldiscretedistributionswithinformationdistanceasin(9),seeLake(1994)fortheproof.Intherestofthepaper,wemaptheparametersofGaussiananddiscretedistributionstoaninformationdistancematrixD=[dij]tounifytheanalysesforbothcases.3.2TestingInter-NodeRelationshipsInthissection,weuseProposition3toascertainchild-parentandsibling(cf.,Section2.1)relation-shipsbetweenthevariablesinalatenttree-structuredgraphicalmodel.Todoso,foranythreevari-ables;;k2V,wedeneFijk=dikdjktobethedifferencebetweentheinformationdistancesdikanddjk.Thefollowinglemmasuggestsasimpleproceduretoidentifythesetofrelationshipsbetweenthenodes.Lemma4(SiblingGrouping)Fordistancesdijforalli;2VonatreeT2T3,thefollowingtwopropertiesonFijk=dikdjkhold:(i)Fijk=dijforallk2Vnf;gifandonlyifiisaleafnodeandjisitsparent.(i)Fijk=dijforallk2Vnf;gifandonlyifjisaleafnodeandiisitsparent.(ii)dijFijk=Fijk0dijforallk;k02Vnf;gifandonlyifbothiandjareleafnodesandtheyhavethesameparent,thatis,theybelongtothesamesiblinggroup.TheproofofthelemmausesProposition3andisprovidedinAppendixA.1.GivenLemma4,wecanrstdetermineallthevaluesofFijkfortriples;;k2V.Nowwecandeterminetherelationshipbetweennodesandasfollows:Fixthepairofnodes;2Vandconsideralltheothernodesk2Vnf;g.Then,therearethreecasesforthesetfFijkk2Vnf;gg1.Fijk=dijforallk2Vnf;g.Then,isaleafnodeandisaparentof.Similarly,ifFijk=dijforallk2Vnf;gisaleafnodeandisaparentof2.Fijkisconstantforallk2Vnf;gbutnotequaltoeitherdijordij.Thenandareleafnodesandtheyaresiblings.1781 CHOI,TAN,ANANDKUMARANDILLSKY3.Fijkisnotequalforallk2Vnf;g.Then,therearethreecases:Either(a)Nodesandarenotsiblingsnorhaveaparent-childrelationshipor,(b)Nodesandaresiblingsbutatleastoneofthemisnotaleafor,(c)Nodesandhaveaparent-childrelationshipbutthechildisnotaleaf.Thus,wehaveasimpletesttodeterminetherelationshipbetweenandandtoascertainwhetherandareleafnodes.WecalltheabovetestTestNodeRelationships.SeeFigure2forexamples.Byrunningthistestforalland,wecandeterminealltherelationshipsamongallpairsofobservedvariables.Inthefollowingsection,wedescribearecursivealgorithmthatisbasedontheaboveTestNodeRelationshipsproceduretoreconstructtheentirelatenttreemodelassumingthatthetruemodelisalatenttreeandthatthetruedistancematrixD=[dij]areknown.InSection5,weprovideimprovedalgorithmsforthelearningoflatenttreesagainassumingthatDisknown.Subsequently,inSection6,wedevelopalgorithmsfortheconsistentreconstructionoflatenttreeswheninfor-mationdistancesareunknownandwehavetoestimatethemfromthesamplesxnV.Inaddition,inSection6.6wediscusshowtoextendthesealgorithmsforthecasewhenpVisnotnecessarilytree-decomposable,thatis,theoriginalgraphicalmodelisnotassumedtobealatenttree.4.RecursiveGroupingAlgorithmGivenInformationDistancesThissectionisdevotedtothedevelopmentoftherstalgorithmforreconstructinglatenttreemod-els,recursivegrouping(RG).Atahighlevel,RGisarecursiveprocedureinwhichateachstep,TestNodeRelationshipsisusedtoidentifynodesthatbelongtothesamefamily.Subsequently,RGintroducesaparentnodeifafamilyofnodes(i.e.,asiblinggroup)doesnotcontainanobservedparent.Thisnewlyintroducedparentnodecorrespondstoahiddennodeintheoriginalunknownlatenttree.Oncesuchaparent(i.e.,hidden)nodehisintroduced,theinformationdistancesfromhtoallotherobservednodescanbecomputed.TheinputstoRGarethevertexsetVandthematrixofinformationdistancesDcorrespondingtoalatenttree.Thealgorithmproceedsbyrecursivelygroupingnodesandaddinghiddenvariables.Ineachiteration,thealgorithmactsonaso-calledactivesetofnodesY,andintheprocessconstructsanewactivesetYnewforthenextiteration.10Thestepsareasfollows:1.InitializebysettingY=Vtobethesetofobservedvariables.2.ComputeFijk=dikdjkforall;;k2Y3.UsingtheTestNodeRelationshipsprocedure,denefPlgLl=1tobethecoarsestpartition11ofYsuchthatforeverysubsetPl(withjPlj2),anytwonodesinPlareeithersiblingswhich 10.Notethatthecurrentactivesetisalsoused(inStep6)afterthenewactivesethasbeendened.Forclarity,wealsointroducethequantityYoldinSteps5and6.11.RecallthatapartitionPofasetYisacollectionofnonemptysubsetsfPlYgLl=1suchthat[Ll=1Pl=YandPl\Pl0=0forall=0.ApartitionPissaidtobecoarserthananotherpartitionP0ifeveryelementofP0isasubsetofsomeelementofP.1782 LEARNINGLATENTTREEGRAPHICALMODELSareleafnodesortheyhaveaparent-childrelationship12inwhichthechildisaleaf.13NotethatforsomePlmayconsistofasinglenode.Begintoconstructthenewactivesetbyaddingnodesinthesesingle-nodepartitions:Ynew Sl:jPlj=1Pl4.Foreach=1;:::;LwithjPlj2,ifPlcontainsaparentnodeu,updateYnew Ynew[fugOtherwise,introduceanewhiddennodeh,connecth(asaparent)toeverynodeinPl,andsetYnew Ynew[fhg5.Updatetheactiveset:Yold YandY Ynew6.Foreachnewhiddennodeh2Y,computetheinformationdistancesdhlforall2Yusing(13)and(14)describedbelow.7.IfjYj3,returntostep2.Otherwise,ifjYj=2,connectthetworemainingnodesinYwithanedgethenstop.IfinsteadjYj=1,donothingandstop.WenowdescribehowtocomputetheinformationdistancesinStep6foreachnewhiddennodeh2Yandallotheractivenodes2Y.Let;2C(h)betwochildrenofh,andletk2Yoldnf;gbeanyothernodeinthepreviousactiveset.FromLemma4andProposition3,wehavethatdihdjh=dikdjk=Fijkanddih+djh=dij,fromwhichwecanrecovertheinformationdistancesbetweenapreviouslyactivenode2Yoldanditsnewhiddenparenth2Yasfollows:dih=1 2dij+Fijk:(13)Foranyotheractivenode2Y,wecancomputedhlusingachildnode2C(h)asfollows:dhl=dildih;if2Yold;dikdihdlk;otherwise;wherek2C():(14)UsingEquations(13)and(14),wecaninferalltheinformationdistancesdhlbetweenanewlyintroducedhiddennodehtoallotheractivenodes2Y.Consequently,wehaveallthedistancesdijbetweenallpairsofnodesintheactivesetY.Itcanbeshownthatthisalgorithmrecoversallminimallatenttrees.TheproofofthefollowingtheoremisprovidedinAppendixA.2.Theorem5(CorrectnessandComputationalComplexityofRG)IfTp2T3andthematrixofinformationdistancesD(betweennodesinV)isavailable,thenRGoutputsthetruelatenttreeTpcorrectlyintimeO(diam(Tp)m3)WenowuseaconcreteexampletoillustratethestepsinvolvedinRG.InFigure3(a),theoriginalunknownlatenttreeisshown.Inthistree,nodes1;:::;6aretheobservednodesandh1;h2;h3arethehiddennodes.WestartbyconsideringthesetofobservednodesasactivenodesY=V=f1;:::;6gOnceFijkarecomputedfromthegivendistancesdijTestNodeRelationshipsisusedtodeterminethatYispartitionedintofoursubsets:P1=f1g;P2=f2;4g;P3=f5;6g;P4=f3g.ThesubsetsP1 12.Inanundirectedtree,theparent-childrelationshipscanbedenedwithrespecttoarootnode.Inthiscase,thenodeinthenalactivesetinStep7beforethealgorithmterminates(oroneofthetwonalnodesifjYj=2)isselectedastherootnode.13.NotethatsinceweusetheactivesetYintheTestNodeRelationshipsprocedure,theleafnodesaredenedwithrespecttoY,thatis,anodeisconsideredasaleafnodeifithasonlyoneneighborinYorinthesetofnodesthathavenotyetbeeninanactiveset.1783 CHOI,TAN,ANANDKUMARANDILLSKY h h2 h3 1 2 4 3 5 6 (a) 1 2 4 3 5 6 h1 (b)(c) h h2 h3 1 2 3 h1 4 5 6 (d) 1 2 3 h1 4 5 6 h2 h3 Figure3:AnillustrativeexampleofRG.SolidnodesindicatetheactivesetYforeachiteration.(a)Originallatenttree.(b)OutputaftertherstiterationofRG.ReddottedlinesindicatethesubsetsPlinthepartitionofY.(c)OutputaftertheseconditerationofRG.Notethath3,whichwasintroducedintherstiteration,isanactivenodefortheseconditeration.Nodes4,5,and6donotbelongtothecurrentactivesetandarerepresentedingrey.(d)OutputafterthethirditerationofRG,whichissameastheoriginallatenttree.andP4containonlyonenode.ThesubsetP3containstwosiblingsthatareleafnodes.ThesubsetP2containsaparentnode2andachildnode4,whichisaleafnode.SinceP3doesnotcontainaparent,weintroduceanewhiddennodeh1andconnecth1to5and6asshowninFigure3(b).Theinformationdistancesd5h1andd6h1canbecomputedusing(13),forexample,d5h1=1 2(d56+F561)Thenewactivesetistheunionofallnodesinthesingle-nodesubsets,aparentnode,andanewhiddennodeYnew=f1;2;3;h1g.DistancesamongthepairsofnodesinYnewcanbecomputedusing(14)(e.g.,d1h1=d15d5h1).Intheseconditeration,weagainuseTestNodeRelationshipstoascertainthatYcanbepartitionedintoP1=f1;2gandP2=fh1;3g.Thesetwosubsetsdonothaveparentssoh2andh3areaddedtoP1andP2respectively.Parentnodesh2andh3areconnectedtotheirchildreninP1andP2asshowninFigure3(c).Finally,weareleftwiththeactivesetasY=fh2;h3gandthealgorithmterminatesafterh2andh3areconnectedbyanedge.ThehithertounknownlatenttreeisfullyreconstructedasshowninFigure3(d).ApotentialdrawbackofRGisthatitinvolvesmultiplelocaloperations,whichmayresultinahighcomputationalcomplexity.Indeed,fromTheorem5,theworst-casecomplexityisO(m4)whichoccurswhenTp,thetruelatenttree,isahiddenMarkovmodel(HMM).Thismaybecom-putationallyprohibitiveifmislarge.InSection5wedesignanalgorithmwhichusesaglobalpre-processingsteptoreducetheoverallcomplexitysubstantially,especiallyfortreeswithlargediameters(ofwhichHMMsareextremeexamples).5.CLGroupingAlgorithmGivenInformationDistancesInthissection,wepresentCLGrouping,analgorithmforreconstructinglatenttreesmoreefcientlythanRG.AsinSection4,inthissection,weassumethatDisknownexactly;theextensiontoinexactknowledgeofDisdiscussedinSection6.5.CLGroupingisatwo-stepprocedure,therstofwhichisaglobalpre-processingstepthatinvolvestheconstructionofaso-calledChow-Liutree(ChowandLiu,1968)overthesetofobservednodesV.Thisstepidentiesnodesthatdonotbelongtothesamesiblinggroup.Inthesecondstep,wecompletetherecoveryofthelatenttreebyapplyingadistance-basedlatenttreereconstructionalgorithm(suchasRGorNJ)repeatedlyonsmallersubsetsofnodes.WereviewtheChow-LiualgorithminSection5.1,relatetheChow-LiutreetothetruelatenttreeinSection5.2,deriveasimpletransformationoftheChow-Liutreeto1784 LEARNINGLATENTTREEGRAPHICALMODELSobtainthelatenttreeinSection5.3andproposeCLGroupinginSection5.4.Forsimplicity,wefocusontheGaussiandistributionsandthesymmetricdiscretedistributionsrst,anddiscusstheextensiontogeneraldiscretemodelsinSection5.5.5.1AReviewoftheChow-LiuAlgorithmInthissection,wereviewtheChow-Liutreereconstructionprocedure.Todoso,deneT(V)tobethesetoftreeswithvertexsetVandP(T(V))tobethesetoftree-structuredgraphicalmodelswhosegraphhasvertexsetV,thatis,everyq2P(T(V))factorizesasin(3).GivenanarbitrarymultivariatedistributionpV(xV),ChowandLiu(1968)consideredthefol-lowingKL-divergenceminimizationproblem:pCL=argminq2P(T(V))D(pVjjq):(15)Thatis,amongallthetree-structuredgraphicalmodelswithvertexsetV,thedistributionpCListheclosestonetopVintermsoftheKL-divergence.Byusingthefactorizationpropertyin(3),wecaneasilyverifythatpCLisMarkovontheChow-LiutreeTCL=(V;ECL)whichisgivenbytheoptimizationproblem:14TCL=argmaxT2T(V)å(i;j)2TI(XiXj):(16)In(16),I(XiXj)=D(p(xi;xj)jjp(xi)p(xj))isthemutualinformation(CoverandThomas,2006)betweenrandomvariablesXiandXj.Theoptimizationin(16)isamax-weightspanningtreeprob-lem(Cormenetal.,2003)whichcanbesolvedefcientlyintimeO(m2logm)usingeitherKruskal'salgorithm(Kruskal,1956)orPrim'salgorithm(Prim,1957).Theedgeweightsforthemax-weightspanningtreearepreciselythemutualinformationquantitiesbetweenrandomvariables.NotethatoncetheoptimaltreeTCLisformed,theparametersofpCLin(15)arefoundbysettingthepairwisedistributionspCL(xi;xj)ontheedgestopV(xi;xj),thatis,pCL(xi;xj)=pV(xi;xj)forall(;)2ECLWenowrelatetheChow-LiutreeontheobservednodesandtheinformationdistancematrixDLemma6(CorrespondencebetweenTCLandMST)IfpVisaGaussiandistributionorasym-metricdiscretedistribution,thentheChow-Liutreein(16)reducestotheminimumspanningtree(MST)wheretheedgeweightsaretheinformationdistancesdij,thatis,TCL=MST(VD)=argminT2T(V)å(i;j)2Tdij:(17)Lemma6,whoseproofisomitted,followsbecauseforGaussianandsymmetricdiscretemod-els,themutualinformation15I(XiXj)isamonotonicallydecreasingfunctionoftheinformationdistancedij16Forothergraphicalmodels(e.g.,non-symmetricdiscretedistributions),thisrelation-shipisnotnecessarilytrue.SeeSection5.5foradiscussion.Notethatwhenallnodesareobserved(i.e.,W=V),Lemma6reducestoProposition3. 14.In(16)andtherestofthepaper,weadoptthefollowingsimplifyingnotation;IfT=(V;E)andif(;)2E,wewillalsosaythat(;)2T.15.Notethat,unlikeinformationdistancesdij,themutualinformationquantitiesI(XiXj)donotformanadditivemetriconTp.16.Forexample,inthecaseofGaussians,I(XiXj)=1 2log(12ij)(CoverandThomas,2006).1785 CHOI,TAN,ANANDKUMARANDILLSKY5.2RelationshipbetweentheLatentTreeandtheChow-LiuTree(MST)Inthissection,werelateMST(VD)in(17)totheoriginallatenttreeTp.Torelatethetwotrees,MST(VD)andTp,werstintroducethenotionofasurrogatenode.Denition7(SurrogateNode)GiventhelatenttreeTp=(W;Ep)andanynodei2W,thesurro-gatenodeofiwithrespecttoVisdenedasSg(Tp;V)=argminj2Vdij:Intuitively,thesurrogatenodeofahiddennodeh2Hisanobservednode2Vthatismoststronglycorrelatedtoh.Inotherwords,theinformationdistancebetweenhandisthesmallest.Notethatif2V,thenSg(Tp;V)=sincedii=0.ThemapSg(Tp;V)isamany-to-onefunction,thatis,severalnodesmayhavethesamesurrogatenode,anditsinverseistheinversesurrogatesetofidenotedasSg1(Tp;V)=fh2W:Sg(hTp;V)=g:WhenthetreeTpandtheobservedvertexsetVareunderstoodfromcontext,thesurrogatenodeofhandtheinversesurrogatesetofareabbreviatedasSg(h)andSg1()respectively.WenowrelatetheoriginallatenttreeTp=(W;Ep)totheChow-Liutree(alsotermedtheMST)MST(VD)formedusingthedistancematrixDLemma8(PropertiesoftheMST)TheMSTin(17)andsurrogatenodessatisfythefollowingproperties:(i)ThesurrogatenodesofanytwoneighboringnodesinEpareneighborsintheMST,thatis,foralli;2WwithSg()=Sg()(;)2Ep)(Sg();Sg())2MST(VD):(18)(ii)Ifj2Vandh2Sg1(),theneverynodealongthepathconnectingjandhbelongstotheinversesurrogatesetSg1()(iii)ThemaximumdegreeoftheMSTsatisesD(MST(VD))D(Tp)1+u ld(Tp;V);(19)whered(TpV)istheeffectivedepthdenedin(1)andl;uaretheboundsontheinformationdistancesonedgesinTpdenedin(11)TheproofofthisresultcanbefoundinAppendixA.3.AsaresultofLemma8,thepropertiesofMST(VD)canbeexpressedintermsoftheoriginallatenttreeTp.Forexample,inFigure5(a),alatenttreeisshownwithitscorrespondingsurrogacyrelationships,andFigure5(b)showsthecorrespondingMSTovertheobservednodes.ThepropertiesinLemma8(i-ii)canalsoberegardedasedge-contractionoperations(RobinsonandFoulds,1981)intheoriginallatenttreetoobtaintheMST.Moreprecisely,anedge-contractionoperationonanedge(;h)2VHinthelatenttreeTpisdenedasthe“shrinking”of(;h)toasinglenodewhoselabelistheobservednode.Thus,theedge(;h)is“contracted”toasingle1786 LEARNINGLATENTTREEGRAPHICALMODELS 4 5 4 5 (a) 1 2 3 (b)(c)(d)(e)(f) h1 h2 1 2 3 4 5 h2 1 2 3 4 5 2 3 1 h2 1 2 3 4 5 h1 h2 1 2 3 4 5 Figure4:AnillustrationofCLBlind.Theshadednodesaretheobservednodesandtherestarehiddennodes.Thedottedlinesdenotesurrogatemappingsforthehiddennodes.(a)Originallatenttree,whichbelongstotheclassofblindlatentgraphicalmodels,(b)Chow-Liutreeovertheobservednodes,(c)Node3istheinputtotheblindtransformation,(d)Outputaftertheblindtransformation,(e)Node2istheinputtotheblindtransformation,(f)Outputaftertheblindtransformation,whichissameastheoriginallatenttree.node.ByusingLemma8(i-ii),weobservethattheChow-LiutreeMST(VD)isformedbyapplyingedge-contractionoperationstoeach(;h)pairforallh2Sg1()\Hsequentiallyuntilallpairshavebeencontractedtoasinglenode.Forexample,theMSTinFigure5(b)isobtainedbycontractingedges(3;h3)(5;h2),andthen(5;h1)inthelatenttreeinFigure5(a).ThepropertiesinLemma8canbeusedtodesignefcientalgorithmsbasedontransformingtheMSTtoobtainthelatenttreeTp.NotethatthemaximumdegreeoftheMST,D(MST(VD)),isboundedbythemaximumdegreeintheoriginallatenttree.ThequantityD(MST(VD))determinesthecomputationalcomplexityofoneofourproposedalgorithms(CLGrouping)anditissmallifthedepthofthelatenttreed(TpV)issmall(e.g.,HMMs)andtheinformationdistancesdijsatisfytightbounds(i.e.,u=isclosetounity).Thelatterconditionholdsfor(almost)homogeneousmodelsinwhichalltheinformationdistancesdijontheedgesarealmostequal.5.3Chow-LiuBlindAlgorithmforaSubclassofLatentTreesInthissection,wepresentasimpleandintuitivetransformationoftheChow-Liutreethatproducestheoriginallatenttree.However,thisalgorithm,calledChow-LiuBlind(orCLBlind),isapplica-bleonlytoasubsetoflatenttreescalledblindlatenttree-structuredgraphicalmodelsP(Tblind)EquippedwiththeintuitionfromCLBlind,wegeneralizeitinSection5.4todesigntheCLGroup-ingalgorithmthatproducesthecorrectlatenttreestructurefromtheMSTforallminimallatenttreemodels.Ifp2P(Tblind),thenitsstructureTp=(W;Ep)andthedistancematrixDsatisfythefollowingproperties:(i)ThetruelatenttreeTp2T3andalltheinternalnodes17arehidden,thatis,V=Leaf(Tp)(ii)Thesurrogatenodeof(i.e.,theobservednodewiththestrongestcorrelationwith)eachhiddennodeisoneofitschildren,thatis,Sg(h)2C(h)forallh2HWenowdescribetheCLBlindalgorithm,whichinvolvestwomainsteps.Firstly,MST(VD)isconstructedusingthedistancematrixD.Secondly,weapplytheblindtransformationoftheChow-LiutreeBlindTransform(MST(VD)),whichproceedsasfollows: 17.Recallthataninternalnodeisonewhosedegreeisgreaterthanorequalto2,thatis,anon-leaf.1787 CHOI,TAN,ANANDKUMARANDILLSKY1.IdentifythesetofinternalnodesinMST(VD).Weperformanoperationforeachinternalnodeasfollows:2.Forinternalnode,addahiddennodehtothetree.3.Connectanedgebetweenhand(whichnowbecomesaleafnode)andalsoconnectedgesbetweenhandtheneighborsofinthecurrenttreemodel.4.Repeatsteps2and3untilallinternalnodeshavebeenoperatedon.SeeFigure4foranillustrationofCLBlind.WeusetheadjectiveblindtodescribethetransformationBlindTransform(MST(VD))sinceitdoesnotdependonthedistancematrixDbutusesonlythestructureoftheMST.ThefollowingtheoremwhoseproofcanbefoundinAppendixA.4statesthecorrectnessresultforCLBlind.Theorem9(CorrectnessandComputationalComplexityofCLBlind)Ifthedistributionp2P(Tblind)isablindtree-structuredgraphicalmodelMarkovonTpandthematrixofdistancesDisknown,thenCLBlindoutputsthetruelatenttreeTpcorrectlyintimeO(m2logm)TherstconditiononP(Tblind)thatallinternalnodesarehiddenisnotuncommoninapplica-tions.Forexample,inphylogenetics,(DNAoraminoacid)sequencesofextantspeciesattheleavesareobserved,whilethesequencesoftheextinctspeciesarehidden(correspondingtotheinternalnodes),andtheevolutionary(phylogenetic)treeistobereconstructed.However,thesecondcondi-tionismorerestrictive18sinceitimpliesthateachhiddennodeisdirectlyconnectedtoatleastoneobservednodeandthatitiscloser(i.e.,morecorrelated)tooneofitsobservedchildrencomparedtoanyotherobservednode.Iftherstconstraintissatisedbutnotthesecond,thentheblindtransformationBlindTransform(MST(VD))doesnotoverestimatethenumberofhiddenvariablesinthelatenttree(theprooffollowsfromLemma8andisomitted).SincethecomputationalcomplexityofconstructingtheMSTisO(m2logm)wherem=jVj,andtheblindtransformationisatmostlinearinm,theoverallcomputationalcomplexityisO(m2logm)Thus,CLBlindisacomputationallyefcientprocedurecomparedtoRG,describedinSection4.5.4Chow-LiuGroupingAlgorithmEventhoughCLBlindiscomputationallyefcient,itonlysucceedsinrecoveringlatenttreesforarestrictedsubclassofminimallatenttrees.Inthissection,weproposeanefcientalgorithm,calledCLGroupingthatreconstructsallminimallatenttrees.WealsoillustrateCLGroupingusinganexample.CLGroupingusesthepropertiesoftheMSTasdescribedinLemma8Atahigh-level,CLGroupinginvolvestwodistinctsteps:Firstly,weconstructtheChow-LiutreeMST(VD)overthesetofobservednodesV.Secondly,weapplyRGorNJtoreconstructalatentsubtreeovertheclosedneighborhoodsofeveryinternalnodeinMST(VD).IfRG(respec-tivelyNJ)isused,wetermthealgorithmCLRG(respectivelyCLNJ).Intherestofthesection,weonlydescribeCLRGforconcretenesssinceCLNJproceedsalongsimilarlines.Formally,CLRGproceedsasfollows:1.ConstructtheChow-LiutreeMST(VD)asin(17).SetT=MST(VD) 18.ThesecondconditiononP(Tblind)holdswhenthetreeis(almost)homogeneous.1788 LEARNINGLATENTTREEGRAPHICALMODELS (a) h1 1 2 3 4 5 6 h2h3 2 1 3 4 5 6 2 1 3 4 5 6 h1 h2 2 1 3 4 6 5 3 h1 2 1 4 5 6 h2 h1 h3 h2 2 1 3 4 5 6 (b)(c)(d)(e)(f) Figure5:IllustrationofCLRG.Theshadednodesaretheobservednodesandtherestarehiddennodes.Thedottedlinesdenotesurrogatemappingsforthehiddennodessoforexample,node3isthesurrogateofh3.(a)Theoriginallatenttree,(b)TheChow-Liutree(MST)overtheobservednodesV,(c)Theclosedneighborhoodofnode5istheinputtoRG,(d)OutputaftertherstRGprocedure,(e)Theclosedneighborhoodofnode3istheinputtotheseconditerationofRG,(f)OutputafterthesecondRGprocedure,whichissameastheoriginallatenttree.2.IdentifythesetofinternalnodesinMST(VD)3.Foreachinternalnode,letnbd[T]beitsclosedneighborhoodinTandletS=RG(nbd[T];D)betheoutputofRGwithnbd[T]asthesetofinputnodes.4.Replacethesubtreeovernodesetnbd[T]inTwithS.DenotethenewtreeasT5.Repeatsteps3and4untilallinternalnodeshavebeenoperatedon.NotethattheonlydifferencebetweenthealgorithmwejustdescribedandCLNJisStep3inwhichthesubroutineNJreplacesRG.Also,observeinStep3thatRGisonlyappliedtoasmallsubsetofnodeswhichhavebeenidentiedinStep1aspossibleneighborsinthetruelatenttree.ThisreducesthecomputationalcomplexityofCLRGcomparedtoRG,asseeninthefollowingtheoremwhoseproofisprovidedinAppendixA.5.LetjJj=jVnLeaf(MST(VD))jmbethenumberofinternalnodesintheMST.Theorem10(CorrectnessandComputationalComplexityofCLRG)IfthedistributionTp2T3isaminimallatenttreeandthematrixofinformationdistancesDisavailable,thenCLRGoutputsthetruelatenttreeTpcorrectlyintimeO(m2logm+jJjD3(MST(VD)))Thus,thecomputationalcomplexityofCLRGislowwhenthelatenttreeTphasasmallmaxi-mumdegreeandasmalleffectivedepth(suchastheHMM)because(19)impliesthatD(MST(VD))isalsosmall.Indeed,wedemonstrateinSection7thatthereisasignicantspeedupcomparedtoapplyingRGovertheentireobservednodesetVWenowillustrateCLRGusingtheexampleshowninFigure5.TheoriginalminimallatenttreeTp=(W;E)isshowninFigure5(a)withW=f1;2;:::;6;h1;h2;h3g.ThesetofobservednodesisV=f1;:::;6gandthesetofhiddennodesisH=fh1;h2;h3g.TheChow-LiutreeTCL=MST(VD)formedusingtheinformationdistancematrixDisshowninFigure5(b).Sincenodes3and5aretheonlyinternalnodesinMST(VD),twoRGoperationswillbeexecutedontheclosedneighborhoodsofeachofthesetwonodes.Intherstiteration,theclosedneighborhoodofnode5istheinputto1789 CHOI,TAN,ANANDKUMARANDILLSKY Latentvariables Distribution MST(VD)=TCL? Structure Parameter Non-latent Gaussian X X X Non-latent SymmetricDiscrete X X X Non-latent GeneralDiscrete  X  Latent Gaussian X X X Latent SymmetricDiscrete X X X Latent GeneralDiscrete  X  Table1:Comparisonbetweenvariousclassesofdistributions.Inthelasttwocolumns,westatewhetherCLGroupingisconsistentforlearningeitherthestructureorparametersofthemodel,namelywhetherCLGroupingisstructurallyconsistentorriskconsistentrespec-tively(cf.,Denition2).NotethatthersttwocasesreduceexactlytothealgorithmproposedbyChowandLiu(1968)inwhichtheedgeweightsarethemutualinformationquantities.RG.ThisisshowninFigure5(c)wherenbd[5;MST(VD)]=f1;3;4;5g,whichisthenreplacedbytheoutputofRGtoobtainthetreeshowninFigure5(d).Inthenextiteration,RGisappliedtotheclosedneighborhoodofnode3inthecurrenttreenbd[3;T]=f2;3;6;h1gasshowninFigure5(e).Notethatnbd[3;T]includesh12H,whichwasintroducedbyRGinthepreviousiteration.Thedistancefromh1toothernodesinnbd[3;T]canbecomputedusingthedistancebetweenh1anditssurrogatenode5,whichispartoftheoutputofRG,forexample,d2h1=d25d5h1.Theclosedneighborhoodnbd[3;T]isthenreplacedbytheoutputofthesecondRGoperationandtheoriginallatenttreeTpisobtainedasshowninFigure5(f).ObservethatthetreesobtainedateachiterationofCLRGcanberelatedtotheoriginallatenttreeintermsofedge-contractionoperations(RobinsonandFoulds,1981),whichweredenedinSection5.2.Forexample,theChow-LiutreeinFigure5(b)isobtainedfromthelatenttreeTpinFigure5(a)bysequentiallycontractingalledgesconnectinganobservednodetoitsinversesurrogateset(cf.,Lemma8(ii)).UponperforminganiterationofRG,thesecontractionoperationsareinvertedandnewhiddennodesareintroduced.Forexample,inFigure5(d),thehiddennodesh1;h2areintroducedafterperformingRGontheclosedneighborhoodofnode5onMST(VD)Thesenewlyintroducedhiddennodesinfact,turnouttobetheinversesurrogatesetofnode5,thatis,Sg1(5)=f5;h1;h2g.ThisisnotmerelyacoincidenceandweformallyproveinAppendixA.5thatateachiteration,thesetofhiddennodesintroducedcorrespondsexactlytotheinversesurrogatesetoftheinternalnode.WeconcludethissectionbyemphasizingthatCLGrouping(i.e.,CLRGorCLNJ)hastwoprimaryadvantages.Firstly,asdemonstratedinTheorem10,thestructureofallminimaltree-structuredgraphicalmodelscanberecoveredbyCLGroupingincontrasttoCLBlind.Secondly,ittypicallyhasmuchlowercomputationalcomplexitycomparedtoRG.5.5ExtensiontoGeneralDiscreteModelsForgeneral(i.e.,notsymmetric)discretemodels,themutualinformationI(XiXj)isingeneralnotmonotonicintheinformationdistancedij,denedin(9).19Asaresult,Lemma6doesnothold, 19.Themutualinformation,however,ismonotonicindijforasymmetricbinarydiscretemodels.1790 LEARNINGLATENTTREEGRAPHICALMODELSthatis,theChow-LiutreeTCLisnotnecessarilythesameasMST(VD).However,Lemma8doesholdforallminimallatenttreemodels.Therefore,forgeneral(non-symmetric)discretemodels,wecomputeMST(VD)(insteadoftheChow-LiutreeTCLwithedgeweightsI(XiXj)),andapplyRGorNJtoeachinternalnodeanditsneighbors.ThisalgorithmguaranteesthatthestructurelearnedusingCLGroupingisthesameasTpifthedistancematrixDisavailable.TheseobservationsaresummarizedclearlyinTable1.Notethatinallcases,thelatentstructureisrecoveredconsistently.6.Sample-BasedAlgorithmsforLearningLatentTreeStructuresInSections4and5,wedesignedalgorithmsfortheexactreconstructionoflatenttreesassumingthatpVisatree-decomposabledistributionandthematrixofinformationdistancesDisavailable.Inmost(ifnotall)machinelearningproblems,thepairwisedistributionsp(xi;xj)areunavailable.Consequently,DisalsounavailablesoRG,NJandCLGroupingasstatedinSections4and5arenotdirectlyapplicable.Inthissection,weconsiderextendingRG,NJandCLGroupingtothecasewhenonlysamplesxnVareavailable.WeshowhowtomodifythepreviouslyproposedalgorithmstoaccommodateMLestimateddistancesandwealsoprovidesamplecomplexityresultsforrelaxedversionsofRGandCLGrouping.6.1MLEstimationofInformationDistancesThecanonicalmethodfordeterministicparameterestimationisviamaximum-likelihood(ML)(Ser-ing,1980).WefocusonGaussianandsymmetricdiscretedistributionsinthissection.Thegener-alizationtogeneraldiscretemodelsisstraightforward.ForGaussiansgraphicalmodels,weuseMLtoestimatetheentriesofthecovariancematrix,20thatis,bSij=1 nnåk=1x(k)ix(k)j;8;2V:(20)TheMLestimateofthecorrelationcoefcientisdenedasbrij=bSij=(bSiibSjj)1=2.Theestimatedinformationdistanceisthengivenbytheanalogof(8),thatis,bdij=logjbrijj.Forsymmetricdiscretedistributions,weestimatethecrossoverprobabilityqijviaMLas21bqij=1 nnåk=1Ix(k)i=x(k)j ;8;2V:Theestimatedinformationdistanceisgivenbytheanalogueof(10),thatis,bdij=(K1)log(1Kbqij).Forbothclassesofmodels,itcaneasilybeveriedfromtheCentralLimitTheoremandcontinuityarguments(Sering,1980)thatbdijdij=Op(n1=2),wherenisthenumberofsamples.Thismeansthattheestimatesoftheinformationdistancesareconsistentwithrateofconvergencebeingn1=2.ThemmmatrixofestimatedinformationdistancesisdenotedasbD=[bdij]6.2Post-processingUsingEdgeContractionsForallsample-basedalgorithmsdiscussedinthissection,weapplyacommonpost-processingstepusingedge-contractionoperations.Recallfrom(11)thatistheminimumboundontheinformation 20.RecallthatweassumethatthemeanofthetruerandomvectorXisknownandequalstothezerovectorsowedonotneedtosubtracttheempiricalmeanin(20).21.WeuseIfgtodenotetheindicatorfunction.1791 CHOI,TAN,ANANDKUMARANDILLSKYdistancesonedges.Afterlearningthelatenttree,ifwendthatthereexistsanedge(;h)2WHwiththeestimateddistancebdih,then(;h)iscontractedtoasinglenodewhoselabelis,thatis,thehiddennodehisremovedandmergedwithnode.Thisedgecontractionoperationremovesahiddennodeifitistoocloseininformationdistancestoanothernode.ForGaussianandbinaryvariables,bdih=logjbrihj,soinourexperiments,weuse=log0:9tocontractanedge(;h)ifthecorrelationbetweenthetwonodesishigherthan0:9.6.3RelaxedRecursiveGrouping(RG)GivenSamplesWenowshowhowtorelaxthecanonicalRGalgorithmdescribedinSection4tohandlethecasewhenonlybDisavailable.RecallthatRGcallstheTestNodeRelationshipsprocedurerecursivelytoascertainchild-parentandsiblingrelationshipsviaequalitytestsFijk=dikdjk(cf.,Section3.2).Theseequalityconstraintsare,ingeneral,notsatisedwiththeestimateddifferencesbFijk=bdikbdjk,whicharecomputedbasedontheestimateddistanceinbD.Besides,notallestimateddistancesareequallyaccurate.Longerdistanceestimates(i.e.,lowercorrelationestimates)arelessaccurateforagivennumberofsamples.22Assuch,notallestimateddistancescanbeusedfortestinginter-noderelationshipsreliably.TheseobservationsmotivatethefollowingthreemodicationstotheRGalgorithm:1.ConsiderusingasmallersubsetofnodestotestwhetherbFijkisconstant(acrossk).2.Applyathreshold(inequality)testtothebFijkvalues.3.Improveontherobustnessoftheestimateddistancesbdihin(13)and(14)byaveraging.Wenowdescribeeachofthesemodicationsingreaterdetail.Firstly,intherelaxedRGalgorithm,weonlycomputebFijkforthoseestimateddistancesbdijbdikandbdjkthatarebelowaprescribedthresholdt&#x-0.9;畵0sincelongerdistanceestimatesareunreliable.Assuch,foreachpairofnodes(;)suchthatbdijt,associatethesetKij=nk2Vnf;g:maxfbdik;bdjkgto:(21)ThisisthesubsetofnodesinVwhoseestimateddistancestoandarelessthant.ComputebFijkforallk2Kijonly.Secondly,insteadofusingequalitytestsinTestNodeRelationshipstodeterminetherelationshipbetweennodesand,werelaxthistestandconsiderthestatisticbLij=maxk2ijbFijkmink2ijbFijk(22)Intuitively,ifbLijin(22)isclosetozero,thennodesandarelikelytobeinthesamefamily.Thus,declarethatnodes;2VareinthesamefamilyifbLije;(23) 22.Infact,byusingalargedeviationresultinShen(2007,Theorem1),wecanformallyshowthatalargernumberofsamplesisrequiredtogetagoodapproximationofikifitissmallcomparedtowhenikislarge.1792 LEARNINGLATENTTREEGRAPHICALMODELSforanotherthresholde�0.Similarly,anobservednodekisidentiedasaparentnodeifjbdik+bdkjbdijjeforallandinthesamefamily.Ifsuchanobservednodedoesnotexistsforagroupoffamily,thenanewhiddennodeisintroducedastheparentnodeforthegroup.Thirdly,inordertofurtherimproveonthequalityofthedistanceestimatebdihofanewlyintro-ducedhiddennodetoobservednodes,wecomputebdihusing(13)withdifferentpairsof2C(h)andk2Kij,andtaketheaverageasfollows:bdih=1 2(jC(h)j1) åj2C(h)bdij+1 jKijjåk2ijbFijk!:(24)Similarly,foranyothernodek=2C(h),wecomputebdkhusingallchildnodesinC(h)andC(k)(ifC(k)=0)asfollows:bdkh=(1 jC(h)jåi2C(h)(bdikbdih);ifk2V;1 jC(h)jjC(k)jå(i;j)2C(h)C(k)(bdijbdihbdjk);otherwise:(25)Itiseasytoverifythatifbdihandbdkhareequaltodihanddkhrespectively,then(24)and(25)reduceto(13)and(14)respectively.ThefollowingtheoremshowsthatrelaxedRGisconsistent,andwithappropriatelychosenthresholdseandt,ithasthesamplecomplexitylogarithmicinthenumberofobservedvariables.TheprooffollowsfromstandardChernoffboundsandisprovidedinAppendixA.6.Theorem11(ConsistencyandSampleComplexityofRelaxedRG)(i)RelaxedRGisstruc-turallyconsistentforallTp2T3.Inaddition,itisriskconsistentforGaussianandsymmetricdiscretedistributions.(ii)Assumethattheeffectivedepthisd(TpV)=O(1)(i.e.,constantinm)andrelaxedRGisusedtoreconstructthetreegivenbD.Foreveryh�0,thereexiststhresholdse;t�0suchthatifn�Clog(m=3p h)(26)forsomeconstantC�0,theerrorprobabilityforstructurereconstructionin(5)isboundedabovebyh.If,inaddition,pisaGaussianorsymmetricdiscretedistributionandn�C0log(m=3p h)theerrorprobabilityfordistributionreconstructionin(6)isalsoboundedabovebyh.Thus,thesamplecomplexityofrelaxedRG,whichisthenumberofsamplesrequiredtoachieveadesiredlevelofaccuracy,islogarithmicinm,thenumberofobservedvariables.Asweobservefrom(26),thesamplecomplexityforRGislogarithmicinmforshallowtrees(i.e.,treeswheretheeffectivedepthisconstant).ThisisincontrasttoNJwherethesamplecom-plexityissuper-polynomialinthenumberofobservednodesfortheHMM(St.Johnetal.,2003;LaceyandChang,2006).6.3.1RGWITHk-MEANSCLUSTERINGInpractice,ifthenumberofsamplesislimited,thedistanceestimatesbdijarenoisyanditisdifculttoselectthethresholdeinTheorem11toidentifysiblingnodesreliably.Inourexperiments,weemployamodiedversionofthek-meansclusteringalgorithmtoclusterasetofnodeswithsmallbLij,denedin(22),asagroupofsiblings.RecallthatwetesteachbLijlocallywithaxedthresholdein(23).Incontrast,thek-meansalgorithmprovidesaglobalschemeandcircumventstheneedtoselectthethresholde.Weadoptthesilhouettemethod(Rousseeuw,1987)withdissimilaritymeasurebLijtoselectoptimalthenumberofclustersk1793 CHOI,TAN,ANANDKUMARANDILLSKY6.4RelaxedNeighbor-JoiningGivenSamplesInthissection,wedescribehowNJcanberelaxedwhenthetruedistancesareunavailable.WerelaxtheNJalgorithmbyusingMLestimatesofthedistancesbdijinplaceofunavailabledistancesdijNJtypicallyassumethatallobservednodesareattheleavesofthelatenttree,soafterlearningthelatenttree,weperformthepost-processingstepdescribedinSection6.2toidentifyinternalnodesthatareobserved.23ThesamplecomplexityofNJisknowntobeO(exp(diam(Tp))logm)(St.Johnetal.,2003)andthusdoesnotscalewellwhenthelatenttreeTphasalargediameter.Compar-isonsbetweenthesamplecomplexitiesofothercloselyrelatedlatenttreelearningalgorithmsarediscussedinAtteson(1999),Erdosetal.(1999),Csur¨os(2000)andSt.Johnetal.(2003).6.5RelaxedCLGroupingGivenSamplesInthissection,wediscusshowtomodifyCLGrouping(CLRGandCLNG)whenweonlyhaveaccesstotheestimatedinformationdistancebD.TherelaxedversionofCLGroupingdiffersfromCLGroupingintwomainaspects.Firstly,wereplacetheedgeweightsintheconstructionoftheMSTin(17)withtheestimatedinformationdistancesbdij,thatis,bTCL=MST(VbD)=argminT2T(V)å(i;j)2Tbdij:(27)Theprocedurein(27)canbeshowntobeequivalenttothelearningoftheMLtreestructuregivensamplesxnVifpVisaGaussianorsymmetricdiscretedistribution.24IthasalsobeenshownthattheerrorprobabilityofstructurelearningPr(bTCL=TCL)convergestozeroexponentiallyfastinthenumberofsamplesnforbothdiscreteandGaussiandata(Tanetal.,2010,2011).SecondlyforCLRG(respectivelyCLNJ),wereplaceRG(respectivelyNJ)withtherelaxedversionofRG(respectivelyNJ).ThesamplecomplexityresultofCLRG(anditsproof)issimilartoTheorem11andtheproofisprovidedinAppendixA.7.Theorem12(ConsistencyandSampleComplexityofRelaxedCLRG)(i)RelaxedCLRGisstructurallyconsistentforallTp2T3.Inaddition,itisriskconsistentforGaussianandsymmetricdiscretedistributions.(ii)Assumethattheeffectivedepthisd(TpV)=O(1)(i.e.,constantinm).ThenthesamplecomplexityofrelaxedCLRGislogarithmicinm.6.6RegularizedCLGroupingforLearningLatentTreeApproximationsFormanypracticalapplications,itisofinteresttolearnalatenttreethatapproximatesthegivenempiricaldistribution.Ingeneral,introducingmorehiddenvariablesenablesbetterttingtotheem-piricaldistribution,butitincreasesthemodelcomplexityandmayleadtoovertting.TheBayesianInformationCriterion(Schwarz,1978)providesatrade-offbetweenmodelttingandmodelcom-plexity,andisdenedasfollows:BIC(bT)=logp(xnVbT)k(bT) 2logn(28) 23.Theprocessing(contraction)oftheinternalnodescanbedoneinanyorder.24.ThisfollowsfromtheobservationthattheMLsearchfortheoptimalstructureisequivalenttotheKL-divergenceminimizationproblemin(15)withpVreplacedbybpV,theempiricaldistributionofxnV.1794 LEARNINGLATENTTREEGRAPHICALMODELSwherebTisalatenttreestructureandk(bT)isthenumberoffreeparameters,whichgrowslinearlywiththenumberofhiddenvariablesbecausebTisatree.Here,wedescriberegularizedCLGroupinginwhichweusetheBICin(28)tospecifyastoppingcriteriononthenumberofhiddenvariablesadded.ForeachinternalnodeanditsneighborsintheChow-Liutree,weuserelaxedNJorRGtolearnalatentsubtree.UnlikeinregularCLGrouping,beforeweintegratethissubtreeintoourmodel,wecomputeitsBICscore.ComputingtheBICscorerequiresestimatingthemaximumlikelihoodparametersforthemodels,soforgeneraldiscretedistributions,weruntheEMalgorithmonthesubtreetoestimatetheparameters.25AfterwecomputetheBICscoresforallsubtreescorrespondingtoallinternalnodesintheChow-Liutree,wechoosethesubtreethatresultsinthehighestBICscoreandincorporatethatsubtreeintothecurrenttreemodel.TheBICscorecanbecomputedefcientlyonatreemodelwithafewhiddenvariables.Thus,forcomputationalefciency,eachtimeasetofhiddennodesisaddedtothemodel,wegeneratesamplesofhiddennodesconditionedonthesamplesofobservednodes,andusetheseaugmentedsamplestocomputetheBICscoreapproximatelywhenweevaluatethenextsubtreetobeintegratedinthemodel.IfnoneofthesubtreesincreasestheBICscore(i.e.,thecurrenttreehasthehighestBICscore),theprocedurestopsandoutputstheestimatedlatenttree.Alternatively,ifwewishtolearnalatenttreewithagivennumberofhiddennodes,wecanusedtheBIC-basedprocedurementionedinthepreviousparagraphtolearnsubtreesuntilthedesirednumberofhiddennodesisintroduced.DependingonwhetherweuseNJorRGasthesubroutine,wedenotethespecicregularizedCLGroupingalgorithmasregCLNJorregCLRGThisapproachofusinganapproximationoftheBICscorehasbeencommonlyusedtolearnagraphicalmodelwithhiddenvariables(ElidanandFriedman,2005;ZhangandKocka,2004).However,forthesealgorithms,theBICscoreneedstobeevaluatedforalargesubsetofnodes,whereasinCLGrouping,theChow-Liutreeamongobservedvariablesprunesoutmanysubsets,soweneedtoevaluateBICscoresonlyforasmallnumberofcandidatesubsets(thenumberofinternalnodesintheChow-Liutree).7.ExperimentalResultsInthissection,wecomparetheperformancesofvariouslatenttreelearningalgorithms.Werstshowsimulationresultsonsyntheticdatasetswithknownlatenttreestructurestodemonstratetheconsistencyofouralgorithms.Wealsoanalyzetheperformanceofthesealgorithmswhenwechangetheunderlyinglatenttreestructures.Then,weshowthatouralgorithmscanapproximatearbitrarymultivariateprobabilitydistributionswithlatenttreesbyapplyingthemtotworeal-worlddatasets,amonthlystockreturnsexampleandthe20newsgroupsdataset.7.1SimulationsusingSyntheticDataSetsInordertoanalyzetheperformancesofdifferenttreereconstructionalgorithms,wegeneratesam-plesfromknownlatenttreestructureswithvaryingsamplesizesandapplyreconstructionalgo-rithms.Wecomparetheneighbor-joiningmethod(NJ)(SaitouandNei,1987)withrecursive 25.NotethatforGaussianandsymmetricdiscretedistributions,themodelparameterscanberecoveredfrominformationdistancesdirectlyusing(8)or(10).1795 CHOI,TAN,ANANDKUMARANDILLSKY (b) HMM (a) Double star (c) 5-complete Figure6:Latenttreestructuresusedinoursimulations.grouping(RG),Chow-LiuNeighborJoining(CLNJ),andChow-LiuRecursiveGrouping(CLRG).Sincethealgorithmsaregivenonlysamplesofobservedvariables,weusethesample-basedalgo-rithmsdescribedinSection6.Forallourexperiments,weusethesameedge-contractionthresholde0=log0:9(seeSections6.4and6.5),andsettinSection6.3togrowlogarithmicallywiththenumberofsamples.Figure6showsthethreelatenttreestructuresusedinoursimulations.Thedouble-starhas2hiddenand80observednodes,theHMMhas78hiddenand80observednodes,andthe5-completetreehas25hiddenand81observednodesincludingtherootnode.Forsimplicity,wepresentsimulationresultsonlyonGaussianmodelsbutnotethatthebehaviorondiscretemodelsissimilar.Allcorrelationcoefcientsontheedgesrijwereindependentlydrawnfromauniformdistributionsupportedon[0:2;0:8].Theperformanceofeachmethodismeasuredbyaveragingover200independentrunswithdifferentparameters.WeusethefollowingperformancemetricstoquantifytheperformanceofeachalgorithminFigure7:(i)Structurerecoveryerrorrate:Thisistheproportionoftimesthattheproposedalgorithmfailstorecoverthetruelatenttreestructure.Notethatthisisaverystrictmeasuresinceevenasinglewronghiddennodeormisplacededgeresultsinanerrorfortheentirestructure.(ii)RobinsonFouldsmetric(RobinsonandFoulds,1981):Thispopularphylogenetictree-distortionmetriccomputesthenumberofgraphtransformations(edgecontractionorexpan-sion)neededtobeappliedtotheestimatedgraphinordertogetthecorrectstructure.Thismetricquantiesthedifferenceinthestructuresoftheestimatedandtruemodels(iii)Errorinthenumberofhiddenvariables:Wecomputetheaveragenumberofhiddenvari-ablesintroducedbyeachmethodandplottheabsolutedifferencebetweentheaverageesti-matedhiddenvariablesandthenumberofhiddenvariablesinthetruestructure.1796 LEARNINGLATENTTREEGRAPHICALMODELS 0 5 10 15 20 25 30 35 40 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 2.5 2 1.5 1 0.5 0 0.5 !! 0 20 40 60 80 100 120 140 160 180 200 3.5 3 2.5 2 1.5 1 0.5 0 0.5 !! 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 4 3.5 3 2.5 2 1.5 1 0.5 0 0.5 1 !! 0 10 20 30 40 50 60 70 RG NJ CLRG CLNJ 1020304050601001k10k100k200k2k20k1001k10k100k200k2k20k1001k10k100k200k2k20k1001k10k100k200k2k20k Double!Star 1k10k100k200k2k20k1k10k100k200k2k20k1k10k100k200k2k20k1k10k100k200k2k20k 10Robinson-Foulds!MetricError!in!Hidden!Variables 1k10k100k200k2k20k1k10k100k200k2k20k1k10k100k200k2k20k1k10k100k200k2k20k HMM5-complete 102030405060 Figure7:PerformanceofRG,NJ,CLRG,andCLNJforthelatenttreesshowninFigure6.(iv)KL-divergenceD(pVjjbpnV):ThisisameasureofthedistancebetweentheestimatedandthetruemodelsoverthesetofobservednodesV26Werstnotethatfromthestructuralerrorrateplotsthatthedoublestaristheeasieststructuretorecoverandthe5-completetreeisthehardest.Ingeneral,giventhesamenumberofobservedvariables,alatenttreewithmorehiddenvariablesorlargereffectivedepth(seeSection2)ismoredifculttorecover.Forthedoublestar,RGclearlyoutperformsallothermethods.Withonly1,000samples,itrecoversthetruestructureexactlyinall200runs.Ontheotherhand,CLGroupingperformssig-nicantlybetterthanRGfortheHMM.Therearetworeasonsforsuchperformancedifferences.Firstly,forGaussiandistributions,itwasshown(Tanetal.,2010)thatgiventhesamenumberofvariablesandtheirsamples,theChow-Liualgorithmismostaccurateforachainandleastaccurateforastar.SincetheChow-Liutreeofalatentdoublestargraphisclosetoastar,andtheChow-Liu 26.Notethatthisisnotthesamequantityasin(6)becauseifthenumberofhiddenvariablesisestimatedincorrectly,D(pjjbpn)isinnitesoweplotD(pVjjbpnV)instead.However,forGaussianandsymmetricdiscretedistributions,D(pjjbpn)convergestozeroinprobabilitysincethenumberofhiddenvariablesisestimatedcorrectlyasymptotically.1797 CHOI,TAN,ANANDKUMARANDILLSKY RG NJ CLRG CLNJ HMM 10.16 0.02 0.10 0.05 5-complete 7.91 0.02 0.26 0.06 Doublestar 1.43 0.01 0.76 0.20 Table2:Averagerunningtimeofeachalgorithminseconds.treeofalatentHMMisclosetoachain,theChow-LiutreetendtobemoreaccuratefortheHMMthanforthedoublestar.Secondly,theinternalnodesintheChow-LiutreeoftheHMMtendtohavesmalldegrees,sowecanapplyRGorNJtoaverysmallneighborhood,whichresultsinasignicantimprovementinbothaccuracyandcomputationalcomplexity.NotethatNJisparticularlypooratrecoveringtheHMMstructure.Infact,ithasbeenshownthatevenifthenumberofsamplesgrowspolynomiallywiththenumberofobservedvariables(i.e.,n=O(mB)foranyB�0),itisinsufcientforNJtorecoverHMMstructures(LaceyandChang,2006).The5-completetreehastwolayersofhiddennodes,makingitverydifculttorecovertheexactstructureusinganymethod.CLNJhasthebeststructurerecoveryerrorrateandKLdivergence,whileCLRGhasthesmallestRobinson-Fouldsmetric.Table2showstherunningtimeofeachalgorithmaveragedover200runsandallsamplesizes.AllalgorithmsareimplementedinMATLAB.Asexpected,weobservethatCLRGissignicantlyfasterthanRGforHMMand5-completegraphs.NJisfastest,butCLNJisalsoveryefcientandleadstomuchmoreaccuratereconstructionoflatenttrees.Basedonthesimulationresults,weconcludethatforalatenttreewithafewhiddenvariables,RGismostaccurate,andforalatenttreewithalargediameter,CLNJperformsthebest.Alatenttreewithmultiplelayersofhiddenvariablesismoredifculttorecovercorrectlyusinganymethod,andCLNJandCLRGoutperformNJandRG.7.2MonthlyStockReturnsInthisandthenextsection,wetestouralgorithmsonreal-worlddatasets.Theprobabilitydistri-butionsthatgovernthesedatasetsofcoursedonotsatisfytheassumptionsrequiredforconsistentlearningofthelatenttreemodels.Nonethelesstheexperimentsheredemonstratethatouralgo-rithmsarealsousefulinapproximatingcomplexprobabilitydistributionsbylatentmodelsinwhichthehiddenvariableshavethesamedomainastheobservedones.Weapplyourlatenttreelearningalgorithmstomodelthedependencystructureofmonthlystockreturnsof84companiesintheS&P100stockindex.27Weusethesamplesofthemonthlyreturnsfrom1990to2007.AsshowninTable3andFigure8,CLNJachievesthehighestlog-likelihoodandBICscores.NJintroducesmorehiddenvariablesthanCLNJandhaslowerlog-likelihoods,whichimpliesthatstartingfromaChow-Liutreehelpstogetabetterlatenttreeapproximation.Figure11showsthelatenttreestructurelearnedusingtheCLNJmethod.Eachobservednodeislabeledwiththetickerofthecompany.Notethatrelatedcompaniesarecloselylocatedonthetree.Manyhiddennodescanbeinterpretedasindustriesordivisions.Forexample,h1hasVerizon,Sprint,andT-mobileasdescendants,andcanbeinterpretedasthetelecomindustry,andh3correspondtothetechnologydivisionwithcompaniessuchasMicrosoft,Apple,andIBMasdescendants.Nodesh26andh27groupcommercialbankstogether,andh25hasallretailstoresaschildnodes. 27.Wedisregard16companiesthathavebeenlistedonS&P100onlyafter1990.1798 LEARNINGLATENTTREEGRAPHICALMODELS Log-Likelihood BIC #Hidden #Parameters Time(secs) CL -13,321 -13,547 0 84 0.15 NJ -12,400 -12,747 45 129 0.02 RG -14,042 -14,300 12 96 21.15 CLNJ -11,990 -12,294 29 113 0.24 CLRG -12,879 -13,174 26 110 0.40 Table3:Comparisonofthelog-likelihood,BIC,numberofhiddenvariablesintroduced,numberofparameters,andrunningtimeforthemonthlystockreturnsexample. -14,500 -14,000 -13,500 -13,000 -12,500 -12,000 CLNJRGCLNJCLRGBIC score Figure8:PlotofBICscoresforthemonthlystockreturnsexample.7.320Newsgroupswith100WordsForourlastexperiment,weapplyourlatenttreelearningalgorithmstothe20Newsgroupsdatasetwith100words.28Thedatasetconsistsof16,242binarysamplesof100words,indicatingwhethereachwordappearsineachpostingornot.InadditiontotheChow-Liutree(CL),NJ,RG,CLNJ,andCLRG,wealsocomparetheperformanceswiththeregCLNJandregCLRG(describedinSection6.6),thelatentclustermodel(LCM)(LazarsfeldandHenry,1968),andBIN,whichisagreedyalgorithmforlearninglatenttrees(HarmelingandWilliams,2010).Table4showstheperformanceofdifferentalgorithms,andFigure9plotstheBICscore.WeusetheMATLABcode(asmallpartofitisimplementedinC)providedbyHarmelingandWilliams(2010)29torunLCMandBIN.NotethatalthoughLCMhasonlyonehiddennode,thehiddennodehas16states,resultinginmanyparameters.WealsotriedtorunthealgorithmbyChenetal.(2008),buttheirJAVAimplementationonthisdatasetdidnotcompleteevenafterseveraldays.ForNJ,RG,CLNJ,andCLRG,welearnedthestructuresusingonlyinformationdistances(denedin(9))andthenusedtheEMalgorithmtottheparameters.ForregCLNJandregCLRG,themodelparametersarelearnedduringthestructurelearningprocedurebyrunningtheEMalgorithmlocally,andoncethestructurelearningisover,werenetheparametersbyrunningtheEMalgorithmfortheentirelatenttree.AllmethodsareimplementedinMATLABexcepttheE-stepoftheEMalgorithm,whichisimplementedinC++. 28.Thedatasetcanbefoundathttp://cs.nyu.edu/˜roweis/data/20news_w100.mat.29.Codecanbefoundathttp://people.kyb.tuebingen.mpg.de/harmeling/code/ltt-1.3.tar.1799 CHOI,TAN,ANANDKUMARANDILLSKY Log-Likelihood BIC Hidden Params Time(s) Total Structure EM CL -238,713 -239,677 0 199 8.9 - - LCM -223,096 -230,925 1 1,615 8,835.9 - - BIN -232,042 -233,952 98 394 3,022.6 - - NJ -230,575 -232,257 74 347 1,611.2 3.3 1,608.2 RG -239,619 -240,875 30 259 927.1 30.8 896.4 CLNJ -230,858 -232,540 74 347 1,479.6 2.7 1,476.8 CLRG -231,279 -232,738 51 301 1,224.6 3.1 1,224.6 regCLNJ -235,326 -236,553 27 253 630.8 449.7 181.1 regCLRG -234,012 -235,229 26 251 606.9 493.0 113.9 Table4:Comparisonbetweenvariousalgorithmsonthenewsgroupset. -242,000 -240,000 -238,000 -236,000 -234,000 -232,000 -230,000 CLLCMBINNJRGCLNJCLRGregCLNJregCLRGBIC score Figure9:TheBICscoresofvariousalgorithmsonthenewsgroupset.Despitehavingmanyparameters,themodelslearnedviaLCMhavethebestBICscore.How-ever,itdoesnotrevealanyinterestingstructureandiscomputationallymoreexpensivetolearn.Inaddition,itmayresultinovertting.Inordertoshowthis,wesplitthedatasetrandomlyandusehalfasthetrainingsetandtheotherhalfasthetestset.Table5showstheperformanceofapplyingthelatenttreeslearnedfromthetrainingsettothetestset,andFigure10showsthelog-likelihoodonthetrainingandthetestsets.ForLCM,thetestlog-likelihooddropssignicantlycomparedtothetraininglog-likelihood,indicatingthatLCMisoverttingthetrainingdata.NJ,CLNJ,andCLRGachievehighlog-likelihoodscoresonthetestset.AlthoughregCLNJandregCLRGdonotresultinabetterBICscore,theyintroducefewerhiddenvariables,whichisdesirableifwewishtolearnalatenttreewithsmallcomputationalcomplexity,orifwewishtodiscoverafewhiddenvariablesthataremeaningfulinexplainingthedependenciesofobservedvariables.Figure12showsthelatenttreestructurelearnedusingregCLRGfromtheentiredataset.Manyhiddenvariablesinthetreecanberoughlyinterpretedastopics—h5assports,h9ascomputertech-nology,h13asmedical,etc.Notethatsomewordshavemultiplemeaningsandappearindifferenttopics—forexample,programcanbeusedinthephrase“spaceprogram”aswellas“computerprogram”,andwinmayindicatethewindowsoperatingsystemorwinninginsportsgames.1800 LEARNINGLATENTTREEGRAPHICALMODELS Train Test Hidden Params Time(s) Log-Like BIC Log-Like BIC Total Struct EM CL -119,013 -119,909 -120,107 -121,003 0 199 3.0 - - LCM -112,746 -117,288 -116,884 -120,949 1 1,009 3,197.7 - - BIN -117,172 -118,675 -117,957 -119,460 78 334 1,331.3 - - NJ -115,319 -116,908 -116,011 -117,600 77 353 802.8 1.3 801.5 RG -118,280 -119,248 -119,181 -120,149 8 215 137.6 7.6 130.0 CLNJ -115,372 -116,987 -116,036 -117,652 80 359 648.0 1.5 646.5 CLRG -115,565 -116,920 -116,199 -117,554 51 301 506.0 1.7 504.3 regCLNJ -117,723 -118,924 -118,606 -119,808 34 267 425.5 251.3 174.2 regCLRG -116,980 -118,119 -117,652 -118,791 27 253 285.7 236.5 49.2 Table5:Comparisonbetweenvariousalgorithmsonthenewsgroupdatasetwithatrain/testsplit. -121,000 -120,000 -119,000 -118,000 -117,000 -116,000 -115,000 -114,000 -113,000 -112,000 CLLCMBINNJRGCLNJCLRGregCLNJregCLRG Train TestLog-likelihood Figure10:Trainandtestlog-likelihoodscoresofvariousalgorithmsonthenewsgroupdatasetwithatrain/testsplit.8.DiscussionandConclusionInthispaper,weproposedalgorithmstolearnalatenttreemodelfromtheinformationdistancesofobservedvariables.Ourrstalgorithm,recursivegrouping(RG),identiessiblingandparent-childrelationshipsandintroduceshiddennodesrecursively.Oursecondalgorithm,CLGrouping,maintainsatreeineachiterationandaddshiddenvariablesbylocallyapplyinglatent-treelearn-ingproceduressuchasrecursivegrouping.Thesealgorithmsarestructurallyconsistent(andriskconsistentaswellinthecaseofGaussiananddiscretesymmetricdistributions),andhavesamplecomplexitylogarithmicinthenumberofobservedvariablesforconstantdepthtrees.Usingsimulationsonsyntheticdatasets,weshowedthatRGperformswellwhenthenum-berofhiddenvariablesissmall,whileCLGroupingperformssignicantlybetterthanotheralgo-rithmswhentherearemanyhiddenvariablesinthelatenttree.WecomparedouralgorithmstootherEM-basedapproachesandtheneighbor-joiningmethodonreal-worlddatasets,underbothGaussiananddiscretedatamodeling.Ourproposedalgorithmsshowsuperiorresultsinbothac-curacy(measuredbyKL-divergenceandgraphdistance)andcomputationalefciency.Inaddi-tion,weintroducedregularizedCLGrouping,whichcanlearnalatenttreeapproximationbytrad-ingoffmodelcomplexity(numberofhiddennodes)withdatadelity.Thisisveryrelevantfor1801 CHOI,TAN,ANANDKUMARANDILLSKY WMB OXY COP CVX XOM TYC GE F BA UTX FDX CBS CMCSA AEP ETR EXC SO HD TGT WMT MCD CVS MS DIS ORCL SP100 S T VZ h1 h4 h5 HAL SLB BHI h10 h12 h18 h22 WY IP DOW DD AA CAT HON GD RTN MMM h2 h15 h19 h23 h24 h25 DELL AAPL HPQ IBM EMC XRX TXN INTC MSFT h3 h11 h20h21 h28 C AXP MER AIG BNI RF NSC SNS BAC JPM WB WFC BK USB h26 h27 SLE CPB HNZ KO PEP MO AMGN ABT BMY JNJ MRK PFE WYE PG AVP CL BAX MDT CI UNH h6 h7 h8 h9 h13 h14 h16 h17 h29 Figure11:TreestructurelearnedfrommonthlystockreturnsusingCLNJ1802 LEARNINGLATENTTREEGRAPHICALMODELS program earth launch lunar mars mission moon nasa orbit satellite shuttle solar space technology h1 h17 card computer data disk display dos drive driver email files format ftp graphics help image mac memory number pc phone problem research science scsi server software system university version video windows h9 h10 h11 h12 h16 h18 h19 h22 h23 h24 h25 h26 jews case children course evidence fact government gun human israel law power president question rights state war world h2 h3 h4 h20 bible christian jesus religion h8 h14god baseball fans games hit hockey league nhl players puck season team win won h5 h6 h7 aids cancer disease doctor food health medicine msg patients studies vitamin water h13 h21 bmw car dealer engine honda insurance oil h15 Figure12:Treestructurelearnedfrom20newsgroupdatasetusingregCLRG.1803 CHOI,TAN,ANANDKUMARANDILLSKYpracticalimplementationonreal-worlddatasets.Infuture,weplantodevelopauniedframe-workforlearninglatenttreeswhereeachrandomvariable(node)maybecontinuousordiscrete.TheMATLABimplementationofouralgorithmscanbedownloadedfromtheprojectwebpagehttp://people.csail.mit.edu/myungjin/latentTree.htmlAcknowledgmentsThisresearchwassupportedinpartbyShellInternationalExplorationandProduction,Inc.andinpartbytheAirForceOfceofScienticResearchunderAwardNo.FA9550-06-1-0324.ThisworkwasalsosupportedinpartbyAFOSRunderGrantFA9550-08-1-1080andinpartbyMURIunderAFOSRGrantFA9550-06-1-0324.Anyopinions,ndings,andconclusionsorrecommendationsexpressedinthispublicationarethoseoftheauthor(s)anddonotnecessarilyreecttheviewsoftheAirForce.VincentTanandAnimashreeAnandkumararesupportedbyA*STAR,SingaporeandbythesetupfundsatU.C.Irvinerespectively.AppendixA.ProofsInthisappendix,weprovideproofsforthetheoremspresentedinthepaper.A.1ProofofLemma4:SiblingGroupingWeprovestatement(i)inLemma4using(12)inProposition3.Statement(ii)followsalongsimilarlinesanditsproofisomittedforbrevity.If:Fromtheadditivepropertyofinformationdistancesin(12),ifisaleafnodeandisitsparent,dik=dij+djkandthusFijk=dijforallk=;OnlyIf:NowassumethatFijk=dijforallk2Vnf;g.Inordertoprovethatisaleafnodeandisitsparent,assumetothecontrary,thatandarenotconnectedwithanedge,thenthereexistsanodeu=;onthepathconnectingand.Ifu2V,thenletk=u.Otherwise,letkbeanobservednodeinthesubtreeawayfromand(seeFigure13(a)),whichexistssinceTp2T3Bytheadditivepropertyofinformationdistancesin(12)andtheassumptionthatalldistancesarepositive,dij=diu+duj�diuduj=dikdkj=Fijkwhichisacontradiction.IfisnotaleafnodeinTp,thenthereexistanodeu=;suchthat(;u)2Ep.Letk=uifu2V,otherwise,letkbeanobservednodeinthesubtreeawayfromand(seeFigure13(b)).Then,Fijk=dikdjk=dijdij;whichisagainacontradiction.Therefore,(;)2Epandisaleafnode. A.2ProofofTheorem5:CorrectnessandComputationalComplexityofRGThecorrectnessofRGfollowsfromthefollowingobservations:Firstly,fromProposition3,forall;intheactivesetY,theinformationdistancesdijcanbecomputedexactlywithEquations(13)and(14).Secondly,ateachiterationofRG,thesiblinggroupswithinYareidentiedcorrectlyusingtheinformationdistancesbyLemma4.Sincethenewparentnodeaddedtoapartitionthatdoesnot1804 LEARNINGLATENTTREEGRAPHICALMODELS i (a)(b)(c)(d) u j k j i u k j u k h l k i j Sg(i)Sg(j) Vi\jVj\i Figure13:Shadednodesindicateobservednodesandtherestindicatehiddennodes.(a),(b)FiguresforProofofLemma4.Dashedredlinerepresentthesubtreesawayfroand.(c)FigureforProofofLemma8(i).(d)FigureforProofofLemma8(iI)containanobservedparentcorrespondstoahiddennode(intheoriginallatenttree),asubforestofTpisrecoveredateachiteration,andwhenjYj2,andtheentirelatenttreeisrecovered.ThecomputationalcomplexityfollowsfromthefactthereareamaximumofO(m3)differencesFijk=dikdjkthatwehavetocomputeateachiterationofRG.Furthermore,thereareatmostdiam(Tp)subsetsinthecoarsestpartition(cf.,step3)ofYattherstiteration,andthenumberofsubsetsreduceatleastby2fromoneiterationtothenextduetotheassumptionthatTp2T3.ThisprovestheclaimthatthecomputationalcomplexityisupperboundedbyO(diam(Tp)m3) A.3ProofofLemma8:PropertiesoftheMST(i)Foranedge(;)2EpsuchthatSg()=Sg(),letVinjVandVjniVdenoteobservednodesinthesubtreesobtainedbytheremovalofedge(;),wheretheformerincludesnodeandexcludesnodeandviceversa(seeFigure13(c)).Usingpart(ii)ofthelemmaandthefactthatSg()=Sg()itcanbeshownthatSg()2VinjandSg()2Vjni.Since(;)liesontheuniquepathfromktoonTp,forallobservednodesk2Vinj;2Vjni,wehavedkl=dki+dij+djldSg(i);i+dij+dSg(j);j=dSg(i);Sg(j);wheretheinequalityisfromthedenitionofsurrogacyandthenalequalityusesthefactthatSg()=Sg().ByusingthepropertyoftheMSTthat(Sg();Sg())istheshortestedgefromVinjtoVjni,wehave(18).(ii)Firstassumethatwehaveatie-breakingruleconsistentacrossallhiddennodessothatifduh=dvh=mini2Vdihandduh0=dvh0=mini2Vdih0thenbothhandh0choosethesamesurrogatenode.Let2Vh2Sg1(),andletubeanodeonthepathconnectinghand(seeFigure13(d)).AssumethatSg(u)=k=.Ifduj�duk,thendhj=dhu+duj�dhu+duk=dhk;whichisacontradictionsince=Sg(h).Ifduj=duk,thendhj=dhk,whichisagainacontradictiontotheconsistenttie-breakingrule.Thus,thesurrogatenodeofuis(iii)FirstweclaimthatjSg1()jD(Tp)u ld(Tp;V):(29)1805 CHOI,TAN,ANANDKUMARANDILLSKYToprovethisclaim,letgbethelongest(worst-case)graphdistanceofanyhiddennodeh2Hfromitssurrogate,thatis,g=maxh2HjPath(h;Sg(h)Tp)j:(30)Fromthedegreebound,foreach2V,thereareatmostD(Tp)hiddennodesthatarewithinthegraphdistanceofg30sojSg1()jD(Tp)(31)forall2V.Letd=maxh2Hdh;Sg(h)bethelongest(worst-case)informationdistancebetweenahiddennodeanditssurrogate.Fromtheboundsontheinformationdistances,gd.Inaddition,foreachh2H,letz(h)=argminj2VjPath((h;)Tp)jbetheobservednodethatisclosesttohingraphdistance.Then,bydenitionoftheeffectivedepth,dh;Sg(h)dh;z(h)udforallh2H,andwehavedud.Sincegdud,wealsohavegud=:(32)Combiningthisresultwith(31)establishestheclaimin(29).NowconsiderD(MST(VD))(a)D(Tp)maxi2VjSg1()j(b)D(Tp)1+u ld(Tp;V)where(a)isaresultoftheapplicationof(18)and(b)resultsfrom(29).Thiscompletestheproofoftheclaimin(19)inLemma8. A.4ProofofTheorem9:CorrectnessandComputationalComplexityofCLBlindItsufcestoshowthattheChow-LiutreeMST(Vd)isatransformationofthetruelatenttreeTp(withparameterssuchthatp2P(Tblind))asfollows:contracttheedgeconnectingeachhiddenvariablehwithitssurrogatenodeSg(h)(oneofitschildrenandaleafbyassumption).NotethattheblindtransformationontheMSTismerelytheinversemappingoftheabove.From(18),allthechildrenofahiddennodeh,exceptitssurrogateSg(h),areneighborsofitssurrogatenodeSg(h)inMST(Vd).Moreover,thesechildrenofhwhicharenotsurrogatesofanyhiddennodesareleafnodesintheMST.Similarlyfortwohiddennodesh1;h22Hsuchthat(h1;h2)2Ep(Sg(h1);Sg(h2))2MST(Vd)fromLemma8(i).Hence,CLBlindoutputsthecorrecttreestruc-tureTp.Thecomputationalcomplexityfollowsfromthefactthattheblindtransformationislinearinthenumberofinternalnodes,whichislessthanthenumberofobservednodes,andthatlearningtheChow-LiutreetakesO(m2logm)operations. A.5ProofofTheorem10:CorrectnessandComputationalComplexityofCLRGWerstdenesomenewnotations.Notation:LetI=VnLeaf(MST(Vd))bethesetofinternalnodes.Letvr2Ibetheinternalnodevisitedatiterationr,andletHrbeallhiddennodesintheinversesurrogatesetSg1(vr)thatis,Hr=Sg1(vr)nfvrg.LetAr=nbd[vrTr1],andhenceAristhenodesetinputtotherecursivegroupingroutineatiterationr,andletRG(Ar;d)betheoutputlatenttreelearnedbyrecursivegrouping.DeneTrasthetreeoutputattheendofriterationsofCLGrouping.LetVr=fvr+1;vr+2;:::;vjIjgbethesetofinternalnodesthathavenotyetbeenvisitedbyCLGrouping 30.Themaximumsizeoftheinversesurrogatesetin(30)isattainedbyaD(Tp)-arycompletetree.1806 LEARNINGLATENTTREEGRAPHICALMODELS (a) h1 1 2 3 4 5 6 (b)(c) 2 1 3 4 6 5 3 h1 2 1 4 5 6 h2 5 T1T2 A1A2 v1v2 RG(A1,d)RG(A2,d) T0T1 2 1 3 4 5 6 h1 h2 2 1 3 4 6 5 3 h1 2 1 4 5 6 h2 h1 h3 h2 2 1 3 4 5 6 EC(Tp, V2) v1v2 S2 v2H2 S1 EC(Tp, V1) v1H1 )EC(Tðððð 2 1 3 4 5 6 h1 h2 v1H1 h1 h3 h2 2 1 3 4 5 6 v2H2 Figure14:FigureforProofofTheorem10.(a)Originallatenttree.(b)IllustrationofCLGrouping.(c)Illustrationofthetreesconstructedusingedgecontractions.attheendofriterations.LetEC(Tp;Vr)bethetreeconstructedusingedgecontractionsasfollows:inthelatenttreeTp,wecontractedgescorrespondingtoeachnodeu2VrandallhiddennodesinitsinversesurrogatesetSg1(u).LetSrbeasubtreeofEC(Tp;Vr)spanningvrHrandtheirneighbors.Forexample,inFigure14,theoriginallatenttreeTpisshowninFigure14(a),andT0T1T2areshowninFigure14(b).ThesetofinternalnodesisI=f3;5g.Intherstiteration,v1=5,A1=f1;3;4;5gandH1=fh1;h2g.Intheseconditeration,v2=3,A2=f2;3;6;h1gandH1=fh3gV0=f3;5gV1=f3g,andV2=0,andinFigure14(c),weshowEC(Tp;V0)EC(Tp;V1),andEC(Tp;V2).InEC(Tp;V1)S1isthesubtreespanning5;h1;h2andtheirneighbors,thatis,f1;3;4;5;h1;h2g.InEC(Tp;V2)S2isthesubtreespanning3;h3andtheirneighbors,thatis,f2;3;6;h1;h3g.NotethatT0=EC(Tp;V0)T1=EC(Tp;V1),andT2=EC(Tp;V2);weshowbelowthatthisholdsforallCLGroupingiterationsingeneral.Weprovethetheorembyinductionontheiterationsr=1;:::;jIjoftheCLGroupingalgorithm.InductionHypothesis:AttheendofkiterationsofCLGrouping,thetreeobtainedisTk=EC(Tp;Vk);8k=0;1;:::;jIj:(33)Inwords,thelatenttreeafterkiterationsofCLGroupingcanbeconstructedbycontractingeachsurrogatenodeinTpthathasnotbeenvisitedbyCLGroupingwithitsinversesurrogateset.NotethatVjIj=0andEC(Tp;VjIj)isequivalenttotheoriginallatenttreeTp.Thus,iftheaboveinductionin(33)holds,thentheoutputofCLGroupingTjIjistheoriginallatenttree.BaseStepr=0:Theclaimin(33)holdssinceV0=IandtheinputtotheCLGroupingpro-cedureistheChow-LiutreeMST(VD),whichisobtainedbycontractingallsurrogatenodesandtheirinversesurrogatesets(seeSection5.2).InductionStep:Assume(33)istruefork=1;:::;r1.Nowconsiderk=r1807 CHOI,TAN,ANANDKUMARANDILLSKYWerstcomparethetwolatenttreesEC(Tp;Vr)andEC(Tp;Vr1).BythedenitionofEC,ifwecontractedgeswithvrandthehiddennodesinitsinversesurrogatesetHronthetreeEC(Tp;Vr)thenweobtainEC(Tp;Vr1),whichisequivalenttoTr1bytheinductionassumption.NotethatasshowninFigure14,thistransformationislocaltothesubtreeSr:contractingvrwithHronEC(Tp;Vr)transformsSrintoastargraphwithvratitscenterandthehiddennodesHrremoved(contractedwithvr).RecallthattheCLGroupingprocedurereplacestheinducedsubtreeofArinTr1(whichispreciselythestargraphmentionedabovebytheinductionhypothesis)withRG(Ar;d)toobtainTrThus,toprovethatTr=EC(Tp;Vr),weonlyneedtoshowthatRGreversestheedge-contractionoperationsonvrandHr,thatis,thesubtreeSr=RG(Ar;d).WerstshowthatSr2T3,thatis,itisidentiable(minimal)whenAristhesetofvisiblenodes.Thisisbecauseanedgecontractionoperationdoesnotdecreasethedegreeofanyexistingnodes.SinceTp2T3,allhiddennodesinEC(Tp;Vr)havedegreesequaltoorgreaterthan3,andsinceweareincludingallneighborsofHrinthesubtreeSr,wehaveSr2T3.ByTheorem5,RGreconstructsalllatenttreesinT3andhence,Sr=RG(Ar;d)Thecomputationalcomplexityfollowsfromthecorrespondingresultinrecursivegrouping.TheChow-LiutreecanbeconstructedwithO(m2logm)complexity.TherecursivegroupingprocedurehascomplexitymaxrjArj3andmaxrjArjD(MST(Vbd)) A.6ProofofTheorem11:ConsistencyandSampleComplexityofRelaxedRG(i)StructuralconsistencyfollowsfromTheorem5andthefactthattheMLestimatesofinformationdistancesbdijapproachdij(inprobability)forall;2Vasthenumberofsamplestendstoinnity.RiskconsistencyforGaussianandsymmetricdiscretedistributionsfollowsfromstructuralcon-sistency.Ifthestructureiscorrectlyrecovered,wecanusetheequationsin(13)and(14)toinfertheinformationdistances.Sincethedistancesareinone-to-onecorrespondencetothecorrelationcoef-cientsandthecrossoverprobabilityforGaussianandsymmetricdiscretedistributionrespectively,theparametersarealsoconsistent.ThisimpliesthattheKL-divergencebetweenpandbpntendstozero(inprobability)asthenumberofsamplesntendstoinnity.Thiscompletestheproof.(ii)Thetheoremfollowsbyusingtheassumptionthattheeffectivedepthd=d(TpV)iscon-stant.Recallthatt�0isthethresholdusedinrelaxedRG(see(21)inSection6.3).Letthesetoftriples(;;k)whosepairwiseinformationdistancesarelessthantapartbe,thatis,(;;k)2ifandonlyifmaxfdij;djk;dkigt.Sinceweassumethatthetrueinformationdistancesareuni-formlybounded,thereexistt&#x-0.9;畵0andsomesufcientlysmalll&#x-0.9;畵0sothatifjbFijkFijkjlforall(;;k)2,thenRGrecoversthecorrectlatentstructure.DenetheerroreventEijk=fjbFijkFijkj&#x-0.9;畵lg.WenotethattheprobabilityoftheeventEijkdecaysexponentiallyfast,thatis,thereexistsJijk&#x-0.9;畵0suchthatforalln2NPr(Eijk)exp(nJijk):(34)1808 LEARNINGLATENTTREEGRAPHICALMODELSTheproofof(34)followsreadilyforChernoffbounds(Hoeffding,1958)andisomitted.Theerrorprobabilityassociatedtostructurelearningcanbeboundedasfollows:Prh(bTn)=Tp(a)Pr0@[(i;j;k)2JEijk1A(b)å(i;j;k)2JPr(Eijk)m3max(i;j;k)2JPr(Eijk)(c)exp(3logm)expnmin(i;j;k)2JJijk;where(a)followsfromthefactthatiftheeventfh(bTn)=Tpgoccurs,thenthereisatleastonesiblingorparent-childrelationshipthatisincorrect,whichcorrespondstotheunionoftheeventsEijk,thatis,thereexistsatriple(;;k)2issuchthatbFijkdiffersfromFijkbymorethanlInequality(b)followsfromtheunionboundand(c)followsfrom(34).Becausetheinformationdistancesareuniformlybounded,therealsoexistsaconstantJmin�0(independentofm)suchthatmin(i;j;k)2JJijkJminforallm2N.Henceforeveryh�0,ifthenumberofsamplessatisesn�3(log(m=3p h))=Jmin,theerrorprobabilityisboundedabovebyhLetC=3=Jmintocompletetheproofofthesamplecomplexityresultin(26).TheproofforthelogarithmicsamplecomplexityofdistributionreconstructionforGaussianandsymmetricdiscretemodelsfollowsfromthelogarithmicsamplecomplexityresultforstructurelearningandthefactthattheinformationdistancesareinaone-to-onecorrespondencewiththecorrelationcoefcients(forGaussianmodels)orcrossoverprobabilities(forsymmetricdiscretemodels).A.7ProofofTheorem12:ConsistencyandSampleComplexityofRelaxedCLRG(i)StructuralconsistencyofCLGroupingfollowsfromstructuralconsistencyofRG(orNJ)andtheconsistencyoftheChow-Liualgorithm.RiskconsistencyofCLGroupingforGaussianorsym-metricdistributionsfollowsfromthestructuralconsistency,andtheproofissimilartotheproofofTheorem11(i).(ii)TheinputtotheCLGroupingprocedurebTCListheChow-LiutreeandhasO(logm)samplecomplexity(Tanetal.,2010,2011),wheremisthesizeofthetree.ThisistrueforbothdiscreteandGaussiandata.FromTheorem11,therecursivegroupingprocedurehasO(logm)samplecom-plexity(forappropriatelychosenthresholds)whentheinputinformationdistancesareuniformlybounded.InanyiterationoftheCLGrouping,theinformationdistancessatisfydijgu,wheregdenedin(30),istheworst-casegraphdistanceofanyhiddennodefromitssurrogate.Sincegsatises(32),diju2d=.Iftheeffectivedepthd=O(1)(asassumed),thedistancesdij=O(1)andthesamplecomplexityisO(logm) ReferencesK.Atteson.Theperformanceofneighbor-joiningmethodsofphylogeneticreconstruction.Algo-rithmica,25(2):251–278,1999.M.F.BalcanandP.Gupta.Robusthierarchicalclustering.InIntl.Conf.onLearningTheory(COLT),2010.H.-J.BandelthandA.Dress.Reconstructingtheshapeofatreefromobserveddissimilaritydata.Adv.Appl.Math,7:309–43,1986.1809 CHOI,TAN,ANANDKUMARANDILLSKYS.Bhamidi,R.Rajagopal,andS.Roch.Networkdelayinferencefromadditivemetrics.ToappearinRandomStructuresandAlgorithms,Arxivpreprintmath/0604367,2009.R.Castro,M.Coates,G.Liang,R.Nowak,andB.Yu.Networktomography:Recentdevelopments.Stat.Science,19:499–517,2004.J.T.ChangandJ.A.Hartigan.Reconstructionofevolutionarytreesfrompairwisedistributionsoncurrentspecies.InComputingScienceandStatistics:Proceedingsofthe23rdSymposiumontheInterface,pages254–257,1991.T.Chen,N.L.Zhang,andY.Wang.Efcientmodelevaluationinthesearchbasedapproachtolatentstructurediscovery.In4thEuropeanWorkshoponProbabilisticGraphicalModels,2008.M.J.Choi,J.J.Lim,A.Torralba,andA.S.Willsky.Exploitinghierarchicalcontextonalargedatabaseofobjectcategories.InIEEEConferenceonComputerVisionandPatternRecognition(CVPR),SanFrancisco,CA,June2010.C.K.ChowandC.N.Liu.Approximatingdiscreteprobabilitydistributionswithdependencetrees.IEEETrans.onInformationTheory,3:462–467,1968.T.Cormen,C.Leiserson,R.Rivest,andC.Stein.IntroductiontoAlgorithms.McGraw-HillSci-ence/Engineering/Math,2ndedition,2003.T.M.CoverandJ.A.Thomas.ElementsofInformationTheory.Wiley-Interscience,2ndedition,2006.R.G.Cowell,A.P.Dawid,S.L.Lauritzen,andD.J.Spiegelhalter.ProbabilisticNetworksandExpertSystems.StatisticsforEngineeringandInformationScience.Springer-Verlag,NewYork,1999.M.Csur¨os.ReconstructingPhylogeniesinMarkovModelsofSequenceEvolution.PhDthesis,YaleUniversity,2000.C.Daskalakis,E.Mossel,andS.Roch.Optimalphylogeneticreconstruction.InSTOC'06:Pro-ceedingsoftheThirty-eighthAnnualACMSymposiumonTheoryofComputing,pages159–168,2006.A.P.Dempster,N.M.Laird,andD.B.Rubin.Maximum-likelihoodfromincompletedataviatheEMalgorithm.JournaloftheRoyalStatisticalSociety,39:1–38,1977.R.Durbin,S.R.Eddy,A.Krogh,andG.Mitchison.BiologicalSequenceAnalysis:ProbabilisticModelsofProteinsandNucleicAcids.CambridgeUniv.Press,1999.G.ElidanandN.Friedman.Learninghiddenvariablenetworks:Theinformationbottleneckap-proach.JournalofMachineLearningResearch,6:81–127,2005.P.L.Erdos,L.A.Sz´ekely,M.A.Steel,andT.J.Warnow.Afewlogssufcetobuild(almost)alltrees:Partii.TheoreticalComputerScience,221:153–184,1999.J.Farris.Estimatingphylogenetictreesfromdistancematrices.Amer.Natur.,106(951):645–67,1972.1810 LEARNINGLATENTTREEGRAPHICALMODELSS.HarmelingandC.K.I.Williams.Greedylearningofbinarylatenttrees.IEEETransactionsonPatternAnalysisandMachineIntelligence,2010.W.Hoeffding.Probabilityinequalitiesforsumsofboundedrandomvariables.JournaloftheAmericanStatisticalAssociation,58:13–30,1958.D.Hsu,S.M.Kakade,andT.Zhang.AspectralalgorithmforlearninghiddenMarkovmodels.InIntl.Conf.onLearningTheory(COLT),2009.A.K.Jain,M.N.Murthy,andP.J.Flynn.Dataclustering:Areview.ACMComputingReviews1999.T.Jiang,P.E.Kearney,andM.Li.Apolynomial-timeapproximationschemeforinferringevolu-tionarytreesfromquartettopologiesanditsapplication.SIAMJ.Comput.,30(6):194261,2001.C.KempandJ.B.Tenenbaum.Thediscoveryofstructuralform.ProceedingsoftheNationalAcademyofScience,105(31):10687–10692,2008.J.B.Kruskal.Ontheshortestspanningsubtreeofagraphandthetravelingsalesmanproblem.ProceedingsoftheAmericanMathematicalSociety,7(1),Feb1956.M.R.LaceyandJ.T.Chang.Asignal-to-noiseanalysisofphylogenyestimationbyneighbor-joining:Insufciencyofpolynomiallengthsequences.MathematicalBiosciences,199:188–215,2006.J.A.Lake.Reconstructingevolutionarytreesfromdnaandproteinsequences:Parallneardistances.ProceedingsoftheNationalAcademyofScience,91:1455–1459,1994.S.L.Lauritzen.GraphicalModels.ClarendonPress,1996.P.F.LazarsfeldandN.W.Henry.LatentStructureAnalysis.Boston:HoughtonMifin,1968.D.G.Luenberger.IntroductiontoDynamicSystems:Theory,Models,andApplications.Wiley,1979.D.ParikhandT.H.Chen.HierarchicalSemanticsofObjects(hSOs).InICCV,pages1–8,2007.J.Pearl.ProbabilisticReasoninginIntelligentSystems:NetworkofPlausibleinference.MorganKaufmann,1988.R.C.Prim.Shortestconnectionnetworksandsomegeneralizations.BellSystemTechnicalJournal36,1957.D.F.RobinsonandL.R.Foulds.Comparisonofphylogenetictrees.MathematicalBiosciences,53:131–147,1981.S.Roch.Ashortproofthatphylogenetictreereconstructionbymaximumlikelihoodishard.IEEE/ACMTrans.Comput.Biol.Bioinformatics,3(1),2006.P.J.Rousseeuw.Silhouettes:agraphicalaidtotheinterpretationandvalidationofclusteranalysis.ComputationalandAppliedMathematics,20:53–65,1987.1811 CHOI,TAN,ANANDKUMARANDILLSKYN.SaitouandM.Nei.Theneighbor-joiningmethod:anewmethodforreconstructingphylogenetictrees.MolBiolEvol,4(4):406–425,Jul1987.S.SattathandA.Tversky.Additivesimilaritytrees.Psychometrika,42:319–45,1977.G.Schwarz.Estimatingthedimensionofamodel.AnnalsofStatistics,6:461–464,1978.R.J.Sering.ApproximationTheoremsofMathematicalStatistics.Wiley-Interscience,Nov1980.S.Shen.LargedeviationfortheempiricalcorrelationcoefcientoftwoGaussianrandomvariables.ActaMathematicaScientia,27(4):821–828,Oct2007.R.Silva,R.Scheine,C.Glymour,andP.Spirtes.Learningthestructureoflinearlatentvariablemodels.JournalofMachineLearningResearch,7:191–246,Feb2006.L.Song,B.Boots,S.M.Siddiqi,G.Gordon,andA.Smola.HilbertspaceembeddingsofhiddenMarkovmodels.InProc.ofIntl.Conf.onMachineLearning,2010.K.St.John,T.Warnow,B.M.E.Moret,andL.Vawter.Performancestudyofphylogeneticmethods:(unweighted)quartetmethodsandneighbor-joining.J.Algorithms,48(1):173–193,2003.M.Steel.Thecomplexityofreconstructingtreesfromqualitativecharactersandsubtrees.JournalofClassication,9:91–116,1992.V.Y.F.Tan,A.Anandkumar,andA.S.Willsky.LearningGaussiantreemodels:Analysisoferrorexponentsandextremalstructures.IEEETransactionsonSignalProcessing,58(5):2701–2714,May2010.V.Y.F.Tan,A.Anandkumar,andA.S.Willsky.Learninghigh-dimensionalMarkovforestdistri-butions:Analysisoferrorrates.JournalofMachineLearningResearch(InPress),2011.Y.Tsang,M.Coates,andR.D.Nowak.Networkdelaytomography.IEEETrans.SignalProcessing51:21252136,2003.Y.Wang,N.L.Zhang,andT.Chen.LatenttreemodelsandapproximateinferenceinBayesiannetworks.JournalofArticialIntelligenceResearch,32:879–900,Aug2008.N.L.Zhang.Hierarchicallatentclassmodelsforclusteranalysis.JournalofMachineLearningResearch,5:697–723,2004.N.L.ZhangandTKocka.Efcientlearningofhierarchicallatentclassmodels.InICTAI,2004.1812