DavidAndrzejewskiyandrzejecswisceduXiaojinZhujerryzhucswisceduMarkCravenycravenbiostatwisceduDepartmentofComputerSciencesyDepartmentofBiostatisticsandMedicalInformaticsUniversityofWiscons ID: 295642
Download Pdf The PPT/PDF document "IncorporatingDomainKnowledgeintoTopicMod..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
IncorporatingDomainKnowledgeintoTopicModelingviaDirichletForestPriors DavidAndrzejewskiyandrzeje@cs.wisc.eduXiaojinZhujerryzhu@cs.wisc.eduMarkCravenycraven@biostat.wisc.eduDepartmentofComputerSciences,yDepartmentofBiostatisticsandMedicalInformaticsUniversityofWisconsin-Madison,Madison,WI53706USAAbstractUsersoftopicmodelingmethodsoftenhaveknowledgeaboutthecompositionofwordsthatshouldhavehighorlowprobabilityinvarioustopics.WeincorporatesuchdomainknowledgeusinganovelDirichletForestpriorinaLatentDirichletAllocationframework.ThepriorisamixtureofDirichlettreedistri-butionswithspecialstructures.Wepresentitsconstruction,andinferenceviacollapsedGibbssampling.Experimentsonsyntheticandrealdatasetsdemonstrateourmodel'sabilitytofollowandgeneralizebeyonduser-specieddomainknowledge.1.IntroductionTopicmodeling,usingapproachessuchasLatentDirichletAllocation(LDA)(Bleietal.,2003),hasen-joyedpopularityasawaytomodelhiddentopicsindata.However,inmanyapplications,ausermayhaveadditionalknowledgeaboutthecompositionofwordsthatshouldhavehighprobabilityinvarioustopics.Forexample,inabiologicalapplication,onemaypre-ferthatthewords\termination",\disassembly"and\release"appearwithhighprobabilityinthesametopic,becausetheyalldescribethesamephaseofbi-ologicalprocesses.Furthermore,abiologistcouldau-tomaticallyextractthesepreferencesfromanexistingbiomedicalontology,suchastheGeneOntology(GO)(TheGeneOntologyConsortium,2000).Asanotherexample,ananalystmayruntopicmodelingonacor-pusofpeople'swishes,inspecttheresultingtopics,andnoticethat\into,college"and\cure,cancer"allap- AppearinginProceedingsofthe26thInternationalConfer-enceonMachineLearning,Montreal,Canada,2009.Copy-right2009bytheauthor(s)/owner(s).pearwithhighprobabilityinthesametopic.Thean-alystmaywanttointeractivelyexpressthepreferencethatthetwosetsofwordsshouldnotappeartogether,re-runtopicmodeling,andincorporateadditionalpref-erencesbasedonthenewresults.Inbothcases,wewouldlikethesepreferencestoguidetherecoveryoflatenttopics.StandardLDAlacksamechanismforincorporatingsuchdomainknowledge.Inthispaper,weproposeaprincipledapproachtotheincorporationofsuchdomainknowledgeintoLDA.Weshowthatmanytypesofknowledgecanbeex-pressedwithtwoprimitivesonwordpairs.Borrowingnamesfromtheconstrainedclusteringliterature(Basuetal.,2008),wecallthetwoprimitivesMust-LinksandCannot-Links,althoughthereareimportantdif-ferences.WethenencodethesetofMust-LinksandCannot-LinksassociatedwiththedomainknowledgeusingaDirichletForestprior,replacingtheDirichletprioroverthetopic-wordmultinomialp(wordjtopic).TheDirichletForestpriorisamixtureofDirichlettreedistributionswithveryspecictreestructures.Ourapproachhasseveraladvantages:(i)ADirich-letForestcanencodeMust-LinksandCannot-Links,somethingimpossiblewithDirichletdistributions.(ii)Theusercancontrolthestrengthofthedomainknowl-edgebysettingaparameter,allowingdomainknowl-edgetobeoverriddenifthedatastronglysuggestoth-erwise.(iii)TheDirichletForestlendsitselftoe-cientinferenceviacollapsedGibbssampling,aprop-ertyinheritedfromtheconjugacyofDirichlettrees.Wepresentexperimentsonseveralsyntheticdatasetsandtworealdomains,demonstratingthattheresult-ingtopicsnotonlysuccessfullyincorporatethespeci-eddomainknowledge,butalsogeneralizebeyonditbyincluding/excludingotherrelatedwordsnotexplic-itlymentionedintheMust-LinksandCannot-Links. IncorporatingDomainKnowledgeintoTopicModeling 2.RelatedWorkWereviewLDAusingthenotationofGrithsandSteyvers(2004).LettherebeTtopics.Letw=w1:::wnrepresentacorpusofDdocuments,withatotalofnwords.Weuseditodenotethedocumentofwordwi,andithehiddentopicfromwhichwiisgenerated.Let(w)j=p(wj=j),and(d)j=p(=j)fordocumentd.TheLDAgenerativemodelisthen:Dirichlet()(1)ij(di)Multinomial((di))(2)Dirichlet()(3)wiji;Multinomial(zi)(4)whereandarehyperparametersforthedocument-topicandtopic-wordDirichletdistributions,respec-tively.Forsimplicitywewillassumesymmetricand,butasymmetrichyperparametersarealsopossible.PreviousworkhasmodeledcorrelationsintheLDAdocument-topicmixturesusingthelogisticNormaldis-tribution(Blei&Laerty,2006),DAG(Pachinko)structures(Li&McCallum,2006),ortheDirichletTreedistribution(Tam&Schultz,2007).Inaddition,theconcept-topicmodel(Chemuduguntaetal.,2008)employsdomainknowledgethroughspecial\concept"topics,inwhichonlyaparticularsetofwordscanbepresent.Ourworkcomplementsthepreviousworkbyencodingcomplexdomainknowledgeonwords(espe-ciallyarbitraryCannot-Links)intoa\rexibleandcom-putationallyecientprior.3.TopicModelingwithDirichletForestOurproposedmodeldiersfromLDAinthewayisgenerated.Insteadof(3),wehaveqDirichletForest(;)DirichletTree(q)whereqspeciesaDirichlettreedistribution,playsaroleanalogoustothetopic-wordhyperparameterinstandardLDA,and1isthe\strengthpa-rameter"ofthedomainknowledge.BeforediscussingDirichletForest(;)andDirichletTree(q),werstex-plainhowknowledgecanbeexpressedusingMust-LinkandCannot-Linkprimitives.3.1.Must-LinksandCannot-LinksMust-LinksandCannot-Linkswereoriginallypro-posedforconstrainedclusteringtoencouragetwoin-stancestofallintothesameclusterorintoseparateclusters,respectively.Weborrowthenotionfortopicmodeling.Informally,theMust-Linkprimitiveprefersthattwowordstendtobegeneratedbythesametopic,whiletheCannot-Linkprimitiveprefersthattwowordstendtobegeneratedbyseparatetopics.However,sinceanytopicisamultinomialoverwords,anytwowords(ingeneral)alwayshavesomeproba-bilityofbeinggeneratedbythetopic.Wethereforeproposethefollowingdenition:Must-Link(u;v):Twowordsu;vhavesimilarprob-abilitywithinanytopic,i.e.,(u)j(v)jforj=1:::T.Itisimportanttonotethattheprobabilitiescanbebothlargeorbothsmall,aslongastheyaresimilar.Forexample,fortheearlierbiologyexamplewecouldsayMust-Link(termination,disassembly).Cannot-Link(u;v):Twowordsu;vshouldnotbothhavelargeprobabilitywithinanytopic.Itispermis-sibleforonetohavealargeprobabilityandtheothersmall,orbothsmall.Forexample,oneprimitiveforthewishexamplecanbeCannot-Link(college,cure).ManytypesofdomainknowledgecanbedecomposedintoasetofMust-LinksandCannot-Links.Wedemonstratethreetypesinourexperiments:wecanSplittwoormoresetsofwordsfromasingletopicintodierenttopicsbyplacingMust-LinkswithinthesetsandCannot-Linksbetweenthem.WecanMergetwoormoresetsofwordsfromdierenttopicsintoonetopicbyplacingMust-Linksamongthesets.Givenacommonsetofwordswhichappearinmultipletopics(suchasstopwordsinEnglish,whichtendtoappearinallLDAtopics),wecanIsolatethembyplacingMust-Linkswithinthecommonset,andthenplacingCannot-Linkbetweenthecommonsetandtheotherhigh-probabilitywordsfromalltopics.Itisimpor-tanttonotethatourMust-LinksandCannot-Linksarepreferencesinsteadofhardconstraints.3.2.EncodingMust-LinksItiswell-knownthattheDirichletdistributionislim-itedinthatallwordsshareacommonvarianceparam-eter,andaremutuallyindependentexceptthenormal-izationconstraint(Minka,1999).However,forMust-Link(u;v)itiscrucialtocontrolthetwowordsu;vdierentlythanotherwords.TheDirichlettreedistribution(DennisIII,1991)isageneralizationoftheDirichletdistributionthatallowssuchcontrol.Itisatreewiththewordsasleafnodes;seeFigure1(a)foranexample.Let\r(k)betheDirich-lettreeedgeweightleadingintonodek.LetC(k)betheimmediatechildrenofnodekinthetree,Ltheleavesofthetree,Itheinternalnodes,andL(k)the IncorporatingDomainKnowledgeintoTopicModeling leavesinthesubtreeunderk.TogenerateasampleDirichletTree(\r),onerstdrawsamultinomialateachinternalnodes2IfromDirichlet(\rC(s)),i.e.,usingtheweightsfromstoitschildrenastheDirich-letparameters.Onecanthinkofitasre-distributingtheprobabilitymassreachingsbythismultinomial(initially,themassis1attheroot).Theprobabil-ity(k)ofawordk2Listhensimplytheprod-uctofthemultinomialparametersontheedgesfromktotheroot,asshowninFigure1(b).Itcanbeshown(DennisIII,1991)thatthisproceduregivesDirichletTree(\r)p(j\r)= LYk(k)(k) 1!IYsPC(s)k\r(k) QC(s)k \r(k)L(s)Xk(k)(s)where ()isthestandardgammafunction,andthenotationQLkmeansQk2L.Thefunction(s)\r(s) Pk2C(s)\r(k)isthedierencebetweenthein-degreeandout-degreeofinternalnodes.Whenthisdierence(s)=0forallinternalnodess2I,theDirichlettreereducestoaDirichletdistribution.LiketheDirichlet,theDirichlettreeisconjugatetothemultinomial.Itispossibletointegrateouttogetadistributionoverwordcountsdirectly,similartothemultivariatePolyadistribution:p(wj\r)=IYsPC(s)k\r(k) PC(s)k \r(k)+n(k)C(s)Yk \r(k)+n(k) (\r(k))(5)Heren(k)isthenumberofwordtokensinwthatap-pearinL(k).WeencodeMust-LinksusingaDirichlettree.NotethatourdenitionofMust-Linkistransitive:Must-Link(u;v)andMust-Link(v;w)implyMust-Link(u;w).WethusrstcomputethetransitiveclosuresofexpressedMust-Links.OurDirichlettreeforMust-Linkshasaverysimplestructure:eachtransitiveclo-sureisasubtree,withoneinternalnodeandthewordsintheclosureasitsleaves.Theweightsfromthein-ternalnodetoitsleavesare.TherootconnectstotheseinternalnodesswithweightjL(s)j,wherejjrepresentsthesetsize.Inaddition,therootdirectlyconnectstootherwordsnotinanyclosure,withweight.Forexample,thetransitiveclosureforaMust-Link(A,B)onvocabularyA,B,CissimplyA,B,corre-spondingtotheDirichlettreeinFigure1(a).TounderstandthisencodingofMust-Links,considerrstthecasewhenthedomainknowledgestrengthpa-rameterisatitsweakest=1.Thenin-degreeequalsout-degreeforanyinternalnodes(botharejL(s)j),andthetreereducestoaDirichletdistributionwithsymmetricprior:theMust-Linksareturnedointhiscase.Asweincrease,there-distributionofprob-abilitymassats(governedbyaDirichletunders)hasincreasingconcentrationjL(s)jbutthesameuniformbase-measure.Thistendstoredistributethemassevenlyinthetransitiveclosurerepresentedbys.Therefore,theMust-Linksareturnedonwhen1.Furthermore,themassreachingsisindependentof,andcanstillhavealargevariance.ThisproperlyencodesthefactthatwewantMust-Linkedwordstohavesimilar,butnotalwayslarge,probabilities.Oth-erwise,Must-Linkedwordswouldbeforcedtoappearwithlargeprobabilityinalltopics,whichisclearlyun-desirable.ThisisimpossibletorepresentwithDirich-letdistributions.Forexample,thebluedotsinFig-ure1(c)aresamplesfromtheDirichlettreeinFig-ure1(a),plottedontheprobabilitysimplexofdimen-sionthree.Whileitisalwaystruethatp()p(B),theirtotalprobabilitymasscanbeanywherefrom0to1.ThemostsimilarDirichletdistributionisper-hapstheonewithparameters(50,50,1),whichgener-atessamplescloseto(0.5,0.5,0)(Figure1(d).)3.3.EncodingCannot-LinksCannot-Linksareconsiderablyhardertohandle.WersttransformthemintoanalternativeformthatisamenabletoDirichlettrees.NotethatCannot-Linksarenottransitive:Cannot-Link(A,B)andCannot-Link(B,C)doesnotentailCannot-Link(A,C).Wede-neaCannot-Link-graphwherethenodesarewords1,andtheedgescorrespondtotheCannot-Links.Thentheconnectedcomponentsofthisgraphareindepen-dentofeachotherwhenencodingCannot-Links.WewillusethispropertytofactoraDirichlet-treeselec-tionprobabilitylater.Forexample,thetwoCannot-Links(A,B)and(B,C)formthegraphinFigure1(e)withasingleconnectedcomponentA,B,C.Considerthesubgraphonconnectedcomponentr.Wedeneitscomplementgraphby\rippingtheedges(ontoo,otoon),asshowninFigure1(f).LettherebeQ(r)maximalcliquesMr1:::MrQ(r)inthiscomplementgraph.Inthefollowing,wesimplycallthem\cliques",butitisimportanttorememberthattheyaremaximalcliquesofthecomplementgraph,nottheoriginalCannot-Link-graph.Inourexample,Q(r)=2andMr1=A;C,Mr2=B.Thesecliqueshavethefollowinginterpretation:eachclique(e.g.,Mr1=A;C)isthemaximalsubsetofwordsintheconnectedcomponentthatcan\occurtogether". 1WhenthereareMust-Links,allwordsinaMust-Linktransitiveclosureformasinglenodeinthisgraph. IncorporatingDomainKnowledgeintoTopicModeling AC2bb 0.090.420.580.91f=(0.53 0.38 0.09) AB AB (a)(b)(c)(d) AB AB AB BACbb AB (e)(f)(g)(h)(i)Figure1.EncodingMust-LinksandCannot-LinkswithaDirichletForest.(a)ADirichlettreeencodingMust-Link(A,B)with=1;=50onvocabularyfA,B,Cg.(b)AsamplefromthisDirichlettree.(c)AlargesetofsamplesfromtheDirichlettree,plottedonthe3-simplex.Notep(A)p(B),yettheyremain\rexibleinactualvalue,whichisdesirableforaMust-Link.(d)Incontrast,samplesfromastandardDirichletwithcomparableparameters(50,50,1)forcep(A)p(B)0:5,andcannotencodeaMust-Link.(e)TheCannot-Link-graphforCannot-Link(A,B)andCannot-Link(B,C).(f)Thecomplementarygraph,withtwomaximalcliquesfA,CgandfBg.(g)TheDirichletsubtreeforcliquefA,Cg.(h)TheDirichletsubtreeforcliquefBg.(i)Samplesfromthemixturemodelon(g,h),encodingbothCannot-Links,againwith=1;=50.Thatis,thesewordsareallowedtosimultaneouslyhavelargeprobabilitiesinagiventopicwithoutviolat-inganyCannot-Linkpreferences.Bythemaximalityofthesecliques,allowinganywordoutsidetheclique(e.g.,\B")toalsohavealargeprobabilitywillviolateatleast1Cannot-Link(inthisexample2).Wediscusstheencodingforthissingleconnectedcom-ponentrnow,deferringdiscussionofthecompleteen-codingtosection3.4.WecreateamixturemodelofQ(r)Dirichletsubtrees,oneforeachclique.Eachtopicselectsexactlyonesubtreeaccordingtoprobabilityp(q)/jMrqj;q=1:::Q(r):(6)Conceptually,theselectedsubtreeindexedbyqtendstoredistributenearlyallprobabilitymasstothewordswithinMrq.Sincethereisnomassleftforothercliques,itisimpossibleforawordoutsidecliqueMrqtohavealargeprobability.Therefore,noCannot-Linkwillbeviolated.Inreality,thesubtreesaresoftratherthanhard,becauseCannot-Linksareonlypref-erences.TheDirichletsubtreeforMrqisstructuredasfollows.Thesubtree'srootconnectstoaninter-nalnodeswithweightjMrqj.Thenodescon-nectstowordsinMrq,withweight.Thesubtree'srootalsodirectlyconnectstowordsnotinMrq(butintheconnectedcomponentr)withweight.Thiswillsendmostprobabilitymassdowntos,andthen\rexiblyredistributeitamongwordsinMrq.Forex-ample,Figures1(g,h)showtheDirichletsubtreesforMr1=A;CandMr2=Brespectively.SamplesfromthismixturemodelareshowninFigure1(i),rep-resentingmultinomialsinwhichnoCannot-Linkisvi-olated.SuchbehaviorisnotachievablebyaDirichletdistribution,orasingleDirichlettree2.Finally,wementionthatalthoughintheworstcasethenumberofmaximalcliquesQ(r)inaconnectedcomponentofsizejrjcangrowexponentiallyasO(3jrj=3)(Griggsetal.,1988),inourexperimentsQ(r)isnolargerthan3,dueinparttoMust-Linkedwords\collapsing"tosinglenodesintheCannot-Linkgraph.3.4.TheDirichletForestPriorIngeneral,ourdomainknowledgeisexpressedbyasetofMust-LinksandCannot-Links.Werstcom-putethetransitiveclosureofMust-Links.WethenformaCannot-Link-graph,whereanodeiseitheraMust-LinkclosureorawordnotpresentinanyMust-Link.Notethatthedomainknowledgemustbe\consistent"inthatnopairsofwordsaresimul-taneouslyCannot-LinkedandMust-Linked(eitherex-plicitlyorimplicitlythroughMust-Linktransitiveclo-sure.)LetRbethenumberofconnectedcompo-nentsintheCannot-Link-graph.OurDirichletFor-estconsistsofQRr=1Q(r)Dirichlettrees,representedbythetemplateinFigure2.EachDirichlettreehas 2Dirichletdistributionswithverysmallconcentrationdohavesomeselectioneect.Forexample,Beta(0.1,0.1)tendstoconcentrateprobabilitymassononeofthetwovariables.However,suchpriorsareweak{the\pseudocounts"inthemaretoosmallbecauseofthesmallconcen-tration.Theposteriorwillbedominatedbythedata,andwewouldloseanyencodeddomainknowledge. IncorporatingDomainKnowledgeintoTopicModeling Rbranchesbeneaththeroot,oneforeachconnectedcomponent.Thetreesdierinwhichsubtreestheyincludeunderthesebranches.Forther-thbranch,thereareQ(r)possibleDirichletsubtrees,correspond-ingtocliquesMr1:::MrQ(r).Therefore,atreeintheforestisuniquelyidentiedbyanindexvectorq=(q(1):::q(R)),whereq(r)2f:::Q(r). M1qQ(1) h*h* ...MRqQ(R) ...... Figure2.TemplateofDirichlettreesintheDirichletForestTodrawaDirichlettreeqfromthepriorDirichletForest(;),weselectthesubtreesindepen-dentlybecausetheRconnectedcomponentsarein-dependentwithrespecttoCannot-Links:p(q)=QRr=1p(q(r)).Eachq(r)issampledaccordingto(6),andcorrespondstochoosingasolidboxforther-thbranchinFigure2.ThestructureofthesubtreewithinthesolidboxhasbeendenedinSection3.3.Theblacknodesmaybeasingleword,oraMust-Linktran-sitiveclosurehavingthesubtreestructureshowninthedottedbox.Theedgeweightleadingtomostnodeskis\r(k)=jL(k)j,whereL(k)isthesetofleavesunderk.However,foredgescomingoutofaMust-Linkin-ternalnodeorgoingintoaCannot-Linkinternalnode,theirweightsaremultipliedbythestrengthparameter.Theseedgesaremarkedby\"inFigure2.WenowdenethecompleteDirichletForestmodel,integratingout(\collapsing")and.Letn(d)jbethenumberofwordtokensindocumentdthatareassignedtotopicj.zisgeneratedthesameasinLDA:p(zj)= (T) ()TYd=1QTj=1 (n(d)j+) (n(d)+T):ThereisoneDirichlettreeqjpertopicj=1:::T,sampledfromtheDirichletForestpriorp(qj)=QRr=1p(q(r)j).EachDirichlettreeqjimplicitlydenesitstreeedgeweights\r()jusing;,anditstreestruc-tureLj;Ij;Cj().Letn(k)jbethenumberofwordto-kensinthecorpusassignedtotopicjthatappearun-derthenodekintheDirichlettreeqj.Theprob-abilityofgeneratingthecorpusw,giventhetreesq1:Tq1:::qTandthetopicassignmentz,canbederivedusing(5):p(wjq1:T;z;;)=TYj=1IjYsPCj(s)k\r(k)j PCj(s)k(\r(k)j+n(k)j)Cj(s)Yk (\r(k)j+n(k)j) (\r(k)j):Finally,thecompletegenerativemodelisp(w;z;q1:Tj;;)=p(wjq1:T;z;;)p(zj)TYj=1p(qj):4.InferenceforDirichletForestBecauseaDirichletForestisamixtureofDirichlettrees,whichareconjugatetomultinomials,wecanef-cientlyperforminferencebyMarkovChainMonteCarlo(MCMC).Specically,weusecollapsedGibbssamplingsimilartoGrithsandSteyvers(2004).However,inourcasetheMCMCstateisdenedbyboththetopiclabelszandthetreeindicesq1:T.AnMCMCiterationinourcaseconsistsofasweepthroughbothzandq1:T.WepresenttheconditionalprobabilitiesforcollapsedGibbssamplingbelow.(Samplingi):Letn(d) i;jbethenumberofwordto-kensindocumentdassignedtotopicj,excludingthewordatpositioni.Similarly,letn(k) i;jbethenumberofwordtokensinthecorpusthatareundernodekintopicj'sDirichlettree,excludingthewordatpositioni.Forcandidatetopiclabelsv=1:::T,wehavep(i=vjz i;q1:T;w)/(n(d) i;v+)Iv("i)Ys\r(Cv(s#i))v+n(Cv(s#i)) i;v PCv(s)k\r(k)v+n(k) i;v;whereIv(i)denotesthesubsetofinternalnodesintopicv'sDirichlettreethatareancestorsofleafwi,andCv(si)istheuniquenodethatiss'simmediatechildandanancestorofwi(includingwiitself).(Samplingq(r)j):Sincetheconnectedcomponentsareindependent,samplingthetreeqjfactorsintosam-plingthecliquesforeachconnectedcomponentq(r)j.Forcandidatecliquesq0=1:::Q(r),wehavep(q(r)j=q0jz;q j;q( r)j;w)/MrqXkkIj;r=YsPCj(s)k\r(k)j PCj(s)k(\r(k)j+n(k)j)Cj(s)Yk (\r(k)j+n(k)j) (\r(k)j)whereIj;rqdenotestheinternalnodesbelowther-thbranchoftreeqj,whencliqueMrqisselected. IncorporatingDomainKnowledgeintoTopicModeling (Estimatingand):AfterrunningMCMCforsucientiterations,wefollowstandardpractice(e.g.(Griths&Steyvers,2004))andusethelastsam-ple(z;q1:T)toestimateand.BecauseaDirich-lettreeisaconjugatedistribution,itsposteriorisaDirichlettreewiththesamestructureandupdatededgeweights.TheposteriorfortheDirichlettreeofthej-thtopicis\rpost(k)j=\r(k)j+n(k)j,wherethecountsn(k)jarecollectedfromz;q1:T;w.Weestimatejbytherstmomentunderthisposterior(Minka,1999):b(w)j=Ij("w)Ys\rpost(Cj(s#w))jCj(s)Xs\rpost(s)j 1:(7)Theparameterisestimatedthesamewayasinstan-dardLDA:b(d)j=(n(d)j+)(n(d)+T).5.ExperimentsSyntheticCorpora:WepresentresultsonsyntheticdatasetstoshowhowtheDirichletForest(DF)incor-poratesdierenttypesofknowledge.RecallthatDFwith=1isequivalenttostandardLDA(veriedwiththecodeof(Griths&Steyvers,2004)).PreviousstudiesoftentakethelastMCMCsample(zandq1:T),anddiscussthetopics1:Tderivedfromthatsample.BecauseofthestochasticnatureofMCMC,wearguethatmoreinsightcanbegainedifmultipleindependentMCMCsamplesareconsidered.Foreachdataset,andeachDFwithadierent,werunalongMCMCchainwith200,000iterationsofburn-in,andtakeoutasampleevery10,000iterationsafterward,foratotalof200samples.Wehavesomein-dicationthatourchainiswell-mixed,asweobserveallexpectedmodes,andthatsampleswith\labelswitch-ing"(i.e.,equivalentuptolabelpermutation)occurwithnearequalfrequency.Foreachsample,wederiveitstopics1:Twith(7)andthengreedilyalignthe'sfromdierentsamples,permutingtheTtopiclabelstoremovethelabelswitchingeect.Withinadataset,weperformPCAonthebaseline(=1)andprojectallsamplesintotheresultingspacetoobtainacom-monvisualization(eachrowinFigure3.Pointsareditheredtoshowoverlap.).Must-Link(B,C):Thecorpusconsistsofsixdocu-mentsoveravocabularyofve\words."Thedoc-umentsare:ABAB,CDCD,andEEEE,eachrep-resentedtwice.WeletT=2;=0:;=0:01.LDAproducesthreekindsof1:T:roughlyathirdofthetimethetopicsarearound[A 2B 2jC 4 4E 2],whichisshorthandfor1=(1 2;1 2;;;0)2=(0;;1 4;1 4;1 2)onthevocabularyABCDE.Anotherthirdarearound -0.5 0 0.5 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 12 -0.5 0 0.5 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 12 -0.5 0 0.5 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 12 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 CL(A,B), eta=112345 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 CL(A,B), eta=50012345 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 CL(A,B), eta=150012345 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Isolate(B), eta=1123 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Isolate(B), eta=500123 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Isolate(B), eta=1000123 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Split(AB,CD), eta=11234567 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Split(AB,CD), eta=1001234567 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Split(AB,CD), eta=5001234567 Figure3.PCAprojectionsofpermutation-alignedsam-plesforthefoursyntheticdataexperiments.[A 4B 4E 2jC 2 2],andthenalthirdaround[A 4B 4C 4 4jE].Theycorrespondtoclusters1,2and3respectivelyintheupper-leftpanelofFigure3.WeaddasingleMust-Link(B,C).When=10,thedatastilloverrideourMust-Linksomewhatbecauseclusters1and2donotdisappearcompletely.Asincreasesto50,Must-Linkoverridesthedataandclusters1and2vanish,leavingonlycluster3.Thatis,runningDFandtakingthelastsampleisverylikelytoobtainthe[A 4B 4C 4 4jE]topics.Thisiswhatwewant:BandCarepresentorabsenttogetherinthetopicsandtheyalso\pull"A,Dalong,eventhoughA,Darenotintheknowledgeweadded.Cannot-Link(A,B):Thecorpushasfourdocu-ments:ABCCABCC,ABDDABDD,twiceeach;T=;=1;=0:01.LDAproducessixkindsof1:Tevenly:[B 2 2jjC],[A 2B 2jCjD],[A 2 2jBjC],[B 2C 2jjD],[A 2C 2jBjD],[C 2 2jjB],correspondingtoclusters1{5andthe\lines".WeaddasingleCannot-Link(A,B).AsDFincreases,cluster2[A 2B 2jCjD]disappears,be-causeitinvolvesatopicA 2B 2thatviolatestheCannot-Link.Otherclustersbecomeuniformlymorelikely.Isolate(B):Thecorpushasfourdocuments,allofwhichareABC;T=2;=1;=0:01.LDApro- IncorporatingDomainKnowledgeintoTopicModeling ducesthreeclustersevenly:[A 2C 2jB],[A 2B 2jC],[B 2C 2j].WeaddIsolate(B),whichiscompiledintoCannot-Link(B,A)andCannot-Link(B,C).TheDF'ssam-plesconcentratetocluster1:[A 2C 2jB],whichindeedisolatesBintoitsowntopic.Split(AB,CD):Thecorpushassixdocuments:ABCDEEEE,ABCDFFFF,eachpresentthreetimes;=0:;=0:01.LDAwithT=3producesalargeportionoftopicsaround[A 4B 4C 4 4jEjF](notshown).WeaddSplit(AB,CD),whichiscompiledintoMust-Link(A,B),Must-Link(C,D),Cannot-Link(B,C),andincreaseT=4.However,DFwith=1(i.e.,LDAwithT=4)producesalargevarietyoftop-ics:e.g.,cluster1is[A 43B 83 8jA 87F 8jCjE],cluster2is[C 87 8j3A 83B 8C 4jEjF],andcluster7is[A 2B 2jC 2 2jEjF].Thatis,simplyaddingonemoretopicdoesnotclearlyseparateABandCD.Ontheotherhand,withincreasing,DFeventuallyconcentratesoncluster7,whichsatisestheSplitoperation.WishCorpus:WenowconsiderinteractivetopicmodelingwithDF.Thecorpusweuseisacollectionof89,574NewYear'swishessubmittedtoTheTimesSquareAlliance(Goldbergetal.,2009).Eachwishistreatedasadocument,downcasedbutwithoutstop-wordremoval.Foreachstepinourinteractiveex-ample,weset=0:5,=0:1,=1000,andrunMCMCfor2000iterationsbeforeestimatingthetop-icsfromthenalsample.ThedomainknowledgeinDFisaccumulativealongthesteps.Step1:WerunLDAwithT=15.Manyofthemostprobablewordsinthetopicsareconventional(\to,and")orcorpus-specic(\wish,2008")stop-words,whichobscurethemeaningofthetopics.Step2:Wemanuallycreatea50-wordstopwordlist,andissueanIsolatepreference.ThisiscompiledintoMust-LinksamongthissetandCannot-Linksbetweenthissetandallotherwordsinthetop50foralltopics.Tisincreasedto16.AfterrunningDF,weendupwithtwostopwordtopics.Importantly,withthestop-wordsexplainedbythesetwotopics,thetopwordsfortheothertopicsbecomemuchmoremeaningful.Step3:Wenoticethatonetopiccon\ratestwocon-cepts:entercollegeandcuredisease(top8words:\goschoolcancerintowellfreecurecollege").WeissueSplit(\go,school,into,college",\cancer,free,cure,well")toseparatetheconcepts.ThisiscompiledintoMust-Linkswithineachquadruple,andaCannot-Linkbe-tweenthem.Tisincreasedto18.AfterrunningDF,oneofthetopicsclearlytakesonthe\college"concept,pickinguprelatedwordswhichwedidnotexplicitlyencodeinourprior.Anothertopicdoeslikewiseforthe\cure"concept(manywishesarelike\momstayscancerfree").Othertopicshaveminorchanges.Table1.WishtopicsfrominteractivetopicmodelingTopic Topwordssortedby=p(wordjtopic) Merge love loseweight togetherforevermarrymeet success healthhappinessfamilygoodfriendsprosperitylife lifehappybestlivetimelongwisheseveryears- asdonotwhatsomeonesolikedonmuchhemoney outmakemoneyuphouseworkablepayownlotspeople nopeoplestoplessdayeveryeachotheranotheriraq homesafeendtroopsiraqbringwarreturnjoy love truepeacehappinessdreamsjoyeveryonefamily happyhealthyfamilybabysafeprosperousvote betterhopepresidentpaulronthanpersonbush Isolate andtoforatheyearinnewallmy god godblessjesuseveryonelovedknowheartchristpeace peaceworldearthwinlotteryaroundsavespam comcallifu4www23visit1 Isolate itowishmyforandabethatthe Split job go great schoolinto good college hopemove Split momhope cancerfree husbandson well dad cure Step4:Wethennoticethattwotopicscorrespondtoromanceconcepts.WeapplyMerge(\love,forever,marry,together,loves",\meet,boyfriend,married,girlfriend,wedding"),whichiscompiledintoMust-Linksbetweenthesewords.Tisdecreasedto17.Af-terrunningDF,oneoftheromancetopicsdisappears,andtheremainingonecorrespondstothemergedro-mancetopic(\lose",\weight"wereinoneofthem,andremainso).Otherprevioustopicssurvivewithonlyminorchanges.Table1showsthewishtopicsaf-terthesefoursteps,whereweplacetheDFoperationsnexttothemostaectedtopics,andcolor-codethewordsexplicitlyspeciedinthedomainknowledge.YeastCorpus:Whereasthepreviousexperimentil-lustratestheutilityofourapproachinaninteractivesetting,wenowconsideracaseinwhichweuseback-groundknowledgefromanontologytoguidetopicmodeling.Ourpriorknowledgeisbasedonsixcon-cepts.Theconceptstranscription,translationandrepli-cationcharacterizethreeimportantprocessesthatarecarriedoutatthemolecularlevel.Theconceptsinitia-tion,elongationandterminationdescribephasesofthethreeaforementionedprocesses.Combinationsofcon-ceptsfromthesetwosetscorrespondtoconceptsintheGeneOntology(e.g.,GO:0006414istranslationalelon-gation,andGO:0006352istranscriptioninitiation).WeguideourtopicmodelingusingMust-Linksamongasmallsetofwordsforeachconcept.Moreover,weuseCannot-Linksamongwordstospecifythatweprefer(i)transcription,translationandreplicationtoberepre-sentedinseparatetopics,and(ii)initiation,elongationandterminationtoberepresentedinseparatetopics.Wedonotsetanypreferencesbetweenthe\process"topicsandthe\phase"topics,however. IncorporatingDomainKnowledgeintoTopicModeling Table2.Yeasttopics.TheleftcolumnshowstheseedwordsintheDFmodel.Themiddlecolumnsindicatethetopicsinwhichatleast2seedwordsareamongthe50highestprobabilitywordsforLDA,the\o"columngivesthenumberofothertopics(notsharedbyanotherword).Therightcolumnsshowthesametopic-wordrelationshipsfortheDFmodel. LDA DF 12345678o 12345678910 transcription 1 transcriptional 2 template 1 translation translational tRNA 1 replication 2 cycle division 3 initiation start assembly 7 elongation 1 termination disassembly release 2 stop Thecorpusthatweuseforourexperimentscon-sistsof18,193abstractsselectedfromtheMEDLINEdatabasefortheirrelevancetoyeastgenes.WeinducetopicmodelsusingDFtoencodetheMust-LinksandCannot-Linksdescribedabove,andusestandardLDAasacontrol.WesetT=100;=0:;=0:;=5000.Foreachwordthatweusetoseedaconcept,Ta-ble2showsthetopicsthatincludeitamongtheir50mostprobablewords.WemakeseveralobservationsabouttheDF-inducedtopics.First,eachconceptisrepresentedbyasmallnumberoftopicsandtheMust-Linkwordsforeachtopicalloccurashighlyprobablewordsinthesetopics.Second,theCannot-Linkprefer-encesareobeyedinthenaltopics.Third,thetopicsusetheprocessandphasetopicsinacompositionally.Forexample,DFTopic4representstranscriptioniniti-ationandDFTopic8representsreplicationinitiation.Moreover,thetopicsthataresignicantlyin\ruencedbythepriortypicallyincludehighlyrelevanttermsamongtheirmostprobablewords.Forexample,thetopwordsinDFTopic4include\TATA",\TFIID",\promoter",and\recruitment"whichareallspeci-callygermanetothecompositeconceptoftranscriptioninitiation.InthecaseofstandardLDA,theseedcon-ceptwordsaredispersedacrossagreaternumberoftopics,andhighlyrelatedwords,suchas\cycle"and\division"oftendonotfallintothesametopic.ManyofthetopicsinducedbyordinaryLDAaresemanti-callycoherent,butthespecicconceptssuggestedbyourpriordonotnaturallyemergewithoutusingDF.Acknowledgments:ThisworkwassupportedbyNIH/NLMgrantsT15LM07359andR01LM07050,andtheWisconsinAlumniResearchFoundation.ReferencesBasu,S.,Davidson,I.,&Wagsta,K.(Eds.).(2008).Constrainedclustering:Advancesinalgorithms,theory,andapplications.Chapman&Hall/CRC.Blei,D.,&Laerty,J.(2006).Correlatedtopicmod-els.InAdvancesinneuralinformationprocessingsystems18,147{154.Cambridge,MA:MITPress.Blei,D.,Ng,A.,&Jordan,M.(2003).LatentDirichletallocation.JournalofMachineLearningResearch,3,993{1022.Chemudugunta,C.,Holloway,A.,Smyth,P.,&Steyvers,M.(2008).Modelingdocumentsbycom-biningsemanticconceptswithunsupervisedstatis-ticallearning.Intl.SemanticWebConf.(pp.229{244).Springer.DennisIII,S.Y.(1991).Onthehyper-Dirichlettype1andhyper-Liouvilledistributions.CommunicationsinStatistics{TheoryandMethods,20,4069{4081.Goldberg,A.,Fillmore,N.,Andrzejewski,D.,Xu,Z.,Gibson,B.,&Zhu,X.(2009).Mayallyourwishescometrue:Astudyofwishesandhowtorecognizethem.HumanLanguageTechnologies:Proc.oftheAnnualConf.oftheNorthAmericanChapteroftheAssoc.forComputationalLinguistics.ACLPress.Griths,T.L.,&Steyvers,M.(2004).Findingscien-tictopics.Proc.oftheNat.AcademyofSciencesoftheUnitedStatesofAmerica,101,5228{5235.Griggs,J.R.,Grinstead,C.M.,&Guichard,D.R.(1988).Thenumberofmaximalindependentsetsinaconnectedgraph.DiscreteMath.,68,211{220.Li,W.,&McCallum,A.(2006).Pachinkoalloca-tion:DAG-structuredmixturemodelsoftopiccor-relations.Proc.ofthe23rdIntl.Conf.onMachineLearning(pp.577{584).ACMPress.Minka,T.P.(1999).TheDirichlet-treedistribution(TechnicalReport).http://research.microsoft.com/minka/papers/dirichlet/minka-dirtree.pdf.Tam,Y.-C.,&Schultz,T.(2007).CorrelatedlatentsemanticmodelforunsupervisedLMadaptation.IEEEIntl.Conf.onAcoustics,SpeechandSignalProcessing(pp.41|44).TheGeneOntologyConsortium(2000).GeneOntol-ogy:Toolfortheunicationofbiology.NatureGe-netics,25,25{29.