/
IncorporatingDomainKnowledgeintoTopicModelingviaDirichletForestPriors IncorporatingDomainKnowledgeintoTopicModelingviaDirichletForestPriors

IncorporatingDomainKnowledgeintoTopicModelingviaDirichletForestPriors - PDF document

pamella-moone
pamella-moone . @pamella-moone
Follow
362 views
Uploaded On 2016-04-27

IncorporatingDomainKnowledgeintoTopicModelingviaDirichletForestPriors - PPT Presentation

DavidAndrzejewskiyandrzejecswisceduXiaojinZhujerryzhucswisceduMarkCravenycravenbiostatwisceduDepartmentofComputerSciencesyDepartmentofBiostatisticsandMedicalInformaticsUniversityofWiscons ID: 295642

DavidAndrzejewskiyandrzeje@cs.wisc.eduXiaojinZhujerryzhu@cs.wisc.eduMarkCravenycraven@biostat.wisc.eduDepartmentofComputerSciences yDepartmentofBiostatisticsandMedicalInformaticsUniversityofWiscons

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "IncorporatingDomainKnowledgeintoTopicMod..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

IncorporatingDomainKnowledgeintoTopicModelingviaDirichletForestPriors DavidAndrzejewskiyandrzeje@cs.wisc.eduXiaojinZhujerryzhu@cs.wisc.eduMarkCravenycraven@biostat.wisc.eduDepartmentofComputerSciences,yDepartmentofBiostatisticsandMedicalInformaticsUniversityofWisconsin-Madison,Madison,WI53706USAAbstractUsersoftopicmodelingmethodsoftenhaveknowledgeaboutthecompositionofwordsthatshouldhavehighorlowprobabilityinvarioustopics.WeincorporatesuchdomainknowledgeusinganovelDirichletForestpriorinaLatentDirichletAllocationframework.ThepriorisamixtureofDirichlettreedistri-butionswithspecialstructures.Wepresentitsconstruction,andinferenceviacollapsedGibbssampling.Experimentsonsyntheticandrealdatasetsdemonstrateourmodel'sabilitytofollowandgeneralizebeyonduser-speci eddomainknowledge.1.IntroductionTopicmodeling,usingapproachessuchasLatentDirichletAllocation(LDA)(Bleietal.,2003),hasen-joyedpopularityasawaytomodelhiddentopicsindata.However,inmanyapplications,ausermayhaveadditionalknowledgeaboutthecompositionofwordsthatshouldhavehighprobabilityinvarioustopics.Forexample,inabiologicalapplication,onemaypre-ferthatthewords\termination",\disassembly"and\release"appearwithhighprobabilityinthesametopic,becausetheyalldescribethesamephaseofbi-ologicalprocesses.Furthermore,abiologistcouldau-tomaticallyextractthesepreferencesfromanexistingbiomedicalontology,suchastheGeneOntology(GO)(TheGeneOntologyConsortium,2000).Asanotherexample,ananalystmayruntopicmodelingonacor-pusofpeople'swishes,inspecttheresultingtopics,andnoticethat\into,college"and\cure,cancer"allap- AppearinginProceedingsofthe26thInternationalConfer-enceonMachineLearning,Montreal,Canada,2009.Copy-right2009bytheauthor(s)/owner(s).pearwithhighprobabilityinthesametopic.Thean-alystmaywanttointeractivelyexpressthepreferencethatthetwosetsofwordsshouldnotappeartogether,re-runtopicmodeling,andincorporateadditionalpref-erencesbasedonthenewresults.Inbothcases,wewouldlikethesepreferencestoguidetherecoveryoflatenttopics.StandardLDAlacksamechanismforincorporatingsuchdomainknowledge.Inthispaper,weproposeaprincipledapproachtotheincorporationofsuchdomainknowledgeintoLDA.Weshowthatmanytypesofknowledgecanbeex-pressedwithtwoprimitivesonwordpairs.Borrowingnamesfromtheconstrainedclusteringliterature(Basuetal.,2008),wecallthetwoprimitivesMust-LinksandCannot-Links,althoughthereareimportantdif-ferences.WethenencodethesetofMust-LinksandCannot-LinksassociatedwiththedomainknowledgeusingaDirichletForestprior,replacingtheDirichletprioroverthetopic-wordmultinomialp(wordjtopic).TheDirichletForestpriorisamixtureofDirichlettreedistributionswithveryspeci ctreestructures.Ourapproachhasseveraladvantages:(i)ADirich-letForestcanencodeMust-LinksandCannot-Links,somethingimpossiblewithDirichletdistributions.(ii)Theusercancontrolthestrengthofthedomainknowl-edgebysettingaparameter,allowingdomainknowl-edgetobeoverriddenifthedatastronglysuggestoth-erwise.(iii)TheDirichletForestlendsitselftoe-cientinferenceviacollapsedGibbssampling,aprop-ertyinheritedfromtheconjugacyofDirichlettrees.Wepresentexperimentsonseveralsyntheticdatasetsandtworealdomains,demonstratingthattheresult-ingtopicsnotonlysuccessfullyincorporatethespeci- eddomainknowledge,butalsogeneralizebeyonditbyincluding/excludingotherrelatedwordsnotexplic-itlymentionedintheMust-LinksandCannot-Links. IncorporatingDomainKnowledgeintoTopicModeling 2.RelatedWorkWereviewLDAusingthenotationofGrithsandSteyvers(2004).LettherebeTtopics.Letw=w1:::wnrepresentacorpusofDdocuments,withatotalofnwords.Weuseditodenotethedocumentofwordwi,andithehiddentopicfromwhichwiisgenerated.Let(w)j=p(wj=j),and(d)j=p(=j)fordocumentd.TheLDAgenerativemodelisthen:Dirichlet( )(1)ij(di)Multinomial((di))(2)Dirichlet( )(3)wiji;Multinomial(zi)(4)where and arehyperparametersforthedocument-topicandtopic-wordDirichletdistributions,respec-tively.Forsimplicitywewillassumesymmetric and ,butasymmetrichyperparametersarealsopossible.PreviousworkhasmodeledcorrelationsintheLDAdocument-topicmixturesusingthelogisticNormaldis-tribution(Blei&La erty,2006),DAG(Pachinko)structures(Li&McCallum,2006),ortheDirichletTreedistribution(Tam&Schultz,2007).Inaddition,theconcept-topicmodel(Chemuduguntaetal.,2008)employsdomainknowledgethroughspecial\concept"topics,inwhichonlyaparticularsetofwordscanbepresent.Ourworkcomplementsthepreviousworkbyencodingcomplexdomainknowledgeonwords(espe-ciallyarbitraryCannot-Links)intoa\rexibleandcom-putationallyecientprior.3.TopicModelingwithDirichletForestOurproposedmodeldi ersfromLDAinthewayisgenerated.Insteadof(3),wehaveqDirichletForest( ;)DirichletTree(q)whereqspeci esaDirichlettreedistribution, playsaroleanalogoustothetopic-wordhyperparameterinstandardLDA,and1isthe\strengthpa-rameter"ofthedomainknowledge.BeforediscussingDirichletForest( ;)andDirichletTree(q),we rstex-plainhowknowledgecanbeexpressedusingMust-LinkandCannot-Linkprimitives.3.1.Must-LinksandCannot-LinksMust-LinksandCannot-Linkswereoriginallypro-posedforconstrainedclusteringtoencouragetwoin-stancestofallintothesameclusterorintoseparateclusters,respectively.Weborrowthenotionfortopicmodeling.Informally,theMust-Linkprimitiveprefersthattwowordstendtobegeneratedbythesametopic,whiletheCannot-Linkprimitiveprefersthattwowordstendtobegeneratedbyseparatetopics.However,sinceanytopicisamultinomialoverwords,anytwowords(ingeneral)alwayshavesomeproba-bilityofbeinggeneratedbythetopic.Wethereforeproposethefollowingde nition:Must-Link(u;v):Twowordsu;vhavesimilarprob-abilitywithinanytopic,i.e.,(u)j(v)jforj=1:::T.Itisimportanttonotethattheprobabilitiescanbebothlargeorbothsmall,aslongastheyaresimilar.Forexample,fortheearlierbiologyexamplewecouldsayMust-Link(termination,disassembly).Cannot-Link(u;v):Twowordsu;vshouldnotbothhavelargeprobabilitywithinanytopic.Itispermis-sibleforonetohavealargeprobabilityandtheothersmall,orbothsmall.Forexample,oneprimitiveforthewishexamplecanbeCannot-Link(college,cure).ManytypesofdomainknowledgecanbedecomposedintoasetofMust-LinksandCannot-Links.Wedemonstratethreetypesinourexperiments:wecanSplittwoormoresetsofwordsfromasingletopicintodi erenttopicsbyplacingMust-LinkswithinthesetsandCannot-Linksbetweenthem.WecanMergetwoormoresetsofwordsfromdi erenttopicsintoonetopicbyplacingMust-Linksamongthesets.Givenacommonsetofwordswhichappearinmultipletopics(suchasstopwordsinEnglish,whichtendtoappearinallLDAtopics),wecanIsolatethembyplacingMust-Linkswithinthecommonset,andthenplacingCannot-Linkbetweenthecommonsetandtheotherhigh-probabilitywordsfromalltopics.Itisimpor-tanttonotethatourMust-LinksandCannot-Linksarepreferencesinsteadofhardconstraints.3.2.EncodingMust-LinksItiswell-knownthattheDirichletdistributionislim-itedinthatallwordsshareacommonvarianceparam-eter,andaremutuallyindependentexceptthenormal-izationconstraint(Minka,1999).However,forMust-Link(u;v)itiscrucialtocontrolthetwowordsu;vdi erentlythanotherwords.TheDirichlettreedistribution(DennisIII,1991)isageneralizationoftheDirichletdistributionthatallowssuchcontrol.Itisatreewiththewordsasleafnodes;seeFigure1(a)foranexample.Let\r(k)betheDirich-lettreeedgeweightleadingintonodek.LetC(k)betheimmediatechildrenofnodekinthetree,Ltheleavesofthetree,Itheinternalnodes,andL(k)the IncorporatingDomainKnowledgeintoTopicModeling leavesinthesubtreeunderk.TogenerateasampleDirichletTree(\r),one rstdrawsamultinomialateachinternalnodes2IfromDirichlet(\rC(s)),i.e.,usingtheweightsfromstoitschildrenastheDirich-letparameters.Onecanthinkofitasre-distributingtheprobabilitymassreachingsbythismultinomial(initially,themassis1attheroot).Theprobabil-ity(k)ofawordk2Listhensimplytheprod-uctofthemultinomialparametersontheedgesfromktotheroot,asshowninFigure1(b).Itcanbeshown(DennisIII,1991)thatthisproceduregivesDirichletTree(\r)p(j\r)= LYk(k)(k)1!IYsPC(s)k\r(k) QC(s)k\r(k)L(s)Xk(k)(s)where()isthestandardgammafunction,andthenotationQLkmeansQk2L.Thefunction(s)\r(s)Pk2C(s)\r(k)isthedi erencebetweenthein-degreeandout-degreeofinternalnodes.Whenthisdi erence(s)=0forallinternalnodess2I,theDirichlettreereducestoaDirichletdistribution.LiketheDirichlet,theDirichlettreeisconjugatetothemultinomial.Itispossibletointegrateouttogetadistributionoverwordcountsdirectly,similartothemultivariatePolyadistribution:p(wj\r)=IYsPC(s)k\r(k) PC(s)k\r(k)+n(k)C(s)Yk\r(k)+n(k) (\r(k))(5)Heren(k)isthenumberofwordtokensinwthatap-pearinL(k).WeencodeMust-LinksusingaDirichlettree.Notethatourde nitionofMust-Linkistransitive:Must-Link(u;v)andMust-Link(v;w)implyMust-Link(u;w).Wethus rstcomputethetransitiveclosuresofexpressedMust-Links.OurDirichlettreeforMust-Linkshasaverysimplestructure:eachtransitiveclo-sureisasubtree,withoneinternalnodeandthewordsintheclosureasitsleaves.Theweightsfromthein-ternalnodetoitsleavesare .TherootconnectstotheseinternalnodesswithweightjL(s)j ,wherejjrepresentsthesetsize.Inaddition,therootdirectlyconnectstootherwordsnotinanyclosure,withweight .Forexample,thetransitiveclosureforaMust-Link(A,B)onvocabularyA,B,CissimplyA,B,corre-spondingtotheDirichlettreeinFigure1(a).TounderstandthisencodingofMust-Links,consider rstthecasewhenthedomainknowledgestrengthpa-rameterisatitsweakest=1.Thenin-degreeequalsout-degreeforanyinternalnodes(botharejL(s)j ),andthetreereducestoaDirichletdistributionwithsymmetricprior :theMust-Linksareturnedo inthiscase.Asweincrease,there-distributionofprob-abilitymassats(governedbyaDirichletunders)hasincreasingconcentrationjL(s)j butthesameuniformbase-measure.Thistendstoredistributethemassevenlyinthetransitiveclosurerepresentedbys.Therefore,theMust-Linksareturnedonwhen�1.Furthermore,themassreachingsisindependentof,andcanstillhavealargevariance.ThisproperlyencodesthefactthatwewantMust-Linkedwordstohavesimilar,butnotalwayslarge,probabilities.Oth-erwise,Must-Linkedwordswouldbeforcedtoappearwithlargeprobabilityinalltopics,whichisclearlyun-desirable.ThisisimpossibletorepresentwithDirich-letdistributions.Forexample,thebluedotsinFig-ure1(c)aresamplesfromtheDirichlettreeinFig-ure1(a),plottedontheprobabilitysimplexofdimen-sionthree.Whileitisalwaystruethatp()p(B),theirtotalprobabilitymasscanbeanywherefrom0to1.ThemostsimilarDirichletdistributionisper-hapstheonewithparameters(50,50,1),whichgener-atessamplescloseto(0.5,0.5,0)(Figure1(d).)3.3.EncodingCannot-LinksCannot-Linksareconsiderablyhardertohandle.We rsttransformthemintoanalternativeformthatisamenabletoDirichlettrees.NotethatCannot-Linksarenottransitive:Cannot-Link(A,B)andCannot-Link(B,C)doesnotentailCannot-Link(A,C).Wede- neaCannot-Link-graphwherethenodesarewords1,andtheedgescorrespondtotheCannot-Links.Thentheconnectedcomponentsofthisgraphareindepen-dentofeachotherwhenencodingCannot-Links.WewillusethispropertytofactoraDirichlet-treeselec-tionprobabilitylater.Forexample,thetwoCannot-Links(A,B)and(B,C)formthegraphinFigure1(e)withasingleconnectedcomponentA,B,C.Considerthesubgraphonconnectedcomponentr.Wede neitscomplementgraphby\rippingtheedges(ontoo ,o toon),asshowninFigure1(f).LettherebeQ(r)maximalcliquesMr1:::MrQ(r)inthiscomplementgraph.Inthefollowing,wesimplycallthem\cliques",butitisimportanttorememberthattheyaremaximalcliquesofthecomplementgraph,nottheoriginalCannot-Link-graph.Inourexample,Q(r)=2andMr1=A;C,Mr2=B.Thesecliqueshavethefollowinginterpretation:eachclique(e.g.,Mr1=A;C)isthemaximalsubsetofwordsintheconnectedcomponentthatcan\occurtogether". 1WhenthereareMust-Links,allwordsinaMust-Linktransitiveclosureformasinglenodeinthisgraph. IncorporatingDomainKnowledgeintoTopicModeling AC2bb 0.090.420.580.91f=(0.53 0.38 0.09) AB AB (a)(b)(c)(d) AB AB AB BACbb AB (e)(f)(g)(h)(i)Figure1.EncodingMust-LinksandCannot-LinkswithaDirichletForest.(a)ADirichlettreeencodingMust-Link(A,B)with =1;=50onvocabularyfA,B,Cg.(b)AsamplefromthisDirichlettree.(c)AlargesetofsamplesfromtheDirichlettree,plottedonthe3-simplex.Notep(A)p(B),yettheyremain\rexibleinactualvalue,whichisdesirableforaMust-Link.(d)Incontrast,samplesfromastandardDirichletwithcomparableparameters(50,50,1)forcep(A)p(B)0:5,andcannotencodeaMust-Link.(e)TheCannot-Link-graphforCannot-Link(A,B)andCannot-Link(B,C).(f)Thecomplementarygraph,withtwomaximalcliquesfA,CgandfBg.(g)TheDirichletsubtreeforcliquefA,Cg.(h)TheDirichletsubtreeforcliquefBg.(i)Samplesfromthemixturemodelon(g,h),encodingbothCannot-Links,againwith =1;=50.Thatis,thesewordsareallowedtosimultaneouslyhavelargeprobabilitiesinagiventopicwithoutviolat-inganyCannot-Linkpreferences.Bythemaximalityofthesecliques,allowinganywordoutsidetheclique(e.g.,\B")toalsohavealargeprobabilitywillviolateatleast1Cannot-Link(inthisexample2).Wediscusstheencodingforthissingleconnectedcom-ponentrnow,deferringdiscussionofthecompleteen-codingtosection3.4.WecreateamixturemodelofQ(r)Dirichletsubtrees,oneforeachclique.Eachtopicselectsexactlyonesubtreeaccordingtoprobabilityp(q)/jMrqj;q=1:::Q(r):(6)Conceptually,theselectedsubtreeindexedbyqtendstoredistributenearlyallprobabilitymasstothewordswithinMrq.Sincethereisnomassleftforothercliques,itisimpossibleforawordoutsidecliqueMrqtohavealargeprobability.Therefore,noCannot-Linkwillbeviolated.Inreality,thesubtreesaresoftratherthanhard,becauseCannot-Linksareonlypref-erences.TheDirichletsubtreeforMrqisstructuredasfollows.Thesubtree'srootconnectstoaninter-nalnodeswithweightjMrqj .Thenodescon-nectstowordsinMrq,withweight .Thesubtree'srootalsodirectlyconnectstowordsnotinMrq(butintheconnectedcomponentr)withweight .Thiswillsendmostprobabilitymassdowntos,andthen\rexiblyredistributeitamongwordsinMrq.Forex-ample,Figures1(g,h)showtheDirichletsubtreesforMr1=A;CandMr2=Brespectively.SamplesfromthismixturemodelareshowninFigure1(i),rep-resentingmultinomialsinwhichnoCannot-Linkisvi-olated.SuchbehaviorisnotachievablebyaDirichletdistribution,orasingleDirichlettree2.Finally,wementionthatalthoughintheworstcasethenumberofmaximalcliquesQ(r)inaconnectedcomponentofsizejrjcangrowexponentiallyasO(3jrj=3)(Griggsetal.,1988),inourexperimentsQ(r)isnolargerthan3,dueinparttoMust-Linkedwords\collapsing"tosinglenodesintheCannot-Linkgraph.3.4.TheDirichletForestPriorIngeneral,ourdomainknowledgeisexpressedbyasetofMust-LinksandCannot-Links.We rstcom-putethetransitiveclosureofMust-Links.WethenformaCannot-Link-graph,whereanodeiseitheraMust-LinkclosureorawordnotpresentinanyMust-Link.Notethatthedomainknowledgemustbe\consistent"inthatnopairsofwordsaresimul-taneouslyCannot-LinkedandMust-Linked(eitherex-plicitlyorimplicitlythroughMust-Linktransitiveclo-sure.)LetRbethenumberofconnectedcompo-nentsintheCannot-Link-graph.OurDirichletFor-estconsistsofQRr=1Q(r)Dirichlettrees,representedbythetemplateinFigure2.EachDirichlettreehas 2Dirichletdistributionswithverysmallconcentrationdohavesomeselectione ect.Forexample,Beta(0.1,0.1)tendstoconcentrateprobabilitymassononeofthetwovariables.However,suchpriorsareweak{the\pseudocounts"inthemaretoosmallbecauseofthesmallconcen-tration.Theposteriorwillbedominatedbythedata,andwewouldloseanyencodeddomainknowledge. IncorporatingDomainKnowledgeintoTopicModeling Rbranchesbeneaththeroot,oneforeachconnectedcomponent.Thetreesdi erinwhichsubtreestheyincludeunderthesebranches.Forther-thbranch,thereareQ(r)possibleDirichletsubtrees,correspond-ingtocliquesMr1:::MrQ(r).Therefore,atreeintheforestisuniquelyidenti edbyanindexvectorq=(q(1):::q(R)),whereq(r)2f:::Q(r). M1qQ(1) h*h* ...MRqQ(R) ...... Figure2.TemplateofDirichlettreesintheDirichletForestTodrawaDirichlettreeqfromthepriorDirichletForest( ;),weselectthesubtreesindepen-dentlybecausetheRconnectedcomponentsarein-dependentwithrespecttoCannot-Links:p(q)=QRr=1p(q(r)).Eachq(r)issampledaccordingto(6),andcorrespondstochoosingasolidboxforther-thbranchinFigure2.Thestructureofthesubtreewithinthesolidboxhasbeende nedinSection3.3.Theblacknodesmaybeasingleword,oraMust-Linktran-sitiveclosurehavingthesubtreestructureshowninthedottedbox.Theedgeweightleadingtomostnodeskis\r(k)=jL(k)j ,whereL(k)isthesetofleavesunderk.However,foredgescomingoutofaMust-Linkin-ternalnodeorgoingintoaCannot-Linkinternalnode,theirweightsaremultipliedbythestrengthparameter.Theseedgesaremarkedby\"inFigure2.Wenowde nethecompleteDirichletForestmodel,integratingout(\collapsing")and.Letn(d)jbethenumberofwordtokensindocumentdthatareassignedtotopicj.zisgeneratedthesameasinLDA:p(zj )=(T ) ( )TYd=1QTj=1(n(d)j+ ) (n(d)+T ):ThereisoneDirichlettreeqjpertopicj=1:::T,sampledfromtheDirichletForestpriorp(qj)=QRr=1p(q(r)j).EachDirichlettreeqjimplicitlyde nesitstreeedgeweights\r()jusing ;,anditstreestruc-tureLj;Ij;Cj().Letn(k)jbethenumberofwordto-kensinthecorpusassignedtotopicjthatappearun-derthenodekintheDirichlettreeqj.Theprob-abilityofgeneratingthecorpusw,giventhetreesq1:Tq1:::qTandthetopicassignmentz,canbederivedusing(5):p(wjq1:T;z; ;)=TYj=1IjYsPCj(s)k\r(k)j PCj(s)k(\r(k)j+n(k)j)Cj(s)Yk(\r(k)j+n(k)j) (\r(k)j):Finally,thecompletegenerativemodelisp(w;z;q1:Tj ; ;)=p(wjq1:T;z; ;)p(zj )TYj=1p(qj):4.InferenceforDirichletForestBecauseaDirichletForestisamixtureofDirichlettrees,whichareconjugatetomultinomials,wecanef- cientlyperforminferencebyMarkovChainMonteCarlo(MCMC).Speci cally,weusecollapsedGibbssamplingsimilartoGrithsandSteyvers(2004).However,inourcasetheMCMCstateisde nedbyboththetopiclabelszandthetreeindicesq1:T.AnMCMCiterationinourcaseconsistsofasweepthroughbothzandq1:T.WepresenttheconditionalprobabilitiesforcollapsedGibbssamplingbelow.(Samplingi):Letn(d)i;jbethenumberofwordto-kensindocumentdassignedtotopicj,excludingthewordatpositioni.Similarly,letn(k)i;jbethenumberofwordtokensinthecorpusthatareundernodekintopicj'sDirichlettree,excludingthewordatpositioni.Forcandidatetopiclabelsv=1:::T,wehavep(i=vjzi;q1:T;w)/(n(d)i;v+ )Iv("i)Ys\r(Cv(s#i))v+n(Cv(s#i))i;v PCv(s)k\r(k)v+n(k)i;v;whereIv(i)denotesthesubsetofinternalnodesintopicv'sDirichlettreethatareancestorsofleafwi,andCv(si)istheuniquenodethatiss'simmediatechildandanancestorofwi(includingwiitself).(Samplingq(r)j):Sincetheconnectedcomponentsareindependent,samplingthetreeqjfactorsintosam-plingthecliquesforeachconnectedcomponentq(r)j.Forcandidatecliquesq0=1:::Q(r),wehavep(q(r)j=q0jz;qj;q(r)j;w)/MrqXk kIj;r=YsPCj(s)k\r(k)j PCj(s)k(\r(k)j+n(k)j)Cj(s)Yk(\r(k)j+n(k)j) (\r(k)j)whereIj;rqdenotestheinternalnodesbelowther-thbranchoftreeqj,whencliqueMrqisselected. IncorporatingDomainKnowledgeintoTopicModeling (Estimatingand):AfterrunningMCMCforsucientiterations,wefollowstandardpractice(e.g.(Griths&Steyvers,2004))andusethelastsam-ple(z;q1:T)toestimateand.BecauseaDirich-lettreeisaconjugatedistribution,itsposteriorisaDirichlettreewiththesamestructureandupdatededgeweights.TheposteriorfortheDirichlettreeofthej-thtopicis\rpost(k)j=\r(k)j+n(k)j,wherethecountsn(k)jarecollectedfromz;q1:T;w.Weestimatejbythe rstmomentunderthisposterior(Minka,1999):b(w)j=Ij("w)Ys\rpost(Cj(s#w))jCj(s)Xs\rpost(s)j1:(7)Theparameterisestimatedthesamewayasinstan-dardLDA:b(d)j=(n(d)j+ )(n(d)+T ).5.ExperimentsSyntheticCorpora:WepresentresultsonsyntheticdatasetstoshowhowtheDirichletForest(DF)incor-poratesdi erenttypesofknowledge.RecallthatDFwith=1isequivalenttostandardLDA(veri edwiththecodeof(Griths&Steyvers,2004)).PreviousstudiesoftentakethelastMCMCsample(zandq1:T),anddiscussthetopics1:Tderivedfromthatsample.BecauseofthestochasticnatureofMCMC,wearguethatmoreinsightcanbegainedifmultipleindependentMCMCsamplesareconsidered.Foreachdataset,andeachDFwithadi erent,werunalongMCMCchainwith200,000iterationsofburn-in,andtakeoutasampleevery10,000iterationsafterward,foratotalof200samples.Wehavesomein-dicationthatourchainiswell-mixed,asweobserveallexpectedmodes,andthatsampleswith\labelswitch-ing"(i.e.,equivalentuptolabelpermutation)occurwithnearequalfrequency.Foreachsample,wederiveitstopics1:Twith(7)andthengreedilyalignthe'sfromdi erentsamples,permutingtheTtopiclabelstoremovethelabelswitchinge ect.Withinadataset,weperformPCAonthebaseline(=1)andprojectallsamplesintotheresultingspacetoobtainacom-monvisualization(eachrowinFigure3.Pointsareditheredtoshowoverlap.).Must-Link(B,C):Thecorpusconsistsofsixdocu-mentsoveravocabularyof ve\words."Thedoc-umentsare:ABAB,CDCD,andEEEE,eachrep-resentedtwice.WeletT=2; =0:; =0:01.LDAproducesthreekindsof1:T:roughlyathirdofthetimethetopicsarearound[A 2B 2jC 4 4E 2],whichisshorthandfor1=(1 2;1 2;;;0)2=(0;;1 4;1 4;1 2)onthevocabularyABCDE.Anotherthirdarearound -0.5 0 0.5 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 12 -0.5 0 0.5 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 12 -0.5 0 0.5 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 12 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 CL(A,B), eta=112345 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 CL(A,B), eta=50012345 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 CL(A,B), eta=150012345 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Isolate(B), eta=1123 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Isolate(B), eta=500123 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Isolate(B), eta=1000123 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Split(AB,CD), eta=11234567 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Split(AB,CD), eta=1001234567 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Split(AB,CD), eta=5001234567 Figure3.PCAprojectionsofpermutation-alignedsam-plesforthefoursyntheticdataexperiments.[A 4B 4E 2jC 2 2],andthe nalthirdaround[A 4B 4C 4 4jE].Theycorrespondtoclusters1,2and3respectivelyintheupper-leftpanelofFigure3.WeaddasingleMust-Link(B,C).When=10,thedatastilloverrideourMust-Linksomewhatbecauseclusters1and2donotdisappearcompletely.Asincreasesto50,Must-Linkoverridesthedataandclusters1and2vanish,leavingonlycluster3.Thatis,runningDFandtakingthelastsampleisverylikelytoobtainthe[A 4B 4C 4 4jE]topics.Thisiswhatwewant:BandCarepresentorabsenttogetherinthetopicsandtheyalso\pull"A,Dalong,eventhoughA,Darenotintheknowledgeweadded.Cannot-Link(A,B):Thecorpushasfourdocu-ments:ABCCABCC,ABDDABDD,twiceeach;T=; =1; =0:01.LDAproducessixkindsof1:Tevenly:[B 2 2jjC],[A 2B 2jCjD],[A 2 2jBjC],[B 2C 2jjD],[A 2C 2jBjD],[C 2 2jjB],correspondingtoclusters1{5andthe\lines".WeaddasingleCannot-Link(A,B).AsDFincreases,cluster2[A 2B 2jCjD]disappears,be-causeitinvolvesatopicA 2B 2thatviolatestheCannot-Link.Otherclustersbecomeuniformlymorelikely.Isolate(B):Thecorpushasfourdocuments,allofwhichareABC;T=2; =1; =0:01.LDApro- IncorporatingDomainKnowledgeintoTopicModeling ducesthreeclustersevenly:[A 2C 2jB],[A 2B 2jC],[B 2C 2j].WeaddIsolate(B),whichiscompiledintoCannot-Link(B,A)andCannot-Link(B,C).TheDF'ssam-plesconcentratetocluster1:[A 2C 2jB],whichindeedisolatesBintoitsowntopic.Split(AB,CD):Thecorpushassixdocuments:ABCDEEEE,ABCDFFFF,eachpresentthreetimes; =0:; =0:01.LDAwithT=3producesalargeportionoftopicsaround[A 4B 4C 4 4jEjF](notshown).WeaddSplit(AB,CD),whichiscompiledintoMust-Link(A,B),Must-Link(C,D),Cannot-Link(B,C),andincreaseT=4.However,DFwith=1(i.e.,LDAwithT=4)producesalargevarietyoftop-ics:e.g.,cluster1is[A 43B 83 8jA 87F 8jCjE],cluster2is[C 87 8j3A 83B 8C 4jEjF],andcluster7is[A 2B 2jC 2 2jEjF].Thatis,simplyaddingonemoretopicdoesnotclearlyseparateABandCD.Ontheotherhand,withincreasing,DFeventuallyconcentratesoncluster7,whichsatis estheSplitoperation.WishCorpus:WenowconsiderinteractivetopicmodelingwithDF.Thecorpusweuseisacollectionof89,574NewYear'swishessubmittedtoTheTimesSquareAlliance(Goldbergetal.,2009).Eachwishistreatedasadocument,downcasedbutwithoutstop-wordremoval.Foreachstepinourinteractiveex-ample,weset =0:5, =0:1,=1000,andrunMCMCfor2000iterationsbeforeestimatingthetop-icsfromthe nalsample.ThedomainknowledgeinDFisaccumulativealongthesteps.Step1:WerunLDAwithT=15.Manyofthemostprobablewordsinthetopicsareconventional(\to,and")orcorpus-speci c(\wish,2008")stop-words,whichobscurethemeaningofthetopics.Step2:Wemanuallycreatea50-wordstopwordlist,andissueanIsolatepreference.ThisiscompiledintoMust-LinksamongthissetandCannot-Linksbetweenthissetandallotherwordsinthetop50foralltopics.Tisincreasedto16.AfterrunningDF,weendupwithtwostopwordtopics.Importantly,withthestop-wordsexplainedbythesetwotopics,thetopwordsfortheothertopicsbecomemuchmoremeaningful.Step3:Wenoticethatonetopiccon\ratestwocon-cepts:entercollegeandcuredisease(top8words:\goschoolcancerintowellfreecurecollege").WeissueSplit(\go,school,into,college",\cancer,free,cure,well")toseparatetheconcepts.ThisiscompiledintoMust-Linkswithineachquadruple,andaCannot-Linkbe-tweenthem.Tisincreasedto18.AfterrunningDF,oneofthetopicsclearlytakesonthe\college"concept,pickinguprelatedwordswhichwedidnotexplicitlyencodeinourprior.Anothertopicdoeslikewiseforthe\cure"concept(manywishesarelike\momstayscancerfree").Othertopicshaveminorchanges.Table1.WishtopicsfrominteractivetopicmodelingTopic Topwordssortedby=p(wordjtopic) Merge love loseweight togetherforevermarrymeet success healthhappinessfamilygoodfriendsprosperitylife lifehappybestlivetimelongwisheseveryears- asdonotwhatsomeonesolikedonmuchhemoney outmakemoneyuphouseworkablepayownlotspeople nopeoplestoplessdayeveryeachotheranotheriraq homesafeendtroopsiraqbringwarreturnjoy love truepeacehappinessdreamsjoyeveryonefamily happyhealthyfamilybabysafeprosperousvote betterhopepresidentpaulronthanpersonbush Isolate andtoforatheyearinnewallmy god godblessjesuseveryonelovedknowheartchristpeace peaceworldearthwinlotteryaroundsavespam comcallifu4www23visit1 Isolate itowishmyforandabethatthe Split job go great schoolinto good college hopemove Split momhope cancerfree husbandson well dad cure Step4:Wethennoticethattwotopicscorrespondtoromanceconcepts.WeapplyMerge(\love,forever,marry,together,loves",\meet,boyfriend,married,girlfriend,wedding"),whichiscompiledintoMust-Linksbetweenthesewords.Tisdecreasedto17.Af-terrunningDF,oneoftheromancetopicsdisappears,andtheremainingonecorrespondstothemergedro-mancetopic(\lose",\weight"wereinoneofthem,andremainso).Otherprevioustopicssurvivewithonlyminorchanges.Table1showsthewishtopicsaf-terthesefoursteps,whereweplacetheDFoperationsnexttothemosta ectedtopics,andcolor-codethewordsexplicitlyspeci edinthedomainknowledge.YeastCorpus:Whereasthepreviousexperimentil-lustratestheutilityofourapproachinaninteractivesetting,wenowconsideracaseinwhichweuseback-groundknowledgefromanontologytoguidetopicmodeling.Ourpriorknowledgeisbasedonsixcon-cepts.Theconceptstranscription,translationandrepli-cationcharacterizethreeimportantprocessesthatarecarriedoutatthemolecularlevel.Theconceptsinitia-tion,elongationandterminationdescribephasesofthethreeaforementionedprocesses.Combinationsofcon-ceptsfromthesetwosetscorrespondtoconceptsintheGeneOntology(e.g.,GO:0006414istranslationalelon-gation,andGO:0006352istranscriptioninitiation).WeguideourtopicmodelingusingMust-Linksamongasmallsetofwordsforeachconcept.Moreover,weuseCannot-Linksamongwordstospecifythatweprefer(i)transcription,translationandreplicationtoberepre-sentedinseparatetopics,and(ii)initiation,elongationandterminationtoberepresentedinseparatetopics.Wedonotsetanypreferencesbetweenthe\process"topicsandthe\phase"topics,however. IncorporatingDomainKnowledgeintoTopicModeling Table2.Yeasttopics.TheleftcolumnshowstheseedwordsintheDFmodel.Themiddlecolumnsindicatethetopicsinwhichatleast2seedwordsareamongthe50highestprobabilitywordsforLDA,the\o"columngivesthenumberofothertopics(notsharedbyanotherword).Therightcolumnsshowthesametopic-wordrelationshipsfortheDFmodel. LDA DF 12345678o 12345678910 transcription 1 transcriptional 2 template 1  translation  translational  tRNA 1  replication 2 cycle  division 3  initiation  start  assembly 7  elongation 1  termination  disassembly release 2 stop  Thecorpusthatweuseforourexperimentscon-sistsof18,193abstractsselectedfromtheMEDLINEdatabasefortheirrelevancetoyeastgenes.WeinducetopicmodelsusingDFtoencodetheMust-LinksandCannot-Linksdescribedabove,andusestandardLDAasacontrol.WesetT=100; =0:; =0:;=5000.Foreachwordthatweusetoseedaconcept,Ta-ble2showsthetopicsthatincludeitamongtheir50mostprobablewords.WemakeseveralobservationsabouttheDF-inducedtopics.First,eachconceptisrepresentedbyasmallnumberoftopicsandtheMust-Linkwordsforeachtopicalloccurashighlyprobablewordsinthesetopics.Second,theCannot-Linkprefer-encesareobeyedinthe naltopics.Third,thetopicsusetheprocessandphasetopicsinacompositionally.Forexample,DFTopic4representstranscriptioniniti-ationandDFTopic8representsreplicationinitiation.Moreover,thetopicsthataresigni cantlyin\ruencedbythepriortypicallyincludehighlyrelevanttermsamongtheirmostprobablewords.Forexample,thetopwordsinDFTopic4include\TATA",\TFIID",\promoter",and\recruitment"whichareallspeci -callygermanetothecompositeconceptoftranscriptioninitiation.InthecaseofstandardLDA,theseedcon-ceptwordsaredispersedacrossagreaternumberoftopics,andhighlyrelatedwords,suchas\cycle"and\division"oftendonotfallintothesametopic.ManyofthetopicsinducedbyordinaryLDAaresemanti-callycoherent,butthespeci cconceptssuggestedbyourpriordonotnaturallyemergewithoutusingDF.Acknowledgments:ThisworkwassupportedbyNIH/NLMgrantsT15LM07359andR01LM07050,andtheWisconsinAlumniResearchFoundation.ReferencesBasu,S.,Davidson,I.,&Wagsta ,K.(Eds.).(2008).Constrainedclustering:Advancesinalgorithms,theory,andapplications.Chapman&Hall/CRC.Blei,D.,&La erty,J.(2006).Correlatedtopicmod-els.InAdvancesinneuralinformationprocessingsystems18,147{154.Cambridge,MA:MITPress.Blei,D.,Ng,A.,&Jordan,M.(2003).LatentDirichletallocation.JournalofMachineLearningResearch,3,993{1022.Chemudugunta,C.,Holloway,A.,Smyth,P.,&Steyvers,M.(2008).Modelingdocumentsbycom-biningsemanticconceptswithunsupervisedstatis-ticallearning.Intl.SemanticWebConf.(pp.229{244).Springer.DennisIII,S.Y.(1991).Onthehyper-Dirichlettype1andhyper-Liouvilledistributions.CommunicationsinStatistics{TheoryandMethods,20,4069{4081.Goldberg,A.,Fillmore,N.,Andrzejewski,D.,Xu,Z.,Gibson,B.,&Zhu,X.(2009).Mayallyourwishescometrue:Astudyofwishesandhowtorecognizethem.HumanLanguageTechnologies:Proc.oftheAnnualConf.oftheNorthAmericanChapteroftheAssoc.forComputationalLinguistics.ACLPress.Griths,T.L.,&Steyvers,M.(2004).Findingscien-ti ctopics.Proc.oftheNat.AcademyofSciencesoftheUnitedStatesofAmerica,101,5228{5235.Griggs,J.R.,Grinstead,C.M.,&Guichard,D.R.(1988).Thenumberofmaximalindependentsetsinaconnectedgraph.DiscreteMath.,68,211{220.Li,W.,&McCallum,A.(2006).Pachinkoalloca-tion:DAG-structuredmixturemodelsoftopiccor-relations.Proc.ofthe23rdIntl.Conf.onMachineLearning(pp.577{584).ACMPress.Minka,T.P.(1999).TheDirichlet-treedistribution(TechnicalReport).http://research.microsoft.com/minka/papers/dirichlet/minka-dirtree.pdf.Tam,Y.-C.,&Schultz,T.(2007).CorrelatedlatentsemanticmodelforunsupervisedLMadaptation.IEEEIntl.Conf.onAcoustics,SpeechandSignalProcessing(pp.41|44).TheGeneOntologyConsortium(2000).GeneOntol-ogy:Toolfortheuni cationofbiology.NatureGe-netics,25,25{29.

Related Contents


Next Show more