/
OnSmoothingandInferenceforTopicModels OnSmoothingandInferenceforTopicModels

OnSmoothingandInferenceforTopicModels - PDF document

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
366 views
Uploaded On 2016-08-03

OnSmoothingandInferenceforTopicModels - PPT Presentation

ArthurAsuncionMaxWellingPadhraicSmythDepartmentofComputerScienceUniversityofCaliforniaIrvineIrvineCAUSAfasuncionwellingsmythgicsucieduYeeWhyeTehGatsbyComputationalNeuroscienceUnitUniversityC ID: 431942

ArthurAsuncion MaxWelling PadhraicSmythDepartmentofComputerScienceUniversityofCalifornia IrvineIrvine USAfasuncion welling smythg@ics.uci.eduYeeWhyeTehGatsbyComputationalNeuroscienceUnitUniversityC

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "OnSmoothingandInferenceforTopicModels" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

OnSmoothingandInferenceforTopicModels ArthurAsuncion,MaxWelling,PadhraicSmythDepartmentofComputerScienceUniversityofCalifornia,IrvineIrvine,CA,USAfasuncion,welling,smythg@ics.uci.eduYeeWhyeTehGatsbyComputationalNeuroscienceUnitUniversityCollegeLondonLondon,UKywteh@gatsby.ucl.ac.ukAbstractLatentDirichletanalysis,ortopicmodeling,isaexiblelatentvariableframeworkformodel-inghigh-dimensionalsparsecountdata.Variouslearningalgorithmshavebeendevelopedinre-centyears,includingcollapsedGibbssampling,variationalinference,andmaximumaposterioriestimation,andthisvarietymotivatestheneedforcarefulempiricalcomparisons.Inthispaper,wehighlightthecloseconnectionsbetweentheseapproaches.Wendthatthemaindifferencesareattributabletotheamountofsmoothingappliedtothecounts.Whenthehyperparametersareop-timized,thedifferencesinperformanceamongthealgorithmsdiminishsignicantly.Theabilityofthesealgorithmstoachievesolutionsofcom-parableaccuracygivesusthefreedomtoselectcomputationallyefcientapproaches.Usingtheinsightsgainedfromthiscomparativestudy,weshowhowaccuratetopicmodelscanbelearnedinseveralsecondsontextcorporawiththousandsofdocuments.1INTRODUCTIONLatentDirichletAllocation(LDA)[Bleietal.,2003]andProbabilisticLatentSemanticAnalysis(PLSA)[Hofmann,2001]arewell-knownlatentvariablemodelsforhighdi-mensionalcountdata,suchastextdatainthebag-of-wordsrepresentationorimagesrepresentedthroughfea-turecounts.Variousinferencetechniqueshavebeenpro-posed,includingcollapsedGibbssampling(CGS)[Grif-thsandSteyvers,2004],variationalBayesianinference(VB)[Bleietal.,2003],collapsedvariationalBayesianin-ference(CVB)[Tehetal.,2007],maximumlikelihoodesti-mation(ML)[Hofmann,2001],andmaximumaposterioriestimation(MAP)[ChienandWu,2008].Amongthesealgorithms,substantialperformancediffer-enceshavebeenobservedinpractice.Forinstance,Bleietal.[2003]haveshownthattheVBalgorithmforLDAoutperformsMLestimationforPLSA.Furthermore,Tehetal.[2007]havefoundthatCVBissignicantlymoreac-curatethanVB.Butcanthesedifferencesinperformancereallybeattributedtothetypeofinferencealgorithm?Inthispaper,weprovideconvincingempiricalevidencethatpointsinadifferentdirection,namelythattheclaimeddifferencescanbeexplainedawaybythedifferentsettingsoftwosmoothingparameters(orhyperparameters).Infact,ourempiricalresultssuggestthattheseinferencealgo-rithmshaverelativelysimilarpredictiveperformancewhenthehyperparametersforeachmethodareselectedinanop-timalfashion.Withhindsight,thisphenomenonshouldnotsurpriseus.Topicmodelsoperateinextremelyhighdi-mensionalspaces(withtypicallymorethan10,000dimen-sions)and,asaconsequence,the“curseofdimensionality”islurkingaroundthecorner;thus,hyperparametersettingshavethepotentialtosignicantlyaffecttheresults.Weshowthatthepotentialperplexitygainsbycarefultreatmentofhyperparametersareontheorderof(ifnotgreaterthan)thedifferencesbetweendifferentinferenceal-gorithms.Theseresultscautionagainstusinggenerichy-perparametersettingswhencomparingresultsacrossalgo-rithms.Thisinturnraisesthequestionastowhethernewlyintroducedmodelsandapproximateinferencealgorithmshaverealmerit,orwhethertheobserveddifferenceinpre-dictiveperformanceisattributabletosuboptimalsettingsofhyperparametersforthealgorithmsbeingcompared.Inperformingthisstudy,wediscoveredthatanalgorithmwhichsuggestsitselfinthinkingaboutinferencealgo-rithmsinauniedway–butwasneverproposedbyitselfbefore–performsbest,albeitmarginallyso.Moreimpor-tantly,ithappenstobethemostcomputationallyefcientalgorithmaswell.Inthefollowingsection,wehighlightthesimilaritiesbe-tweeneachofthealgorithms.Wethendiscusstheimpor-tanceofhyperparametersettings.Weshowaccuracyre-sults,usingperplexityandprecision/recallmetrics,foreachalgorithmovervarioustextdatasets.Wethenfocuson a ix k f h idiz jq N Figure1:GraphicalmodelforLatentDirichletAllocation.Boxesdenoteparameters,andshaded/unshadedcirclesde-noteobserved/hiddenvariables.computationalefciencyandprovidetimingresultsacrossalgorithms.Finally,wediscussrelatedworkandconcludewithfuturedirections.2INFERENCETECHNIQUESFORLDALDAhasrootsinearlierstatisticaldecompositiontech-niques,suchasLatentSemanticAnalysis(LSA)[Deer-westeretal.,1990]andProbabilisticLatentSemanticAnal-ysis(PLSA)[Hofmann,2001].Proposedasageneraliza-tionofPLSA,LDAwascastwithinthegenerativeBayesianframeworktoavoidsomeoftheoverttingissuesthatwereobservedwithPLSA[Bleietal.,2003].AreviewofthesimilaritiesbetweenLSA,PLSA,LDA,andothermodelscanbefoundinBuntineandJakulin[2006].WedescribetheLDAmodelandbeginwithgeneralnota-tion.LDAassumesthestandardbag-of-wordsrepresenta-tion,whereDdocumentsareeachrepresentedasavectorofcountswithWcomponents,whereWisthenumberofwordsinthevocabulary.EachdocumentjinthecorpusismodeledasamixtureoverKtopics,andeachtopickisadistributionoverthevocabularyofWwords.Eachtopic,k,isdrawnfromaDirichletwithparameter,whileeachdocument'smixture,j,issampledfromaDirichletwithparameter 1.Foreachtokeniinthecorpus,atopicas-signmentziissampledfromdi,andthespecicwordxiisdrawnfromzi.Thegenerativeprocessisbelow:k;jD[ ]w;kD[]zik;dixiw;zi:InFigure1,thegraphicalmodelforLDAispresentedinaslightlyunconventionalfashion,asaBayesiannetworkwherekjandwkareconditionalprobabilitytablesandirunsoveralltokensinthecorpus.Eachtoken'sdocumentindexdiisexplicitlyshownasanobservedvariable,inor-dertoshowLDA'scorrespondencetothePLSAmodel.Exactinference(i.e.computingtheposterioroverthehid-denvariables)forthismodelisintractable[Bleietal.,2003],andsoavarietyofapproximatealgorithmshavebeendeveloped.Ifweignore andandtreatkjandwkasparameters,weobtainthePLSAmodel,andmax-imumlikelihood(ML)estimationoverkjandwkdi-rectlycorrespondstoPLSA'sEMalgorithm.Addingthehyperparameters andbackinleadstoMAPestimation. 1WeusesymmetricDirichletpriorsforsimplicityinthispaper.TreatingkjandwkashiddenvariablesandfactorizingtheposteriordistributionleadstotheVBalgorithm,whilecollapsingkjandwk(i.e.marginalizingoverthesevari-ables)leadstotheCVBandCGSalgorithms.Inthefol-lowingsubsections,weprovidedetailsforeachapproach.2.1MLESTIMATIONThePLSAalgorithmdescribedinHofmann[2001]canbeunderstoodasanexpectationmaximizationalgorithmforthemodeldepictedinFigure1.Westartthederivationbywritingthelog-likelihoodas,`XilogXziP(xijzi;)P(zijdi;)fromwhichwederiveviaastandardEMderivationtheup-dates(wherewehaveleftoutexplicitnormalizations):P(zijxi;di)P(xijzi;)P(zijdi;)(1)w;kXiI[xiw;zik]P(zijxi;di)(2)k;jXiI[zik;dij]P(zijxi;di):(3)Theseupdatescanberewrittenbydening\rwjkP(zkjxw;dj),Nwjthenumberofobservationsforwordtypewindocumentj,NwkPjNwj\rwjk,NkjPwNwj\rwjk,NkPwNwkandNjPkNkj,w;kNwk=Nkk;jNkj=Nj:Pluggingtheseexpressionsbackintotheexpressionfortheposterior(1)wearriveattheupdate,\rwjkNwkNkj Nk(4)wheretheconstantNjisabsorbedintothenormalization.Hofmann[2001]regularizesthePLSAupdatesbyraisingtherighthandsideof(4)toapower �0andsearchingforthebestvalueof onavalidationset.2.2MAPESTIMATIONWetreat;asrandomvariablesfromnowon.WeaddDirichletpriorswithstrengthsforand forrespec-tively.Thisextensionwasintroducedas“latentDirichletallocation”inBleietal.[2003].ItispossibletooptimizefortheMAPestimateof;.ThederivationisverysimilartotheMLderivationintheprevioussection,exceptthatwenowhavetermscorre-spondingtothelogoftheDirichletpriorwhichareequaltoPwk(1)logwkandPkj( 1)logkj.Afterworkingthroughthemath,wederivethefollowingupdate(deFreitasandBarnard[2001],ChienandWu[2008]),\rwjk(Nwk1)(Nkj 1) (NkWW)(5) where ;�1.Uponconvergence,MAPestimatesareobtained:^wkNwk1 NkWW^kjNkj 1 NjK K:(6)2.3VARIATIONALBAYESThevariationalBayesianapproximation(VB)toLDAfol-lowsthestandardvariationalEMframework[Attias,2000,GhahramaniandBeal,2000].Weintroduceafactorized(andhenceapproximate)variationalposteriordistribution:Q(;;z)=Ykq(;k)Yjq(;j)Yiq(zi):UsingthisassumptioninthevariationalformulationoftheEMalgorithm[NealandHinton,1998]wereadilyderivetheVBupdatesanalogoustotheMLupdatesofEqns.2,3and1:q(;k)=D[N;k];NwkXiq(zik)(xi;w)(7)q(;j)=D[ N;j];NkjXiq(zik)(di;j)(8)q(zi)expE[logi;zi]q()E[logzi;di]q():(9)Wecaninserttheexpressionforq()at(7)andq()at(8)intotheupdateforq(z)in(9)andusethefactthatE[logXi]D(X) (Xi) (PjXj)with ()beingthe“digamma”function.Asanalobservation,notethatthereisnothinginthefreeenergythatwouldrenderanydiffer-entlythedistributionsq(zi)fortokensthatcorrespondtothesameword-typewinthesamedocumentj.Hence,wecansimplifyandupdateasingleprototypeofthatequiva-lenceclass,denotedas\rwjkq(zik)(xi;w)(di;j)asfollows,\rwjkexp( (Nwk)) exp( (NkW))exp (Nkj ):(10)Wenotethatexp( (n))n0:5forn�1.SinceNwk,Nkj,andNkareaggregationsofexpectedcounts,weex-pectmanyofthesecountstobegreaterthan1.Thus,theVBupdatecanbeapproximatedasfollows,\rwjk(Nwk0:5) (NkW0:5)Nkj 0:5(11)whichexposestherelationtotheMAPupdatein(5).Inclosingthissection,wementionthattheoriginalVBalgorithmderivedinBleietal.[2003]wasahybridversionbetweenwhatwecallVBandMLhere.Althoughtheydidestimatevariationalposteriorsq(),theweretreatedasparametersandwereestimatedthroughML.2.4COLLAPSEDVARIATIONALBAYESItispossibletomarginalizeouttherandomvariableskjandwkfromthejointprobabilitydistribution.Followingavariationaltreatment,wecanintroducevariationalpos-teriorsoverzvariableswhichisonceagainassumedtobefactorized:Q(z)=Qiq(zi).Thiscollapsedvariationalfreeenergyrepresentsastrictlybetterboundonthe(nega-tive)evidencethantheoriginalVB[Tehetal.,2007].Thederivationoftheupdateequationfortheq(zi)isslightlymorecomplicatedandinvolvesapproximationstocomputeintractablesummations.Theupdateisgivenbelow2:\rijkN:ijwk N:ijkWN:ijkj expV:ijkj 2(N:ijkj )2V:ijwk 2(N:ijwk)2V:ijk 2(N:ijkW)2:(12)N:ijkjdenotestheexpectednumberoftokensindocumentjassignedtotopick(excludingthecurrenttoken),andcanbecalculatedasfollows:N:ijkjPi0=i\ri0jk.ForCVB,thereisalsoavarianceassociatedwitheachcount:V:ijkjPi0=i\ri0jk(1\ri0jk).ForfurtherdetailswerefertoTehetal.[2007].Theupdatein(12)makesuseofasecond-orderTaylorex-pansionasanapproximation.Afurtherapproximationcanbemadebyusingonlythezeroth-orderinformation3:\rijkN:ijwk N:ijkWN:ijkj :(13)WerefertothisapproximatealgorithmasCVB0.2.5COLLAPSEDGIBBSSAMPLINGMCMCtechniquesareavailabletoLDAaswell.IncollapsedGibbssampling(CGS)[GrifthsandSteyvers,2004],kjandwkareintegratedout(asinCVB)andsam-plingofthetopicassignmentsisperformedsequentiallyinthefollowingmanner:P(zijkjz:ij;xijw)N:ijwk N:ijkWN:ijkj :(14)Nwkdenotesthenumberofwordtokensoftypewas-signedtotopick,Nkjisthenumberoftokensindocu-mentjassignedtotopick,andNkPwNwk.N:ijdenotesthecountwithtokenijremoved.Notethatstan-dardnon-collapsedGibbssamplingover,,andzcanalsobeperformed,butwehaveobservedthatCGSmixesmorequicklyinpractice. 2Forconvenience,weswitchbacktotheconventionalindex-ingschemeforLDAwhereirunsovertokensindocumentj.3Therst-orderinformationbecomeszerointhiscase. 2.6COMPARISONOFALGORITHMSAcomparisonofupdateequations(5),(11),(12),(13),(14)revealsthesimilaritiesbetweenthesealgorithms.AlloftheseupdatesconsistofaproductoftermsfeaturingNwkandNkjaswellasadenominatorfeaturingNk.TheseupdatesresembletheCallenequations[Tehetal.,2007],whichthetrueposteriordistributionmustsatisfy(withZasthenormalizationconstant):P(zijkjx)=Ep(z:ijj)1 Z(N:ijwk) (N:ijkW)N:ijkj :Wehighlightthestrikingconnectionsbetweenthealgo-rithms.Interestingly,theprobabilitiesforCGS(14)andCVB0(13)areexactlythesame.TheonlydifferenceisthatCGSsampleseachtopicassignmentwhileCVB0de-terministicallyupdatesadiscretedistributionovertopicsforeachtoken.AnotherwaytoviewthisconnectionistoimaginethatCGScansampleeachtopicassignmentzijRtimesusing(14)andmaintainadistributionoverthesesampleswithwhichitcanupdatethecounts.AsR!1,thisdistributionwillbeexactly(13)andthisalgorithmwillbeCVB0.ThefactthatalgorithmslikeCVB0areabletopropagatetheentireuncertaintyinthetopicdistributionduringeachupdatesuggeststhatdeterministicalgorithmsshouldconvergemorequicklythanCGS.CVB0andCVBarealmostidenticalaswell,thedistinctionbeingtheinclusionofsecond-orderinformationforCVB.TheconditionaldistributionsusedinVB(11)andMAP(5)arealsoverysimilartothoseusedforCGS,withthemaindifferencebeingthepresenceofoffsetsofupto0:5and1inthenumeratortermsforVBandMAP,respectively.Throughthesettingofhyperparameters and,theseextraoffsetsinthenumeratorcanbeeliminated,whichsuggeststhatthesealgorithmscanbemadetoperformsimilarlywithappropriatehyperparametersettings.Anotherintuitionwhichshedslightonthisphenomenonisasfollows.VariationalmethodslikeVBareknowntounderestimateposteriorvariance[WangandTitterington,2004].InthecaseofLDAthisisreectedintheoffsetof-0.5:typicalvaluesofandinthevariationalposte-riortendtoconcentratemoremassonthehighprobabilitywordsandtopicsrespectively.Wecancounteractthisbyin-crementingthehyperparametersby0.5,whichencouragesmoreprobabilitymasstobesmoothedtoallwordsandtop-ics.Similarly,MAPoffsetsby-1,concentratingevenmoremassonhighprobabilitywordsandtopics,andrequiringevenmoresmoothingbyincrementing andby1.Othersubtledifferencesbetweenthealgorithmsexist.Forinstance,VBsubtractsonly0.5fromthedenominatorwhileMAPremovesW,whichsuggeststhatVBappliesmoresmoothingtothedenominator.SinceNkisusuallylarge,wedonotexpectthisdifferencetoplayalargeroleinlearn-ing.Forthecollapsedalgorithms(CGS,CVB),thecountsNwk,Nkj,Nkareupdatedaftereachtokenupdate.Mean-while,thestandardformulationsofVBandMAPupdatethesecountsonlyaftersweepingthroughallthetokens.Thisupdateschedulemayaffecttherateofconvergence[NealandHinton,1998].Anotherdifferenceisthatthecollapsedalgorithmsremovethecountforthecurrentto-kenij.Aswewillseeintheexperimentalresultssection,theper-formancedifferencesamongthesealgorithmsthatwereob-servedinpreviousworkcanbesubstantiallyreducedwhenthehyperparametersareoptimizedforeachalgorithm.3THEROLEOFHYPERPARAMETERSThesimilaritiesbetweentheupdateequationsfortheseal-gorithmsshedlightontheimportantrolethathyperparame-tersplay.Sincetheamountofsmoothingintheupdatesdif-ferentiatesthealgorithms,itisimportanttohavegoodhy-perparametersettings.Inpreviousresults[Tehetal.,2007,Wellingetal.,2008b,MukherjeeandBlei,2009],hyper-parametersforVBweresettosmallvalueslike =0:1,=0:1,andconsequently,theperformanceofVBwasobservedtobesignicantlysuboptimalincomparisontoCVBandCGS.SinceVBeffectivelyaddsadiscountofupto0:5intheupdates,greatervaluesfor andarenec-essaryforVBtoperformwell.Wediscusshyperparameterlearningandtheroleofhyperparametersinprediction.3.1HYPERPARAMETERLEARNINGItispossibletolearnthehyperparametersduringtraining.OneapproachistoplaceGammapriorsonthehyperpa-rameters(G[a;b], G[c;d])anduseMinka'sxed-pointiterations[Minka,2000],e.g.: 1+^ PjPk[ (Nkj+^ ) (^ )] dKPj[ (NjK^ ) (K^ )]:OtherwaysforlearninghyperparametersincludeNewton-Raphsonandotherxed-pointtechniques[Wallach,2008],aswellassamplingtechniques[Tehetal.,2006].Anotherapproachistouseavalidationsetandtoexplorevarioussettingsof ,throughgridsearch.Weexploreseveraloftheseapproacheslaterinthepaper.3.2PREDICTIONHyperparametersplayaroleinpredictionaswell.ConsidertheupdateforMAPin(5)andtheestimatesforwkandkj(6)andnotethatthetermsusedinlearningarethesameasthoseusedinprediction.Essentiallythesamecanbesaidforthecollapsedalgorithms,sincethefollowingRao-Blackwellizedestimatesareused,whichbearresemblancetotermsintheupdates(14),(13):^wkNwk NkW^kjNkj NjK :(15)InthecaseofVB,theexpectedvaluesoftheposteriorDirichletsin(7)and(8)areusedinprediction,leadingto Table1:DatasetsusedinexperimentsNAMEDWNtrainDtest CRAN 979 3,763 81,773 210 KOS 3,000 6,906 410,595 215 MED 9,300 5,995 886,306 169 NIPS 1,500 12,419 1,932,365 92 NEWS 19,500 27,059 2,057,207 249 NYT 6,800 16,253 3,768,969 139 PAT 6,500 19,447 14,328,094 106 estimatesforandofthesameformas(15).However,forVB,anoffsetof0:5isfoundinupdateequation(10)whileitisnotfoundintheestimatesusedforprediction.TheknowledgethatVB'supdateequationcontainsanef-fectiveoffsetofupto0:5suggeststheuseofanalterna-tiveestimateforprediction:^wkexp( (Nwk)) exp( (NkW))^kjexp( (Nkj )) exp( (NjK )):(16)NotethesimilaritythattheseestimatesbeartotheVBup-date(10).Essentially,the0:5offsetisintroducedintotheseestimatesjustastheyarefoundintheupdate.An-otherwaytomimicthisbehavioristouse +0:5and+0:5duringlearningandthenuse andforpredic-tion,usingtheestimatesin(15).Wendthatcorrectingthis“mismatch”betweentheupdateandtheestimatere-ducestheperformancegapbetweenVBandtheotheral-gorithms.PerhapsthisphenomenonbearsrelationshipstotheobservationofWainwright[2006],whoshowsthatitcertaincases,itisbenecialtousethesamealgorithmforbothlearningandprediction,evenifthatalgorithmisap-proximateratherthanexact.4EXPERIMENTSSevendifferenttextdatasetsareusedtoevaluatetheperfor-manceofthesealgorithms:Craneld-subset(CRAN),Kos(KOS),Medline-subset(MED),NIPS(NIPS),20News-groups(NEWS),NYT-subset(NYT),andPatent(PAT).SeveralofthesedatasetsareavailableonlineattheUCIMLRepository[AsuncionandNewman,2007].Thechar-acteristicsofthesedatasetsaresummarizedinTable1.Eachdatasetisseparatedintoatrainingsetandatestset.Welearnthemodelonthetrainingset,andthenwemea-suretheperformanceofthealgorithmsonthetestset.Wealsohaveaseparatevalidationsetofthesamesizeasthetestsetthatcanbeusedtotunethehyperparameters.Toevaluateaccuracy,weuseperplexity,awidely-usedmet-ricinthetopicmodelingcommunity.Whileperplexityisasomewhatindirectmeasureofpredictiveperformance,itisnonethelessausefulcharacterizationofthepredictivequal-ityofalanguagemodelandhasbeenshowntobewell-correlatedwithothermeasuresofperformancesuchword-errorrateinspeechrecognition[KlakowandPeters,2002].Wealsoreportprecision/recallstatistics. 0 100 200 300 400 500 1000 1200 1400 1600 1800 PerplexityIteration CGS VB CVB CVB0 Figure2:ConvergenceplotshowingperplexitiesonMED,K=40;hyperparameterslearnedthroughMinka'supdate.Wedescribehowperplexityiscomputed.Foreachofouralgorithms,weperformrunslasting500iterationsandweobtaintheestimate^wkattheendofeachofthoseruns.Toobtain^kj,onemustlearnthetopicassignmentsonthersthalfofeachdocumentinthetestsetwhileholding^wkxed.Forthisfold-inprocedure,weusethesamelearningalgorithmthatweusedfortraining.Perplexityisevaluatedonthesecondhalfofeachdocumentinthetestset,given^wkand^jk.ForCGS,onecanaverageovermultiplesam-ples(whereSisthenumberofsamplestoaverageover):logp(xtest)=XjwNjwlog1 SXsXk^skj^swk:Inourexperimentswedon'tperformaveragingoversam-plesforCGS(otherthaninFigure7whereweexplicitlyinvestigateaveraging),bothforcomputationalreasonsandtoprovideafaircomparisontotheotheralgorithms.Us-ingasinglesamplefromCGSisconsistentwithitsuseasanefcientstochastic“mode-nder”tondasetofinter-pretabletopicsforadocumentset.Foreachexperiment,weperformthreedifferentrunsus-ingdifferentinitializations,andreporttheaverageoftheseperplexities.Usuallytheseperplexitiesaresimilartoeachotheracrossdifferentinitializations(e.g.10orless).4.1PERPLEXITYRESULTSInourrstsetofexperiments,weinvestigatetheeffectsoflearningthehyperparametersduringtrainingusingMinka'sxedpointupdates.WecompareCGS,VB,CVB,andCVB0inthissetofexperimentsandleaveoutMAPsinceMinka'supdatedoesnotapplytoMAP.Foreachrun,weinitializethehyperparametersto =0:5,=0:5andturnonMinka'supdatesafter15iterationstopreventnumericalinstabilities.Everyotheriteration,wecomputeperplexityonthevalidationsettoallowustoperformearly-stoppingifnecessary.Figure2showsthetestperplexityasafunc-tionofiterationforeachalgorithmontheMEDdataset.TheseperplexityresultssuggestthatCVBandCVB0out-performVBwhenMinka'supdatesareused.ThereasonisbecausethelearnedhyperparametersforVBaretoosmallanddonotcorrectfortheeffective0:5offsetfoundintheVBupdateequations.Also,CGSconvergesmoreslowlythanthedeterministicalgorithms. 0 500 1000 1500 2000 2500 3000 3500 PerplexityCRANKOS MED NIPSNEWSNYT PAT CGS VB CVB CVB0 Figure3:Perplexitiesachievedwithhyperparameterlearn-ingthroughMinka'supdate,onvariousdatasets,K=40. 0 50 100 150 1400 1600 1800 2000 2200 Number of TopicsPerplexity CGS VB CVB CVB0 Figure4:Perplexityasafunctionofnumberoftopics,onNIPS,withMinka'supdateenabled.InFigure3,weshowthenalperplexitiesachievedwithhyperparameterlearning(throughMinka'supdate),oneachdataset.VBperformsworseonseveralofthedatasetscomparedtotheotheralgorithms.WealsofoundthatCVB0usuallylearnsthehighestlevelofsmoothing,fol-lowedbyCVB,whileMinka'supdatesforVBlearnsmallvaluesfor ,.Inourexperimentsthusfar,wexedthenumberattopicsatK=40.InFigure4,wevarythenumberoftopicsfrom10to160.Inthisexperiment,CGS/CVB/CVB0performsimi-larly,whileVBlearnslessaccuratesolutions.Interestingly,theCVB0perplexityatK=160ishigherthantheper-plexityatK=80.ThisisduetothefactthatahighvalueforwaslearnedforCVB0.Whenweset=0:13(totheK=80level),theCVB0perplexityis1464,matchingCGS.Theseresultssuggestthatlearninghyperparametersduringtraining(usingMinka'supdates)doesnotnecessar-ilyleadtotheoptimalsolutionintermsoftestperplexity.Inthenextsetofexperiments,weuseagridofhyperpa-rametersforeachof and,[0.01,0.1,0.25,0.5,0.75,1],andwerunthealgorithmsforeachcombinationofhy-perparameters.WeincludeMAPinthissetofexperiments,andweshiftthegridtotherightby1forMAP(sincehyper-parameterslessthan1causeMAPtohavenegativeproba-bilities).Weperformgridsearchonthevalidationset,and 0 500 1000 1500 2000 2500 3000 3500 PerplexityCRANKOS MED NIPSNEWS CGS VB VB (alt) CVB CVB0 MAP Figure5:Perplexitiesachievedthroughgridsearch,K=40.wendthebesthyperparametersettings(accordingtoval-idationsetperplexity)andusethecorrespondingestimatesforprediction.ForVB,wereportboththestandardper-plexitycalculationandthealternativecalculationthatwasdetailedpreviouslyin(16).InFigure5,wereporttheresultsachievedthroughperform-inggridsearch.ThedifferencesbetweenVB(withtheal-ternativecalculation)andCVBhavelargelyvanished.Thisisduetothefactthatweareusinglargervaluesforthehyperparameters,whichallowsVBtoreachparitywiththeotheralgorithms.ThealternativepredictionschemeforVBalsohelpstoreducetheperplexitygap,especiallyfortheNEWSdataset.Interestingly,CVB0appearstoperformslightlybetterthantheotheralgorithms.Figure6showsthetestperplexityofeachmethodasafunctionof.ItisvisuallyapparentthatVBandMAPperformbetterwhentheirhyperparametervaluesareoff-setby0.5and1,respectively,relativetotheothermeth-ods.Whilethispictureisnotasclear-cutforeverydataset(sincetheapproximateVBupdateholdsonlywhenn�1),wehaveconsistentlyobservedthattheminimumperplexi-tiesachievedbyVBareathyperparametervaluesthatarehigherthantheonesusedbytheotheralgorithms.Inthepreviousexperiments,weusedonesampleforCGStocomputeperplexity.Withenoughsamples,CGSshouldbethemostaccuratealgorithm.InFigure7,weshowtheeffectsofaveragingover10differentsamplesforCGS,takenover10differentruns,andndthatCGSgainssub-stantiallyfromaveragingsamples.ItisalsopossibleforothermethodslikeCVB0toaverageovertheirlocalposte-rior“modes”butwefoundtheresultinggainisnotasgreat.Wealsotestedwhetherthealgorithmswouldperformsim-ilarlyincaseswherethetrainingsetsizeisverysmallorthenumberoftopicsisveryhigh.WeranVBandCVBwithgridsearchonhalfofCRANandachievedvirtuallythesameperplexities.WealsoranVBandCVBonCRANwithK=100,andonlyfounda16-pointperplexitygap. 0 0.5 1 1.5 2 1400 1600 1800 2000 2200 Perplexityh CGS VB VB (alt) CVB CVB0 MAP 0 0.5 1 1.5 2 1000 1100 1200 1300 Perplexityh CGS VB VB (alt) CVB CVB0 MAP Figure6:TOP:KOS,K=40.BOTTOM:MED,K=40.Per-plexityasafunctionof.Wexed to0.5(1.5forMAP).Relativetotheothercurves,VBandMAPcurvesareineffectshiftedrightbyapproximately0.5and1. 0 100 200 300 400 500 1500 2000 2500 IterationPerplexity CGS S=1 CVB0 S=1 CGS S=10 CVB0 S=10 Figure7:Theeffectofaveragingover10samples/modesonKOS,K=40.Tosummarizeourperplexityresults,wejuxtaposethreedifferentwaysofsettinghyperparametersinFigure8,forNIPS,K=40.Therstwayistohavethesamearbitraryvaluesusedacrossallalgorithms(e.g. =0:1,=0:1).ThesecondwayistolearnthehyperparametersthroughMinka'supdate.Thethirdwaywayistondthehyperpa-rametersbygridsearch.Forthethirdway,wealsoshowtheVBperplexityachievedbythealternativeestimates.4.2PRECISION/RECALLRESULTSWealsocalculatedprecision/recallstatisticsontheNEWSdataset.SinceeachdocumentinNEWSisassociatedwithoneoftwentynewsgroups,onecanlabeleachdoc-umentbyitscorrespondingnewsgroup.Itispossibletousethetopicmodelforclassicationandtocomputepre-cision/recallstatistics.InFigure9,weshowthemeanareaundertheROCcurve(AUC)achievedbyCGS,VB,CVB,andCVB0withhyperparameterlearningthroughMinka'supdate.Wealsoperformedgridsearchover ,andfoundthateachmethodwasabletoachievesimilarstatistics.Forinstance,onNEWS,K=10,eachalgorithmachievedthe Point Minka Grid Grid (Alt) 1500 1600 1700 1800 1900 Perplexity CGS VB CVB CVB0 Figure8:POINT:When =0:1,=0:1,VBperformssubstantiallyworsethanothermethods.MINKA:WhenusingMinka'supdates,thedifferencesarelessprominent.GRID:Whengridsearchisperformed,differencesdimin-ishevenmore,especiallywiththealternativeestimates. 0 20 40 60 80 0.65 0.7 0.75 0.8 0.85 0.9 Mean AUCIteration CGS VB CVB CVB0 Figure9:MeanAUCachievedonNEWS,K=40,withMinka'supdate.sameareaundertheROCcurve(0.90)andmeanaverageprecision(0.14).Theseresultsareconsistentwiththeper-plexityresultsintheprevioussection.5COMPUTATIONALEFFICIENCYWhilethealgorithmscangivesimilarlyaccuratesolutions,someofthesealgorithmsaremoreefcientthanothers.VBcontainsdigammafunctionswhicharecomputationallyex-pensive,whileCVBrequiresthemaintenanceofvariancecounts.Meanwhile,thestochasticnatureofCGScausesittoconvergemoreslowlythanthedeterministicalgorithms.Inpractice,weadvocateusingCVB0since1)itisfasterthanVB/CVBgiventhattherearenocallstodigammaorvariancecountstomaintain;2)itconvergesmorequicklythanCGSsinceitisdeterministic;3)itdoesnothaveMAP's1offsetissue.Furthermore,ourempiricalresultssuggestthatCVB0learnsmodelsthatareasgoodorbetter(predictively)thanthoselearnedbytheotheralgorithms.Thesealgorithmscanbeparallelizedovermultipleproces-sorsaswell.TheupdatesinMAPestimationcanbeper-formedinparallelwithoutaffectingthexedpointsinceMAPisanEMalgorithm[NealandHinton,1998].SincetheotheralgorithmsareverycloselyrelatedtoMAPthereiscondencethatperformingparallelupdatesovertokensfortheotheralgorithmswouldleadtogoodresultsaswell. Table2:Timingresults(inseconds) MED KOS NIPS VB 151.6 73.8 126.0 CVB 25.1 9.0 21.7 CGS 18.2 3.8 10.0 CVB0 9.5 4.0 8.4 Parallel-CVB0 2.4 1.5 3.0 Whilenon-collapsedalgorithmssuchasMAPandVBcanbereadilyparallelized,thecollapsedalgorithmsarese-quential,andthustherehasnotbeenatheoreticalbasisforparallelizingCVBorCGS(althoughgoodempiricalresultshavebeenachievedforapproximateparallelCGS[New-manetal.,2008]).WeexpectthataversionofCVB0thatparallelizesovertokenswouldconvergetothesamequalityofsolutionassequentialCVB0,sinceCVB0isessentiallyMAPbutwithoutthe1offset4.InTable2,weshowtimingresultsforVB,CVB,CGS,andCVB0onMED,KOS,andNIPS,withK=10.Werecordtheamountoftimeittakesforeachalgorithmtopassaxedperplexitythreshold(thesameforeachalgorithm).SinceVBcontainsmanycallstothedigammafunction,itisslowerthantheotheralgorithms.Meanwhile,CGSneedsmoreiterationsbeforeitcanreachthesameperplex-ity,sinceitisstochastic.WeseethatCVB0iscomputation-allythefastestapproachamongthesealgorithms.WealsoparallelizedCVB0onamachinewith8coresandndthatatopicmodelwithcoherenttopicscanbelearnedin1.5secondsforKOS.Theseresultssuggestthatitisfeasibletolearntopicmodelsinnearreal-timeforsmallcorpora.6RELATEDWORK&CONCLUSIONSSomeofthesealgorithmshavebeencomparedtoeachotherinpreviouswork.Tehetal.[2007]formulatetheCVBalgorithmandempiricallycompareittoVB,whileMukherjeeandBlei[2009]theoreticallyanalyzethediffer-encesbetweenVBandCVBandgivecasesforwhenCVBshouldperformbetterthanVB.Wellingetal.[2008b]alsocomparethealgorithmsandintroduceahybridCGS/VBalgorithm.Inallthesestudies,lowvaluesofand wereusedforeachalgorithm,includingVB.Ourinsightssug-gestthatVBrequiresmoresmoothinginordertomatchtheperformanceoftheotheralgorithms.ThesimilaritiesbetweenPLSAandLDAhavebeennotedinthepast[GirolamiandKaban,2003].Othershaveuni-edsimilardeterministiclatentvariablemodels[Wellingetal.,2008a]andmatrixfactorizationtechniques[SinghandGordon,2008].Inthiswork,wehighlightthesimilar-itiesbetweenvariouslearningalgorithmsforLDA.WhilewefocusedonLDAandPLSAinthispaper,webe-lievethattheinsightsgainedarerelevanttolearningingen- 4Ifonewantsconvergenceguarantees,oneshouldalsonotre-movethecurrenttokenij.eraldirectedgraphicalmodelswithDirichletpriors,andgeneralizingtheseresultstoothermodelsisaninterestingavenuetopursueinthefuture.Inconclusion,wehavefoundthattheupdateequationsforthesealgorithmsarecloselyconnected,andthatusingtheappropriatehyperparameterscausestheperformancedifferencesbetweenthesealgorithmstolargelydisappear.Theseinsightssuggestthathyperparametersplayalargeroleinlearningaccuratetopicmodels.OurcomparativestudyalsoshowedthatthereexistaccurateandefcientlearningalgorithmsforLDAandthatthesealgorithmscanbeparallelized,allowingustolearnaccuratemodelsoverthousandsofdocumentsinamatterofseconds.AcknowledgementsThisworkissupportedinpartbyNSFAwardsIIS-0083489(PS,AA),IIS-0447903andIIS-0535278(MW),andanNSFgraduatefellowship(AA),aswellasONRgrants00014-06-1-073(MW)andN00014-08-1-1015(PS).PSisalsosupportedbyaGoogleResearchAward.YWTissup-portedbytheGatsbyCharitableFoundation.ReferencesA.AsuncionandD.Newman.UCImachinelearningrepository,2007.URLhttp://www.ics.uci.edu/mlearn/MLRepository.html.H.Attias.AvariationalBayesianframeworkforgraphicalmodels.InNIPS12,pages209–215.MITPress,2000.D.M.Blei,A.Y.Ng,andM.I.Jordan.LatentDirichletallocation.JMLR,3:993–1022,2003.W.BuntineandA.Jakulin.Discretecomponentanalysis.LectureNotesinComputerScience,3940:1,2006.J.-T.ChienandM.-S.Wu.AdaptiveBayesianlatentsemanticanalysis.Audio,Speech,andLanguageProcessing,IEEETransactionson,16(1):198–207,2008.N.deFreitasandK.Barnard.Bayesianlatentsemanticanalysisofmultimediadatabases.TechnicalReportTR-2001-15,UniversityofBritishColumbia,2001.S.Deerwester,S.Dumais,G.Furnas,T.Landauer,andR.Harshman.Indexingbylatentsemanticanalysis.JASIS,41(6):391–407,1990.Z.GhahramaniandM.Beal.VariationalinferenceforBayesianmixturesoffactoranalysers.InNIPS12,pages449–455.MITPress,2000.M.GirolamiandA.Kaban.OnanequivalencebetweenPLSIandLDA.InSIGIR'03,pages433–434.ACMNewYork,NY,USA,2003.T.L.GrifthsandM.Steyvers.Findingscientictopics.PNAS,101(Suppl1):5228–5235,2004.T.Hofmann.Unsupervisedlearningbyprobabilisticlatentsemanticanalysis.Ma-chineLearning,42(1):177–196,2001.D.KlakowandJ.Peters.Testingthecorrelationofworderrorrateandperplexity.SpeechCommunication,38(1-2):19–28,2002.T.Minka.EstimatingaDirichletdistribution.2000.URLhttp://research.microsoft.com/minka/papers/dirichlet/.I.MukherjeeandD.M.Blei.RelativeperformanceguaranteesforapproximateinferenceinlatentDirichletallocation.InNIPS21,pages1129–1136,2009.R.NealandG.Hinton.AviewoftheEMalgorithmthatjustiesincremental,sparse,andothervariants.Learningingraphicalmodels,89:355–368,1998.D.Newman,A.Asuncion,P.Smyth,andM.Welling.DistributedinferenceforlatentDirichletallocation.InNIPS20,pages1081–1088.MITPress,2008.A.SinghandG.Gordon.AUniedViewofMatrixFactorizationModels.InECMLPKDD,pages358–373.Springer,2008.Y.W.Teh,M.I.Jordan,M.J.Beal,andD.M.Blei.HierarchicalDirichletprocesses.JournaloftheAmericanStatisticalAssociation,101(476):1566–1581,2006.Y.W.Teh,D.Newman,andM.Welling.AcollapsedvariationalBayesianinferencealgorithmforlatentDirichletallocation.InNIPS19,pages1353–1360.2007.M.J.Wainwright.Estimatingthe”wrong”graphicalmodel:Benetsinthecomputation-limitedsetting.JMLR,7:1829–1859,2006.H.M.Wallach.StructuredTopicModelsforLanguage.PhDthesis,UniversityofCambridge,2008.B.WangandD.Titterington.ConvergenceandasymptoticnormalityofvariationalBayesianapproximationsforexponentialfamilymodelswithmissingvalues.InUAI,pages577–584,2004.M.Welling,C.Chemudugunta,andN.Sutter.Deterministiclatentvariablemodelsandtheirpitfalls.InSIAMInternationalConferenceonDataMining,2008a.M.Welling,Y.W.Teh,andB.Kappen.Hybridvariational/MCMCinferenceinBayesiannetworks.InUAI,volume24,2008b.

Related Contents


Next Show more