They have been applied to a vast variety of data sets contexts and tasks to varying degrees of success However to date there is almost no formal theory explicating the LDAs behavior and despite its familiarity there is very little systematic analysi ID: 25331
Download Pdf The PPT/PDF document "Understanding the Limiting Factors of To..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
UnderstandingtheLimitingFactorsofTopicModelingviaPosteriorContractionAnalysis topic-modelfriendly?WhydidtheLDAfailonmydata?HowmanydocumentsdoIneedtolearn100topics?Itisdifcultevenforexpertstoprovidequickandsatisfactoryanswerstothosequestions.AsnotedabovesomebehaviorsoftheLDAareknownintuitivelyinthemachinelearningfolklore,buttheyarenevertheoreticallyjustied,suchastheLDA'sdeciencyinhandlingshortdocuments.Othersituationsremainquitemysterious,suchashowtheLDAperformsonafewlengthydocuments,andwhathappenswhenitovertswithalargernumberoftopicsthanactu-allypresentinthedata.Wewillshowinthispaper,likeLiebig'sbarrel,theperformanceofLDAislimitedbythelengthoftheshorteststave,makingitchallengingtojudgeitseffectivenesswithoutthoroughempiricalexperiments.InthispaperweaimtoprovideasystematicanalysisofthebehavioroftheLDAinvarioussettings.WeidentifyseverallimitingfactorswhoseinteractionsplaythecrucialroleindeterminingtheLDA'sperformance.Thesefactorsincludethenumberofdocuments,thelengthofindividualdocuments,thenumberoftopics,andtheDirichlet(hy-per)parameters.Themaincontributionsinclude(i)theoret-icalresultsexplicatingtheconvergencebehaviorofthepos-teriordistributionoflatenttopicsastheamountoftrainingdataincreasesasshowninSection2,theconvergencebe-haviorisfoundtodependontheinteractionsofthelimitingfactorsmentionedabove;(ii)athoroughempiricalstudyre-portedinSection3thatprovidessupportforthetheory,byvaryingthesettingsofthelimitingfactorsonsyntheticandrealdatasets;(iii)thesendingsaretranslatedtoanum-berofconcreteguidelinesregardingthepracticaluseoftheLDAmodelthesearediscussedindetailsinSection4.2.PosteriorContractionAnalysisofTopicPolytopeintheLDAThelatentDirichletallocation(LDA)modelexplainsthegenerationoftextdocuments,whichcanbeviewedassam-plesfromamixtureofmultinomialdistributionsoveravo-cabularyofwords.Eachmultinomialmixturecomponentiscalledatopic.WereferthereadertoBleietal.(2003)andPritchardetal.(2000)forabackgroundonthemodel.HereweprovideageometricreformulationoftheLDAmodelandpresentseveraltheoreticalresultsthatexplainthebehavioroftheposteriordistributionofthelatenttopicvariables,asthenumberoftrainingdataincreases.2.1.LatentTopicPolytopeintheLDALetVdenotethevocabulary(setofwords).EachtopickisavectorinjVj1the(jVj1)-dimensionalprobabil-itysimplex.WeassumethatthedocumentsaregeneratedbyKtopics=(1:::;K).Eachdocumentinthecor-pusd2f1;:::;Dgisthenassociatedwithatopicpropor-tionvectord=(d;1;:::;d;K)2K1,whered;kistheprobabilitythateachwordindocumentdisassignedtotopick.Equivalently,documentduniquelycorrespondstoawordprobabilityvectord=PKk=1d;kk2jVj1.Thewordsofdocumentd,i.e.Wd[Nd]:=(wdn)Ndn=1,arein-dependentandidenticallydistributedsamplesfromamulti-nomialdistributionparameterizedbythisprobabilityvec-tord.Tosimplifytheanalysis,weassumealldocumentshavethesamenumberofwordsNd=N.Thejointdis-tributionofthefulldatasetW[D][N]:=(Wd[N])Dd=1,denotedbyPDW[N],istheproductdistributionofallsingledocumentdistributions:PDW[N](W[D][N]):=QDd=1PW[N](Wd[N]).OurprimaryinterestistheinferenceofthetopicparametersonthebasisofthesampledDNwordsW[D][N],andhowthisinferenceisaffectedbythewaythatthewordsformintothedocuments.Notethatcomparedtotheoriginaldef-initionofBleietal.(2003),thisalternativerepresentationdoesnotinvolvelatentassignmentvariableszdn'stheyaresimplymarginalizedout.InaBayesianestimationsettingoftheLDAmodel,thedocument-topicandtopic-wordproportionvectors(dandk's)areassumedtoberandomandendowedwithDirich-letpriordistributions,parameterizedbyhyperparameters=(1;:::;K)and=(1;:::;jVj),respectively.Accordinglyoneisinterestedinthebehaviorofthepos-teriordistributionofthetopicparametersgiventheob-serveddocuments.Inparticular,wewanttounderstandtheconvergencebehavioroftheposteriortopicdistributionwhenthetotalamountofdataDNincreasestoinnity.Itisexpectedthattheposteriordistributionofthetopicvari-ablesshouldcontracttowardtheirtruevaluesasonehasmoredata.Anaturalquestiontoaskistheratesatwhichthisposteriorcontractionphenomenonoccurs.Inordertoestablishsuchananalysis,weshallintroduceametricwhichdescribesthe(contracting)neighborhoodcenteredatthetruetopicvalues,wheretheposteriordis-tributionwillbeshowntoplacemostitsprobabilitymasson.Thefasterthecontraction,themoreefcientthestatis-ticalinference.Althoughitwouldbeidealtoexaminetheconvergenceforeachindividualtopicparameter,itischal-lengingduetotheissueofidentiability:onerelativelyminorproblemisthelabel-switchingissue,whichmeansthatonecanonlyidentifythecollectionofuptoaper-mutation.Amoredifcultidenticationproblemisthat,anyvectorthatcanbeexpressedasaconvexcombinationofthetopicparameterswouldbehardtoidentifyandan-alyze.Toaddressthesetheoreticaldifculties,insteadofinvestigatingtheconvergencebehaviorofindividualtop-ics,westudytheconvergenceofthelatenttopicstructurethroughitsconvexhull:G()=conv(1;:::;K);whichisreferredtoasthetopicpolytope(cf.Nguyen UnderstandingtheLimitingFactorsofTopicModelingviaPosteriorContractionAnalysis (a)K=K;=0:01 (b)KK;=0:01 (c)K=K;=1 (d)KK;=1Figure1.ScenarioI-FixedNandincreasingD.Intheexact-ttedcase,theerrorconvergenceratewellmatchestheresultofTheorem1.Theover-ttedcaseleadstoamuchworserate. (a)K=K;=0:01 (b)KK;=0:01 (c)K=K;=1 (d)KK;=1Figure2.ScenarioII-FixedDandincreasingN.Intheover-ttedcase,theerrorfailstovanish.slowlyduetothe1 2(K1)exponent,orgraduallybecomeattenedastheconstantlogN Ntermintheboundbecomesthebottleneck.2.Compareplot(a)with(c),and(b)with(d)(sameKbutdifferent):whenislarger,theerrorcurvesdecayfasterwhenlessdataisavailable.Apossibleexplanationisthatthetopicsareclusteredthuseasiertoidentifyonacoarserlevel.Asmoredatabecomesavailable,theerrordecreasesmoreslowlyduetotheconfusionbetweentop-icsonanerlevelofinference.Interestingly,theerrorrateatsout,evenastheamountofdataincreases.Bycontrast,whenissmall,topicsaremoreword-sparseanddistin-guishable,resultinginamoreefcientlearningrate.3.Whenthenumberoftopicsisexactlytted,theerrorrateseemstomatchthefunctionlogD D1 2quitewell.Intheover-ttedcasehowever,theempiricalrateisslower.Laterwewillanalyzetheexactexponentofthisrate.3.1.2.SCENARIOII:FIXEDDANDINCREASINGNWenextxthenumberofdocumentsDto1;000andletthedocumentlengthNrangefrom10to1;400.Wecon-sidertheexact-tted(K=K=3)andtheover-ttedcase(K=5),usingthesetting=0:01(word-sparsetopics)and=1(word-diffusetopics).TheerrorsofthetopicslearnedbytheLDAarereportedinFig.2,inwhichwecompareagainstthevaryingtermlogN N1 2.Thebe-havioroftheLDAinthisscenarioisverysimilartoSce-narioI,aspredictedbyboththeorems.Inparticular,intheover-ttedcase,theerrorfailstovanishevenasNbe-comeslarge,possiblyduetothepresenceoftheconstanttermlogD Dintheupperbound.3.1.3.SCENARIOIII:N=D,BOTHINCREASINGWenextapplytheLDAtosyntheticdatageneratedwithD=Nthatisallowedtoincreasesimultaneouslyfrom10to1;300.Asbefore,boththeexact-tted(K=K=3)andanover-ttedcase(K=5)with=f0:01;1gareconsidered.TheresultsarereportedinFig.3,wheretheerrorisplottedagainstlogN N=logD D.Severalobservationscanbemade:1.Asinpreviousscenarios,theLDAiseffectiveintheexact-ttedsettingandwhentheword-distributionsofthetopicsaresparse(i.e.,issmall).Whenbothofthesecon-ditionsfail,theerrorratefailstoconvergetozero,evenasthedatasizesD=Nincreases(seeFig.3).2.Interestingly,whenbothDandNincreasesimultane-ously,theempiricalerrordecaysatafasterratethanindi-catedbytheupperboundlogD D1 2fromThm.1.Asinthesubsequentplots,aroughestimatecouldbe (1=D),whichactuallymatchesthetheoreticallowerboundoftheerrorcontractionrate(cf.Thm.3inNguyen(2012)).ThissuggeststhattheupperboundgiveninThm.1couldbe UnderstandingtheLimitingFactorsofTopicModelingviaPosteriorContractionAnalysis Table1.StatisticsoftheRealDataSets DataSet #documents #trainingdocuments #averagedocumentlength vocabularysize(training) Wikipedia 2,395,616 10,000 300.7 109,611 NYT 299,752 10,000 313.8 52,847 Twitter 81,553 10,000 417.3 122,035 Foreachdataset,theentiresetofdocumentsareusedforcalculatingPMI.point-wisemutualinformation(PMI)tomeasurethequal-ityofthelearnedtopics(Newmanetal.,2011).Wehavealsodonetheevaluationusingperplexityonheld-outdata.ThendingsareconsistentwiththoseobservedwithPMI.Duetothespacelimitation,weomittheseresults.Fig.5reportstheempiricalperformanceoftheLDAontheabovereal-worlddatasetsvia10-foldcross-validations.RowsinFig.5correspondtothethreedatasets,andcolumnscorrespondtodifferentparametercongurations.Theresultsareconsistentwiththetheoryandtheempiri-calanalysisonsyntheticdata.Whenthedatapresentsex-tremeproperties(e.g.,veryshortorveryfewdocuments)orwhenthehyperparametersarenotappropriatelyset,theLDA'sperformancesuffers.Aturningpointandadimin-ishedreturnoftheperformancecanbeobservedwhenin-creasingNorD,suggestingfavorablerangesofvaluesoftheparameters.Abenignrangeofthehyperparametersalsocanbeobserved,e.g.,asmallandeitherasmall(Wikipedia)oralarge(NYTarticles,Twitter)ondiffer-entdatasets.4.DiscussionRelatedwork.Ourmainfocusistounderstandthelim-itingfactorsoftheLDAthroughstudyingtheposteriordis-tributionofthelatenttopicstructure,astheamountofdataincreases.Ourresultsarebasedonmildgeometricassump-tionsandareindependentofinferencealgorithms(how-ever,seeadiscussiononthelimitationsbelow).Wenotesomerecentpapersontherecoverabilityoftheparame-tersoftheLDA(Aroraetal.,2012;Anandkumaretal.,2012).Bothworksrelyonarguablystrongerseparabil-ity/sparsityconditionsonthetruetopicsandarespecictocertaincomputationallyefcientmatrixfactorizational-gorithms.TheperformanceoftheLDAhasalsobeenstudiedempir-icallyforvarioustasksanddatasetsusingdifferenteval-uationmeasures(Newmanetal.,2010;2011).Thecon-clusionsaregenerallyconsistentwithourndings.Mean-while,Wallachetal.(2009)studiedtheroleofasymmet-ricDirichletpriors.Mukherjee&Blei(2009)presentedananalysisofthedifferencebetweentwoinferenceal-gorithms,collapsedvariationalinferenceandmeaneldvariationalinference,andprovidedarelativeperformanceguaranteefortheapproximateinference.Thesestudiesserveasanicecomplementtoourndings.TherearealsointerestingexplorationsofhowtocongurethedatatoimprovetheperformanceofLDAinreality.Forexample,Hong&Davison(2010)showedthatLDAcanobtainbettertopicswhentrainedonpseudo-documentsofaggregatedtweetsthanonindividualtweet.Mimno&McCallum(2007)proposedtobreakdigitalbooksintopagestoobtaindocumentsthatareshorterandmorecon-centratedonafewtopics.Thesepracticesreconrmedourndingsaboutthelimitingfactors.Ourworkprovidesatheoreticaljusticationandathoroughsimulationbasedvalidationoftheseheuristicprocedures.ImplicationsandguidelinesforlayusersoftheLDA.WehavepresentedthetheoryandsupportingempiricalstudyoftheLDAmodel-basedposteriorinference.ThesendingsaretranslatedintothefollowingguidanceontheuseoftheLDAinpractice:(1)Thenumberofdocumentsplaysperhapsthemostim-portantrole;itistheoreticallyimpossibletoguaranteeiden-ticationoftopicsfromasmallnumberofdocuments,nomatterhowlong.Oncetherearesufcientlymanydocu-ments,furtherincreasingthenumbermaynotsignicantlyimprovetheperformance,unlessthedocumentlengthisalsosuitablyincreased.Inpractice,theLDAachievescom-parableresultsevenifthousandsofdocumentsaresampledfromamuchlargercollection.(2)Thelengthofdocumentsalsoplaysacrucialrole:poorperformanceoftheLDAisexpectedwhendocumentsaretooshort,evenifthereisaverylargenumberofthem.Ide-ally,thedocumentsneedtobesufcientlylong,butneednotbetoolong:inpractice,forverylongdocuments,onecansampleafractionofeachdocumentandtheLDAstillyieldscomparabletopics.(3)WhenaverylargenumberoftopicsthanneededareusedtottheLDA,thestatisticalinferencemaybecomeinescapablyinefcient.Intheory,theconvergenceratede-terioratesquicklytoanonparametricrate,dependingonthenumberoftopicsusedtottheLDA.Thisimplies,inprac-tice,theuserneedstoexerciseextracautiontoavoidselect-ingoverlylargenumberoftopicsforthemodel.(4)TheLDAperformswellwhentheunderlyingtopicsarewell-separatedinthesenseofEuclideanmetric;e.g.,thisisthecaseifthetopicsareconcentratedatasmallnumberofwords.Anotherfavorablescenarioisconcernedwiththedistributionofdocumentswithinthetopicpolytope:whenindividualdocumentsareassociatedmostlywithsmallsub-setsoftopics,sothattheyaregeometricallyconcentrated