/
Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis - PDF document

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
509 views
Uploaded On 2014-12-17

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis - PPT Presentation

They have been applied to a vast variety of data sets contexts and tasks to varying degrees of success However to date there is almost no formal theory explicating the LDAs behavior and despite its familiarity there is very little systematic analysi ID: 25331

They have been applied

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Understanding the Limiting Factors of To..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

UnderstandingtheLimitingFactorsofTopicModelingviaPosteriorContractionAnalysis topic-model“friendly?”WhydidtheLDAfailonmydata?HowmanydocumentsdoIneedtolearn100topics?Itisdifcultevenforexpertstoprovidequickandsatisfactoryanswerstothosequestions.AsnotedabovesomebehaviorsoftheLDAareknownintuitivelyinthemachinelearningfolklore,buttheyarenevertheoreticallyjustied,suchastheLDA'sdeciencyinhandlingshortdocuments.Othersituationsremainquitemysterious,suchashowtheLDAperformsonafewlengthydocuments,andwhathappenswhenitovertswithalargernumberoftopicsthanactu-allypresentinthedata.Wewillshowinthispaper,likeLiebig'sbarrel,theperformanceofLDAislimitedbythelengthoftheshorteststave,makingitchallengingtojudgeitseffectivenesswithoutthoroughempiricalexperiments.InthispaperweaimtoprovideasystematicanalysisofthebehavioroftheLDAinvarioussettings.WeidentifyseverallimitingfactorswhoseinteractionsplaythecrucialroleindeterminingtheLDA'sperformance.Thesefactorsincludethenumberofdocuments,thelengthofindividualdocuments,thenumberoftopics,andtheDirichlet(hy-per)parameters.Themaincontributionsinclude(i)theoret-icalresultsexplicatingtheconvergencebehaviorofthepos-teriordistributionoflatenttopicsastheamountoftrainingdataincreases–asshowninSection2,theconvergencebe-haviorisfoundtodependontheinteractionsofthelimitingfactorsmentionedabove;(ii)athoroughempiricalstudyre-portedinSection3thatprovidessupportforthetheory,byvaryingthesettingsofthelimitingfactorsonsyntheticandrealdatasets;(iii)thesendingsaretranslatedtoanum-berofconcreteguidelinesregardingthepracticaluseoftheLDAmodel–thesearediscussedindetailsinSection4.2.PosteriorContractionAnalysisofTopicPolytopeintheLDAThelatentDirichletallocation(LDA)modelexplainsthegenerationoftextdocuments,whichcanbeviewedassam-plesfromamixtureofmultinomialdistributionsoveravo-cabularyofwords.Eachmultinomialmixturecomponentiscalleda“topic”.WereferthereadertoBleietal.(2003)andPritchardetal.(2000)forabackgroundonthemodel.HereweprovideageometricreformulationoftheLDAmodelandpresentseveraltheoreticalresultsthatexplainthebehavioroftheposteriordistributionofthelatenttopicvariables,asthenumberoftrainingdataincreases.2.1.LatentTopicPolytopeintheLDALetVdenotethevocabulary(setofwords).EachtopickisavectorinjVj�1–the(jVj�1)-dimensionalprobabil-itysimplex.WeassumethatthedocumentsaregeneratedbyKtopics=(1:::;K).Eachdocumentinthecor-pusd2f1;:::;Dgisthenassociatedwithatopicpropor-tionvectord=(d;1;:::;d;K)2K�1,whered;kistheprobabilitythateachwordindocumentdisassignedtotopick.Equivalently,documentduniquelycorrespondstoawordprobabilityvectord=PKk=1d;kk2jVj�1.Thewordsofdocumentd,i.e.Wd[Nd]:=(wdn)Ndn=1,arein-dependentandidenticallydistributedsamplesfromamulti-nomialdistributionparameterizedbythisprobabilityvec-tord.Tosimplifytheanalysis,weassumealldocumentshavethesamenumberofwordsNd=N.Thejointdis-tributionofthefulldatasetW[D][N]:=(Wd[N])Dd=1,denotedbyPDW[N],istheproductdistributionofallsingledocumentdistributions:PDW[N](W[D][N]):=QDd=1PW[N](Wd[N]).OurprimaryinterestistheinferenceofthetopicparametersonthebasisofthesampledDNwordsW[D][N],andhowthisinferenceisaffectedbythewaythatthewordsformintothedocuments.Notethatcomparedtotheoriginaldef-initionofBleietal.(2003),thisalternativerepresentationdoesnotinvolvelatentassignmentvariableszdn's–theyaresimplymarginalizedout.InaBayesianestimationsettingoftheLDAmodel,thedocument-topicandtopic-wordproportionvectors(dandk's)areassumedtoberandomandendowedwithDirich-letpriordistributions,parameterizedbyhyperparameters =( 1;:::; K)and =( 1;:::; jVj),respectively.Accordinglyoneisinterestedinthebehaviorofthepos-teriordistributionofthetopicparametersgiventheob-serveddocuments.Inparticular,wewanttounderstandtheconvergencebehavioroftheposteriortopicdistributionwhenthetotalamountofdataDNincreasestoinnity.Itisexpectedthattheposteriordistributionofthetopicvari-ablesshouldcontracttowardtheirtruevaluesasonehasmoredata.Anaturalquestiontoaskistheratesatwhichthisposteriorcontractionphenomenonoccurs.Inordertoestablishsuchananalysis,weshallintroduceametricwhichdescribesthe(contracting)neighborhoodcenteredatthetruetopicvalues,wheretheposteriordis-tributionwillbeshowntoplacemostitsprobabilitymasson.Thefasterthecontraction,themoreefcientthestatis-ticalinference.Althoughitwouldbeidealtoexaminetheconvergenceforeachindividualtopicparameter,itischal-lengingduetotheissueofidentiability:onerelativelyminorproblemisthe“label-switching”issue,whichmeansthatonecanonlyidentifythecollectionofuptoaper-mutation.Amoredifcultidenticationproblemisthat,anyvectorthatcanbeexpressedasaconvexcombinationofthetopicparameterswouldbehardtoidentifyandan-alyze.Toaddressthesetheoreticaldifculties,insteadofinvestigatingtheconvergencebehaviorofindividualtop-ics,westudytheconvergenceofthelatenttopicstructurethroughitsconvexhull:G()=conv(1;:::;K);whichisreferredtoasthetopicpolytope(cf.Nguyen UnderstandingtheLimitingFactorsofTopicModelingviaPosteriorContractionAnalysis (a)K=K; =0:01 (b)K�K; =0:01 (c)K=K; =1 (d)K�K; =1Figure1.ScenarioI-FixedNandincreasingD.Intheexact-ttedcase,theerrorconvergenceratewellmatchestheresultofTheorem1.Theover-ttedcaseleadstoamuchworserate. (a)K=K; =0:01 (b)K�K; =0:01 (c)K=K; =1 (d)K�K; =1Figure2.ScenarioII-FixedDandincreasingN.Intheover-ttedcase,theerrorfailstovanish.slowlyduetothe1 2(K�1)exponent,orgraduallybecomeattenedastheconstantlogN Ntermintheboundbecomesthebottleneck.2.Compareplot(a)with(c),and(b)with(d)(sameKbutdifferent ):when islarger,theerrorcurvesdecayfasterwhenlessdataisavailable.Apossibleexplanationisthatthetopicsareclusteredthuseasiertoidentifyonacoarserlevel.Asmoredatabecomesavailable,theerrordecreasesmoreslowlyduetotheconfusionbetweentop-icsonanerlevelofinference.Interestingly,theerrorrateatsout,evenastheamountofdataincreases.Bycontrast,when issmall,topicsaremoreword-sparseanddistin-guishable,resultinginamoreefcientlearningrate.3.Whenthenumberoftopicsisexactlytted,theerrorrateseemstomatchthefunctionlogD D1 2quitewell.Intheover-ttedcasehowever,theempiricalrateisslower.Laterwewillanalyzetheexactexponentofthisrate.3.1.2.SCENARIOII:FIXEDDANDINCREASINGNWenextxthenumberofdocumentsDto1;000andletthedocumentlengthNrangefrom10to1;400.Wecon-sidertheexact-tted(K=K=3)andtheover-ttedcase(K=5),usingthesetting =0:01(word-sparsetopics)and =1(word-diffusetopics).TheerrorsofthetopicslearnedbytheLDAarereportedinFig.2,inwhichwecompareagainstthevaryingtermlogN N1 2.Thebe-havioroftheLDAinthisscenarioisverysimilartoSce-narioI,aspredictedbyboththeorems.Inparticular,intheover-ttedcase,theerrorfailstovanishevenasNbe-comeslarge,possiblyduetothepresenceoftheconstanttermlogD Dintheupperbound.3.1.3.SCENARIOIII:N=D,BOTHINCREASINGWenextapplytheLDAtosyntheticdatageneratedwithD=Nthatisallowedtoincreasesimultaneouslyfrom10to1;300.Asbefore,boththeexact-tted(K=K=3)andanover-ttedcase(K=5)with =f0:01;1gareconsidered.TheresultsarereportedinFig.3,wheretheerrorisplottedagainstlogN N=logD D.Severalobservationscanbemade:1.Asinpreviousscenarios,theLDAiseffectiveintheexact-ttedsettingandwhentheword-distributionsofthetopicsaresparse(i.e., issmall).Whenbothofthesecon-ditionsfail,theerrorratefailstoconvergetozero,evenasthedatasizesD=Nincreases(seeFig.3).2.Interestingly,whenbothDandNincreasesimultane-ously,theempiricalerrordecaysatafasterratethanindi-catedbytheupperboundlogD D1 2fromThm.1.Asinthesubsequentplots,aroughestimatecouldbe (1=D),whichactuallymatchesthetheoreticallowerboundoftheerrorcontractionrate(cf.Thm.3inNguyen(2012)).ThissuggeststhattheupperboundgiveninThm.1couldbe UnderstandingtheLimitingFactorsofTopicModelingviaPosteriorContractionAnalysis Table1.StatisticsoftheRealDataSets DataSet #documents #trainingdocuments #averagedocumentlength vocabularysize(training) Wikipedia 2,395,616 10,000 300.7 109,611 NYT 299,752 10,000 313.8 52,847 Twitter 81,553 10,000 417.3 122,035 Foreachdataset,theentiresetofdocumentsareusedforcalculatingPMI.point-wisemutualinformation(PMI)tomeasurethequal-ityofthelearnedtopics(Newmanetal.,2011).Wehavealsodonetheevaluationusingperplexityonheld-outdata.ThendingsareconsistentwiththoseobservedwithPMI.Duetothespacelimitation,weomittheseresults.Fig.5reportstheempiricalperformanceoftheLDAontheabovereal-worlddatasetsvia10-foldcross-validations.RowsinFig.5correspondtothethreedatasets,andcolumnscorrespondtodifferentparametercongurations.Theresultsareconsistentwiththetheoryandtheempiri-calanalysisonsyntheticdata.Whenthedatapresentsex-tremeproperties(e.g.,veryshortorveryfewdocuments)orwhenthehyperparametersarenotappropriatelyset,theLDA'sperformancesuffers.Aturningpointandadimin-ishedreturnoftheperformancecanbeobservedwhenin-creasingNorD,suggestingfavorablerangesofvaluesoftheparameters.Abenignrangeofthehyperparametersalsocanbeobserved,e.g.,asmall andeitherasmall (Wikipedia)oralarge (NYTarticles,Twitter)ondiffer-entdatasets.4.DiscussionRelatedwork.Ourmainfocusistounderstandthelim-itingfactorsoftheLDAthroughstudyingtheposteriordis-tributionofthelatenttopicstructure,astheamountofdataincreases.Ourresultsarebasedonmildgeometricassump-tionsandareindependentofinferencealgorithms(how-ever,seeadiscussiononthelimitationsbelow).Wenotesomerecentpapersontherecoverabilityoftheparame-tersoftheLDA(Aroraetal.,2012;Anandkumaretal.,2012).Bothworksrelyonarguablystrongerseparabil-ity/sparsityconditionsonthetruetopicsandarespecictocertaincomputationallyefcientmatrixfactorizational-gorithms.TheperformanceoftheLDAhasalsobeenstudiedempir-icallyforvarioustasksanddatasetsusingdifferenteval-uationmeasures(Newmanetal.,2010;2011).Thecon-clusionsaregenerallyconsistentwithourndings.Mean-while,Wallachetal.(2009)studiedtheroleofasymmet-ricDirichletpriors.Mukherjee&Blei(2009)presentedananalysisofthedifferencebetweentwoinferenceal-gorithms,collapsedvariationalinferenceandmeaneldvariationalinference,andprovidedarelativeperformanceguaranteefortheapproximateinference.Thesestudiesserveasanicecomplementtoourndings.TherearealsointerestingexplorationsofhowtocongurethedatatoimprovetheperformanceofLDAinreality.Forexample,Hong&Davison(2010)showedthatLDAcanobtainbettertopicswhentrainedon“pseudo-documents”ofaggregatedtweetsthanonindividualtweet.Mimno&McCallum(2007)proposedtobreakdigitalbooksintopagestoobtaindocumentsthatareshorterandmorecon-centratedonafewtopics.Thesepracticesreconrmedourndingsaboutthelimitingfactors.Ourworkprovidesatheoreticaljusticationandathoroughsimulationbasedvalidationoftheseheuristicprocedures.ImplicationsandguidelinesforlayusersoftheLDA.WehavepresentedthetheoryandsupportingempiricalstudyoftheLDAmodel-basedposteriorinference.ThesendingsaretranslatedintothefollowingguidanceontheuseoftheLDAinpractice:(1)Thenumberofdocumentsplaysperhapsthemostim-portantrole;itistheoreticallyimpossibletoguaranteeiden-ticationoftopicsfromasmallnumberofdocuments,nomatterhowlong.Oncetherearesufcientlymanydocu-ments,furtherincreasingthenumbermaynotsignicantlyimprovetheperformance,unlessthedocumentlengthisalsosuitablyincreased.Inpractice,theLDAachievescom-parableresultsevenifthousandsofdocumentsaresampledfromamuchlargercollection.(2)Thelengthofdocumentsalsoplaysacrucialrole:poorperformanceoftheLDAisexpectedwhendocumentsaretooshort,evenifthereisaverylargenumberofthem.Ide-ally,thedocumentsneedtobesufcientlylong,butneednotbetoolong:inpractice,forverylongdocuments,onecansampleafractionofeachdocumentandtheLDAstillyieldscomparabletopics.(3)WhenaverylargenumberoftopicsthanneededareusedtottheLDA,thestatisticalinferencemaybecomeinescapablyinefcient.Intheory,theconvergenceratede-terioratesquicklytoanonparametricrate,dependingonthenumberoftopicsusedtottheLDA.Thisimplies,inprac-tice,theuserneedstoexerciseextracautiontoavoidselect-ingoverlylargenumberoftopicsforthemodel.(4)TheLDAperformswellwhentheunderlyingtopicsarewell-separatedinthesenseofEuclideanmetric;e.g.,thisisthecaseifthetopicsareconcentratedatasmallnumberofwords.Anotherfavorablescenarioisconcernedwiththedistributionofdocumentswithinthetopicpolytope:whenindividualdocumentsareassociatedmostlywithsmallsub-setsoftopics,sothattheyaregeometricallyconcentrated