ponentvectorkandiindexesintothevocabularyAzeromeanLaplacepriorhasthesameeffectasplacinganL1regularizeronkiinducingsparsitywhileatthesametimepermittingmoreextremedeviationsfromthemeanTheLaplac ID: 94793
Download Pdf The PPT/PDF document "Sparse Additive Generative Models of Tex..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
SparseAdditiveGenerativeModelsofText ponentvectorkandiindexesintothevocabu-lary.Azero-meanLaplacepriorhasthesameef-fectasplacinganL1regularizeronki,inducingsparsitywhileatthesametimepermittingmoreex-tremedeviationsfromthemean.TheLaplacedistri-butionL(;m;)isequivalenttoacompoundmodel,RN(;m;)E(;)d,whereE(;)indicatestheEx-ponentialdistribution(Lange&Sinsheimer,1993;Figueiredo,2003).Thisidentityisthecornerstoneofourinference,whichtreatsthevarianceasalatentvariable.WenowpresentagenerativestoryfortheincorporationofSAGEinanaveBayesclassier:DrawthebackgroundmfromanuninformativepriorForeachclassk{ForeachtermiDrawk;iE( )Drawk;iN(0;k;i){Setk/exp(k+m)Foreachdocumentd{Drawaclassydfromauniformdistribution{Foreachwordn,draww(d)nydIngeneralweworkinaBayesiansetting,butforthecomponentswetakemaximumaposterioripointestimates.Bayesianuncertaintyisproblematicduetothelogistictransformation:eveniftheexpectationhkii=0,anyposteriorvarianceoverkiwouldmakehexp(ki+mi)ihexpmii.Weresorttoacombi-nationofMAPestimationoverandBayesianinfer-enceoverallotherlatentvariables|thisissimilartothetreatmentofthetopicsintheoriginalformu-lationoflatentDirichletallocation(Bleietal.,2003).Thebackgroundworddistributionmisassumedtobexed,andwetavariationaldistributionovertheremaininglatentvariables,optimizingthebound,`=XdNdXnlogP(w(d)njm;yd)+XkhlogP(kj0;k)i+XkhlogP(kj )iXkhlogQ(k)i;(2)whereNdisthenumberoftokensindocumentd.3.EstimationWenowdescribehowSAGEcomponentscanbee-cientlyestimatedusingaNewtonoptimization.3.1.ComponentmeansFirstweaddresslearningthecomponentvectors.Lettingcdrepresentthevectoroftermcountsfordoc-umentd,andCd=Picdi,wecomputetherelevantpartsofthebound,`(k)=Xd:cd=kcTdkCdlogXiexp(ki+mi)Tkdiag 1kk=2(3)@` @k=ckCkexp(k+m) Piexp(ki+mi)diag 1kk=ckCkkdiag 1kk;(4)abusingnotationsothatck=Pd:cd=kcdandCk=Picki.Notethatthefractionexp(k+m) Piexp(ki+mi)isequaltothetermfrequencyvectork.ThegradienthasanintuitiveinterpretationasthedierenceofthetruecountsckfromtheirexpectationCkk,minusthedi-vergenceoffromitspriormean0,scaledbytheex-pectedinverse-variance.WewilluseNewton'smethodtooptimize,sowerstderivetheHessian,d2` d2ki=Ckki(ki1) 1ki;d2` dkidki0=Ckkiki0H(k)=CkkTkdiagCkk+ 1k:(5)TheHessianmatrixHisrank-oneplusdiagonal,soitcanbeecientlyinvertedusingtheSherman-Morrisonformula.Fornotationalsimplicity,weelidetheclassindexk,anddenetheconveniencevariableA=diag(C+ 1)1.WecannowderiveaNewtonoptimizationstepfor,usingthegradientg()=@` @fromEquation4:H1()=AACTA 1+CTA=H1()g()=Ag()CA 1+CTAhT(Ag())i;(6)wheretheparenthesizationdenesanorderofopera-tionsthatavoidsforminganynon-diagonalmatrices.Thus,thecomplexityofeachNewtonstepislinearinthesizeofthevocabulary.3.2.VariancesNextweconsiderthevariance;recallthatwehavearandomvectorkforeverycomponentk.Unlikethecomponents,weareBayesianwithrespectto,andconstructafully-factoredvariationaldistributionQk(k)=QiQki(ki).WesettheformQkitobeaGammadistributionwithparametersha;bi:Q()=G(;a;b)=a1exp(=b) (a)ba;sothathi=ab; 1=((a1)b)1,andhlogi= (a)+log(b).TheprioronisanExponential SparseAdditiveGenerativeModelsofText butionforagiventokenisP(w(d)njz(d)n;;m)/expz(d)n+m:WecancombinethemeaneldvariationalinferenceforlatentDirichletallocation(LDA)withthevaria-tionaltreatmentof,optimizingthebound,`=XdhlogP(dj)i+NdXnDlogP(w(d)njm;z(d)n)E+DlogP(z(d)njd)E+XkhlogP(kj0;k)i+XkhlogP(kj )ihlogQ(;z;)i:(8)TheupdatesforQ(z)andQ()areidenticaltostan-dardLDA;theupdatesforQ()remainasSec-tion3.2.However,thepresenceoflatentvariablesslightlychangestheMAPestimationfor:`(k)=XdNdXnQz(d)n(k) klogXiexp(ki+mi)!Tkdiag 1kk=2@` @k=hckihCkiexp(k+m) Piexp(ki+mi)diag 1kk;wherehckii=PdPnQz(d)n(k)(w(d)n=i),andhCki=Pihckii:Thus,theexactcountsckarereplacedwiththeirexpectationsunderQ(z).WedeneanEMprocedureinwhichtheM-stepcon-sistsiniterativelyttingtheparametersandQ().Itistemptingtoperforma\warmstart"byintializingwiththevaluesfromapreviousiterationoftheouterEMloop.However,theseparametersaretightlycou-pled:asthecomponentmeankigoestozero,theexpectedvariancehkiiisalsodriventozero;oncehkiiisverysmall,kicannotmoveawayfromzeroregardlessoftheexpectedcountsck.Thismeansthatawarmstartriskslockinginasparsitypatterndur-ingtheearlystagesofEMwhichmaybefarfromtheglobaloptimum.Therearetwosolutions:eitherweabandonthewarmstart(thusexpendingmorecompu-tation),orwedonotiteratetoconvergenceineachM-step(thusobtainingnoisierandlesssparsesolutions,initially).Fortunately,pilotexperimentsshowedthatgoodresultscanbeobtainedbyperformingjustoneiterationineachM-step,whileusingthewarmstarttechnique.Application2:SparsetopicmodelsOursecondevaluationappliestheSAGETopicModeltothebenchmarkNIPSdataset.2Followingtheeval-uationofWangandBlei(2009),wesubsampleto10% 2http://www.cs.nyu.edu/roweis/data.html Figure3.PerplexityresultsforSAGEandlatentDirichletallocationontheNIPSdataset. Figure4.Proportionoftotalvariationcommittedtowordsateachfrequencydecile.Dirichlet-multinomialLDAmakeslargedistinctionsinthetopic-termfrequenciesofveryrarewords,whileSAGEonlydistinguishesthetopic-termfrequenciesofwordswithrobustcounts.ofthetokensineachdocument,andholdout20%ofthedocumentsasa\testset"onwhichtoevalu-atepredictivelikelihood.Welimitthevocabularytothe10,000termsthatappearinthegreatestnumberofdocuments;nostopwordpruningisapplied.Over-allthisleaves1986trainingdocumentswith237,691tokens,andatestsetof497documentsand57,427tokens.WeevaluateperplexityusingtheChib-styleestimationprocedureofWallachetal.(2009b).Forcomparison,weimplementvariationallatentDirichletallocation,makingmaximum-likelihoodupdatestoasymmetricDirichletprioronthetopic-termdistribu-tions.ResultsareshowninFigure3,usingboxplotsovervepairedrandominitializationsforeachmethod.SAGEoutperformsstandardlatentDirichletalloca-tionasthenumberoftopicsincreases;withboth25and50topics,everySAGErunoutperformeditscoun-terpartDirichlet-multinomial.Asintheclassicationtask,SAGEcontrolssparsityadaptively:asthenum-beroftopicsincreasesfrom10to50,theproportionofnon-zeroweightsdecreasedve-fold(from5%to1%),holdingthetotalmodelcomplexityconstant. SparseAdditiveGenerativeModelsofText Figure6.SAGE'saccuracyontheideologicalperspectivetask;thestate-of-the-artis69.1%(Ahmed&Xing,2010).teractioncomponents(I)jkdependonexactlyonela-beldistribution(A)jandonetopic(T)k,sowecanuseCjkjkdirectlywithoutcomputinganysums.Application3:TopicandideologyWerstevaluateonapublicly-availabledatasetofpo-liticalblogsdescribingthe2008U.S.presidentialelec-tion(Eisenstein&Xing,2010).Thereareatotalofsixblogs|threefromtherightandthreefromleft|comprising20,827documents,5.1millionwordtokens,andavocabularyof8284items.Thetaskistopredicttheideologicalperspectiveoftwounlabeledblogs,us-ingtheremainingfourasatrainingset.WestrictlyfollowtheexperimentalprocedureofAhmed&Xing(2010),allowingustocomparewiththeirreportedre-sultsdirectly.4AhmedandXingconsideredthreeideology-predictiontasks,andfoundthatthesix-blogtaskwasthemostdicult:theirMultiviewLDAmodelachievesaccu-racybetween65%and69.1%dependingonthenum-beroftopics.Theyndacomparableresultof69%usingsupportvectormachines;alternativelatentvari-ablemodelsDiscriminativeLDA(Lacoste-Julienetal.,2008)andSupervisedLDA(Blei&McAulie,2007)doworse.OurresultsareshowninFigure6,reportingthemedianacrossverandominitializationateachsettingforK.Ourbestmedianresultis69.6%,atK=30,equallingthestate-of-the-art;ourbestindividualrunachieves71.9%.Ourmodelobtainssparsityof93%fortopics,82%forlabels,and99.3%fortopic-labelinteractions;nonetheless,pilotexperimentsshowthattheabsenceoftopic-labelinteractionsreducesperfor-mancesubstantially.Application4:GeolocationfromtextWenowconsiderthesettinginwhichthelabelisit-selfalatentvariable,generatingboththetext(asde-scribedabove)andsomeadditionalmetadata.ThisisthesettingfortheGeographicalTopicModel,inwhichalatent\region"helpstoselectthedistributionsthat 4Thesoleexceptionisthatwelearnthepriorfromdata,whileAhmed&Xingsetitmanually.Table1.PredictionerrorforTwittergeolocation.errorinkilometers:medianmean (Eisensteinetal.,2010)494900(Wing&Baldridge,2011)479967SAGE501845generatebothtextandobservedGPSlocations(Eisen-steinetal.,2010).Bytrainingonlabeledexamplesinwhichbothtextandgeolocationareobserved,themodelisabletomakepredictionsabouttheGPSlo-cationofunlabeledauthors.TheGeographicTopicModelinducesregion-specicversionsofeachtopicbychainingtogetherlog-Normaldistributions.Thisisequivalenttoanadditivemodelinwhichboththetopicandthetopic-regioninter-actionexertazero-meanGaussiandeviationfromabackgroundlanguagemodel.SAGEdiersbyallow-ingeectsthatareregion-specicbuttopic-neutral,andbyinducingsparsity.Wefollowthetuningproce-duresfromEisensteinetal.(2010)exactly:thenumberoflatentregionsisdeterminedbyrunningaDirichletprocessmixturemodelonthelocationdataalone,andthenumberoftopicsistunedagainstadevelopmentset.WealsopresentmorerecentresultsfromWing&Baldridge(2011),whouseanearlyidenticaldataset,butincludealargervocabulary.AsshowninTable1,SAGEachievesthebestmeanerrorofanysystemonthistask,thoughWing&Baldridge(2011)havethebestmedianerror.6.RelatedworkSparselearning(Tibshirani,1996;Tipping,2001)typ-icallyfocusesonsupervisedsettings,learningasparsesetofweightsthatminimizealossonthetrainingla-bels.Tworecentpapersapplysparsitytotopicmod-els.Williamsonetal.(2010)inducesparsityinthetopicproportionsbyusingtheIndianBuetProcesstorepresentthepresenceorabsenceoftopicsinadocu-ment.MorecloselyrelatedistheSparseTMofWang&Blei(2009),whichinducessparsityinthetopicsthem-selvesusingaspike-and-slabdistribution.However,thenotionofsparsityisdierent:intheSparseTM,thetopic-termprobabilitydistributionscancontainzeros,whileinSAGE,eachtopicisasetofsparsedeviationsfromabackgrounddistribution.InferenceintheSparseTMrequirescomputingacombinatorialsumoverallsparsitypatterns,whileinourcasearel-ativelysimplecoordinateascentispossible.Sparsedictionarylearningprovidesanalternativeap-proachtomodelingdocumentcontentwithsparsebases(Jenattonetal.,2010).Ingeneral,suchap-