/
Sparse Additive Generative Models of Text Sparse Additive Generative Models of Text

Sparse Additive Generative Models of Text - PDF document

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
353 views
Uploaded On 2018-07-20

Sparse Additive Generative Models of Text - PPT Presentation

ponentvectorkandiindexesintothevocabularyAzeromeanLaplacepriorhasthesameeffectasplacinganL1regularizeronkiinducingsparsitywhileatthesametimepermittingmoreextremedeviationsfromthemeanTheLaplac ID: 94793

ponentvectorkandiindexesintothevocabu-lary.Azero-meanLaplacepriorhasthesameef-fectasplacinganL1regularizeronki inducingsparsitywhileatthesametimepermittingmoreex-tremedeviationsfromthemean.TheLaplac

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Sparse Additive Generative Models of Tex..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

SparseAdditiveGenerativeModelsofText ponentvectorkandiindexesintothevocabu-lary.Azero-meanLaplacepriorhasthesameef-fectasplacinganL1regularizeronki,inducingsparsitywhileatthesametimepermittingmoreex-tremedeviationsfromthemean.TheLaplacedistri-butionL(;m;)isequivalenttoacompoundmodel,RN(;m;)E(;)d,whereE(;)indicatestheEx-ponentialdistribution(Lange&Sinsheimer,1993;Figueiredo,2003).Thisidentityisthecornerstoneofourinference,whichtreatsthevarianceasalatentvariable.WenowpresentagenerativestoryfortheincorporationofSAGEinanaveBayesclassi er:DrawthebackgroundmfromanuninformativepriorForeachclassk{ForeachtermiDrawk;iE( )Drawk;iN(0;k;i){Set k/exp(k+m)Foreachdocumentd{Drawaclassydfromauniformdistribution{Foreachwordn,draww(d)n ydIngeneralweworkinaBayesiansetting,butforthecomponentswetakemaximumaposterioripointestimates.Bayesianuncertaintyisproblematicduetothelogistictransformation:eveniftheexpectationhkii=0,anyposteriorvarianceoverkiwouldmakehexp(ki+mi)i�hexpmii.Weresorttoacombi-nationofMAPestimationoverandBayesianinfer-enceoverallotherlatentvariables|thisissimilartothetreatmentofthetopics intheoriginalformu-lationoflatentDirichletallocation(Bleietal.,2003).Thebackgroundworddistributionmisassumedtobe xed,andwe tavariationaldistributionovertheremaininglatentvariables,optimizingthebound,`=XdNdXnlogP(w(d)njm;yd)+XkhlogP(kj0;k)i+XkhlogP(kj )i�XkhlogQ(k)i;(2)whereNdisthenumberoftokensindocumentd.3.EstimationWenowdescribehowSAGEcomponentscanbee-cientlyestimatedusingaNewtonoptimization.3.1.ComponentmeansFirstweaddresslearningthecomponentvectors.Lettingcdrepresentthevectoroftermcountsfordoc-umentd,andCd=Picdi,wecomputetherelevantpartsofthebound,`(k)=Xd:cd=kcTdk�CdlogXiexp(ki+mi)�Tkdiag� �1k k=2(3)@` @k=ck�Ckexp(k+m) Piexp(ki+mi)�diag� �1k k=ck�Ck k�diag� �1k k;(4)abusingnotationsothatck=Pd:cd=kcdandCk=Picki.Notethatthefractionexp(k+m) Piexp(ki+mi)isequaltothetermfrequencyvector k.Thegradienthasanintuitiveinterpretationasthedi erenceofthetruecountsckfromtheirexpectationCk k,minusthedi-vergenceoffromitspriormean0,scaledbytheex-pectedinverse-variance.WewilluseNewton'smethodtooptimize,sowe rstderivetheHessian,d2` d2ki=Ck ki( ki�1)� �1ki ;d2` dkidki0=Ck ki ki0H(k)=Ck k Tk�diag�Ck k+ �1k :(5)TheHessianmatrixHisrank-oneplusdiagonal,soitcanbeecientlyinvertedusingtheSherman-Morrisonformula.Fornotationalsimplicity,weelidetheclassindexk,andde netheconveniencevariableA=diag��(C + �1 )�1.WecannowderiveaNewtonoptimizationstepfor,usingthegradientg()=@` @fromEquation4:H�1()=A�AC TA 1+C TA �=H�1()g()=Ag()�CA 1+C TA h T(Ag())i;(6)wheretheparenthesizationde nesanorderofopera-tionsthatavoidsforminganynon-diagonalmatrices.Thus,thecomplexityofeachNewtonstepislinearinthesizeofthevocabulary.3.2.VariancesNextweconsiderthevariance;recallthatwehavearandomvectorkforeverycomponentk.Unlikethecomponents,weareBayesianwithrespectto,andconstructafully-factoredvariationaldistributionQk(k)=QiQki(ki).WesettheformQkitobeaGammadistributionwithparametersha;bi:Q()=G(;a;b)=a�1exp(�=b) �(a)ba;sothathi=ab; �1 =((a�1)b)�1,andhlogi= (a)+log(b).TheprioronisanExponential SparseAdditiveGenerativeModelsofText butionforagiventokenisP(w(d)njz(d)n;;m)/expz(d)n+m:Wecancombinethemean eldvariationalinferenceforlatentDirichletallocation(LDA)withthevaria-tionaltreatmentof,optimizingthebound,`=XdhlogP(dj )i+NdXnDlogP(w(d)njm;z(d)n)E+DlogP(z(d)njd)E+XkhlogP(kj0;k)i+XkhlogP(kj )i�hlogQ(;z;)i:(8)TheupdatesforQ(z)andQ()areidenticaltostan-dardLDA;theupdatesforQ()remainasSec-tion3.2.However,thepresenceoflatentvariablesslightlychangestheMAPestimationfor:`(k)=XdNdXnQz(d)n(k) k�logXiexp(ki+mi)!�Tkdiag� �1k k=2@` @k=hcki�hCkiexp(k+m) Piexp(ki+mi)�diag� �1k k;wherehckii=PdPnQz(d)n(k)(w(d)n=i),andhCki=Pihckii:Thus,theexactcountsckarereplacedwiththeirexpectationsunderQ(z).Wede neanEMprocedureinwhichtheM-stepcon-sistsiniteratively ttingtheparametersandQ().Itistemptingtoperforma\warmstart"byintializingwiththevaluesfromapreviousiterationoftheouterEMloop.However,theseparametersaretightlycou-pled:asthecomponentmeankigoestozero,theexpectedvariancehkiiisalsodriventozero;oncehkiiisverysmall,kicannotmoveawayfromzeroregardlessoftheexpectedcountsck.Thismeansthatawarmstartriskslockinginasparsitypatterndur-ingtheearlystagesofEMwhichmaybefarfromtheglobaloptimum.Therearetwosolutions:eitherweabandonthewarmstart(thusexpendingmorecompu-tation),orwedonotiteratetoconvergenceineachM-step(thusobtainingnoisierandlesssparsesolutions,initially).Fortunately,pilotexperimentsshowedthatgoodresultscanbeobtainedbyperformingjustoneiterationineachM-step,whileusingthewarmstarttechnique.Application2:SparsetopicmodelsOursecondevaluationappliestheSAGETopicModeltothebenchmarkNIPSdataset.2Followingtheeval-uationofWangandBlei(2009),wesubsampleto10% 2http://www.cs.nyu.edu/roweis/data.html Figure3.PerplexityresultsforSAGEandlatentDirichletallocationontheNIPSdataset. Figure4.Proportionoftotalvariationcommittedtowordsateachfrequencydecile.Dirichlet-multinomialLDAmakeslargedistinctionsinthetopic-termfrequenciesofveryrarewords,whileSAGEonlydistinguishesthetopic-termfrequenciesofwordswithrobustcounts.ofthetokensineachdocument,andholdout20%ofthedocumentsasa\testset"onwhichtoevalu-atepredictivelikelihood.Welimitthevocabularytothe10,000termsthatappearinthegreatestnumberofdocuments;nostopwordpruningisapplied.Over-allthisleaves1986trainingdocumentswith237,691tokens,andatestsetof497documentsand57,427tokens.WeevaluateperplexityusingtheChib-styleestimationprocedureofWallachetal.(2009b).Forcomparison,weimplementvariationallatentDirichletallocation,makingmaximum-likelihoodupdatestoasymmetricDirichletprioronthetopic-termdistribu-tions.ResultsareshowninFigure3,usingboxplotsover vepairedrandominitializationsforeachmethod.SAGEoutperformsstandardlatentDirichletalloca-tionasthenumberoftopicsincreases;withboth25and50topics,everySAGErunoutperformeditscoun-terpartDirichlet-multinomial.Asintheclassi cationtask,SAGEcontrolssparsityadaptively:asthenum-beroftopicsincreasesfrom10to50,theproportionofnon-zeroweightsdecreased ve-fold(from5%to1%),holdingthetotalmodelcomplexityconstant. SparseAdditiveGenerativeModelsofText Figure6.SAGE'saccuracyontheideologicalperspectivetask;thestate-of-the-artis69.1%(Ahmed&Xing,2010).teractioncomponents(I)jkdependonexactlyonela-beldistribution(A)jandonetopic(T)k,sowecanuseCjk jkdirectlywithoutcomputinganysums.Application3:TopicandideologyWe rstevaluateonapublicly-availabledatasetofpo-liticalblogsdescribingthe2008U.S.presidentialelec-tion(Eisenstein&Xing,2010).Thereareatotalofsixblogs|threefromtherightandthreefromleft|comprising20,827documents,5.1millionwordtokens,andavocabularyof8284items.Thetaskistopredicttheideologicalperspectiveoftwounlabeledblogs,us-ingtheremainingfourasatrainingset.WestrictlyfollowtheexperimentalprocedureofAhmed&Xing(2010),allowingustocomparewiththeirreportedre-sultsdirectly.4AhmedandXingconsideredthreeideology-predictiontasks,andfoundthatthesix-blogtaskwasthemostdicult:theirMultiviewLDAmodelachievesaccu-racybetween65%and69.1%dependingonthenum-beroftopics.They ndacomparableresultof69%usingsupportvectormachines;alternativelatentvari-ablemodelsDiscriminativeLDA(Lacoste-Julienetal.,2008)andSupervisedLDA(Blei&McAuli e,2007)doworse.OurresultsareshowninFigure6,reportingthemedianacross verandominitializationateachsettingforK.Ourbestmedianresultis69.6%,atK=30,equallingthestate-of-the-art;ourbestindividualrunachieves71.9%.Ourmodelobtainssparsityof93%fortopics,82%forlabels,and99.3%fortopic-labelinteractions;nonetheless,pilotexperimentsshowthattheabsenceoftopic-labelinteractionsreducesperfor-mancesubstantially.Application4:GeolocationfromtextWenowconsiderthesettinginwhichthelabelisit-selfalatentvariable,generatingboththetext(asde-scribedabove)andsomeadditionalmetadata.ThisisthesettingfortheGeographicalTopicModel,inwhichalatent\region"helpstoselectthedistributionsthat 4Thesoleexceptionisthatwelearntheprior fromdata,whileAhmed&Xingsetitmanually.Table1.PredictionerrorforTwittergeolocation.errorinkilometers:medianmean (Eisensteinetal.,2010)494900(Wing&Baldridge,2011)479967SAGE501845generatebothtextandobservedGPSlocations(Eisen-steinetal.,2010).Bytrainingonlabeledexamplesinwhichbothtextandgeolocationareobserved,themodelisabletomakepredictionsabouttheGPSlo-cationofunlabeledauthors.TheGeographicTopicModelinducesregion-speci cversionsofeachtopicbychainingtogetherlog-Normaldistributions.Thisisequivalenttoanadditivemodelinwhichboththetopicandthetopic-regioninter-actionexertazero-meanGaussiandeviationfromabackgroundlanguagemodel.SAGEdi ersbyallow-inge ectsthatareregion-speci cbuttopic-neutral,andbyinducingsparsity.Wefollowthetuningproce-duresfromEisensteinetal.(2010)exactly:thenumberoflatentregionsisdeterminedbyrunningaDirichletprocessmixturemodelonthelocationdataalone,andthenumberoftopicsistunedagainstadevelopmentset.WealsopresentmorerecentresultsfromWing&Baldridge(2011),whouseanearlyidenticaldataset,butincludealargervocabulary.AsshowninTable1,SAGEachievesthebestmeanerrorofanysystemonthistask,thoughWing&Baldridge(2011)havethebestmedianerror.6.RelatedworkSparselearning(Tibshirani,1996;Tipping,2001)typ-icallyfocusesonsupervisedsettings,learningasparsesetofweightsthatminimizealossonthetrainingla-bels.Tworecentpapersapplysparsitytotopicmod-els.Williamsonetal.(2010)inducesparsityinthetopicproportionsbyusingtheIndianBu etProcesstorepresentthepresenceorabsenceoftopicsinadocu-ment.MorecloselyrelatedistheSparseTMofWang&Blei(2009),whichinducessparsityinthetopicsthem-selvesusingaspike-and-slabdistribution.However,thenotionofsparsityisdi erent:intheSparseTM,thetopic-termprobabilitydistributionscancontainzeros,whileinSAGE,eachtopicisasetofsparsedeviationsfromabackgrounddistribution.InferenceintheSparseTMrequirescomputingacombinatorialsumoverallsparsitypatterns,whileinourcasearel-ativelysimplecoordinateascentispossible.Sparsedictionarylearningprovidesanalternativeap-proachtomodelingdocumentcontentwithsparsebases(Jenattonetal.,2010).Ingeneral,suchap-