/
Doubly Stochastic Variational Bayes for nonConjugate Inference Michalis K Doubly Stochastic Variational Bayes for nonConjugate Inference Michalis K

Doubly Stochastic Variational Bayes for nonConjugate Inference Michalis K - PDF document

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
502 views
Uploaded On 2014-12-25

Doubly Stochastic Variational Bayes for nonConjugate Inference Michalis K - PPT Presentation

Titsias MTITSIAS AUEB GR Department of Informatics Athens University of Economics and Business Greece Miguel L azaroGredilla MIGUEL TSC UC ES Dpt Signal Processing Communications Universidad Carlos III de Madrid Spain Abstract We propose a simple a ID: 29296

Titsias MTITSIAS AUEB

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Doubly Stochastic Variational Bayes for ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DoublyStochasticVariationalBayesfornon-ConjugateInference tallythisdoublystochasticschemeinlarge-scaleBayesianlogisticregression.Independentlyfromourwork,Kingma&Welling(2013)andRezendeetal.(2014)alsoderiveddoublystochasticvariationalinferencealgorithmsbyutilisinggradientsfromthejointprobabilitydensity.Ourworkprovidesanaddi-tionalperspectiveanditspecialisesalsoondifferenttypeofapplicationssuchasvariableselectionandGaussianpro-cesshyperparameterlearning.2.TheoryConsiderarandomvectorz2RDthatfollowsadistri-butionwithacontinuousdensityfunction(z).Weshallassume(z)toexistinstandardformsothatanyparame-termeanvectorissettozeroandscaleparametersaresettoone.Forinstance,(z)couldbethestandardnormaldistribution,thestandardtdistribution,aproductofstan-dardlogisticdistributions,etc.Weoftenreferto(z)asthestandarddistributionandforthetimebeing,weshallleaveitunspeciedanddevelopthetheoryinageneralset-ting.Asecondassumptionwemakeabout(z)isthatitpermitsstraightforwardsimulationofindependentsam-ples.Weaimtoutilisethestandarddistributionasabuild-ingblockforconstructingcorrelatedvariationaldistribu-tions.While(z)hascurrentlynostructure,wecanaddcorrelationbyapplyinganinvertibletransformation,=Cz+;wherethescalematrixCistakentobealowertriangu-larpositivedenitematrix(i.e.itsdiagonalelementsarestrictlypositive)andisarealvector.GiventhattheJa-cobianoftheinversetransformationis1 jCj,thedistributionovertakestheformq(j;C)=1 jCj�C�1(�);(1)whichisamultivariateandgenerallycorrelateddistribu-tionhavingasadjustableparametersthemeanvectorandthescalematrixC.Wewishtoemployq(j;C)asavariationaldistributionforapproximatingtheexactBayesianposteriorinthegeneralsettingwherewehaveanon-conjugatemodel.Moreprecisely,weconsideraprob-abilisticmodelwiththejointdensityg()=p(yj)p();(2)whereyaredataand2RDareallunobservedrandomvariableswhichcanincludebothlatentvariablesandpa-rameters.FollowingthestandardvariationalBayesinfer-encemethod(Jordanetal.,1999;Neal&Hinton,1999;Wainwright&Jordan,2008)weseektominimisetheKLdivergenceKL[q(j;C)jjp(jy)]betweenthevariationalandthetrueposteriordistribution.Thiscanequivalentlyformulatedasthemaximisationofthefollowinglowerboundonthelogmarginallikelihood,F(;C)=Zq(j;C)logg() q(j;C)d;(3)whereq(j;C)isgivenby(1).Bychangingvariablesaccordingtoz=C�1(�),theaboveiswrittenasF(;C)=Z(z)logg(Cz+)jCj (z)dz=E(z)[logg(Cz+)]+logjCj+H;(4)wherelogjCj=PDd=1logCddandHdenotestheen-tropyof(z)whichisconstantwithrespecttothevari-ationalparameters(;C)andthereforeitcanbeignoredwhenmaximisingthebound.Alsonoticethattheabovere-quiresintegrationoverthedistribution(z)whichexistsinstandardformandthereforeitdoesnotdependonthevaria-tionalparameters(;C).Theseparameterssomehowhavebeentransferredinsidethelogarithmofthejointdensity.Further,itisworthnoticingthatwhenthelogarithmofthejointmodeldensity,i.e.logg(),isconcavewithrespectto,thelowerboundin(4)isalsoconcavewithrespecttothevariationalparameters(;C)andthisholdsforanystandarddistribution(z);seethesupplementarymaterialforaformalstatementandproof.ThisgeneralisestheresultofChallis&Barber(2011;2013),whoproveditforthevariationalGaussianapproximation,anditissimilartothegeneralisationpresentedin(Staines&Barber,2012).Totthevariationaldistributiontothetrueposterior,weneedtomaximisethebound(4)andthereforeweconsiderthegradientsoverandC,rF(;C)=E(z)[rlogg(Cz+)];(5)rCF(;C)=E(z)[rClogg(Cz+)]+C;(6)whereCdenotesthediagonalmatrixwithelements(1=C11;:::;1=CDD)inthediagonal.AlsothetermrClogg(Cz+)ineq.(6)shouldbeunderstoodasthepartialderivativesw.r.t.Cstoredinalowertriangularma-trixsothatthereisone-to-onecorrespondencewiththeelementsinC.Arstobservationaboutthegradientsaboveisthattheyinvolvetakingderivativesoftheloga-rithmofthejointdensitybyaddingrandomnessthroughzandthenaveragingout.Togainmoreintuition,wecanequivalentlyexpressthemintheoriginalspaceofbychangingvariablesinthereversedirectionaccordingto=Cz+.Usingthechainrulewehavethatrlogg(Cz+)=rCz+logg(Cz+)andsimilarlyrClogg(Cz+)=rCz+logg(Cz+)zTwhereagainrCz+logg(Cz+)zTshouldbeunderstoodastakingthelowertriangularpartafterperformingtheoutervector DoublyStochasticVariationalBayesfornon-ConjugateInference Algorithm1Doublystochasticvariationalinference Input:,y,,rlogg.Initialise(0);C(0),t=0.repeatt=t+1;z(z);(t�1)=C(t�1)z+(t�1);(t)=(t�1)+trlogg((t�1));C(t)=C(t�1)+trlogg((t�1))zT+C(t�1);untilconvergencecriterionismet. product.Theseobservationsallowtotransform(5)and(6)intheoriginalspaceofasfollows,rF(;C)=Eq(j;C)[rlogg()];(7)rCF(;C)=Eq(j;C)rlogg()(�)TC�T+C;(8)Eq.(7)isparticularlyintuitiveasitsaysthatthegradientoverissimplythegradientofthelogarithmofthejointdensitywithrespecttotheparametersaveragedoverthevariationaldistribution.Wewouldlikenowtooptimisethevariationallowerboundover(;C)usingastochasticapproximationprocedure.Tothisend,weneedtoprovidestochasticgradientshavingasexpectationstheexactquantities.Basedontheexpres-sions(5)-(6)ortheirequivalentcounterparts(7)-(8)wecanproceedbyrstlydrawing(s)q(j;C),andthenus-ingrlogg((s))asthestochasticdirectionforupdatingandrlogg((s))((s)�)TC�TasthedirectionforupdatingC.Todraw(s),weneedrsttosamplezfrom(z)(whichbyassumptionispossible)andthende-terministicallyobtain(s)=Cz+.Basedonthelatter((s)�)TC�TisjustzT,thereforethecomputationallyefcientwaytoimplementthewholestochasticapproxi-mationschemeisassummarisedinAlgorithm1.Basedonthetheoryofstochasticapproximation(Robbins&Monro,1951),usingascheduleofthelearningratesftgsuchthatPt=1;P2t1;theiterationinAlgo-rithm1willconvergetoalocalmaximaoftheboundin(3)ortotheglobalmaximumwhenthisboundisconcave.FornotationalsimplicitywehaveassumedcommonlearningratesequencesforandC,however,inpracticewecanusedifferentsequencesandthealgorithmremainsvalid.Wewillrefertotheabovestochasticapproximationalgo-rithmasdoublystochasticvariationalinference(DSVI)be-causeitintroducesstochasticityinadifferentdirectionthanthestandardstochasticvariationalinferenceproposedby(Hoffmanetal.,2010;2013).Thelatterisbasedonsub-samplingthetrainingdataandperformingonlineparame-terupdatesbyusingeachtimeasingledatapointorasmall“mini-batch”whichisanalogoustootheronlinelearningalgorithms(Bottou,1998;Bottou&Bousquet,2008).In-stead,ouralgorithmintroducesstochasticitybysamplingfromthevariationaldistribution.Noticethatthelattertypeofstochasticitywasrstintroducedby(Paisleyetal.,2012)whoproposedadifferentstochasticgradientforvari-ationalparametersthatwecompareagainstinSection2.2.Forjointprobabilitymodelswithafactorisedlikelihood,thetwotypesofstochasticitycanbecombinedsothatineachiterationthestochasticgradientsarecomputedbybothsamplingfromthevariationaldistributionandusingamini-batchofnNdatapoints.Itisstraightforwardtoseethatsuchdoublystochasticgradientsareunbiasedestimatesofthetruegradientsandthereforethewholeschemeisvalid.Intheexperiments,wedemonstratethedoublestochastic-ityforlearningfromverylargedatasetsinBayesianlogis-ticregression.However,forsimplicityintheremainderofourpresentationwewillnotanalysefurtherthemini-batchtypeofstochasticity.Finally,forinferenceproblemswherethedimensionalityofisverylargeandthereforeitisimpracticaltoopti-miseoverafullscalematrixC,wecanconsideradiag-onalmatrixinwhichcasethescalescanbestoredinaD-dimensionalstrictlypositivevectorc.Then,theupdateoverCinAlgorithm1isreplacedbyc(t)d=c(t�1)d+t @logg((t�1)) @dzd+1 c(t�1)d!;(9)whered=1;:::;D.Noticethatwhentheinitialstan-darddistribution(z)isfullyfactorisedtheaboveleadstoafullyfactorisedvariationalapproximationq(j;c)=QDd=1qd(djd;cd).WhilesuchapproachcanhaveloweraccuracythanusingthefullscalematrixC,ithasthegreatadvantagethatitcanscaleuptothousandsormillionsofparameters.InSection3.2,weusethisschemeforvari-ableselectioninlogisticregressionandwealsointroduceanovelvariationalobjectivefunctionforsparseinferencefollowingtheideaofautomaticrelevancedetermination.InthefollowingtwosectionsweelaboratemoreonthepropertiesofDSVIbydrawingconnectionswiththeGaus-sianapproximation(Section2.1)andanalysingconver-genceproperties(Section2.2).2.1.ConnectionwiththeGaussianapproximationTheGaussianapproximation,see(Barber&Bishop,1998;Seeger,1999)andmorerecently(Opper&Archam-beau,2009),assumesamultivariateGaussiandistributionN(j;)asavariationaldistributiontoapproximatetheexactmodelposteriorwhichleadstothemaximisationofthelowerboundF(;)=ZN(j;)logg() N(j;)d:(10) DoublyStochasticVariationalBayesfornon-ConjugateInference Themaximisationreliesonanalyticalintegration(orintheworstcaseinone-dimensionalnumericalintegration)forcomputingRN(j;)logg()d,whichsubsequentlycanallowtotunethevariationalparameters(;)us-inggradientoptimizationmethods(Opper&Archambeau,2009;Honkelaetal.,2011).Morerecently,Challis&Bar-ber(2011;2013)usethisframeworkwiththeparametrisa-tion=CCTandshowthatwhenlogg()isconcave,theboundisconcavew.r.t.(;C).Thelimitationoftheseap-proachesisthattheyrelyonlogg()havingasimpleformsothattheintegralRN(j;)logg()disanalyticallytractable.Unfortunately,thisconstraintexcludesmanyin-terestingBayesianinferenceproblems,suchasinferenceoverkernelhyperparametersinGaussianprocessmodels,inferenceoverweightsinneuralnetworksandothers.Incontrast,ourstochasticvariationalframeworkonlyre-liesonlogg()beingadifferentiablefunctionof.Ifwespecifythedistribution(z)tobethestandardnor-malN(zj0;I),thelowerboundin(4)becomestheGaus-sianapproximationboundin(10)withtheparametrisation=CCT.Subsequently,ifweapplytheDSVIitera-tionaccordingtoAlgorithm1withthespecializationthatzisdrawnfromN(zj0;I),thealgorithmwillstochasti-callymaximisetheGaussianapproximationbound.There-fore,DSVIallowstoapplytheGaussianapproximationtoamuchwiderrangeofmodels.AdifferentdirectionofexibilityintheDSVIframeworkisconcernedwiththechoiceofthestandarddistribution(z).Clearly,ifwechooseanon-Gaussianformweobtainnon-Gaussianvariationalapproximations.Forinstance,whenthisdistributionisthestandardtwithdegreesoffreedom,i.e.(z)=St(z;;0;I),thevariationaldistri-butionq(j;C)becomesthegeneraltdistributionwithdegreesoffreedom,i.e.q(j;C)=St(z;;;CCT).Aexiblewaytodeneastandarddistributionistoassumeafullyfactorisedform(z)=QDd=1d(zd)andthense-lecttheunivariatemarginals,d(z)withd=1;:::;D,fromafamilyofunivariatedistributionsforwhichexactsimulationispossible.Whileinsuchcasestheresultingq(j;C)canbeofnon-standardform,simulatingexactsamplesfromthisdistributionisalwaysstraightforwardsince(s)=Cz+,z(z)isbyconstructionanindependentsamplefromq(j;C).Inthecurrentexper-imentalstudypresentedinSection3weonlyconsidertheDSVIalgorithmforstochasticmaximisationoftheGaus-sianapproximationbound.Wedefertheexperimentationwithotherformsofthe(z)distributionforfuturework.2.2.IllustrativeconvergenceanalysisInthissection,weinformallyanalysetheconvergencebe-haviourofDSVI,i.e.,itsabilitytolocatealocalmaximumofthevariationallowerbound.Wewilluseanillustra- Figure1.Theevolutionofthelowerbound(optimalvalueiszero)obtainedbythetwostochasticapproximationmethodsemployingtwoalternativestochasticgradientsforttinga10-dimensionalGaussiandistributionN(jm;I)wheremwassettothevectoroftwos.Thevariationalmeanwasinitialisedtothezerovector.Foreachmethodtworealisationsareshown(onewithsmallandonewithlargelearningrate).BluesolidlinescorrespondtoDSVIwhilegreenandredlinestothealternativealgorithm.tiveexamplewhereg()isproportionaltoamultivariateGaussianandwewillcompareourmethodwithanalterna-tivedoublystochasticapproximationapproachproposedin(Paisleyetal.,2012).Forsimplicitynextwewillbeusingthenotationf()=logg().Recalleq.(7),whichgivesthegradientoverthevariationalmean,thatwerepeathereforconvenience,ZN(j;)rf()d;(11)wherewehavealsospeciedthevariationaldistributionq(j;C)tobetheGaussianN(j;)with=CCT.Basedontheabove,theimpliedsingle-samplestochas-ticapproximationofthegradientisr(s)f((s)),where(s)N(j;),whichispreciselywhatDSVIuses.Thequestionthatarisesnowiswhetherexistsanalterna-tivewaytowritetheexactgradientoverthatcangiverisetoadifferentstochasticgradientandmoreimportantlyhowthedifferentstochasticgradientscomparewithoneanotherintermsofconvergence.Inturnsoutthatanalternativeexpressionforthegradientoverisobtainedbydirectlydifferentiatingtheinitialboundin(10)whichgivesZN(j;)f()�1(�)d:(12)Thisformcanbealsoobtainedbythegeneralmethodin(Paisleyetal.,2012)accordingtowhichthegradientoversomevariationalparameter inavariationaldistributionq(j )iscomputedbasedonRf()q(j )r [logq(j )]dwhichinthecase DoublyStochasticVariationalBayesfornon-ConjugateInference Table1.Sizeandnumberoffeaturesofeachcancerdataset. Dataset#Train#TestD Colon42202,000Leukemia38347,129Breast3847,129 Table2.Trainandtesterrorsforthethreecancerdatasetsandforeachmethod:CONCAVistheoriginalDSVIalgorithmwithaxedprior,whereasARDisthefeature-selectionversion. ProblemTrainErrorTestError Colon(ARD)0/421/20Colon(CONCAV)0/420/20Leukemia(ARD)0/383/34Leukemia(CONCAV)0/3812/34Breast(ARD)0/382/4Breast(CONCAV)0/380/4 Table3.Sizeandsparsitylevelofeachlarge-scaledataset. Dataset#Train#TestD#Nonzeros a9a32,56116,281123451,592rcv120,242677,39947,23649,556,258Epsilon400,000100,0002,000800,000,000 Table4.TesterrorratesforDSVI-ARDand`1-logisticregressiononthreelarge-scaledatasets. DatasetDSVIARDLog.Reg. a9a0.15070.15002rcv10.04140.04204Epsilon0.10140.10110.5 Table5.PerformancemeasuresofGPregressionwherehyperpa-rametersareselectedbyML-II,DSVIorMCMC. DatasetML-IIDSVIMCMC Boston(smse)0.07430.07090.0699(nlpd)0.17830.14250.1317Bodyfat(smse)0.19920.07260.0726(nlpd)-0.1284-2.0750-2.0746Pendulum(smse)0.27270.28070.2801(nlpd)0.45370.44650.4462 3.3.Large-scaledatasetsInordertodemonstratethescalabilityoftheproposedmethod,werunitonthreewell-knownlarge-scalebinaryclassicationdatasetsa9a,rcv1,andEpsilon,whosedetailsarelistedonTable3.Dataseta9aisderivedfrom“Adult”inUCIrepository,rcv1isanarchiveofmanuallycategorisednewsstoriesfromReuters(weusetheoriginaltrain/testsplit),andEpsilonisanarticialdatasetfromPASCAL'slarge-scalelearningchallenge2008.WeuseagaintheBayesianlogisticregressionmodelwithvariableselectionandweappliedtheDSVI-ARDalgo-rithmdescribedpreviously.Forallproblems,mini-batchesofsize500areused,sothisprocessdoesnoteverre-quirethewholedatasettobeloadedinmemory.Wecontrastourresultswithstandard`1-logisticregression,whichexactlyminimisestheconvexfunctionalL(w)=jjwjj1�PNn=1logs(ynx�nw).Bothmethodsarerunontheexactsamesplits.Thevalueofwasselectedusing5-foldcross-validation.ResultsarereportedonTable4andshowthecompromisesmadebetweenbothapproaches.Theproposedapproachscaleswelltoverylargedatasetsbutitdoesnotoutperform`1-logisticregressionintheseex-amples.Thisisexpected,sincethenumberofdatapointsissohighthatthereislittlebenetfromusingaBayesianapproachhere.Note,however,theslightadvantageob-tainedforrcv1,wherethereareahugenumberofdimen-sions.AnotherbenetofDSVI-ARDisthelowmemoryrequirements(weneededa32GBRAMcomputertorunthe`1-logisticregression,whereasa4GBonewasenoughforDSVI-ARD).Incontrast,logisticregressionwasmorethan100timesfasterinachievingconvergence(usingthehighlyoptimisedLIBLINEARsoftware).3.4.GaussianprocesshyperparametersGaussianprocesses(GPs)arenon-parametricBayesianmodelswidelyusedtosolveregressiontasks.Inatypi-calsetting,aregressiondatasetDfxn;yngNn=1withxn2RDandyn2Rismodelledasyn=f(xn)+"n,where"nN(0;2)andf(x)GP(0;k(x;x0;)),forsomekernelhyperparametersandnoisevariance2.Pointestimatesforthehyperparametersaretypicallyob-tainedbyoptimisingthemarginallikelihoodoftheGPusingsomegradientascentprocedure(Rasmussen&Williams,2006).Here,wesuggesttoreplacethispro-cedurewithstochasticgradientascentoptimisationofthelowerboundthatprovidesaposteriordistributionoverthehyperparameters.Whilethestochasticnatureofthepro-posedmethodwillprobablyimplythatmoremarginallike-lihoodevaluationsarerequiredforconvergence,thisaddi-tionalcomputationalcostwillmakethemodelmoreresis-tanttooverttingandprovideaposterioroverthehyperpa-rametersatafractionofthecostoffullMCMC.UsingaGPwithkernelk(x;x0)=2fe�1 2PDd=1(xd�x0d)2 `2d,weplacevagueindependentnormalpriorsoverthehyper-parametersinlogspaceandcomputetheposteriorandpre-dictivedensitiesforthreedatasets:Boston,Bodyfat,andPendulum.Obviously,forthismodel,nostochastic-ityoverthedatasetisused.BostonisaUCIdatasetre-latedtohousingvaluesinBoston,Bodyfatrequirespre-dictingthepercentageofbodyfatfromseveralbodymea-surementsandinPendulumthechangeinangularveloc- DoublyStochasticVariationalBayesfornon-ConjugateInference Figure3.Top:Rolling-windowaverage(seesupplementarymaterial)oftheinstantaneouslowerboundvalues.Bottom:Finalvalueoftheapproximatemeanvectors.FirstcolumncorrespondstoColon,secondtoLeukemiaandthirdtoBreastdataset. Figure4.MarginalvariationalGaussiandistributionsforsomehyperparametersinBostondataset(shownasdashedredlines).Theblacksolidlinesshowtheground-truthempiricalestimatesforthesemarginalsobtainedbyMCMC.ityofasimulatedmechanicalpendulummustbepredicted.Figure4displaysvariationalposteriormarginaldistri-butionsforthreeofthehyperparametersintheBostonhousingdatasettogetherwiththecorrespondingempiri-calmarginalsobtainedbylongMCMCruns.Clearly,thevariationalmarginalsmatchverycloselytheMCMCesti-mates;seethesupplementarymaterialforacompletesetofsuchguresforallhyperparametersinallthreeregressiondatasets.Furthermore,negativelog-predictivedensities(nlpd)aswellasstandardisedmeansquareerrors(smse)intestdataareshowninTable5formaximummarginallikeli-hoodmodelselection(ML-II,thestandardforGPs),DSVIandMCMC.Asthetableshows,ML-II,whichisthemostwidelyusedmethodforhyperparameterselectioninGPs,overtstheBodyfatdataset.DSVIandMCMCdonotshowthisproblem,yieldingmuchbettertestperformance.Toprovideanintuitionofthecomputationaleffortassoci-atedtoeachofthesemethods,notethatontheseexperi-ments,onaverageML-IItook40seconds,DSVI30min-utesandMCMC20hours.FurtherdetailsonallaboveGPregressionexperiments,includingthelearningratesused,aregiveninthesupplementarymaterial.4.DiscussionandfutureworkWehavepresentedastochasticvariationalinferencealgo-rithmthatutilisesgradientsofthejointprobabilityden-sityanditisbasedondoublestochasticity(bybothsub-samplingtrainingdataandsimulatingfromthevaria-tionaldensity)todealwithnon-conjugatemodelsandbigdatasets.Wehaveshownthatthemethodcanbeappliedtoanumberofdivergecasesachievingcompetitivere-sults.Furtherworkshouldbeconcernedwithspeedingthestochasticapproximationalgorithmaswellasttingmorecomplexvariationaldistributionssuchasmixturemodels.AcknowledgmentsMKTgreatlyacknowledgessupportfrom“ResearchFund-ingatAUEBforExcellenceandExtroversion,Action1:2012-2014”.MLGgratefullyacknowledgessupportfromSpanishCICYTTIN2011-24533. DoublyStochasticVariationalBayesfornon-ConjugateInference ReferencesBarber,D.andBishop,C.M.EnsemblelearninginBayesianneuralnetworks.InJordan,M.,Kearns,M.,andSolla,S.(eds.),Neuralnetworksandmachinelearn-ing,pp.215–237,Berlin,1998.Bottou,L´eon.OnlineAlgorithmsandStochasticApprox-imations.InOnlineLearningandNeuralNetworks.CambridgeUniversityPress,1998.Bottou,L´eonandBousquet,Olivier.Thetradeoffsoflargescalelearning.InNIPS,volume20,pp.161–168,2008.Challis,EdwardandBarber,David.Concavegaussianvariationalapproximationsforinferenceinlarge-scalebayesianlinearmodels.InAISTATS,pp.199–207,2011.Challis,EdwardandBarber,David.Gaussiankullback-leiblerapproximateinference.J.Mach.Learn.Res.,14(1):2239–2286,January2013.Hoffman,MatthewD.,Blei,DavidM.,andBach,Fran-cisR.Onlinelearningforlatentdirichletallocation.InNIPS,pp.856–864,2010.Hoffman,MatthewD.,Blei,DavidM.,Wang,Chong,andPaisley,JohnWilliam.Stochasticvariationalin-ference.JournalofMachineLearningResearch,14(1):1303–1347,2013.Honkela,A.,Raiko,T.,Kuusela,M.,Tornio,M.,andKarhunen,J.ApproximateRiemannianconjugategra-dientlearningforxed-formvariationalBayes.JournalofMachineLearningResearch,11:3235–3268,2011.Jordan,MichaelI.,Ghahramani,Zoubin,Jaakkola,TommiS.,andSaul,LawrenceK.Anintroductiontovariationalmethodsforgraphicalmodels.Mach.Learn.,37(2):183–233,November1999.Kingma,DiederikP.andWelling,Max.Auto-encodingvariationalbayes.arXivpreprintarXiv:1312.6114,2013.Mnih,AndriyandGregor,Karol.Neuralvariationalin-ferenceandlearninginbeliefnetworks.InThe31stInternationalConferenceonMachineLearning(ICML2014),2014.Neal,RadfordM.andHinton,GeoffreyE.Aviewoftheemalgorithmthatjustiesincremental,sparse,andothervariants.InJordan,MichaelI.(ed.),LearninginGraph-icalModels,pp.355–368.1999.Opper,M.andArchambeau,C.ThevariationalGaussianapproximationrevisited.NeuralComputation,21(3):786–792,2009.Paisley,JohnWilliam,Blei,DavidM.,andJordan,MichaelI.Variationalbayesianinferencewithstochasticsearch.InICML,2012.Ranganath,Rajesh,Gerrish,Sean,andBlei,David.Blackboxvariationalinference.InProceedingsoftheSev-enteenthInternationalConferenceonArticialIntelli-genceandStatistics(AISTATS),pp.814822,2014.Rasmussen,C.E.andWilliams,C.K.I.GaussianProcessesforMachineLearning.AdaptiveComputationandMa-chineLearning.MITPress,2006.Rezende,DaniloJimenez,Mohamed,Shakir,andWierstra,Daan.Stochasticbackpropagationandapproximatein-ferenceindeepgenerativemodels.InThe31stInterna-tionalConferenceonMachineLearning(ICML2014),2014.Robbins,HerbertandMonro,Sutton.AStochasticAp-proximationMethod.TheAnnalsofMathematicalStatistics,22(3):400–407,1951.Robert,ChristianP.andCasella,George.MonteCarloStatisticalMethods.Springer-Verlag,1edition,August1999.Seeger,Matthias.Bayesianmodelselectionforsupportvectormachines,gaussianprocessesandotherkernelclassiers.InNIPS12,pp.603–609,1999.Shevade,ShirishKrishnajandKeerthi,S.Sathiya.Asimpleandefcientalgorithmforgeneselectionusingsparselogisticregression.Bioinformatics,19(17):2246–2253,2003.Staines,JoeandBarber,David.Variationaloptimization.Technicalreport,2012.Tipping,MichaelE.Sparsebayesianlearningandtherel-evancevectormachine.JournalofMachineLearningResearch,1:211–244,2001.Wainwright,MartinJ.andJordan,MichaelI.Graphi-calmodels,exponentialfamilies,andvariationalinfer-ence.Found.TrendsMach.Learn.,1(1-2):1–305,Jan-uary2008.