Titsias MTITSIAS AUEB GR Department of Informatics Athens University of Economics and Business Greece Miguel L azaroGredilla MIGUEL TSC UC ES Dpt Signal Processing Communications Universidad Carlos III de Madrid Spain Abstract We propose a simple a ID: 29296
Download Pdf The PPT/PDF document "Doubly Stochastic Variational Bayes for ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
DoublyStochasticVariationalBayesfornon-ConjugateInference tallythisdoublystochasticschemeinlarge-scaleBayesianlogisticregression.Independentlyfromourwork,Kingma&Welling(2013)andRezendeetal.(2014)alsoderiveddoublystochasticvariationalinferencealgorithmsbyutilisinggradientsfromthejointprobabilitydensity.Ourworkprovidesanaddi-tionalperspectiveanditspecialisesalsoondifferenttypeofapplicationssuchasvariableselectionandGaussianpro-cesshyperparameterlearning.2.TheoryConsiderarandomvectorz2RDthatfollowsadistri-butionwithacontinuousdensityfunction(z).Weshallassume(z)toexistinstandardformsothatanyparame-termeanvectorissettozeroandscaleparametersaresettoone.Forinstance,(z)couldbethestandardnormaldistribution,thestandardtdistribution,aproductofstan-dardlogisticdistributions,etc.Weoftenreferto(z)asthestandarddistributionandforthetimebeing,weshallleaveitunspeciedanddevelopthetheoryinageneralset-ting.Asecondassumptionwemakeabout(z)isthatitpermitsstraightforwardsimulationofindependentsam-ples.Weaimtoutilisethestandarddistributionasabuild-ingblockforconstructingcorrelatedvariationaldistribu-tions.While(z)hascurrentlynostructure,wecanaddcorrelationbyapplyinganinvertibletransformation,=Cz+;wherethescalematrixCistakentobealowertriangu-larpositivedenitematrix(i.e.itsdiagonalelementsarestrictlypositive)andisarealvector.GiventhattheJa-cobianoftheinversetransformationis1 jCj,thedistributionovertakestheformq(j;C)=1 jCjC1();(1)whichisamultivariateandgenerallycorrelateddistribu-tionhavingasadjustableparametersthemeanvectorandthescalematrixC.Wewishtoemployq(j;C)asavariationaldistributionforapproximatingtheexactBayesianposteriorinthegeneralsettingwherewehaveanon-conjugatemodel.Moreprecisely,weconsideraprob-abilisticmodelwiththejointdensityg()=p(yj)p();(2)whereyaredataand2RDareallunobservedrandomvariableswhichcanincludebothlatentvariablesandpa-rameters.FollowingthestandardvariationalBayesinfer-encemethod(Jordanetal.,1999;Neal&Hinton,1999;Wainwright&Jordan,2008)weseektominimisetheKLdivergenceKL[q(j;C)jjp(jy)]betweenthevariationalandthetrueposteriordistribution.Thiscanequivalentlyformulatedasthemaximisationofthefollowinglowerboundonthelogmarginallikelihood,F(;C)=Zq(j;C)logg() q(j;C)d;(3)whereq(j;C)isgivenby(1).Bychangingvariablesaccordingtoz=C1(),theaboveiswrittenasF(;C)=Z(z)logg(Cz+)jCj (z)dz=E(z)[logg(Cz+)]+logjCj+H;(4)wherelogjCj=PDd=1logCddandHdenotestheen-tropyof(z)whichisconstantwithrespecttothevari-ationalparameters(;C)andthereforeitcanbeignoredwhenmaximisingthebound.Alsonoticethattheabovere-quiresintegrationoverthedistribution(z)whichexistsinstandardformandthereforeitdoesnotdependonthevaria-tionalparameters(;C).Theseparameterssomehowhavebeentransferredinsidethelogarithmofthejointdensity.Further,itisworthnoticingthatwhenthelogarithmofthejointmodeldensity,i.e.logg(),isconcavewithrespectto,thelowerboundin(4)isalsoconcavewithrespecttothevariationalparameters(;C)andthisholdsforanystandarddistribution(z);seethesupplementarymaterialforaformalstatementandproof.ThisgeneralisestheresultofChallis&Barber(2011;2013),whoproveditforthevariationalGaussianapproximation,anditissimilartothegeneralisationpresentedin(Staines&Barber,2012).Totthevariationaldistributiontothetrueposterior,weneedtomaximisethebound(4)andthereforeweconsiderthegradientsoverandC,rF(;C)=E(z)[rlogg(Cz+)];(5)rCF(;C)=E(z)[rClogg(Cz+)]+C;(6)whereCdenotesthediagonalmatrixwithelements(1=C11;:::;1=CDD)inthediagonal.AlsothetermrClogg(Cz+)ineq.(6)shouldbeunderstoodasthepartialderivativesw.r.t.Cstoredinalowertriangularma-trixsothatthereisone-to-onecorrespondencewiththeelementsinC.Arstobservationaboutthegradientsaboveisthattheyinvolvetakingderivativesoftheloga-rithmofthejointdensitybyaddingrandomnessthroughzandthenaveragingout.Togainmoreintuition,wecanequivalentlyexpressthemintheoriginalspaceofbychangingvariablesinthereversedirectionaccordingto=Cz+.Usingthechainrulewehavethatrlogg(Cz+)=rCz+logg(Cz+)andsimilarlyrClogg(Cz+)=rCz+logg(Cz+)zTwhereagainrCz+logg(Cz+)zTshouldbeunderstoodastakingthelowertriangularpartafterperformingtheoutervector DoublyStochasticVariationalBayesfornon-ConjugateInference Algorithm1Doublystochasticvariationalinference Input:,y,,rlogg.Initialise(0);C(0),t=0.repeatt=t+1;z(z);(t1)=C(t1)z+(t1);(t)=(t1)+trlogg((t1));C(t)=C(t1)+trlogg((t1))zT+C(t1);untilconvergencecriterionismet. product.Theseobservationsallowtotransform(5)and(6)intheoriginalspaceofasfollows,rF(;C)=Eq(j;C)[rlogg()];(7)rCF(;C)=Eq(j;C)rlogg()()TCT+C;(8)Eq.(7)isparticularlyintuitiveasitsaysthatthegradientoverissimplythegradientofthelogarithmofthejointdensitywithrespecttotheparametersaveragedoverthevariationaldistribution.Wewouldlikenowtooptimisethevariationallowerboundover(;C)usingastochasticapproximationprocedure.Tothisend,weneedtoprovidestochasticgradientshavingasexpectationstheexactquantities.Basedontheexpres-sions(5)-(6)ortheirequivalentcounterparts(7)-(8)wecanproceedbyrstlydrawing(s)q(j;C),andthenus-ingrlogg((s))asthestochasticdirectionforupdatingandrlogg((s))((s))TCTasthedirectionforupdatingC.Todraw(s),weneedrsttosamplezfrom(z)(whichbyassumptionispossible)andthende-terministicallyobtain(s)=Cz+.Basedonthelatter((s))TCTisjustzT,thereforethecomputationallyefcientwaytoimplementthewholestochasticapproxi-mationschemeisassummarisedinAlgorithm1.Basedonthetheoryofstochasticapproximation(Robbins&Monro,1951),usingascheduleofthelearningratesftgsuchthatPt=1;P2t1;theiterationinAlgo-rithm1willconvergetoalocalmaximaoftheboundin(3)ortotheglobalmaximumwhenthisboundisconcave.FornotationalsimplicitywehaveassumedcommonlearningratesequencesforandC,however,inpracticewecanusedifferentsequencesandthealgorithmremainsvalid.Wewillrefertotheabovestochasticapproximationalgo-rithmasdoublystochasticvariationalinference(DSVI)be-causeitintroducesstochasticityinadifferentdirectionthanthestandardstochasticvariationalinferenceproposedby(Hoffmanetal.,2010;2013).Thelatterisbasedonsub-samplingthetrainingdataandperformingonlineparame-terupdatesbyusingeachtimeasingledatapointorasmallmini-batchwhichisanalogoustootheronlinelearningalgorithms(Bottou,1998;Bottou&Bousquet,2008).In-stead,ouralgorithmintroducesstochasticitybysamplingfromthevariationaldistribution.Noticethatthelattertypeofstochasticitywasrstintroducedby(Paisleyetal.,2012)whoproposedadifferentstochasticgradientforvari-ationalparametersthatwecompareagainstinSection2.2.Forjointprobabilitymodelswithafactorisedlikelihood,thetwotypesofstochasticitycanbecombinedsothatineachiterationthestochasticgradientsarecomputedbybothsamplingfromthevariationaldistributionandusingamini-batchofnNdatapoints.Itisstraightforwardtoseethatsuchdoublystochasticgradientsareunbiasedestimatesofthetruegradientsandthereforethewholeschemeisvalid.Intheexperiments,wedemonstratethedoublestochastic-ityforlearningfromverylargedatasetsinBayesianlogis-ticregression.However,forsimplicityintheremainderofourpresentationwewillnotanalysefurtherthemini-batchtypeofstochasticity.Finally,forinferenceproblemswherethedimensionalityofisverylargeandthereforeitisimpracticaltoopti-miseoverafullscalematrixC,wecanconsideradiag-onalmatrixinwhichcasethescalescanbestoredinaD-dimensionalstrictlypositivevectorc.Then,theupdateoverCinAlgorithm1isreplacedbyc(t)d=c(t1)d+t @logg((t1)) @dzd+1 c(t1)d!;(9)whered=1;:::;D.Noticethatwhentheinitialstan-darddistribution(z)isfullyfactorisedtheaboveleadstoafullyfactorisedvariationalapproximationq(j;c)=QDd=1qd(djd;cd).WhilesuchapproachcanhaveloweraccuracythanusingthefullscalematrixC,ithasthegreatadvantagethatitcanscaleuptothousandsormillionsofparameters.InSection3.2,weusethisschemeforvari-ableselectioninlogisticregressionandwealsointroduceanovelvariationalobjectivefunctionforsparseinferencefollowingtheideaofautomaticrelevancedetermination.InthefollowingtwosectionsweelaboratemoreonthepropertiesofDSVIbydrawingconnectionswiththeGaus-sianapproximation(Section2.1)andanalysingconver-genceproperties(Section2.2).2.1.ConnectionwiththeGaussianapproximationTheGaussianapproximation,see(Barber&Bishop,1998;Seeger,1999)andmorerecently(Opper&Archam-beau,2009),assumesamultivariateGaussiandistributionN(j;)asavariationaldistributiontoapproximatetheexactmodelposteriorwhichleadstothemaximisationofthelowerboundF(;)=ZN(j;)logg() N(j;)d:(10) DoublyStochasticVariationalBayesfornon-ConjugateInference Themaximisationreliesonanalyticalintegration(orintheworstcaseinone-dimensionalnumericalintegration)forcomputingRN(j;)logg()d,whichsubsequentlycanallowtotunethevariationalparameters(;)us-inggradientoptimizationmethods(Opper&Archambeau,2009;Honkelaetal.,2011).Morerecently,Challis&Bar-ber(2011;2013)usethisframeworkwiththeparametrisa-tion=CCTandshowthatwhenlogg()isconcave,theboundisconcavew.r.t.(;C).Thelimitationoftheseap-proachesisthattheyrelyonlogg()havingasimpleformsothattheintegralRN(j;)logg()disanalyticallytractable.Unfortunately,thisconstraintexcludesmanyin-terestingBayesianinferenceproblems,suchasinferenceoverkernelhyperparametersinGaussianprocessmodels,inferenceoverweightsinneuralnetworksandothers.Incontrast,ourstochasticvariationalframeworkonlyre-liesonlogg()beingadifferentiablefunctionof.Ifwespecifythedistribution(z)tobethestandardnor-malN(zj0;I),thelowerboundin(4)becomestheGaus-sianapproximationboundin(10)withtheparametrisation=CCT.Subsequently,ifweapplytheDSVIitera-tionaccordingtoAlgorithm1withthespecializationthatzisdrawnfromN(zj0;I),thealgorithmwillstochasti-callymaximisetheGaussianapproximationbound.There-fore,DSVIallowstoapplytheGaussianapproximationtoamuchwiderrangeofmodels.AdifferentdirectionofexibilityintheDSVIframeworkisconcernedwiththechoiceofthestandarddistribution(z).Clearly,ifwechooseanon-Gaussianformweobtainnon-Gaussianvariationalapproximations.Forinstance,whenthisdistributionisthestandardtwithdegreesoffreedom,i.e.(z)=St(z;;0;I),thevariationaldistri-butionq(j;C)becomesthegeneraltdistributionwithdegreesoffreedom,i.e.q(j;C)=St(z;;;CCT).Aexiblewaytodeneastandarddistributionistoassumeafullyfactorisedform(z)=QDd=1d(zd)andthense-lecttheunivariatemarginals,d(z)withd=1;:::;D,fromafamilyofunivariatedistributionsforwhichexactsimulationispossible.Whileinsuchcasestheresultingq(j;C)canbeofnon-standardform,simulatingexactsamplesfromthisdistributionisalwaysstraightforwardsince(s)=Cz+,z(z)isbyconstructionanindependentsamplefromq(j;C).Inthecurrentexper-imentalstudypresentedinSection3weonlyconsidertheDSVIalgorithmforstochasticmaximisationoftheGaus-sianapproximationbound.Wedefertheexperimentationwithotherformsofthe(z)distributionforfuturework.2.2.IllustrativeconvergenceanalysisInthissection,weinformallyanalysetheconvergencebe-haviourofDSVI,i.e.,itsabilitytolocatealocalmaximumofthevariationallowerbound.Wewilluseanillustra- Figure1.Theevolutionofthelowerbound(optimalvalueiszero)obtainedbythetwostochasticapproximationmethodsemployingtwoalternativestochasticgradientsforttinga10-dimensionalGaussiandistributionN(jm;I)wheremwassettothevectoroftwos.Thevariationalmeanwasinitialisedtothezerovector.Foreachmethodtworealisationsareshown(onewithsmallandonewithlargelearningrate).BluesolidlinescorrespondtoDSVIwhilegreenandredlinestothealternativealgorithm.tiveexamplewhereg()isproportionaltoamultivariateGaussianandwewillcompareourmethodwithanalterna-tivedoublystochasticapproximationapproachproposedin(Paisleyetal.,2012).Forsimplicitynextwewillbeusingthenotationf()=logg().Recalleq.(7),whichgivesthegradientoverthevariationalmean,thatwerepeathereforconvenience,ZN(j;)rf()d;(11)wherewehavealsospeciedthevariationaldistributionq(j;C)tobetheGaussianN(j;)with=CCT.Basedontheabove,theimpliedsingle-samplestochas-ticapproximationofthegradientisr(s)f((s)),where(s)N(j;),whichispreciselywhatDSVIuses.Thequestionthatarisesnowiswhetherexistsanalterna-tivewaytowritetheexactgradientoverthatcangiverisetoadifferentstochasticgradientandmoreimportantlyhowthedifferentstochasticgradientscomparewithoneanotherintermsofconvergence.Inturnsoutthatanalternativeexpressionforthegradientoverisobtainedbydirectlydifferentiatingtheinitialboundin(10)whichgivesZN(j;)f()1()d:(12)Thisformcanbealsoobtainedbythegeneralmethodin(Paisleyetal.,2012)accordingtowhichthegradientoversomevariationalparameter inavariationaldistributionq(j )iscomputedbasedonRf()q(j )r [logq(j )]dwhichinthecase DoublyStochasticVariationalBayesfornon-ConjugateInference Table1.Sizeandnumberoffeaturesofeachcancerdataset. Dataset#Train#TestD Colon42202,000Leukemia38347,129Breast3847,129 Table2.Trainandtesterrorsforthethreecancerdatasetsandforeachmethod:CONCAVistheoriginalDSVIalgorithmwithaxedprior,whereasARDisthefeature-selectionversion. ProblemTrainErrorTestError Colon(ARD)0/421/20Colon(CONCAV)0/420/20Leukemia(ARD)0/383/34Leukemia(CONCAV)0/3812/34Breast(ARD)0/382/4Breast(CONCAV)0/380/4 Table3.Sizeandsparsitylevelofeachlarge-scaledataset. Dataset#Train#TestD#Nonzeros a9a32,56116,281123451,592rcv120,242677,39947,23649,556,258Epsilon400,000100,0002,000800,000,000 Table4.TesterrorratesforDSVI-ARDand`1-logisticregressiononthreelarge-scaledatasets. DatasetDSVIARDLog.Reg. a9a0.15070.15002rcv10.04140.04204Epsilon0.10140.10110.5 Table5.PerformancemeasuresofGPregressionwherehyperpa-rametersareselectedbyML-II,DSVIorMCMC. DatasetML-IIDSVIMCMC Boston(smse)0.07430.07090.0699(nlpd)0.17830.14250.1317Bodyfat(smse)0.19920.07260.0726(nlpd)-0.1284-2.0750-2.0746Pendulum(smse)0.27270.28070.2801(nlpd)0.45370.44650.4462 3.3.Large-scaledatasetsInordertodemonstratethescalabilityoftheproposedmethod,werunitonthreewell-knownlarge-scalebinaryclassicationdatasetsa9a,rcv1,andEpsilon,whosedetailsarelistedonTable3.Dataseta9aisderivedfromAdultinUCIrepository,rcv1isanarchiveofmanuallycategorisednewsstoriesfromReuters(weusetheoriginaltrain/testsplit),andEpsilonisanarticialdatasetfromPASCAL'slarge-scalelearningchallenge2008.WeuseagaintheBayesianlogisticregressionmodelwithvariableselectionandweappliedtheDSVI-ARDalgo-rithmdescribedpreviously.Forallproblems,mini-batchesofsize500areused,sothisprocessdoesnoteverre-quirethewholedatasettobeloadedinmemory.Wecontrastourresultswithstandard`1-logisticregression,whichexactlyminimisestheconvexfunctionalL(w)=jjwjj1PNn=1logs(ynxnw).Bothmethodsarerunontheexactsamesplits.Thevalueofwasselectedusing5-foldcross-validation.ResultsarereportedonTable4andshowthecompromisesmadebetweenbothapproaches.Theproposedapproachscaleswelltoverylargedatasetsbutitdoesnotoutperform`1-logisticregressionintheseex-amples.Thisisexpected,sincethenumberofdatapointsissohighthatthereislittlebenetfromusingaBayesianapproachhere.Note,however,theslightadvantageob-tainedforrcv1,wherethereareahugenumberofdimen-sions.AnotherbenetofDSVI-ARDisthelowmemoryrequirements(weneededa32GBRAMcomputertorunthe`1-logisticregression,whereasa4GBonewasenoughforDSVI-ARD).Incontrast,logisticregressionwasmorethan100timesfasterinachievingconvergence(usingthehighlyoptimisedLIBLINEARsoftware).3.4.GaussianprocesshyperparametersGaussianprocesses(GPs)arenon-parametricBayesianmodelswidelyusedtosolveregressiontasks.Inatypi-calsetting,aregressiondatasetDfxn;yngNn=1withxn2RDandyn2Rismodelledasyn=f(xn)+"n,where"nN(0;2)andf(x)GP(0;k(x;x0;)),forsomekernelhyperparametersandnoisevariance2.Pointestimatesforthehyperparametersaretypicallyob-tainedbyoptimisingthemarginallikelihoodoftheGPusingsomegradientascentprocedure(Rasmussen&Williams,2006).Here,wesuggesttoreplacethispro-cedurewithstochasticgradientascentoptimisationofthelowerboundthatprovidesaposteriordistributionoverthehyperparameters.Whilethestochasticnatureofthepro-posedmethodwillprobablyimplythatmoremarginallike-lihoodevaluationsarerequiredforconvergence,thisaddi-tionalcomputationalcostwillmakethemodelmoreresis-tanttooverttingandprovideaposterioroverthehyperpa-rametersatafractionofthecostoffullMCMC.UsingaGPwithkernelk(x;x0)=2fe1 2PDd=1(xdx0d)2 `2d,weplacevagueindependentnormalpriorsoverthehyper-parametersinlogspaceandcomputetheposteriorandpre-dictivedensitiesforthreedatasets:Boston,Bodyfat,andPendulum.Obviously,forthismodel,nostochastic-ityoverthedatasetisused.BostonisaUCIdatasetre-latedtohousingvaluesinBoston,Bodyfatrequirespre-dictingthepercentageofbodyfatfromseveralbodymea-surementsandinPendulumthechangeinangularveloc- DoublyStochasticVariationalBayesfornon-ConjugateInference Figure3.Top:Rolling-windowaverage(seesupplementarymaterial)oftheinstantaneouslowerboundvalues.Bottom:Finalvalueoftheapproximatemeanvectors.FirstcolumncorrespondstoColon,secondtoLeukemiaandthirdtoBreastdataset. Figure4.MarginalvariationalGaussiandistributionsforsomehyperparametersinBostondataset(shownasdashedredlines).Theblacksolidlinesshowtheground-truthempiricalestimatesforthesemarginalsobtainedbyMCMC.ityofasimulatedmechanicalpendulummustbepredicted.Figure4displaysvariationalposteriormarginaldistri-butionsforthreeofthehyperparametersintheBostonhousingdatasettogetherwiththecorrespondingempiri-calmarginalsobtainedbylongMCMCruns.Clearly,thevariationalmarginalsmatchverycloselytheMCMCesti-mates;seethesupplementarymaterialforacompletesetofsuchguresforallhyperparametersinallthreeregressiondatasets.Furthermore,negativelog-predictivedensities(nlpd)aswellasstandardisedmeansquareerrors(smse)intestdataareshowninTable5formaximummarginallikeli-hoodmodelselection(ML-II,thestandardforGPs),DSVIandMCMC.Asthetableshows,ML-II,whichisthemostwidelyusedmethodforhyperparameterselectioninGPs,overtstheBodyfatdataset.DSVIandMCMCdonotshowthisproblem,yieldingmuchbettertestperformance.Toprovideanintuitionofthecomputationaleffortassoci-atedtoeachofthesemethods,notethatontheseexperi-ments,onaverageML-IItook40seconds,DSVI30min-utesandMCMC20hours.FurtherdetailsonallaboveGPregressionexperiments,includingthelearningratesused,aregiveninthesupplementarymaterial.4.DiscussionandfutureworkWehavepresentedastochasticvariationalinferencealgo-rithmthatutilisesgradientsofthejointprobabilityden-sityanditisbasedondoublestochasticity(bybothsub-samplingtrainingdataandsimulatingfromthevaria-tionaldensity)todealwithnon-conjugatemodelsandbigdatasets.Wehaveshownthatthemethodcanbeappliedtoanumberofdivergecasesachievingcompetitivere-sults.Furtherworkshouldbeconcernedwithspeedingthestochasticapproximationalgorithmaswellasttingmorecomplexvariationaldistributionssuchasmixturemodels.AcknowledgmentsMKTgreatlyacknowledgessupportfromResearchFund-ingatAUEBforExcellenceandExtroversion,Action1:2012-2014.MLGgratefullyacknowledgessupportfromSpanishCICYTTIN2011-24533. DoublyStochasticVariationalBayesfornon-ConjugateInference ReferencesBarber,D.andBishop,C.M.EnsemblelearninginBayesianneuralnetworks.InJordan,M.,Kearns,M.,andSolla,S.(eds.),Neuralnetworksandmachinelearn-ing,pp.215237,Berlin,1998.Bottou,L´eon.OnlineAlgorithmsandStochasticApprox-imations.InOnlineLearningandNeuralNetworks.CambridgeUniversityPress,1998.Bottou,L´eonandBousquet,Olivier.Thetradeoffsoflargescalelearning.InNIPS,volume20,pp.161168,2008.Challis,EdwardandBarber,David.Concavegaussianvariationalapproximationsforinferenceinlarge-scalebayesianlinearmodels.InAISTATS,pp.199207,2011.Challis,EdwardandBarber,David.Gaussiankullback-leiblerapproximateinference.J.Mach.Learn.Res.,14(1):22392286,January2013.Hoffman,MatthewD.,Blei,DavidM.,andBach,Fran-cisR.Onlinelearningforlatentdirichletallocation.InNIPS,pp.856864,2010.Hoffman,MatthewD.,Blei,DavidM.,Wang,Chong,andPaisley,JohnWilliam.Stochasticvariationalin-ference.JournalofMachineLearningResearch,14(1):13031347,2013.Honkela,A.,Raiko,T.,Kuusela,M.,Tornio,M.,andKarhunen,J.ApproximateRiemannianconjugategra-dientlearningforxed-formvariationalBayes.JournalofMachineLearningResearch,11:32353268,2011.Jordan,MichaelI.,Ghahramani,Zoubin,Jaakkola,TommiS.,andSaul,LawrenceK.Anintroductiontovariationalmethodsforgraphicalmodels.Mach.Learn.,37(2):183233,November1999.Kingma,DiederikP.andWelling,Max.Auto-encodingvariationalbayes.arXivpreprintarXiv:1312.6114,2013.Mnih,AndriyandGregor,Karol.Neuralvariationalin-ferenceandlearninginbeliefnetworks.InThe31stInternationalConferenceonMachineLearning(ICML2014),2014.Neal,RadfordM.andHinton,GeoffreyE.Aviewoftheemalgorithmthatjustiesincremental,sparse,andothervariants.InJordan,MichaelI.(ed.),LearninginGraph-icalModels,pp.355368.1999.Opper,M.andArchambeau,C.ThevariationalGaussianapproximationrevisited.NeuralComputation,21(3):786792,2009.Paisley,JohnWilliam,Blei,DavidM.,andJordan,MichaelI.Variationalbayesianinferencewithstochasticsearch.InICML,2012.Ranganath,Rajesh,Gerrish,Sean,andBlei,David.Blackboxvariationalinference.InProceedingsoftheSev-enteenthInternationalConferenceonArticialIntelli-genceandStatistics(AISTATS),pp.814822,2014.Rasmussen,C.E.andWilliams,C.K.I.GaussianProcessesforMachineLearning.AdaptiveComputationandMa-chineLearning.MITPress,2006.Rezende,DaniloJimenez,Mohamed,Shakir,andWierstra,Daan.Stochasticbackpropagationandapproximatein-ferenceindeepgenerativemodels.InThe31stInterna-tionalConferenceonMachineLearning(ICML2014),2014.Robbins,HerbertandMonro,Sutton.AStochasticAp-proximationMethod.TheAnnalsofMathematicalStatistics,22(3):400407,1951.Robert,ChristianP.andCasella,George.MonteCarloStatisticalMethods.Springer-Verlag,1edition,August1999.Seeger,Matthias.Bayesianmodelselectionforsupportvectormachines,gaussianprocessesandotherkernelclassiers.InNIPS12,pp.603609,1999.Shevade,ShirishKrishnajandKeerthi,S.Sathiya.Asimpleandefcientalgorithmforgeneselectionusingsparselogisticregression.Bioinformatics,19(17):22462253,2003.Staines,JoeandBarber,David.Variationaloptimization.Technicalreport,2012.Tipping,MichaelE.Sparsebayesianlearningandtherel-evancevectormachine.JournalofMachineLearningResearch,1:211244,2001.Wainwright,MartinJ.andJordan,MichaelI.Graphi-calmodels,exponentialfamilies,andvariationalinfer-ence.Found.TrendsMach.Learn.,1(1-2):1305,Jan-uary2008.