Approx. Inference via Sampling ( - PowerPoint Presentation

naomi . @naomi

66 views
Uploaded On 2023-07-27

Approx. Inference via Sampling ( - PPT Presentation

Contd MCMC with Gradients Recent Advances CS772A Probabilistic Machine Learning Piyush Rai Plan for today Some other aspects of MCMC MCMC with gradient Some other recent advances 2 Sampling Methods Label Switching Issue ID: 1011858

gradient posterior langevin sampling posterior gradient sampling langevin dynamics mcmc samples stochastic bayesian learning rate accept reject map sample

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/1011858" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Approx. Inference via Sampling (" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

1. Approx. Inference via Sampling (Contd):MCMC with Gradients, Recent AdvancesCS772A: Probabilistic Machine LearningPiyush Rai

2. Plan for todaySome other aspects of MCMCMCMC with gradientSome other recent advances2

3. Sampling Methods: Label Switching IssueSuppose we are given samples from the posterior We can’t always simply “average” them to get the “posterior mean” Why: Non-identifiability of latent vars in models with multiple equival. posterior modesExample: In clustering via GMM, the likelihood is invariant to how we label clustersWhat we call cluster 1 in one sample may be cluster 2 in the next sampleSay, in GMM, and , both may imply the same Averaging will give , which is incorrectQuantities not affected by permutations of dims of can be safely averagedE.g., probability that two points belong to the same cluster (e.g., in GMM)Predicting the mean of an entry in matrix factorization 3One sample may be from near one of the modes and the other may be from near the other modeChanges in order of entries in these vectors across different samples doesn’t affect the inner product

4. MCMC: Some Practical AspectsChoice of proposal distribution is importantFor MH sampling, Gaussian proposal is popular when is continuous, e.g.,Other options: Mixture of proposal distributions, data-driven or adaptive proposalsAutocorrelation. Can show that when approximating using Autocorrelation function (ACF) at lag :Multiple Chains: Run multiple chains, take union of generated samples 4 Change at each iterHessian at the MAP of the target distribution Monte Carlo assumes uncorrelated samplesValue of using MCMC sample Basically measures what fractions of the total samples are uncorrelated. Want it to be close to 1Lower is better

5. Coming Up NextAvoiding the random-walk behavior of MCMCUsing gradient information of the posteriorScalable MCMC methods5

6. Using Gradients in MCMC: Langevin DynamicsMCMC uses a random-walk based proposal to generate the next sample, e.g., Langevin dynamics: Use (unnormalized) posterior’s gradient info in the proposal asNote that the above is equivalent toAfter some waiting period , iterates are MCMC samples from 6 And then accept/reject (MH) And then accept/reject (MH)Move towards the mode of the posterior (like finding MAP est)And then accept/reject (MH)LikelihoodPriorCan use automatic differentiation methods for thisKnown as Metropolis-Adjusted Langevin Algorithm (MALA)Will use to denote all the unknowns set s.t. acceptance rate is around 0.6 If gradient is pre-multiplied by a preconditioner matrix : Simplified Manifold MALA One option to use for is the second derivative of the unnorm. posterior Helps also incorporate the curvature info of the posteriorSame as doing a gradient ascent step towards the posterior and injecting noise Noise ensures we aren’t stuck at the MAP solution Using gradient info in the proposal helps us move faster towards high-prob regions“Bayesian Learning via Stochastic Gradient Langevin Dynamics” by Welling and Teh (2011)

7. Langevin Dynamics: A Closer LookIs generating MCMC samples really as easy as computing MAP?Recall the form of Langevin Dynamics updatesEquivalent to discretization of an SDE with equilibrium distribution where and is Brownian motion s.t. are i.i.d. Gaussian r.v.sDiscretization introduces some error which is corrected by MH accept/reject stepNote: As learning rate decreases, discretization error also decreases and rejection rate tends to zeroNote: Gradient computations require all the data (thus slow)Solution: Use stochastic gradients - Stochastic Gradient Langevin Dynamics (SGLD) 7And then accept/reject (MH) Same as our target posteriorNote that this is continuous timeAbove update is its discretiization

8. Stochastic Gradient Langevin Dynamics (SGLD)An “online” MCMC method: Langevin Dynamics with minibatches to compute gradientsGiven minibatch , the (stochastic) Langevin dynamics update:Choice of the learning rate is important. For convergence, Switching to constant learning rates (after a few iterations) often helps convergenceAs becomes very very small, acceptance prob. becomes close to 1Recent flurry of work on this topic (see “Bayesian Learning via Stochastic Gradient Langevin Dynamics” by Welling and Teh (2011) and follow-up works) 8 And then accept/reject (MH)No need for accept/reject (MH)Almost as fast as doing SGD updates 

9. Improvements to SGLDThe basic SGLD, although fairly simple, has many limitations, e.g.Exhibits slow convergence and mixing. Uses same learning rate in all dimensions of Doesn’t apply to models where is constrained (e.g., non-neg or prob. vector)Needs to the model to be differentiable (since it needs ) A lot of recent work on improving the basic SGLD to handle such limitationsIntroducing the curvature information in the gradients, e.g., Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring (Ahn et al, 2012), and Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks (Li et al, 2016)These methods use a preconditioner matrix in the learning rate to improve convergenceThis also allows different amounts of updates in different dimensionsSLGD in Riemannian space to handle constrained variablesStoch. Grad. Riemannian Langevin Dynamics on the Probability Simplex (Patterson and Teh, 2013) 9Based on reparametrizing the constrained variables to make them unconstrainted

10. Applications of SGLDPopular for Bayesian neural networks and other complex Bayesian modelsReason: SGLD = backprop based updates + Gaussian noise10

11. Other Recent “SGD-inspired” Sampling AlgorithmsRun SGD and use SGD iterates to construct a Gaussian approximationRecently Maddox et al (2019) proposed an idea using stochastic weight avging (SWA)If we want full cov., we can use a low-rank approx. of (see Maddox et al for details)Reason it works: SGD is asymptotically Normal under certain conditionsFor a more detailed theory of SGD and MCMC, may also refer to this very nice paper: Stochastic Gradient Descent as Approximate Bayesian Inference (Mandt et al, 2017)Such algos can give not too accurate but very fast posterior approx for complex models 11Approach known as SWA-Gaussian (SWAG)A Simple Baseline for Bayesian Uncertainty in Deep Learning, Maddox et al (2019)

12. Hamiltonian/Hybrid Monte Carlo (HMC)HMC (Neal, 1996) is an “auxiliary variable sampler” and incorporates gradient infoUses the idea of simulating a Hamiltonian Dynamics of a physical systemConsider the target posterior Think of as “position” then is like “potential energy”Let’s introduce an auxiliary variable - the momentum of the systemCan now define a joint distribution over the position and momentum asThe total energy (potential + kinetic) or the Hamiltonian of the systemGiven a sample from , ignoring , will be a sample from 12Constant w.r.t. time

13. Generating Samples in HMCGiven an initial , Hamiltonian Dynamics defines how changes w.r.t. time We can use these equations to update by discretizing timeFor , sample as followsInitialize Do “leapfrog” steps with learning rates for and For Perform MH accept/reject test on . If accepted The momentum forces exploring different regions instead of getting driven to regions where the MAP solution is 13 A single sample generated by taking steps Reason: Getting analytical solutions for the above requires integrals which is in general intractable usually set to 5 and learning rate tuned to make acceptance rate around 90%

14. HMC in PracticeHMC typically has very low rejection rate (that too, primarily due to discretization error)Performance can be sensitive to (no. of leapfrog steps) and step-sizes, tuning hardA lot of renewed interest in HMC (you may check out NUTS - No U-turn Sampler – doesn’t require setting )Prob. Prog. packages e.g., Tensorflow Prob., Stan, etc, contain implementations of HMCCan also do HMC on minibatches (Stochastic Gradient HMC - Chen et al, 2014)An illustration: SGHMC vs other methods on MNIST classification (Bayesian neural net) 14

15. Parallel/Distributed MCMCSuppose our goal is to compute the posterior of (assuming is very large)Suppose we have machines with data partitioned as Let’s assume that the posterior factorizes asHere is known as the “subset posterior” Assume the machine generates MCMC samples We need a way to combine these subset posteriors using a “consensus” 15

16. Parallel/Distributed MCMCMany ways to compute the consensus samples. Let’s look at two of themApproach 1: Weighted Average: where can be learned as followsAssuming Gaussian likelihood and Gaussian prior Approach 2: Fit Gaussians, one for each and take their product For detailed proofs and other approaches, may refer to the reference below 16Patterns of Scalable Bayesian Inference (Angelino et al, 2016)These approaches can also be used to make VI parallel/distributed

17. Approximate Inference: VI vs SamplingVI approximates a posterior distribution by another distribution Sampling uses samples to approximate Sampling can be used within VI (ELBO approx using Monte-Carlo)In terms of “comparison” between VI and sampling, a few things to be notedConvergence: VI only has local convergence, sampling (in theory) can give exact posteriorStorage: Sampling based approx needs to storage all samples, VI only needs var. params Prediction Cost: Sampling always requires Monte-Carlo avging for posterior predictive; with VI, sometimes we can get closed form posterior predictiveThere is some work on “compressing” sampling-based approximations* 17 PPD if using sampling:PPD if using VI:*”Compact approximations to Bayesian predictive distributions” by Snelson and Ghaharamani, 2005; and “Bayesian Dark Knowledge” by Korattikara et al, 2015Compressing the samples into something more compact

18. Inference Methods: SummaryMLE/MAP: Straightforward for differentiable models (can even use automatic diff.)Conjugate models with one “main” parameter: Straightforward posterior updatesMLE-II/MAP-II: Often useful for estimating the hyperparametersEM: If we want to do MLE/MAP for models with latent variablesVery general algorithm, can also be made onlineUsed when we want point estimates for some unknowns and posterior over othersCan use it for hyperparameter estimation as wellOften better than using direct gradient methodsVI and sampling methods can be used to get full posterior for complex modelsQuite easy if we have local conjugacy (VI has closed form updates, Gibbs sampler is easy to derive)In other cases, we have general VI with Monte-Carlo gradients, MH samplingMCMC can also make use of gradient info (LD/SGLD)For large-scale problems, online/distributed VI/MCMC, or SGD based posterior approx18

Approx. Inference via Sampling ( - PowerPoint Presentation

Approx. Inference via Sampling ( - PPT Presentation

Share:

Link:

Embed:

Related Contents