(BO). Javad. . Azimi. Fall 2010. http://web.engr.oregonstate.edu/~azimi/. Outline. Formal Definition. Application. Bayesian Optimization Steps. Surrogate Function(Gaussian Process). Acquisition Function. ID: 602728
DownloadNote  The PPT/PDF document "Bayesian Optimization" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, noncommercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Bayesian Optimization(BO)
Javad
Azimi
Fall 2010
http://web.engr.oregonstate.edu/~azimi/
Slide2Outline
Formal Definition
Application
Bayesian Optimization Steps
Surrogate Function(Gaussian Process)
Acquisition Function
PMAX
IEMAX
MPI
MEI
UCB
GPHedge
Slide3Formal Definition
Input: Goal:
Slide4Fuel Cell Application
Anode
Cathode
bacteria
Oxidation products (CO
2
)
Fuel (organic matter)
e

e

O
2
H
2
O
H
+
This is how an MFC works
SEM image of bacteria sp.
on Ni
nanoparticle
enhanced carbon fibers
.
Nano
structure of anode significantly impact the electricity production.
We should optimize anode
nano
structure to maximize power by selecting a set of experiment.
4
Slide5Big Picture
Since Running experiment is very expensive we use BO.
Select one experiment to run at a time based on results of previous experiments.
Current Experiments
Our Current Model
Select Single Experiment
Run Experiment
5
Slide6BO Main Steps
Surrogate Function(Response Surface
,
Model)
Make a posterior over unobserved points based on the prior.
Its
parameter might be based on the prior. Remember it is a BAYESIAN approach.
Acquisition
Criteria(Function)
Which sample should be selected next.
Slide7Surrogate Function
Simulates the
unknown function
distribution
based on the prior.
Deterministic (Classical Linear Regression,…)
There is a deterministic prediction for each point x in the input space.
Stochastic (Bayesian regression, Gaussian Process,…)
There is a distribution over the prediction for each point x in the input space. (
i.e
Normal distribution)
Example
Deterministic: f(x1)=y1, f(x2)=y2
Stochastic: f(x1)=N(y1,2) f(x2)=N(y2,5)
Slide8Gaussian Process(GP)
A Gaussian process is a collection number of random variables, any finite number of which have a joint Gaussian distribution.Consistency requirement or marginalization property.Marginalization property:
Slide9Gaussian Process(GP)
Formal prediction:Interesting points:Squared exponential function corresponds to Bayesian linear regression with an infinite number of basis function.Variance is independent from observationThe mean is a linear combination of observation.If the covariance function specifies the entries of covariance matrix, marginalization is satisfied!
Slide10Gaussian Process(GP)
Gaussian Process is:An exact interpolating regression method.Predict the training data perfectly. (not true in classical regression)A natural generalization of linear regression.Nonlinear regression approach! A simple example of GP can be obtained from Bayesian regression.Identical resultsSpecifies a distribution over functions.
Slide11Gaussian process(2):distribution over functions
95% confidence interval for each point x.
Three sampled functions
Slide12Gaussian process(2):GP vs Bayesian regression
Bayesian regression:
Distribution over weight
The prior is defined over the weights.
Gaussian Process
Distribution over function
The prior is defined over the function space.
These are the same but from different view.
Slide13Short Summary
Given any unobserved point z, we can define a normal distribution of its prediction value such that:
Its means is the linear combination of the observed value.
Its variance is related to its distance from observed value. (closer to observed data, less variance)
Slide14BO Main Steps
Surrogate Function(Response Surface , Model)
Make a posterior over unobserved points based on the prior.
Its parameter might be based on the prior. Remember it is a BAYESIAN approach.
Acquisition Criteria(Function)
Which sample should be selected next.
Slide15Bayesian Optimization:(Acquisition criterion)
Remember: we are looking for:Input: Set of observed data. A set of points with their corresponding mean and variance.Goal: Which point should be selected next to get to the maximizer of the function faster.Different Acquisition criterion(Acquisition functions or policies)
Slide16Policies
Maximum Mean (MM).
Maximum Upper Interval (MUI).
Maximum Probability of Improvement (MPI).
Maximum Expected of Improvement (MEI).
Slide17Policies:Maximum Mean (MM).
Returns the point with highest expected value.Advantage:If the model is stable and has been learnt very good, performs very good.Disadvantage:There is a high chance to fall in local minimum(just exploit).Can converge to global optimum finally?No
Slide18Policies:Maximum Upper Interval (MUI).
Returns the point with highest 95% upper interval.Advantage:Combination of mean and variance(exploitation and exploration).Disadvantage:Dominated by variance and mainly explore the input space. Can converge to global optimum finally?Yes.But needs almost infinite number of samples.
Slide19Policies:Maximum Probability of Improvement (MPI)
Selects the sample with highest probability of improving the current best observation (ymax) by some margins m.
Slide20Policies:Maximum Probability of Improvement (MPI)
Advantage:Considers mean and variance and ymax in policy(smarter than MUI)Disadvantage:Adhoc parameter m Large value of m? ExplorationSmall value of m? Exploitation
Slide21Policies:Maximum Expected of Improvement (MEI)
Maximum Expected of improvement. Question: Expectation over which variable?m
Slide22Policies:Upper Confidence Bounds
Select based on the variance and mean of each point.The selection of k left to the user.Recently, a principle approach to select this parameter has been proposed.
Slide23Summary
We introduced several approaches, each of which has advantage and disadvantage.
MM
MUI
MPI
MEI
GPUCB
Which one should be selected for an unknown model?
Slide24GPHedge
GPHedge(2010) It select one of the baseline policy based on the theoretical results of multiarmed bandit problem, although the objective is a bit different! They show that they can perform better than (or as well as) the best baseline policy in some framework.
Slide25Future Works
Method
selection smarter than GPHedge with theoretical analysis.
Batch Bayesian optimization.
Scheduling Bayesian optimization.
Slide26Next Slides