SLiu and on his book Monte Carlo Strategies in Scientific Computing Nir Keret Sequential Monte Carlo Methods for Dynamic Systems Importance Sampling The basic idea Suppose that ID: 619762
Download Presentation The PPT/PDF document "Based on an identically named article by..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Based on an identically named article by Jun S.Liu, and on his book “Monte Carlo Strategies in Scientific Computing”
Nir
Keret
Sequential Monte Carlo Methods for
Dynamic SystemsSlide2
Importance Sampling
The basic idea: Suppose that
, and we wish to estimate
. A straightforward way to go about it is to sample
from F, and to calculate
From the WLLN we know that
.
However, sometimes sampling directly from F is hard/costly. Instead, we can sample
and calculate . The density ratio is called “weight” and we will refer to it as
Slide3
Importance Sampling
Conditions on the distribution G:easy to sample from
g(x) should be as proportional as possible to h(x)f(x) (in order to minimize the variance)
(So, when
)
Slide4
Importance Sampling
When f is known only up to a normalizing constant (as is often the case in Bayesian statistics), we can standardize the weights so that they sum to 1:
And the normalizing constant will cancel out in the numerator and the denominator. It introduces a bias of O(1/n) but when the sample size is large it can be overlooked.
Slide5
Importance Sampling
Weight Diagnostics: how to check that we chose a good importance distribution (G):
, so we know that some weights will be larger than 1, or ever w(x)>>1. When that happens, one or a few observations might “dominate” our estimate, as we are calculating
. We should balance the w(x) and h(x) values.
then we would like to assign low weights for high values of h(X) and
vice-versa, therefore we could graphically plot w(x)~h(x) and check whether it seems like the case (we should expect a mono-decreasing curve).
A badly chosen G might result in a large or even unbounded variance. (It may happen when we choose a light-tailed distribution such as the normal dist. as the importance dist. for a heavy-tailed one, such as the logistic distribution).
Slide6
Importance Sampling
Weight Diagnostics:In general, having a few weights with very high values is unfavorable, as it drives the other weights towards zero when we standardize them, and our estimate will be concentrated on a few samples.
Effective sample size: how many unweighted samples our weighted samples are worth?Consider a hypothetical linear combination:
The
unweighted
mean of
Z samples has variance
. Setting
while solving for
yields the effective sample size: Slide7
Importance Resampling
When we use the standardized weights, we can essentially think of our estimate as the expectation of the discrete distribution which puts mass
on each observed
coming from G.Therefore, in a way, we are approximating f by this distribution. Perhaps we can sample from this discrete distribution in order to get an approximate sample from f?
We can sample with replacement
from the observed sample
with probabilities
.
*If we want to sample
then we need and Theorem: A random variable Y drawn in this manner has distribution that converges to f as . Slide8
Importance Resampling
Proof: Let
denote the original (unstandardized) weight. Consider a set A:
And by
Lebesgue’s
dominated convergence theorem:
}
Slide9
The Dynamic Model
Definition: A sequence of evolving probability distributions
indexed by discrete time t=0,1,2…, is called a probabilistic dynamic system.
The state variable
has an increasing dimension as a function of t. That is,
, where
can be a multidimensional component.
represents the entire history of the stochastic process up to time t.
The
difference between and is caused by the incorporation of new information in the analysis.Suppose that we wish to estimate with respect to Slide10
The Dynamic Model Sequential Importance Sampling
One way to estimate
is to draw a sample of
sequences and calculate the average. We can utilize advanced sampling algorithms such as the Gibbs sampler in order to draw such sequences.
However, this solution requires us to start the process from scratch in every time index t. It is especially problematic when we want to do on-line estimation.
At time t, it would be better to update previous inferences than to act as if we had no previous information. In our approach, we will use sequential importance sampling: we will append the simulated
to the previously simulated
and adjust the importance weights accordingly.
Slide11
The Dynamic Model Sequential Importance Sampling
We can express our target density in the following manner:
Suppose we adopt the same form for the importance distribution:
Then the importance weights take the form
:
Slide12
The Dynamic Model Sequential Importance Sampling
And we can notice the recursive relation:
Therefore, the algorithm for obtaining a sequence
and a corresponding weight
is given in the following steps:
1.
Sample
. Let
. Set
t = 2.2. Sample 3. Append to , obtaining .4. Let
5.
Let
. At the current time,
is the importance weight
for
6. Increment t and return to step 2.
Slide13
The Dynamic Model Sequential Importance Sampling
In this way we can break a difficult task such as directly sampling a sequence from the distribution
into smaller, easier to handle, bits.
A simple example: Consider the following sequence of probability densities:
/ 3} where
is a normalizing constant and ||*|| is the Euclidean norm.
There is no easy way to sample from this density directly. We can employ a standard normal distribution as the Imp. Dist.
The conditional density can be expressed as: Slide14
The Dynamic Model Sequential Importance Sampling
Let t=1. Sample n points
from a standard normal distribution.
Calculate initial weights as
is the standard normal density. The constants
cancel out when the weights are standardized.
3. For t>1, sample n points
from a standard normal distribution, and append them to
, yielding
Slide15
The Dynamic Model Sequential Importance Sampling
Calculate the weight adjustment factors
, set
, and standardize.
Increment t and return to step 2.
In each time t, we will use our sample of n sequences in order to estimate the desired expectation, using IS.
What if instead we sampled a multivariate normal directly at each time t?Slide16
Weight Degeneracy and Rejuvenation
One major problem with Sequential IS is that the weights become increasingly degenerate as t is getting bigger. Each time an unlikely new component is appended to a sequence, the weight for that entire sequence decreases proportionally. Eventually, since we’re dealing with long time-sequences, only a few sequences will have avoided this phenomenon, thus the weight will be more and more concentrated around a few sequences. Degenerate weights reduce the effective sample size and degrade estimation performance.Slide17
Weight Degeneracy and Rejuvenation
Therefore, practically, we have to perform “rejuvenation” steps every once in a while in order to keep the effective sample size large enough.We can perform those rejuvenation steps at pre-defined checkpoints (like, at times t=5,10,15…),
or instead we can constantly monitor our effective sample size and perform rejuvenation whenever it falls below a pre-defined threshold. This is a stochastic rejuvenation schedule. But how do we rejuvenate?Slide18
Weight Degeneracy and Rejuvenation
First option: Discard any sequence whose weight falls beneath a threshold and sample a new sequence instead.Problematic?Second option
: Resampling.Out of the current sequences we have, we can resample (with replacement) new n sequences. After resampling, we reset the weights (1/n) to all sequences. It should be noted that resampling does not facilitate inference at the current time, but rather improves the chances for the good for later times. Hence, inference at the current time should be done before resampling has taken place.Slide19
Weight Degeneracy and Rejuvenation
Resampling schemes:Simple Random Sampling:
Resample from the current sequences according to their standardized weights. (reminiscent of Importance Resampling)Residual Resampling:
Retain
copies of
where
and
Let
.
Obtain iid draws from (the sample of sequences at time t) with probabilities proportional to Reset the weights. Slide20
Weight Degeneracy and Rejuvenation
Residual Resampling (example):
S
0.9/5
2.1/5
0.25/5
1.25/5
0.5/5
0
2010k0.90.10.250.250.50.450.050.1250.1250.25S0.9/52.1/50.25/51.25/50.5/502010k0.90.10.250.250.50.450.050.1250.1250.25 Sample 2 additional sequences according to
Slide21
Weight Degeneracy and Rejuvenation
Resampling deals with the weight degeneracy problem – but creates a new one: now the different sequences are not longer statistically independent. High-weight sequences are merely replicated rather than diversified.Particle filters employ additional measures in order to diversify the sequences at each resampling step (such as resampling from mixture of smooth
densities centered at some or all the current particles).The simplest filter is the Bootstrap Filter – which is simply performing Simple Random Sampling at each time t. It is proper only for
Markovian models.Particle Filters:Slide22
SISR – Examples – simulation of chain polymers
The self avoiding random walk (SAW) in a two or three dimensional lattice is often used as a simple model for chain polymers such as polyester. A realization of a chain polymer of length N is fully characterized by the positions of all of its molecules (
where
is a point in the 2-D lattice. The distance between two adjacent points has to be 1, and
for all
. The lines connecting two points is called a covalent bond.
Slide23
SISR – Examples – simulation of chain polymers
One possible distribution for the chain polymer is the uniform distribution. More formally,
where
is the total number of possible SAW chains of length t.
A basic method for sampling a SAW chain: At each step, the “walker” chooses with equal probabilities one of the 3 allowed
neighbouring
points (it cannot go backwards). If the chosen position has already been chosen before, the walker goes back to (0,0), and starts again.
Problem?
A
more efficient method is the one-step-look-ahead method. In this method, at each step the walker checks what the possible neighbours are, and chooses one out of them with equal probabilities. However, this introduces a bias, as can be demonstrated by the following example: Slide24
SISR – Examples – simulation of chain polymers
Look at the following two instances of a 5-atom chain:
Using the one-step-look-ahead method, what are the probabilities for generating each of those?
Hence, chains generated in this method do not comply with the uniform distribution assumption.Slide25
SISR – Examples – simulation of chain polymers
We can correct for this bias using the IS framework:Essentially, we are using an importance distribution
where
denotes the number of unoccupied
neighbours
to point
Since
then the weight increment is
, and
When t is large two problems occur: 1. weight degeneracy 2. the one-step-look-ahead method can become ineffective.How would we implement a two-step-look-ahead method? where
is the unoccupied
neighbourhood
of
Advantages and disadvantages of bigger k-step-look-ahead methods?
Slide26Slide27
SISR – Examples – Terrain Navigation (HMM)
Consider a Markov sequence of unobservable random variables
indexed by time t, such that the distribution of
depends only on
.
Consider another sequence of observable random variables
such that
is dependent on the current state
and the observations
are conditionally independent given the first process values.Thus, we have the model: and are density functions.We wish to use the observed as data to estimate the states of the hidden Markov process. In the Importance Sampling framework, the target distribution is The following example is taken from the book “Computational Statistics” by Givens and HoetingSlide28
SISR – Examples – Terrain Navigation (HMM)
However, note that there is a recursive relationship between
and
:
We can decompose
like before :
But it isn’t very helpful.
Slide29
SISR – Examples – Terrain Navigation (HMM)
can be decomposed similarly:
However, we can choose g to be independent of
, specifically it is convenient to choose
(
markovian
)
Slide30
SISR – Examples – Terrain Navigation (HMM)
Then the weight increment at time t is:
=
And now for the concrete example.
Slide31
SISR – Examples – Terrain Navigation (HMM)
An airplane flying over uneven terrain can use information about the ground elevation beneath it to estimate its current location. Simultaneously, an inertial navigation system provides an estimated travel
direction and distance. At each time point the previously estimated location of the plane is updated using both types of new information.
Interest in such problems arises in, for example, military applications where the approach could serve as an alternative or backup to global satellite systems.
Let the two-dimensional variable
= (
)
denote the true location of
the plane at time
t, and let denote the measured drift in the plane location as measured by the inertial navigation system.The key data for terrain navigation come from a map database that contains the true elevation m( at any location Slide32
SISR – Examples – Terrain Navigation (HMM)
The hidden Markov model for terrain navigation is:
Y is the observed elevation.
Let us suppose that
the plane
is following a circular arc specified by 101 angles
(for
)
equally spaced between and , with the true location at time t being = Suppose that the random error where and
Slide33
SISR – Examples – Terrain Navigation (HMM)Slide34
SISR – Examples – Terrain Navigation (HMM)
This distribution
effectively constitutes the
importance distribution
This complicated specification is more
simply described
by saying that
has
a bivariate normal distribution with standard
deviations q and kq, rotated so that the major axis of density contours is parallel to the tangent of the flight path at the current location.A standard bivariate normal distribution would be an alternative choice, but ours simulates the situation where uncertainty about the distance flown during the time increment is greater than uncertainty about deviations orthogonal to the direction of flight. Slide35
SISR – Examples – Terrain Navigation (HMM)
The sequential importance sampling algorithm for this problem proceeds as follows:
Initialize at t = 0. Draw n starting points
for i = 1, . . . , n.
In real life, one could imagine that the initialization point corresponds to the departure airport or to some position update provided by occasional detection stations along the flight path, which provide highly accurate location data allowing the current location to be “reinitialized.”
2. Receive observed elevation data
.
Slide36
SISR – Examples – Terrain Navigation (HMM)
3. Calculate the weight update factor
4.
Update the weights according
to
If
then
.
5. Standardize the weights so that they sum to 1.
6. Estimate 7. Check for weight degeneracy and rejuvenate if needed.8. Sample a set of location errors and advance the set of locations according to 9. Increment t and return to step 2. Slide37
Thanks!