Reinforcement Learning Alan Fern Based in part on slides by Ronald Parr Overview What is batch reinforcement learning Least Squares Policy Iteration Fitted Qiteration Batch DQN Online versus Batch RL ID: 527617
Download Presentation The PPT/PDF document "Batch" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Batch Reinforcement Learning
Alan Fern
* Based in part on slides by Ronald ParrSlide2
OverviewWhat is batch reinforcement learning?
Least Squares Policy IterationFitted Q-iterationBatch DQNSlide3
Online versus Batch RLOnline RL: integrates data collection and optimization
Select actions in environment and at the same time update parameters based on each observed (s,a,s’,r) Batch RL: decouples data collection and optimizationFirst generate/collect experience in the environment giving a data set of state-action-reward-state pairs {(
s
i
,a
i
,r
i
,s
i
’)}
We may not even know where the data
came from
Use the fixed set of experience to optimize/learn a policy
Online vs. Batch:
Batch algorithms are often more “data efficient” and stable
Batch algorithms ignore the exploration-exploitation problem, and do their best with the data they haveSlide4
Batch RL MotivationThere are many applications that naturally fit the batch RL modelMedical Treatment Optimization:
Input: collection of treatment episodes for an ailment giving sequence of observations and actions including outcomesOuput: a treatment policy, ideally better than current practice Emergency Response Optimization:Input: collection of emergency response episodes giving movement of emergency resources before, during, and after 911 calls
Output:
emergency response policySlide5
Batch RL MotivationOnline Education Optimization:
Input: collection of episodes of students interacting with an educational system that gives information and questions in order to teach a topicActions correspond to giving the student some information or giving them a question of a particular difficulty and topic
Ouput
:
a
teaching policy that is tuned to student based on what is known about the studentSlide6
Least Squares Policy Iteration (LSPI)LSPI is a model-free batch RL algorithmLearns a linear approximation of Q-function
stable and efficientNever diverges or gives meaningless answersLSPI can be applied to a dataset regardless of how it was collectedBut garbage in, garbage out.Least-Squares Policy Iteration
,
Michail
Lagoudakis
and Ronald Parr,
Journal of Machine Learning Research (JMLR)
, Vol. 4, 2003, pp. 1107-1149.Slide7
Least Squares Policy iterationNo time to cover details of derivationDetails are in the appendix of these slides
LSPI is a wrapper around an algorithm LSTDQLSTDQ: learns a Q-function for current policy given the batch of dataCan learn Q-function for policy from any (reasonable) set of samples---sometimes called an off-policy method
No need to collect samples from current policy
Disconnects policy evaluation from data collection
Permits reuse of data across iterations!
Truly a batch method.Slide8
Implementing LSTDQLSTDQ uses a linear Q-function with features
and weights
.
defines greedy
policy:
For each (
s,a,r,s
’)
sample in data set:
Slide9
Running LSPI
There is a Matlab implementation available!Collect a database of (s,a,r,s’) experiences
(this is the magic step)
Start with random
weights (= random policy)
Repeat
Evaluate current policy against database
Run
LSTDQ
to generate new set of weights
New weights imply
new Q-function and hence new
policy
Replace current weights with new weights
Until
convergenceSlide10
Results: Bicycle RidingWatch random controller operate simulated bikeCollect ~40,000 (s,a,r,s
’) samples Pick 20 simple feature functions (5 actions)Make 5-10 passes over data (PI steps)Reward was based on distance to goal + goal achievementResult: Controller that balances and rides to goalSlide11
Bicycle TrajectoriesSlide12
What about Q-learning?Ran Q-learning with same featuresUsed experience replay for data efficiency Slide13
Q-learning ResultsSlide14
LSPI RobustnessSlide15
Some key pointsLSPI is a batch RL algorithmCan generate trajectory data anyway you wantInduces a policy based on global optimization over
full datasetVery stable with no parameters that need tweakingSlide16
So, what’s the bad news?LSPI does not address the exploration problemIt decouples data collection from policy optimizationThis is often not a major issue, but can be in some cases
k2 can sometimes be bigLots of storageMatrix inversion can be expensiveBicycle needed “shaping” rewardsStill haven’t solvedFeature selection (issue for all machine learning, but RL seems even more sensitive)Slide17
Fitted Q-IterationLSPI is limited to linear functions over a given set of features
Fitted Q-Iteration allows us to use any type of function approximator for the Q-functionRandom Forests have been popularDeep NetworksFitted Q-Iteration is a very straightforward batch version of Q-learning
Damien Ernst, Pierre
Geurts
, Louis
Wehenkel
. (2005).
Tree-Based
Batch Mode Reinforcement Learning
Journal of Machine Learning Research
;
6(Apr):
503—556
.Slide18
18
Fitted Q-Iteration
Let
} be our batch of transitions
Initialize approximate Q-function
(perhaps weights of a deep network)
Initialize training set
For each
// new estimate of
Add training example
to T
Learn new
from training data
Goto
3
Step 5 could use any regression algorithm: neural network, random forests, support vector regression, Gaussian Process
Slide19
DQNDQN was developed by DeepMind originally for online learning of Atari games
However, the algorithm can be used effectively as is for Batch RL. I haven’t seen this done, but it is straightforward. Slide20
20
DQN for Batch RL
Let
} be our batch of transitions
Initialize
neural network parameter
values to
Randomly
sample a mini-batch of
transition
from
Perform a TD update for each parameter based on mini-batch
Goto
3
Slide21
Appendix21Slide22
Projection Approach to ApproximationRecall the standard Bellman equation:
or equivalently where T[.] is the Bellman operatorRecall from value iteration, the sub-optimality of a value function can be bounded in terms of the Bellman error:
This motivates trying to find an approximate value function with small Bellman error Slide23
Projection Approach to ApproximationSuppose that we have a space of representable value functionsE.g. the space of linear functions over given features
Let P be a projection operator for that spaceProjects any value function (in or outside of the space) to “closest” value function in the space“Fixed Point” Bellman Equation with approximation
Depending on space this will have a small Bellman error
LSPI will attempt to arrive at such a value function
Assumes linear approximation and least-squares projection Slide24Slide25
Projected Value IterationNaïve Idea: try computing projected fixed point using VIExact VI: (iterate Bellman backups)
Projected VI: (iterated projected Bellman backups):
exact Bellman backup
(produced value function)
Projects exact Bellman
backup to closest function
in our restricted function spaceSlide26
Example: Projected Bellman Backup
Restrict space to linear functions over a single feature
f
:
Suppose just two states s
1
and s
2
with:
Suppose exact backup of V
i
gives:
f(
s
1
)=1
f(
s
2
)=2
f(
s
1
)=1,
f(
s
2
)=2
Can we represent this exact
backup in our linear space?
NoSlide27
Example: Projected Bellman Backup
Restrict space to linear functions over a single feature
f
:
Suppose just two states s
1
and s
2
with:
Suppose exact backup of V
i
gives:
The backup can’t be represented via our linear function:
f(
s
1
)=1
f(
s
2
)=2
f(
s
1
)=1,
f(
s
2
)=2
projected backup is
just least-squares fit
to exact backupSlide28
Problem: StabilityExact value iteration stability ensured by contraction property of Bellman backups:Is the “projected” Bellman backup a contraction:
?Slide29
Example: Stability Problem [Bertsekas & Tsitsiklis 1996]
Problem
:
Most projections lead to backups that are not contractions and unstable
s
2
s
1
Rewards all zero, single action,
g
= 0.9: V* = 0
Consider linear approx. w/ single feature
f
with weight w.
Optimal w = 0
since V*=0Slide30
Example: Stability Problem
From V
i
perform projected backup for each state
Can’t be represented in our space so find w
i+1
that gives least-squares approx. to exact backup
After some math we can get:
w
i+1
= 1.2 w
i
What does this mean?
s
2
s
1
f(
s
1
)=1
V
i
(s
1
) = w
i
f(
s
2
)=2
V
i
(s
2
) = 2w
i
weight value
at iteration iSlide31
Example: Stability Problem
1
2
Iteration #
S
V(x)
Each iteration of Bellman backup makes approximation worse!
Even for this trivial problem “projected” VI diverges.Slide32
Understanding the ProblemWhat went wrong?Exact Bellman backups reduces error in max-norm
Least squares (= projection) non-expansive in L2 normBut may increase max-norm distance!Conclusion: Alternating Bellman backups and projection is risky businessSlide33
OK, What’s LSTD?Approximates value function of policy
given trajectories of Assumes linear approximation of denoted
The
f
k
are arbitrary feature functions of states
Some vector notation
Slide34
Deriving LSTD
is a linear function in the column space of
f
1
…
f
k
, that is,
K basis functions
# states
f
1
(s1)
f
2
(s1)...
f
1
(s2)
f
2
(s2)…
.
.
.
F
=
assigns a value to every stateSlide35
Suppose we know true value of policyWe would like the following:
Least squares weights minimizes squared errorLeast squares projection is then
Textbook least squares projection operator
Sometimes called pseudoinverseSlide36
But we don’t know V…Recall fixed-point equation for policies
Will solve a projected fixed-point equation:Substituting least squares projection into this gives:
Solving for w:Slide37
Almost there…Matrix to invert is only K x KBut…Expensive to construct matrix (e.g. P is |
S|x|S|)Presumably we are using LSPI because |S| is enormous We don’t know PWe don’t know RSlide38
Using Samples for F
K basis functions
f
1
(s1)
f
2
(s1)...
f
1
(s2)
f
2
(s2)…
.
.
.
Idea: Replace enumeration of states with sampled states
states
samples
Suppose we have state transition samples of the policy
running in the MDP:
{(s
i
,a
i
,r
i
,s
i
’)}Slide39
Using Samples for R
r
1
r
2
.
.
.
Idea: Replace enumeration of reward with sampled rewards
samples
Suppose we have state transition samples of the policy
running in the MDP:
{(s
i
,a
i
,r
i
,s
i
’)}
R = Slide40
40Slide41
Using Samples for PF
K basis functions
f
1
(s1’)
f
2
(s1’)...
f
1
(s2’)
f
2
(s2’)…
.
.
.
Idea: Replace expectation over next states with sampled
next states.
s’ from (s,a,r,s’)Slide42
Putting it TogetherLSTD needs to compute:
The hard part of which is B the kxk matrix:Both B and b can be computed incrementally for each (s,a,r,s’) sample: (initialize to zero)
from previous slideSlide43
LSTD AlgorithmCollect data by executing trajectories of current policyFor each (s,a,r,s’) sample: Slide44
LSTD SummaryDoes O(k2) work per datum
Linear in amount of data.Approaches model-based answer in limitFinding fixed point requires inverting matrixFixed point almost always existsStable; efficientSlide45
Approximate Policy Iteration with LSTD
Start with random weights w (i.e. value function)Repeat Until Convergence
Evaluate using LSTD
Generate sample trajectories of
Use LSTD to produce new weights
w
(
w
gives an approx. value function of )
= greedy( )
// policy improvement
Policy Iteration:
iterates between policy improvement
and policy evaluation
Idea:
use LSTD for approximate policy evaluation in PISlide46
What Breaks?No way to execute greedy policy without a modelApproximation is biased by current policyWe only approximate values of states we see when executing the current policy
LSTD is a weighted approximation toward those statesCan result in Learn-forget cycle of policy iterationDrive off the road; learn that it’s badNew policy never does this; forgets that it’s badNot truly a batch methodData must be collected from current policy for LSTDSlide47
LSPILSPI is similar to previous loop but replaces LSTD with a new algorithm LSTDQLSTD: produces a value functionRequires samples from policy under consideration
LSTDQ: produces a Q-function for current policyCan learn Q-function for policy from any (reasonable) set of samples---sometimes called an off-policy methodNo need to collect samples from current policyDisconnects policy evaluation from data collectionPermits reuse of data across iterations!Truly a batch method.