/
Batch Batch

Batch - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
412 views
Uploaded On 2017-03-21

Batch - PPT Presentation

Reinforcement Learning Alan Fern Based in part on slides by Ronald Parr Overview What is batch reinforcement learning Least Squares Policy Iteration Fitted Qiteration Batch DQN Online versus Batch RL ID: 527617

batch policy data function policy batch function data iteration bellman backup linear samples lspi lstd squares current learning weights

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Batch" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Batch Reinforcement Learning

Alan Fern

* Based in part on slides by Ronald ParrSlide2

OverviewWhat is batch reinforcement learning?

Least Squares Policy IterationFitted Q-iterationBatch DQNSlide3

Online versus Batch RLOnline RL: integrates data collection and optimization

Select actions in environment and at the same time update parameters based on each observed (s,a,s’,r) Batch RL: decouples data collection and optimizationFirst generate/collect experience in the environment giving a data set of state-action-reward-state pairs {(

s

i

,a

i

,r

i

,s

i

’)}

We may not even know where the data

came from

Use the fixed set of experience to optimize/learn a policy

Online vs. Batch:

Batch algorithms are often more “data efficient” and stable

Batch algorithms ignore the exploration-exploitation problem, and do their best with the data they haveSlide4

Batch RL MotivationThere are many applications that naturally fit the batch RL modelMedical Treatment Optimization:

Input: collection of treatment episodes for an ailment giving sequence of observations and actions including outcomesOuput: a treatment policy, ideally better than current practice Emergency Response Optimization:Input: collection of emergency response episodes giving movement of emergency resources before, during, and after 911 calls

Output:

emergency response policySlide5

Batch RL MotivationOnline Education Optimization:

Input: collection of episodes of students interacting with an educational system that gives information and questions in order to teach a topicActions correspond to giving the student some information or giving them a question of a particular difficulty and topic

Ouput

:

a

teaching policy that is tuned to student based on what is known about the studentSlide6

Least Squares Policy Iteration (LSPI)LSPI is a model-free batch RL algorithmLearns a linear approximation of Q-function

stable and efficientNever diverges or gives meaningless answersLSPI can be applied to a dataset regardless of how it was collectedBut garbage in, garbage out.Least-Squares Policy Iteration

,

Michail

Lagoudakis

and Ronald Parr,

Journal of Machine Learning Research (JMLR)

, Vol. 4, 2003, pp. 1107-1149.Slide7

Least Squares Policy iterationNo time to cover details of derivationDetails are in the appendix of these slides

LSPI is a wrapper around an algorithm LSTDQLSTDQ: learns a Q-function for current policy given the batch of dataCan learn Q-function for policy from any (reasonable) set of samples---sometimes called an off-policy method

No need to collect samples from current policy

Disconnects policy evaluation from data collection

Permits reuse of data across iterations!

Truly a batch method.Slide8

Implementing LSTDQLSTDQ uses a linear Q-function with features

and weights

.

defines greedy

policy:

For each (

s,a,r,s

’)

sample in data set:

 Slide9

Running LSPI

There is a Matlab implementation available!Collect a database of (s,a,r,s’) experiences

(this is the magic step)

Start with random

weights (= random policy)

Repeat

Evaluate current policy against database

Run

LSTDQ

to generate new set of weights

New weights imply

new Q-function and hence new

policy

Replace current weights with new weights

Until

convergenceSlide10

Results: Bicycle RidingWatch random controller operate simulated bikeCollect ~40,000 (s,a,r,s

’) samples Pick 20 simple feature functions (5 actions)Make 5-10 passes over data (PI steps)Reward was based on distance to goal + goal achievementResult: Controller that balances and rides to goalSlide11

Bicycle TrajectoriesSlide12

What about Q-learning?Ran Q-learning with same featuresUsed experience replay for data efficiency Slide13

Q-learning ResultsSlide14

LSPI RobustnessSlide15

Some key pointsLSPI is a batch RL algorithmCan generate trajectory data anyway you wantInduces a policy based on global optimization over

full datasetVery stable with no parameters that need tweakingSlide16

So, what’s the bad news?LSPI does not address the exploration problemIt decouples data collection from policy optimizationThis is often not a major issue, but can be in some cases

k2 can sometimes be bigLots of storageMatrix inversion can be expensiveBicycle needed “shaping” rewardsStill haven’t solvedFeature selection (issue for all machine learning, but RL seems even more sensitive)Slide17

Fitted Q-IterationLSPI is limited to linear functions over a given set of features

Fitted Q-Iteration allows us to use any type of function approximator for the Q-functionRandom Forests have been popularDeep NetworksFitted Q-Iteration is a very straightforward batch version of Q-learning

Damien Ernst, Pierre

Geurts

, Louis

Wehenkel

. (2005).

Tree-Based

Batch Mode Reinforcement Learning

Journal of Machine Learning Research

;

6(Apr):

503—556

.Slide18

18

Fitted Q-Iteration

Let

} be our batch of transitions

Initialize approximate Q-function

(perhaps weights of a deep network)

Initialize training set

For each

// new estimate of

Add training example

to T

Learn new

from training data

Goto

3

Step 5 could use any regression algorithm: neural network, random forests, support vector regression, Gaussian Process

 

Slide19

DQNDQN was developed by DeepMind originally for online learning of Atari games

However, the algorithm can be used effectively as is for Batch RL. I haven’t seen this done, but it is straightforward. Slide20

20

DQN for Batch RL

Let

} be our batch of transitions

Initialize

neural network parameter

values to

Randomly

sample a mini-batch of

transition

from

Perform a TD update for each parameter based on mini-batch

Goto

3

 

Slide21

Appendix21Slide22

Projection Approach to ApproximationRecall the standard Bellman equation:

or equivalently where T[.] is the Bellman operatorRecall from value iteration, the sub-optimality of a value function can be bounded in terms of the Bellman error:

This motivates trying to find an approximate value function with small Bellman error Slide23

Projection Approach to ApproximationSuppose that we have a space of representable value functionsE.g. the space of linear functions over given features

Let P be a projection operator for that spaceProjects any value function (in or outside of the space) to “closest” value function in the space“Fixed Point” Bellman Equation with approximation

Depending on space this will have a small Bellman error

LSPI will attempt to arrive at such a value function

Assumes linear approximation and least-squares projection Slide24
Slide25

Projected Value IterationNaïve Idea: try computing projected fixed point using VIExact VI: (iterate Bellman backups)

Projected VI: (iterated projected Bellman backups):

exact Bellman backup

(produced value function)

Projects exact Bellman

backup to closest function

in our restricted function spaceSlide26

Example: Projected Bellman Backup

Restrict space to linear functions over a single feature

f

:

Suppose just two states s

1

and s

2

with:

Suppose exact backup of V

i

gives:

f(

s

1

)=1

f(

s

2

)=2

f(

s

1

)=1,

f(

s

2

)=2

Can we represent this exact

backup in our linear space?

NoSlide27

Example: Projected Bellman Backup

Restrict space to linear functions over a single feature

f

:

Suppose just two states s

1

and s

2

with:

Suppose exact backup of V

i

gives:

The backup can’t be represented via our linear function:

f(

s

1

)=1

f(

s

2

)=2

f(

s

1

)=1,

f(

s

2

)=2

projected backup is

just least-squares fit

to exact backupSlide28

Problem: StabilityExact value iteration stability ensured by contraction property of Bellman backups:Is the “projected” Bellman backup a contraction:

?Slide29

Example: Stability Problem [Bertsekas & Tsitsiklis 1996]

Problem

:

Most projections lead to backups that are not contractions and unstable

s

2

s

1

Rewards all zero, single action,

g

= 0.9: V* = 0

Consider linear approx. w/ single feature

f

with weight w.

Optimal w = 0

since V*=0Slide30

Example: Stability Problem

From V

i

perform projected backup for each state

Can’t be represented in our space so find w

i+1

that gives least-squares approx. to exact backup

After some math we can get:

w

i+1

= 1.2 w

i

What does this mean?

s

2

s

1

f(

s

1

)=1

V

i

(s

1

) = w

i

f(

s

2

)=2

V

i

(s

2

) = 2w

i

weight value

at iteration iSlide31

Example: Stability Problem

1

2

Iteration #

S

V(x)

Each iteration of Bellman backup makes approximation worse!

Even for this trivial problem “projected” VI diverges.Slide32

Understanding the ProblemWhat went wrong?Exact Bellman backups reduces error in max-norm

Least squares (= projection) non-expansive in L2 normBut may increase max-norm distance!Conclusion: Alternating Bellman backups and projection is risky businessSlide33

OK, What’s LSTD?Approximates value function of policy

given trajectories of Assumes linear approximation of denoted

The

f

k

are arbitrary feature functions of states

Some vector notation

 Slide34

Deriving LSTD

is a linear function in the column space of

f

1

f

k

, that is,

K basis functions

# states

f

1

(s1)

f

2

(s1)...

f

1

(s2)

f

2

(s2)…

.

.

.

F

=

assigns a value to every stateSlide35

Suppose we know true value of policyWe would like the following:

Least squares weights minimizes squared errorLeast squares projection is then

Textbook least squares projection operator

Sometimes called pseudoinverseSlide36

But we don’t know V…Recall fixed-point equation for policies

Will solve a projected fixed-point equation:Substituting least squares projection into this gives:

Solving for w:Slide37

Almost there…Matrix to invert is only K x KBut…Expensive to construct matrix (e.g. P is |

S|x|S|)Presumably we are using LSPI because |S| is enormous We don’t know PWe don’t know RSlide38

Using Samples for F

K basis functions

f

1

(s1)

f

2

(s1)...

f

1

(s2)

f

2

(s2)…

.

.

.

Idea: Replace enumeration of states with sampled states

states

samples

Suppose we have state transition samples of the policy

running in the MDP:

{(s

i

,a

i

,r

i

,s

i

’)}Slide39

Using Samples for R

r

1

r

2

.

.

.

Idea: Replace enumeration of reward with sampled rewards

samples

Suppose we have state transition samples of the policy

running in the MDP:

{(s

i

,a

i

,r

i

,s

i

’)}

R = Slide40

40Slide41

Using Samples for PF

K basis functions

f

1

(s1’)

f

2

(s1’)...

f

1

(s2’)

f

2

(s2’)…

.

.

.

Idea: Replace expectation over next states with sampled

next states.

s’ from (s,a,r,s’)Slide42

Putting it TogetherLSTD needs to compute:

The hard part of which is B the kxk matrix:Both B and b can be computed incrementally for each (s,a,r,s’) sample: (initialize to zero)

from previous slideSlide43

LSTD AlgorithmCollect data by executing trajectories of current policyFor each (s,a,r,s’) sample: Slide44

LSTD SummaryDoes O(k2) work per datum

Linear in amount of data.Approaches model-based answer in limitFinding fixed point requires inverting matrixFixed point almost always existsStable; efficientSlide45

Approximate Policy Iteration with LSTD

Start with random weights w (i.e. value function)Repeat Until Convergence

Evaluate using LSTD

Generate sample trajectories of

Use LSTD to produce new weights

w

(

w

gives an approx. value function of )

= greedy( )

// policy improvement

Policy Iteration:

iterates between policy improvement

and policy evaluation

Idea:

use LSTD for approximate policy evaluation in PISlide46

What Breaks?No way to execute greedy policy without a modelApproximation is biased by current policyWe only approximate values of states we see when executing the current policy

LSTD is a weighted approximation toward those statesCan result in Learn-forget cycle of policy iterationDrive off the road; learn that it’s badNew policy never does this; forgets that it’s badNot truly a batch methodData must be collected from current policy for LSTDSlide47

LSPILSPI is similar to previous loop but replaces LSTD with a new algorithm LSTDQLSTD: produces a value functionRequires samples from policy under consideration

LSTDQ: produces a Q-function for current policyCan learn Q-function for policy from any (reasonable) set of samples---sometimes called an off-policy methodNo need to collect samples from current policyDisconnects policy evaluation from data collectionPermits reuse of data across iterations!Truly a batch method.