/
Factored  Approches  for MDP & RL Factored  Approches  for MDP & RL

Factored Approches for MDP & RL - PowerPoint Presentation

pasty-toler
pasty-toler . @pasty-toler
Follow
344 views
Uploaded On 2018-11-01

Factored Approches for MDP & RL - PPT Presentation

Some Slides taken from Alan Ferns course Factored MDPRL Representations States made of features Boolean vs Continuous Actions modify the features probabilistically Representations include Probabilistic STRIPS 2Timeslice Dynamic ID: 707990

policy state action gradient state policy gradient action reward values states actions linear features update approximation based variables parameter

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Factored Approches for MDP & RL" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Factored Approches for MDP & RL

(Some Slides taken from Alan Fern’s course)Slide2

Factored MDP/RL

Representations

States made of features

Boolean vs. Continuous

Actions modify the features (probabilistically)Representations include Probabilistic STRIPS, 2-Time-slice Dynamic Bayes Nets etc.Reward and Value functionsRepresentations include ADDs, linear weighted sums of features etc.

Advantages

Specification:

is far easier

Inference:

Novel lifted versions of the Value and Policy iterations possible

Bellman backup directly in terms of ADDs

Policy gradient approach where you do direct search in the policy space

Learning

: Generalization possibilities

Q-learning etc. will now directly update the factored representations (e.g. weights of the features)

Thus giving implicit generalization

Approaches such as FF-HOP can recognize and reuse common substructure Slide3

Problems with transition systems

Transition systems are a great conceptual tool to understand the differences between the various planning problems

…However direct manipulation of transition systems tends to be too cumbersome

The size of the explicit graph corresponding to a transition system is often very large

The remedy is to provide “compact” representations for transition systems

Start by explicating the structure of the “states”

e.g. states specified in terms of state variables

Represent actions not as incidence matrices but rather functions specified directly in terms of the state variables

An action will work in any state where some state variables have certain values. When it works, it will change the values of certain (other) state variables Slide4

State Variable Models

World is made up of states which are defined in terms of state variables

Can be boolean (or multi-ary or continuous)

States are

complete assignments over state variablesSo, k boolean state variables can represent how many states?Actions change the values of the state variablesApplicability conditions of actions are also specified in terms of partial assignments over state variablesSlide5

Blocks world

State variables:

Ontable(x) On(x,y) Clear(x) hand-empty holding(x)

Stack(x,y)

Prec: holding(x), clear(y)

eff: on(x,y), ~cl(y), ~holding(x), hand-empty

Unstack(x,y)

Prec: on(x,y),hand-empty,cl(x)

eff: holding(x),~clear(x),clear(y),~hand-empty

Pickup(x)

Prec: hand-empty,clear(x),ontable(x)

eff: holding(x),~ontable(x),~hand-empty,~Clear(x)

Putdown(x)

Prec: holding(x)

eff: Ontable(x), hand-empty,clear(x),~holding(x)

Initial state:

Complete specification of T/F values to state variables

--By convention, variables with F values are omitted

Goal state: A partial specification of the desired state variable/value combinations --desired values can be both positive and negative

Init: Ontable(A),Ontable(B), Clear(A), Clear(B), hand-emptyGoal: ~clear(B), hand-empty

All the actions here have only positive preconditions; but this is not necessary

STRIPS ASSUMPTION: If an action changes a state variable, this must be explicitly mentioned in its effectsSlide6

Why is STRIPS representation compact?(than explicit transition systems)

In explicit transition systems actions are represented as state-to-state transitions where in each action will be represented by an incidence matrix of size |S|x|S|

In state-variable model, actions are represented only in terms of state variables whose values they care about, and whose value they affect.

Consider a state space of 1024 states. It can be represented by log

21024=10 state variables. If an action needs variable v1 to be true and makes v7 to be false, it can be represented by just 2 bits (instead of a 1024x1024 matrix)

Of course, if the action has a complicated mapping from states to states, in the worst case the action rep will be just as large

The assumption being made here is that the actions will have effects on a small number of state variables.

Sit. Calc

STRIPS rep

Transition

rep

First

order

Rel/

Prop

AtomicSlide7

Factored Representations fo

MDPs:

Actions

Actions can be represented directly in terms of their effects on the individual state variables (

fluents). The CPTs of the BNs can be represented compactly too!

Write a

Bayes

Network relating the value of

fluents

at the state before and after the action

Bayes

networks representing

fluents

at different time points are called “Dynamic

Bayes

Networks”

We look at 2TBN (2-time-slice dynamic bayes nets)

Go further by using STRIPS assumptionFluents not affected by the action are not represented explicitly in the modelCalled Probabilistic STRIPS Operator (PSO) model Slide8

Action CLKSlide9
Slide10
Slide11

Factored Representations: Reward, Value and Policy Functions

Reward functions can be represented in factored form too. Possible representations include

Decision trees (made up of fluents)

ADDs (Algebraic decision diagrams)

Value functions are like reward functions (so they too can be represented similarly)Bellman update can then be done directly using factored representations..Slide12
Slide13

SPUDDs use of ADDs Slide14

Direct manipulation of ADDs

in SPUDDSlide15

Ideas for Efficient Algorithms..

Use heuristic search (and reachability information)

LAO*, RTDP

Use execution and/or Simulation

“Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model)

“Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc.

Use “factored” representations

Factored representations for Actions, Reward Functions, Values and Policies

Directly manipulating factored representations during the Bellman updateSlide16

Probabilistic Planning

--The competition (IPPC)

--The Action language..

PPDDL

was based on PSOA new standard RDDL is based on 2-TBN Slide17
Slide18

Not

ergodicSlide19
Slide20
Slide21

Reducing Heuristic Computation Cost by exploiting factored representations

The heuristics computed for a state might give us an idea about the heuristic value of other “similar” states

Similarity is possible to determine in terms of the state structure

Exploit overlapping structure of heuristics for different states

E.g. SAG idea for McLUGE.g. Triangle tables idea for plans (c.f. Kolobov

)Slide22

A Plan is a Terrible Thing to Waste

Suppose we have a plan

s0—a0—s1—a1—s2—a2—s3…an—

sG

We realized that this tells us not just the estimated value of s0, but also of s1,s2…snSo we don’t need to compute the heuristic for them againIs that all?If we have states and actions in factored representation, then we can explain exactly what aspects of si are relevant for the plan’s success.

The “explanation” is a proof of correctness of the planCan be based on regression (if the plan is a sequence) or causal proof (if the plan is a partially ordered one.

The explanation will typically be just a subset of the literals making up the state

That means actually, the plan suffix from

si

may actually be relevant in many more states that are consistent with that explanationSlide23

Triangle Table Memoization

Use triangle tables /

memoization

C

B

A

A

B

C

If the above problem is solved, then we don’t need to call FF again for the below:

B

A

A

BSlide24

Explanation-based Generalization

(of Successes and Failures)

Suppose we have a plan P that solves a problem [S, G].

We can first find out what aspects of S does this plan actually depend on

Explain (prove) the correctness of the plan, and see which parts of S actually contribute to this proofNow you can memoize this plan for just that subset of S Slide25

Relaxations for Stochastic Planning

Determinizations

can also be used as a basis for heuristics to initialize the V for value iteration [

mGPT

; GOTH etc]Heuristics come from relaxationWe can relax along two separate dimensions:Relax –ve interactionsConsider +ve interactions alone using relaxed planning graphs

Relax uncertaintyConsider determinizations

Or a combination of both!Slide26

Solving Determinizations

If we relax –

ve

interactions

Then compute relaxed planAdmissible if optimal relaxed plan is computedInadmissible otherwiseIf we keep –ve interactionsThen use a deterministic planner (e.g. FF/LPG)Inadmissible unless the underlying planner is optimalSlide27

Dimensions of Relaxation

Uncertainty

Negative Interactions

1

2

3

1

Relaxed Plan Heuristic

2

McLUG

3

FF/LPG

Reducing Uncertainty

Bound the number of stochastic outcomes

 Stochastic “width”

4

4

Limited width stochastic planning?

Increasing consideration

Slide28

Dimensions of Relaxation

None

Some

Full

NoneRelaxed Plan

McLUG

Some

Full

FF/LPG

Limited width

Stoch

Planning

Uncertainty

-

ve

interactionsSlide29

Expressiveness v. Cost

h = 0

McLUG

FF-

Replan

FF

Limited width stochastic planning

Node Expansions v.

Heuristic Computation Cost

Nodes Expanded

Computation Cost

FF

R

FFSlide30
Slide31

--Factored TD and Q-learning--Policy search (has to be factored..)Slide32

32

Large State Spaces

When a problem has a large state space we can not longer represent the V or Q functions as explicit tables

Even if we had enough memory

Never enough training data!

Learning takes too long

What to do??

[Slides from Alan Fern]Slide33

33

Function Approximation

Never enough training data!

Must

generalize what is learned from one situation to other “similar” new situations

Idea:

Instead of using large table to represent V or Q, use a parameterized function

The number of parameters should be small compared to number of states (generally exponentially fewer parameters)

Learn parameters from experience

When we update the parameters based on observations in one state, then our V or Q estimate will also change for other similar states

I.e. the parameterization facilitates generalization of experienceSlide34

34

Linear Function Approximation

Define a set of state features f1(s), …, fn(s)

The features are used as our representation of states

States with similar feature values will be considered to be similarA common approximation is to represent V

(s) as a weighted sum of the features (i.e. a linear approximation)

The approximation accuracy is fundamentally limited by the information provided by the features

Can we always define features that allow for a perfect linear approximation?

Yes. Assign each state an indicator feature. (I.e. i’th feature is 1 iff i’th state is present and

i

represents value of i’th state)

Of course this requires far to many features and gives no generalization.Slide35

35

Example

Consider grid problem with no obstacles, deterministic actions U/D/L/R (49 states)

Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features)

V(s) =

0

+ 

1

x + 

2

y

Is there a good linear

approximation?

Yes.

0 =10, 

1 = -1, 2 = -1(note upper right is origin)

V(s) = 10 - x - ysubtracts Manhattan dist.from goal reward

10

0

0

6

6Slide36

36

But What If We Change Reward …

V(s) =

0 + 

1 x + 

2

y

Is there a good linear approximation?

No.

10

0

0Slide37

37

But What If We Change Reward …

V(s) =

0 + 

1 x + 

2

y

Is there a good linear approximation?

No.

10

0

0Slide38

38

But What If…

V(s) =

0 + 

1 x + 2

y

10

+

3

z

Include new feature z

z= |3-x| + |3-y|

z is dist. to goal location

Does this allow a

good linear approx?

0

=10, 

1

= 

2

= 0,

0

= -1

0

0

3

3

Feature Engineering….Slide39
Slide40
Slide41

41

Linear Function Approximation

Define a set of features f1(s), …, fn(s)

The features are used as our representation of states

States with similar feature values will be treated similarly

More complex functions require more complex features

Our goal is to learn good parameter values (i.e. feature weights) that approximate the value function well

How can we do this?

Use TD-based RL and somehow update parameters based on each experience.Slide42

42

TD-based RL for Linear Approximators

Start with initial parameter values

Take action according to an

explore/exploit policy

(should converge to greedy policy, i.e. GLIE)

Update estimated model

Perform TD update for each parameter

Goto

2

What is a “TD update” for a parameter?

Slide43

43

Aside: Gradient

Descent

Given a function

f

(

1

,…, 

n

) of n real values =

(

1

,…, 

n

) suppose we want to minimize f with respect to A common approach to doing this is gradient descent

The gradient of f at point , denoted by 

 f(), is an n-dimensional vector that points in the direction where f

increases most steeply at point Vector calculus tells us that  f() is just a vector of partial derivatives

where

can decrease f by moving in negative gradient direction

This will be used

Again with

Graphical Model

LearningSlide44

44

Aside: Gradient Descent for Squared Error

Suppose that we have a sequence of states and target values for each state

E.g. produced by the TD-based RL loop

Our goal is to minimize the sum of squared errors between our estimated function and each target value:

After seeing j’th

state the

gradient descent rule

tells us that we can decrease error by updating parameters by:

squared error of example j

our estimated value

for j’th state

learning rate

target value for j’th stateSlide45

45

Aside: continued

For a linear approximation function:

Thus the update becomes

:

For linear functions this update is guaranteed to converge

to best approximation for suitable learning rate schedule

depends on form of

approximatorSlide46

46

TD-based RL for Linear Approximators

Start with initial parameter values

Take action according to an

explore/exploit policy

(should converge to greedy policy, i.e. GLIE)

Transition from s to s’

Update estimated model

Perform TD update for each parameter

Goto 2

What should we use for “target value” v(s)?

Use the TD prediction based on the next state s’

this is the same as previous TD method only with approximation

Note that we are

generalizing

w.r.t

.

possibly faulty data.. (the neighbor’s

value may not be correct yet..)Slide47

47

TD-based RL for Linear Approximators

Start with initial parameter values

Take action according to an

explore/exploit policy

(should converge to greedy policy, i.e. GLIE)

Update estimated model

Perform TD update for each parameter

Goto

2

Step 2 requires a model to select greedy action

For applications such as Backgammon it is easy to get a

simulation-based model

For others it is difficult to get a good model

But we can do the same thing for model-free Q-learningSlide48

48

Q-learning with Linear Approximators

Start with initial parameter values

Take action a according to an

explore/exploit policy

(should converge to greedy policy, i.e. GLIE) transitioning from s to s’

Perform TD update for each parameter

Goto 2

For both Q and V, these algorithms converge to the closest

linear approximation to optimal Q or V.

Features are a function of states and actions.Slide49

49

Example: Tactical Battles in Wargus

Wargus is real-time strategy (RTS) game

Tactical battles are a key aspect of the game

RL Task

: learn a policy to control

n

friendly agents in a battle against

m

enemy agents

Policy should be applicable to tasks with different sets and numbers of agents

5 vs. 5

10 vs. 10Slide50
Slide51
Slide52
Slide53
Slide54
Slide55
Slide56

56

Policy Gradient Ascent

Let () be the expected value of policy 

.

(

) is just the expected discounted total reward for a trajectory of 

.

For simplicity assume each trajectory starts at a single initial state.

Our objective is to find a

that maximizes

()

Policy gradient ascent tells us to iteratively update parameters via: Problem:

() is generally very complex and it is rare that we can compute a closed form for the gradient of ().We will instead estimate the gradient based on experienceSlide57

57

Gradient Estimation

Concern

:

Computing or estimating the gradient of discontinuous functions can be problematic.

For our example parametric policy

is () continuous?

No.

There are values of

where arbitrarily small changes, cause the policy to change.

Since different policies can have different values this means that changing

can cause discontinuous jump of

().

Slide58

58

Example: Discontinous ()

Consider a problem with initial state s and two actions a1 and a2

a1 leads to a very large terminal reward R1

a2 leads to a very small terminal reward R2

Fixing 

2

to a constant we can plot the ranking assigned to each action by Q and the corresponding value

()

1

1

()

R1

R2

Discontinuity in

()

when

ordering of a1 and a2 changeSlide59

59

Probabilistic Policies

We would like to avoid policies that drastically change with small parameter changes, leading to discontinuities

A

probabilistic policy



takes a state as input and returns a distribution over actions

Given a state s

(

s,a

) returns the probability that

selects action a in s Note that () is still well defined for probabilistic policiesNow uncertainty of trajectories comes from environment and policyImportantly if

(s,a) is continuous relative to changing  then () is also continuous relative to changing 

A common form for probabilistic policies is the softmax function or Boltzmann exploration function

Aka Mixed Policy(not needed for

Optimality…)Slide60

60

Empirical Gradient Estimation

Our first approach to estimating 

() is to simply compute empirical gradient estimates

Recall that  = (1,…, n) and

so we can compute the gradient by empirically estimating each partial derivative

So for small

we can estimate the partial derivatives by

This requires estimating n+1 values:

Slide61

61

Empirical Gradient Estimation

How do we estimate the quantities

For each set of parameters, simply execute the policy for N trials/episodes and average the values achieved across the trials

This requires a total of N(n+1) episodes to get gradient estimate

For stochastic environments and policies the value of N must be relatively large to get good estimates of the true value

Often we want to use a relatively large number of parameters

Often it is expensive to run episodes of the policy

So while this can work well in many situations, it is often not a practical approach computationally

Better approaches try to use the fact that the stochastic policy is

differentiable.

Can get the gradient by just running the current policy multiple times

Doable without permanent damage if there is a simulatorSlide62

62

Applications of Policy Gradient Search

Policy gradient techniques have been used to create controllers for difficult helicopter maneuvers

For example, inverted helicopter flight.

A planner called FPG also “won” the 2006 International Planning Competition

If you don’t count FF-ReplanSlide63
Slide64

64

Policy Gradient Recap

When policies have much simpler representations than the corresponding value functions, direct search in policy space can be a good idea

Allows us to design complex parametric controllers and optimize details of parameter settings

For baseline algorithm the gradient estimates are unbiased (i.e. they will converge to the right value) but have high variance

Can require a large N to get reliable estimates

OLPOMDP offers can trade-off bias and variance via the discount parameter

[Baxter & Bartlett, 2000]

Can be prone to finding local maxima

Many ways of dealing with this, e.g. random restarts.Slide65

65

Gradient Estimation: Single Step Problems

For stochastic policies it is possible to estimate

() directly from trajectories of just the current policy 

Idea: take advantage of the fact that we know the functional form of the policy

First consider the simplified case where all trials have length 1

For simplicity assume each trajectory starts at a single initial state and reward only depends on action choice

() is just the expected reward of action selected by 

.

where s

0

is the initial state and R(a) is reward of action a

The gradient of this becomes How can we estimate this by just observing the execution of

?Slide66

66

Rewriting

The gradient is just the expected value of g(s

0

,a)R(a) over execution trials of

Can estimate by executing

for N trials and averaging samples

a

j

is action selected by policy on

j’th

episode Only requires executing  for a number of trials that need not depend on the number of parameters

can get closed form

g(s

0,a)Gradient Estimation: Single Step ProblemsSlide67

67

Gradient Estimation: General Case

So for the case of a length 1 trajectories we got:

For the general case where trajectories have length greater than one and reward depends on state we can do some work and get:

s

jt

is

t’th

state of

j’th

episode,

a

jt

is

t’th

action of epidode j

The derivation of this is straightforward but messy.

Observed total reward in trajectory j

from step t to end

length of trajectory j

# of trajectories

of current policySlide68

68

How to interpret gradient expression?

So the overall gradient is a reward weighted combination of individual gradient directions

For large

R

j

(

s

j

,

t

) will increase probability of

a

j

,

t in sj

,tFor negative Rj(sj

,t) will decrease probability of aj,t in sj

,tIntuitively this increases probability of taking actions that typically are followed by good reward sequences

Direction to move parameters in order to

increase the probability that policy selects

ajt in state sjt

Total reward observed after taking a

jt in state sjtSlide69

69

Basic Policy Gradient Algorithm

Repeat until stopping condition

Execute

for N trajectories while storing the state, action, reward sequences

One disadvantage of this approach is the small number of updates per amount of experience

Also requires a notion of trajectory rather than an infinite sequence of experience

Online policy gradient algorithms perform updates after each step in environment (often learn faster)Slide70

Online Policy Gradient (OLPOMDP)

Repeat forever

Observe state

s

Draw action a according to distribution

(

s

)

Execute

a

and observe reward

r

;; discounted sum of

;; gradient directions

Performs policy update at each time step and executes indefinitely

This is the OLPOMDP algorithm [Baxter & Bartlett, 2000]Slide71

Interpretation

Repeat forever

Observe state

s

Draw action a according to distribution

(

s

)

Execute

a

and observe reward

r

;; discounted sum of

;; gradient directions

Step 4 computes an “eligibility trace”

eDiscounted sum of gradients over previous state-action pairs

Points in direction of parameter space that increases probability of taking more recent actions in more recent statesFor positive rewards step 5 will increase probability of recent actions and decrease for negative rewards. Slide72

72

Computing the Gradient of Policy

Both algorithms require computation of

For the Boltzmann distribution with linear approximation we have:

where

Here the partial derivatives needed for g(s,a) are: