/
Efficient Approaches for Solving Large-scale MDPs Efficient Approaches for Solving Large-scale MDPs

Efficient Approaches for Solving Large-scale MDPs - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
343 views
Uploaded On 2019-06-22

Efficient Approaches for Solving Large-scale MDPs - PPT Presentation

Slides on LRTDP and UCT are courtesy Mausam Kolobov Ideas for Efficient Algorithms Use heuristic search and reachability information LAO RTDP Use execution andor Simulation Actual Execution Reinforcement learning ID: 759664

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Efficient Approaches for Solving Large-s..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Efficient Approaches for Solving Large-scale MDPs

Slides on LRTDP and UCT are courtesy

Mausam

/

Kolobov

Slide2

Ideas for Efficient Algorithms..

Use heuristic search (and reachability information)LAO*, RTDPUse execution and/or Simulation“Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model)“Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc.

Use “factored” representations

Factored representations for Actions, Reward Functions, Values and Policies

Directly manipulating factored representations during the Bellman update

Slide3

Real Time Dynamic Programming[Barto et al 95]

Original Motivationagent acting in the real worldTrial simulate greedy policy starting from start state;perform Bellman backup on visited statesstop when you hit the goalRTDP: repeat trials foreverConverges in the limit #trials ! 1

3

We will do the discussion in terms of SSP MDPs --Recall they subsume infinite horizon MDPs

Slide4

Trial

4

s0

S

g

s

1

s

2

s

3

s

4

s

5

s

6

s

7

s

8

Slide5

Trial

5

s0

S

g

s

1

s

2

s

3

s

4

s

5

s

6

s

7

s

8

h

h

h

h

V

s

tart at start state

repeat

perform a Bellman backup

simulate greedy action

Slide6

Trial

6

s0

S

g

s

1

s

2

s

3

s

4

s

5

s

6

s

7

s

8

h

h

h

h

V

s

tart at start state

repeat

perform a Bellman backup

simulate greedy action

h

h

Slide7

Trial

7

s0

S

g

s

1

s

2

s

3

s

4

s

5

s

6

s

7

s

8

h

h

V

h

V

s

tart at start state

repeat

perform a Bellman backup

simulate greedy action

h

h

Slide8

Trial

8

s0

S

g

s

1

s

2

s

3

s

4

s

5

s

6

s

7

s

8

h

h

V

h

V

s

tart at start state

repeat

perform a Bellman backup

simulate greedy action

h

h

Slide9

Trial

9

s0

S

g

s

1

s

2

s

3

s

4

s

5

s

6

s

7

s

8

h

h

V

h

V

s

tart at start state

repeat

perform a Bellman backup

simulate greedy action

V

h

Slide10

Trial

10

s0

S

g

s

1

s

2

s

3

s

4

s

5

s

6

s

7

s

8

h

h

V

h

V

s

tart at start state

repeat

perform a Bellman backup

simulate greedy action

until hit the goal

V

h

Slide11

Trial

11

s0

S

g

s

1

s

2

s

3

s

4

s

5

s

6

s

7

s

8

h

h

V

h

V

s

tart at start state

repeat

perform a Bellman backup

simulate greedy action

until hit the goal

V

h

Backup all states

on trajectory

RTDP

repeat

forever

Slide12

Real Time Dynamic Programming[Barto et al 95]

Original Motivationagent acting in the real worldTrial simulate greedy policy starting from start state;perform Bellman backup on visited statesstop when you hit the goalRTDP: repeat trials foreverConverges in the limit #trials ! 1

12

No termination

c

ondition!

Slide13

RTDP Family of Algorithms

repeat s à s0 repeat //trials REVISE s; identify agreedy FIND: pick s’ s.t. T(s, agreedy, s’) > 0 s à s’ until s 2 G until termination test

13

Slide14

Admissible heuristic & monotonicity⇒ V(s) · V*(s)⇒ Q(s,a) · Q*(s,a)Label a state s as solved if V(s) has converged

high Q costs

best

action

Res

V

(s

) < ²) V(s) won’t change!label s as solved

s

g

s

Slide15

Labeling (contd)

15

high Q costs

best

action

Res

V

(s

) < ²s' already solved) V(s) won’t change!label s as solved

sg

s

s'

Slide16

Labeling (contd)

16

high Q costs

best

action

Res

V

(s

) < ²s' already solved) V(s) won’t change!label s as solved

sg

s

s'

high Q costs

best action

Res

V

(s

) <

²

Res

V

(s’) < ²V(s), V(s’) won’t change!label s, s’ as solved

sg

s

s'

high Q costs

best

action

Slide17

Labeled RTDP [Bonet&Geffner 03b]

repeat s à s0 label all goal states as solved repeat //trials REVISE s; identify agreedy FIND: sample s’ from T(s, agreedy, s’) s à s’ until s is solved for all states s in the trial try to label s as solveduntil s0 is solved

17

Slide18

terminates in finite timedue to labeling procedureanytimefocuses attention on more probable statesfast convergencefocuses attention on unconverged states

18

LRTDP

Slide19

Picking a Successor Take 2

Labeled RTDP/RTDP: sample s’ / T(s, agreedy, s’)Adv: more probable states are explored firstLabeling Adv: no time wasted on converged statesDisadv: labeling is a hard constraintDisadv: sampling ignores “amount” of convergenceIf we knew how much V(s) is expected to change?sample s’ / expected change

19

Slide20

Upper Bounds in SSPs

RTDP/LAO* maintain lower boundscall it VlAdditionally associate upper bound with sVu(s) ¸ V*(s) Define gap(s) = Vu(s) – Vl(s)low gap(s): more converged a statehigh gap(s): more expected change in its value

20

Slide21

Backups on Bounds

Recall monotonicityBackups on lower bound continue to be lower boundsBackups on upper boundcontinues to be upper bounds IntuitivelyVl will increase to converge to V*Vu will decrease to converge to V*

21

Slide22

Bounded RTDP [McMahan et al 05]

repeat s à s0 repeat //trials identify agreedy based on Vl FIND: sample s’ / T(s, agreedy, s’).gap(s’) s à s’ until gap(s) < ² for all states s in trial in reverse order REVISE suntil gap(s0) < ²

22

Slide23

Min

?

?

s

0

J

n

J

n

J

n

J

n

J

n

J

n

J

n

Q

n+1

(s

0

,a)

J

n+1

(s

0

)

a

greedy

= a

2

Goal

a

1

a

2

a

3

RTDP Trial

?

Slide24

Greedy “On-Policy” RTDP

without

execution

Using the current utility values, select the

action with the highest expected utility

(greedy action) at each state, until you

reach a terminating state. Update the

values along this path. Loop back—until

the values

stabilize

Slide25

Comments

Propertiesif all states are visited infinitely often then Jn → J*Only relevant states will be consideredA state is relevant if the optimal policy could visit it. Notice emphasis on “optimal policy”—just because a rough neighborhood surrounds National Mall doesn’t mean that you will need to know what to do in that neighborhood AdvantagesAnytime: more probable states explored quicklyDisadvantagescomplete convergence is slow!no termination condition

Do we care about complete

convergence?

Think Cpt.

Sullenberger

Slide26

Slide27

9/26

Slide28

The “Heuristic”

The value function is They approximate it by

Exactly what are they relaxing?

They are assuming that they can make the best

outcome of the action happen..

What if we pick the s’

corresponding to the

highest P?

Slide29

Monte-Carlo Planning

Consider the Sysadmin problem:

29

Restart(Ser

2

)

Restart(Ser

1

)

Restart(Ser

3

)

P(Ser

1

t

|Restart

t-1

(Ser

1

), Ser

1

t-1

, Ser2t-1)

P(Ser2t |Restartt-1(Ser2), Ser1t-1, Ser2t-1, Ser3t-1)

P(Ser3t |Restartt-1(Ser3), Ser2t-1, Ser3t-1)

Time t-1

Time t

T:

A:

R

: ∑

i

[Ser

i

= ↑]

Slide30

Monte-Carlo Planning: Motivation

Characteristics of Sysadmin:FH MDP turned SSPs0 MDPReaching the goal is trivial, determinization approaches not really helpfulEnormous reachable state spaceHigh-entropy T (2|X| outcomes per action, many likely ones)Building determinizations can be super-expensiveDoing Bellman backups can be super-expensiveTry Monte-Carlo planningDoes not manipulate T or C/R explicitly – no Bellman backups Relies on a world simulator – indep. of MDP description size

30

Slide31

UCT: A Monte-Carlo Planning Algorithm

UCT [Kocsis & Szepesvari, 2006] computes a solution by simulating the current best policy and improving itSimilar principle as RTDPBut action selection, value updates, and guarantees are differentUseful when we haveEnormous reachable state spaceHigh-entropy T (2|X| outcomes per action, many likely ones)Building determinizations can be super-expensiveDoing Bellman backups can be super-expensiveSuccess stories:Go (thought impossible in ‘05, human grandmaster level at 9x9 in ‘08) Klondike Solitaire (wins 40% of games)General Game Playing CompetitionReal-Time Strategy GamesProbabilistic Planning CompetitionThe list is growing…

31

Slide32

Select an arm that probably (w/ high probability) has approximately the best expected rewardUse as few simulator calls (or pulls) as possible

s

a

1

a

2

a

k

R(s,a

1)

R(s,a2)

R(s,ak)

Background: Multi-Armed Bandit Problem

Slide courtesy of A. Fern

32

Just like a an FH MDP with horizon 1!

Slide33

Current World State

Rollout policy

Terminal

(reward = 1)

1

1

1

1

Build a state-action tree

At

a leaf node perform a random

rollout

Initially tree is single leaf

UCT Example

Slide courtesy of A. Fern

33

Slide34

Current World State

1

1

1

1

Must select each action at a node at least once

0

Rollout

Policy

Terminal

(reward = 0)

Slide courtesy of A. Fern

34

UCT Example

Slide35

Current World State

1

1

1

1

Must select each action at a node at least once

0

0

0

0

Slide courtesy of A. Fern

35

UCT Example

Slide36

Current World State

1

1

1

1

0

0

0

0

When all node actions tried once, select action according to tree policy

Tree Policy

Slide courtesy of A. Fern

36

UCT Example

Slide37

Current World State

1

1

1

1

When all node actions tried once, select action according to tree policy

0

0

0

0

Tree Policy

0

Rollout

Policy

Slide courtesy of A. Fern

37

UCT Example

Slide38

Current World State

1

1

1

1/2

When all node actions tried once, select action according to tree policy

0

0

0

0

Tree

Policy

0

0

0

0

What is an appropriate tree policy?

Rollout policy?

Slide courtesy of A. Fern

38

UCT Example

Slide39

Rollout policy:Basic UCT uses randomTree policy:Q(s,a) : average reward received in current trajectories after taking action a in state sn(s,a) : number of times action a taken in sn(s) : number of times state s encountered

Theoretical constant that must

be selected empirically in

practice. Setting it to distance to horizon guarantees arriving at the optimal policy eventually, if R

Slide courtesy of A. Fern

39

UCT Details

Exploration term

Slide40

Current World State

1

1

1

1/2

When all node actions tried once, select action according to tree policy

0

0

0

0

Tree

Policy

0

0

0

0

a

1

a

2

Slide courtesy of A. Fern

40

UCT Example

Slide41

Current World State

1

1

1

1/2

1/3

When all node actions tried once, select action according to tree policy

0

0

0

0

Tree

Policy

0

0

0

0

Slide courtesy of A. Fern

41

Slide42

To select an action at a state sBuild a tree using N iterations of Monte-Carlo tree searchDefault policy is uniform random up to level LTree policy is based on bandit ruleSelect action that maximizes Q(s,a)(note that this final action selection does not take the exploration term into account, just the Q-value estimate)The more simulations, the more accurateGuaranteed to pick suboptimal actions exponentially rarely after convergence (under some assumptions)Possible improvementsInitialize the state-action pairs with a heuristic (need to pick a weight)Think of a better-than-random rollout policy

Slide courtesy of A. Fern

42

UCT Summary & Theoretical Properties

Slide43

LRTDP or UCT?

43

AAAI 2012!

Slide44

Other Advances

Ordering the Bellman backups to maximise information flow.

[Wingate & Seppi’05]

[Dai & Hansen’07]

Partition the state space and combine value iterations from different partitions.

[Wingate & Seppi’05]

[Dai & Goldsmith’07]

External memory version of value iteration

[Edelkamp, Jabbar & Bonet’07]

Slide45

Two Models of Evaluating Probabilistic Planning

IPPC (Probabilistic Planning Competition)How often did you reach the goal under the given time constraintsFF-HOPFF-Replan

Evaluate on the quality of the policy

Converging to optimal policy faster

LRTDP

mGPT

Kolobov’s

approach

Slide46

Online Action Selection

Off-line policy generation

First compute the whole policyGet the initial stateCompute the optimal policy given the initial state and the goalsThen just execute the policyLoopDo action recommended by the policyGet the next stateUntil reaching goal statePros: Can anticipate all problems;Cons: May take too much time to start executing

Online action selection

LoopCompute the best action for the current state execute itget the new statePros: Provides fast first responseCons: May paint itself into a corner..

Policy Computation

Exec

Select

ex

Select

ex

Select

ex

Select

ex

Slide47

FF-Replan

Simple replannerDeterminizes the probabilistic problemIF an action has multiple effect sets with different probabilitiesSelect the most likely onSplit the action into multiple actions one for each setupSolves for a plan in the determinized problem

S

G

a1

a2

a3

a4

a2

a3

a4

G

a5

Slide48

All Outcome Replanning (FFRA)

Action

Effect 1

Effect 2

Probability

1

Probability

2

Action1

Effect 1

Action2

Effect 2

ICAPS-07

48

Slide49

1st IPPC & Post-Mortem..

IPPC Competitors

Most IPPC competitors used different approaches for offline policy generation.One group implemented a simple online “replanning” approach in addition to offline policy generationDeterminize the probabilistic problem Most-likely vs. All-outcomesLoopGet the state S; Call a classical planner (e.g. FF) with [S,G] as the problemExecute the first action of the planUmpteen reasons why such an approach should do quite badly..

Results and Post-mortem

To everyone’s surprise, the

replanning

approach wound up winning the competition.

Lots of hand-wringing ensued..

May be we should require that the planners really

really

use probabilities?

May be the domains should somehow be made “probabilistically interesting”?

Current understanding:

No reason to believe that off-line policy computation must dominate online action selection

The “

replanning

” approach is just a degenerate case of hind-sight optimization

Slide50

Reducing calls to FF..

We can reduce calls to FF by

memoizing

successes

If we were given s0 and

sG

as the problem, and solved it using our

determinization

to get the plan s0—a0—s1—a1—s2—a2—s3…an—

sG

Then in addition to sending a1 to the simulator, we can

memoize

{

si—ai

} as the partial policy.

Whenever a new state is given by the simulator, we can see if it is already in the partial policy

Additionally, FF-

replan

can consider every state in the partial policy table as a goal state (in that if it reaches them, it knows how to get to goal state..)

Slide51

Hindsight Optimization for Anticipatory Planning/Scheduling

Consider a deterministic planning (scheduling) domain, where the goals arrive probabilistically

Using up resources and/or doing greedy actions may preclude you from exploiting the later opportunities

How do you select actions to perform?

Answer: If you have a distribution of the goal arrival, then

Sample goals

upto

a certain horizon using this distribution

Now, we have a deterministic planning problem

with

known goals

Solve it; do the first action from it.

Can improve accuracy with multiple samples

FF-Hop uses this idea for stochastic planning. In anticipatory planning, the uncertainty is

exogenous

(it is the uncertain arrival of goals). In stochastic planning, the uncertainty is

endogenous

(the actions have multiple outcomes)

Slide52

Probabilistic Planning(goal-oriented)

Action

Probabilistic

Outcome

Time 1

Time 2

Goal State

52

Action

State

Maximize Goal Achievement

Dead End

A1

A2

I

A1

A2

A1

A2

A1

A2

A1

A2

Left Outcomes are more likely

Slide53

Probabilistic PlanningAll Outcome Determinization

Action

Probabilistic

Outcome

Time 1

Time 2

Goal State

53

Action

State

Find Goal

Dead End

A1

A2

A1

A2

A1

A2

A1

A2

A1

A2

I

A1-1

A1-2

A2-1

A2-2

A1-1

A1-2

A2-1

A2-2

A1-1

A1-2

A2-1

A2-2

A1-1

A1-2

A2-1

A2-2

A1-1

A1-2

A2-1

A2-2

Slide54

Probabilistic Planning

All Outcome Determinization

Action

Probabilistic

Outcome

Time 1

Time 2

Goal State

54

Action

State

Find Goal

Dead End

A1

A2

A1

A2

A1

A2

A1

A2

A1

A2

I

A1-1

A1-2

A2-1

A2-2

A1-1

A1-2

A2-1

A2-2

A1-1

A1-2

A2-1

A2-2

A1-1

A1-2

A2-1

A2-2

A1-1

A1-2

A2-1

A2-2

Slide55

Problems of FF-Replan and better alternative sampling

55

FF-Replan’s Static Determinizations don’t respect probabilities.We need “Probabilistic and Dynamic Determinization”

Sample Future Outcomes and Determinization in Hindsight

Each Future Sample Becomes a

Known-Future

Deterministic

Problem

Slide56

Solving stochastic planning problems via determinizations

Quite an old idea (e.g. envelope extension methods)

What is new is that there is increasing realization that determinizing approaches provide state-of-the-art performance

Even

for probabilistically interesting domains

Should be a happy occasion..

Slide57

Hindsight Optimization(Online Computation of VHS )

H-horizon future FH for M = [S,A,T,R]Mapping of state, action and time (h<H) to a stateS × A × h → SValue of a policy π for FH R(s,FH, π)Pick action a with highest Q(s,a,H) whereQ(s,a,H) = R(s) + V*(s,H-1)V*(s,H) = maxπ EFH [ R(s,FH,π) ]Compare this and the real valueVHS(s,H) = EFH [maxπ R(s,FH,π)]VFFRa(s) = maxF V(s,F) ≥ VHS(s,H) ≥ V*(s,H)Q(s,a,H) = (R(a) + EFH-1 [maxπ R(a(s),FH-1,π)] )In our proposal, computation of maxπ R(s,FH-1,π) is approximately done by FF [Hoffmann and Nebel ’01]

57

Done by FF

Each Future is a

Deterministic

Problem

Slide58

Implementation FF-Hindsight

Constructs a set of futures

Solves the planning problem using the H-horizon futures using FF

Sums the rewards of each of the plans

Chooses action with largest Qhs value

Slide59

Hindsight Optimization(Online Computation of VHS )

Pick action a with highest Q(s,a,H) whereQ(s,a,H) = R(s,a) + ST(s,a,s’)V*(s’,H-1)Compute V* by sampling H-horizon future FH for M = [S,A,T,R]Mapping of state, action and time (h<H) to a stateS × A × h → SCommon-random number (correlated) vs. independent futures..Time-independent vs. Time-dependent futuresValue of a policy π for FH R(s,FH, π)V*(s,H) = maxπ EFH [ R(s,FH,π) ]But this is still too hard to compute..Let’s swap max and expectationVHS(s,H) = EFH [maxπ R(s,FH,π)]maxπ R(s,FH-1,π) is approximated by FF plan

VHS underestimates V*Why?Intuitively, because VHS can assume that it can use different policies in different futures; while V* needs to pick one policy that works best (in expectation) in all futures.But then, VFFRa underestimates VHSViewed in terms of J*, VHS is a more informed admissible heuristic..

59

Slide60

Probabilistic Planning(goal-oriented)

Action

Probabilistic

Outcome

Time 1

Time 2

Goal State

60

Action

State

Maximize Goal Achievement

Dead End

Left Outcomes are more likely

A1

A2

A1

A2

A1

A2

A1

A2

A1

A2

I

Slide61

Improvement Ideas

Reuse

Generated futures that are still relevant

Scoring for action branches at each step

If expected outcomes occur, keep the plan

Future generation

Not just probabilistic

Somewhat even distribution of the space

Adaptation

Dynamic width and horizon for sampling

Actively detect and avoid unrecoverable failures on top of sampling

Slide62

Hindsight Sample 1

Action

Probabilistic

Outcome

Time 1

Time 2

Goal State

62

Action

State

Maximize Goal Achievement

Dead End

A1: 1

A2: 0

Left Outcomes are more likely

A1

A2

A1

A2

A1

A2

A1

A2

A1

A2

I

Slide63

Exploiting Determinism

Find the longest prefix for all plans

Apply the actions in the prefix to continuously until one is not applicable

Resume ZSL/OSL steps

Slide64

Exploiting Determinism

G

S

1

G

S

1

G

S

1

a*

a*

a*

Plans generated for chosen action, a*

Longest prefix

for each plan

is identified and

executed without

running ZSL, OSL

or FF!

Slide65

Handling unlikely outcomes:All-outcome Determinization

Assign each possible outcome an action

Solve for a plan

Combine the plan with the plans from the HOP solutions

Slide66

Deterministic Techniques for Stochastic Planning

No longer the Rodney Dangerfield of Stochastic Planning?

Slide67

Determinizations

Most-likely outcome

determinization

Inadmissible

e.g. if only path to goal relies on less likely outcome of an action

All outcomes

determinization

Admissible, but not very informed

e.g. Very unlikely action leads you straight to goal

Slide68

Problems with transition systems

Transition systems are a great conceptual tool to understand the differences between the various planning problems

…However direct manipulation of transition systems tends to be too cumbersome

The size of the explicit graph corresponding to a transition system is often very large

The remedy is to provide “compact” representations for transition systems

Start by explicating the structure of the “states”

e.g. states specified in terms of state variables

Represent actions not as incidence matrices but rather functions specified directly in terms of the state variables

An action will work in any state where some state variables have certain values. When it works, it will change the values of certain (other) state variables

Slide69

State Variable Models

World is made up of states which are defined in terms of state variables

Can be boolean (or multi-ary or continuous)

States are

complete assignments over state variables

So, k boolean state variables can represent how many states?

Actions change the values of the state variables

Applicability conditions of actions are also specified in terms of partial assignments over state variables

Slide70

Blocks world

State variables: Ontable(x) On(x,y) Clear(x) hand-empty holding(x)

Stack(x,y) Prec: holding(x), clear(y) eff: on(x,y), ~cl(y), ~holding(x), hand-empty

Unstack(x,y) Prec: on(x,y),hand-empty,cl(x) eff: holding(x),~clear(x),clear(y),~hand-empty

Pickup(x) Prec: hand-empty,clear(x),ontable(x) eff: holding(x),~ontable(x),~hand-empty,~Clear(x)

Putdown(x) Prec: holding(x) eff: Ontable(x), hand-empty,clear(x),~holding(x)

Initial state: Complete specification of T/F values to state variables --By convention, variables with F values are omitted

Goal state: A partial specification of the desired state variable/value combinations --desired values can be both positive and negative

Init: Ontable(A),Ontable(B), Clear(A), Clear(B), hand-emptyGoal: ~clear(B), hand-empty

All the actions here have only positive preconditions; but this is not necessary

STRIPS ASSUMPTION: If an action changes a state variable, this must be explicitly mentioned in its effects

Slide71

Why is STRIPS representation compact?(than explicit transition systems)

In explicit transition systems actions are represented as state-to-state transitions where in each action will be represented by an incidence matrix of size |S|x|S| In state-variable model, actions are represented only in terms of state variables whose values they care about, and whose value they affect. Consider a state space of 1024 states. It can be represented by log21024=10 state variables. If an action needs variable v1 to be true and makes v7 to be false, it can be represented by just 2 bits (instead of a 1024x1024 matrix)Of course, if the action has a complicated mapping from states to states, in the worst case the action rep will be just as largeThe assumption being made here is that the actions will have effects on a small number of state variables.

Sit. Calc

STRIPS rep

Transition rep

Firstorder

Rel/Prop

Atomic

Slide72

Factored Representations fo MDPs: Actions

Actions can be represented directly in terms of their effects on the individual state variables (

fluents

).

The CPTs of the BNs can be represented compactly too!

Write a

Bayes

Network relating the value of

fluents

at the state before and after the action

Bayes

networks representing

fluents

at different time points are called “Dynamic

Bayes

Networks”

We look at 2TBN (2-time-slice dynamic

bayes

nets)

Go further by using STRIPS assumption

Fluents

not affected by the action are not represented explicitly in the model

Called Probabilistic STRIPS Operator (PSO) model

Slide73

Action CLK

Slide74

Slide75

Slide76

Factored Representations: Reward, Value and Policy Functions

Reward functions can be represented in factored form too. Possible representations include

Decision trees (made up of fluents)

ADDs (Algebraic decision diagrams)

Value functions are like reward functions (so they too can be represented similarly)

Bellman update can then be done directly using factored representations..

Slide77

Slide78

SPUDDs use of ADDs

Slide79

Direct manipulation of ADDs

in SPUDD

Slide80

Ideas for Efficient Algorithms..

Use heuristic search (and reachability information)LAO*, RTDPUse execution and/or Simulation“Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model)“Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc.

Use “factored” representations

Factored representations for Actions, Reward Functions, Values and Policies

Directly manipulating factored representations during the Bellman update

Slide81

Probabilistic Planning

--The competition (IPPC)

--The Action language.. (PPDDL)

Slide82

Slide83

Not

ergodic

Slide84

Slide85

Slide86

Reducing Heuristic Computation Cost by exploiting factored representations

The heuristics computed for a state might give us an idea about the heuristic value of other “similar” states

Similarity is possible to determine in terms of the state structure

Exploit overlapping structure of heuristics for different states

E.g. SAG idea for

McLUG

E.g. Triangle tables idea for plans (c.f.

Kolobov

)

Slide87

A Plan is a Terrible Thing to Waste

Suppose we have a plan

s0—a0—s1—a1—s2—a2—s3…an—

sG

We realized that this tells us not just the estimated value of s0, but also of s1,s2…

sn

So we don’t need to compute the heuristic for them again

Is that all?

If we have states and actions in factored representation, then we can explain exactly what aspects of

si

are relevant for the plan’s success.

The “explanation” is a proof of correctness of the plan

Can be based on regression (if the plan is a sequence) or causal proof (if the plan is a partially ordered one.

The explanation will typically be just a subset of the literals making up the state

That means actually, the plan suffix from

si

may actually be relevant in many more states that are consistent with that explanation

Slide88

Triangle Table Memoization

Use triangle tables / memoization

C

B

A

A

B

C

If the above problem is solved, then we don’t need to call FF again for the below:

B

A

A

B

Slide89

Explanation-based Generalization (of Successes and Failures)

Suppose we have a plan P that solves a problem [S, G].

We can first find out what aspects of S does this plan actually depend on

Explain (prove) the correctness of the plan, and see which parts of S actually contribute to this proof

Now you can

memoize

this plan for just that subset of S

Slide90

Relaxations for Stochastic Planning

Determinizations

can also be used as a basis for heuristics to initialize the V for value iteration [

mGPT

; GOTH etc]

Heuristics come from relaxation

We can relax along two separate dimensions:

Relax –

ve

interactions

Consider +

ve

interactions alone using relaxed planning graphs

Relax uncertainty

Consider

determinizations

Or a combination of both!

Slide91

Solving Determinizations

If we relax –

ve

interactions

Then compute relaxed plan

Admissible if optimal relaxed plan is computed

Inadmissible otherwise

If we keep –

ve

interactions

Then use a deterministic planner (e.g. FF/LPG)

Inadmissible unless the underlying planner is optimal

Slide92

Dimensions of Relaxation

Uncertainty

Negative Interactions

1

2

3

1

Relaxed Plan Heuristic

2

McLUG

3

FF/LPG

Reducing Uncertainty

Bound the number of stochastic outcomes

 Stochastic “width”

4

4

Limited width stochastic planning?

Increasing consideration

Slide93

Dimensions of Relaxation

NoneSomeFullNoneRelaxed PlanMcLUGSomeFullFF/LPGLimited width Stoch Planning

Uncertainty

-

ve

interactions

Slide94

Expressiveness v. Cost

h = 0

McLUG

FF-

Replan

FF

Limited width stochastic planning

Node Expansions v.

Heuristic Computation Cost

Nodes Expanded

Computation Cost

FF

R

FF

Slide95