Slides on LRTDP and UCT are courtesy Mausam Kolobov Ideas for Efficient Algorithms Use heuristic search and reachability information LAO RTDP Use execution andor Simulation Actual Execution Reinforcement learning ID: 759664
Download Presentation The PPT/PDF document "Efficient Approaches for Solving Large-s..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Efficient Approaches for Solving Large-scale MDPs
Slides on LRTDP and UCT are courtesy
Mausam
/
Kolobov
Ideas for Efficient Algorithms..
Use heuristic search (and reachability information)LAO*, RTDPUse execution and/or Simulation“Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model)“Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc.
Use “factored” representations
Factored representations for Actions, Reward Functions, Values and Policies
Directly manipulating factored representations during the Bellman update
Slide3Real Time Dynamic Programming[Barto et al 95]
Original Motivationagent acting in the real worldTrial simulate greedy policy starting from start state;perform Bellman backup on visited statesstop when you hit the goalRTDP: repeat trials foreverConverges in the limit #trials ! 1
3
We will do the discussion in terms of SSP MDPs --Recall they subsume infinite horizon MDPs
Slide4Trial
4
s0
S
g
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
8
Slide5Trial
5
s0
S
g
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
8
h
h
h
h
V
s
tart at start state
repeat
perform a Bellman backup
simulate greedy action
Slide6Trial
6
s0
S
g
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
8
h
h
h
h
V
s
tart at start state
repeat
perform a Bellman backup
simulate greedy action
h
h
Slide7Trial
7
s0
S
g
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
8
h
h
V
h
V
s
tart at start state
repeat
perform a Bellman backup
simulate greedy action
h
h
Slide8Trial
8
s0
S
g
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
8
h
h
V
h
V
s
tart at start state
repeat
perform a Bellman backup
simulate greedy action
h
h
Slide9Trial
9
s0
S
g
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
8
h
h
V
h
V
s
tart at start state
repeat
perform a Bellman backup
simulate greedy action
V
h
Slide10Trial
10
s0
S
g
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
8
h
h
V
h
V
s
tart at start state
repeat
perform a Bellman backup
simulate greedy action
until hit the goal
V
h
Slide11Trial
11
s0
S
g
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
8
h
h
V
h
V
s
tart at start state
repeat
perform a Bellman backup
simulate greedy action
until hit the goal
V
h
Backup all states
on trajectory
RTDP
repeat
forever
Slide12Real Time Dynamic Programming[Barto et al 95]
Original Motivationagent acting in the real worldTrial simulate greedy policy starting from start state;perform Bellman backup on visited statesstop when you hit the goalRTDP: repeat trials foreverConverges in the limit #trials ! 1
12
No termination
c
ondition!
Slide13RTDP Family of Algorithms
repeat s à s0 repeat //trials REVISE s; identify agreedy FIND: pick s’ s.t. T(s, agreedy, s’) > 0 s à s’ until s 2 G until termination test
13
Slide14Admissible heuristic & monotonicity⇒ V(s) · V*(s)⇒ Q(s,a) · Q*(s,a)Label a state s as solved if V(s) has converged
high Q costs
best
action
Res
V
(s
) < ²) V(s) won’t change!label s as solved
s
g
s
Slide15Labeling (contd)
15
high Q costs
best
action
Res
V
(s
) < ²s' already solved) V(s) won’t change!label s as solved
sg
s
s'
Slide16Labeling (contd)
16
high Q costs
best
action
Res
V
(s
) < ²s' already solved) V(s) won’t change!label s as solved
sg
s
s'
high Q costs
best action
Res
V
(s
) <
²
Res
V
(s’) < ²V(s), V(s’) won’t change!label s, s’ as solved
sg
s
s'
high Q costs
best
action
Slide17Labeled RTDP [Bonet&Geffner 03b]
repeat s à s0 label all goal states as solved repeat //trials REVISE s; identify agreedy FIND: sample s’ from T(s, agreedy, s’) s à s’ until s is solved for all states s in the trial try to label s as solveduntil s0 is solved
17
Slide18terminates in finite timedue to labeling procedureanytimefocuses attention on more probable statesfast convergencefocuses attention on unconverged states
18
LRTDP
Slide19Picking a Successor Take 2
Labeled RTDP/RTDP: sample s’ / T(s, agreedy, s’)Adv: more probable states are explored firstLabeling Adv: no time wasted on converged statesDisadv: labeling is a hard constraintDisadv: sampling ignores “amount” of convergenceIf we knew how much V(s) is expected to change?sample s’ / expected change
19
Slide20Upper Bounds in SSPs
RTDP/LAO* maintain lower boundscall it VlAdditionally associate upper bound with sVu(s) ¸ V*(s) Define gap(s) = Vu(s) – Vl(s)low gap(s): more converged a statehigh gap(s): more expected change in its value
20
Slide21Backups on Bounds
Recall monotonicityBackups on lower bound continue to be lower boundsBackups on upper boundcontinues to be upper bounds IntuitivelyVl will increase to converge to V*Vu will decrease to converge to V*
21
Slide22Bounded RTDP [McMahan et al 05]
repeat s à s0 repeat //trials identify agreedy based on Vl FIND: sample s’ / T(s, agreedy, s’).gap(s’) s à s’ until gap(s) < ² for all states s in trial in reverse order REVISE suntil gap(s0) < ²
22
Slide23Min
?
?
s
0
J
n
J
n
J
n
J
n
J
n
J
n
J
n
Q
n+1
(s
0
,a)
J
n+1
(s
0
)
a
greedy
= a
2
Goal
a
1
a
2
a
3
RTDP Trial
?
Slide24Greedy “On-Policy” RTDP
without
execution
Using the current utility values, select the
action with the highest expected utility
(greedy action) at each state, until you
reach a terminating state. Update the
values along this path. Loop back—until
the values
stabilize
Slide25Comments
Propertiesif all states are visited infinitely often then Jn → J*Only relevant states will be consideredA state is relevant if the optimal policy could visit it. Notice emphasis on “optimal policy”—just because a rough neighborhood surrounds National Mall doesn’t mean that you will need to know what to do in that neighborhood AdvantagesAnytime: more probable states explored quicklyDisadvantagescomplete convergence is slow!no termination condition
Do we care about complete
convergence?
Think Cpt.
Sullenberger
Slide26Slide279/26
Slide28The “Heuristic”
The value function is They approximate it by
Exactly what are they relaxing?
They are assuming that they can make the best
outcome of the action happen..
What if we pick the s’
corresponding to the
highest P?
Slide29Monte-Carlo Planning
Consider the Sysadmin problem:
29
Restart(Ser
2
)
Restart(Ser
1
)
Restart(Ser
3
)
P(Ser
1
t
|Restart
t-1
(Ser
1
), Ser
1
t-1
, Ser2t-1)
P(Ser2t |Restartt-1(Ser2), Ser1t-1, Ser2t-1, Ser3t-1)
P(Ser3t |Restartt-1(Ser3), Ser2t-1, Ser3t-1)
Time t-1
Time t
T:
A:
R
: ∑
i
[Ser
i
= ↑]
Slide30Monte-Carlo Planning: Motivation
Characteristics of Sysadmin:FH MDP turned SSPs0 MDPReaching the goal is trivial, determinization approaches not really helpfulEnormous reachable state spaceHigh-entropy T (2|X| outcomes per action, many likely ones)Building determinizations can be super-expensiveDoing Bellman backups can be super-expensiveTry Monte-Carlo planningDoes not manipulate T or C/R explicitly – no Bellman backups Relies on a world simulator – indep. of MDP description size
30
Slide31UCT: A Monte-Carlo Planning Algorithm
UCT [Kocsis & Szepesvari, 2006] computes a solution by simulating the current best policy and improving itSimilar principle as RTDPBut action selection, value updates, and guarantees are differentUseful when we haveEnormous reachable state spaceHigh-entropy T (2|X| outcomes per action, many likely ones)Building determinizations can be super-expensiveDoing Bellman backups can be super-expensiveSuccess stories:Go (thought impossible in ‘05, human grandmaster level at 9x9 in ‘08) Klondike Solitaire (wins 40% of games)General Game Playing CompetitionReal-Time Strategy GamesProbabilistic Planning CompetitionThe list is growing…
31
Slide32Select an arm that probably (w/ high probability) has approximately the best expected rewardUse as few simulator calls (or pulls) as possible
s
a
1
a
2
a
k
R(s,a
1)
R(s,a2)
R(s,ak)
…
…
Background: Multi-Armed Bandit Problem
Slide courtesy of A. Fern
32
Just like a an FH MDP with horizon 1!
Slide33Current World State
Rollout policy
Terminal
(reward = 1)
1
1
1
1
Build a state-action tree
At
a leaf node perform a random
rollout
Initially tree is single leaf
UCT Example
Slide courtesy of A. Fern
33
Slide34Current World State
1
1
1
1
Must select each action at a node at least once
0
Rollout
Policy
Terminal
(reward = 0)
Slide courtesy of A. Fern
34
UCT Example
Slide35Current World State
1
1
1
1
Must select each action at a node at least once
0
0
0
0
Slide courtesy of A. Fern
35
UCT Example
Slide36Current World State
1
1
1
1
0
0
0
0
When all node actions tried once, select action according to tree policy
Tree Policy
Slide courtesy of A. Fern
36
UCT Example
Slide37Current World State
1
1
1
1
When all node actions tried once, select action according to tree policy
0
0
0
0
Tree Policy
0
Rollout
Policy
Slide courtesy of A. Fern
37
UCT Example
Slide38Current World State
1
1
1
1/2
When all node actions tried once, select action according to tree policy
0
0
0
0
Tree
Policy
0
0
0
0
What is an appropriate tree policy?
Rollout policy?
Slide courtesy of A. Fern
38
UCT Example
Slide39Rollout policy:Basic UCT uses randomTree policy:Q(s,a) : average reward received in current trajectories after taking action a in state sn(s,a) : number of times action a taken in sn(s) : number of times state s encountered
Theoretical constant that must
be selected empirically in
practice. Setting it to distance to horizon guarantees arriving at the optimal policy eventually, if R
Slide courtesy of A. Fern
39
UCT Details
Exploration term
Slide40Current World State
1
1
1
1/2
When all node actions tried once, select action according to tree policy
0
0
0
0
Tree
Policy
0
0
0
0
a
1
a
2
Slide courtesy of A. Fern
40
UCT Example
Slide41Current World State
1
1
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0
0
0
0
Tree
Policy
0
0
0
0
Slide courtesy of A. Fern
41
Slide42To select an action at a state sBuild a tree using N iterations of Monte-Carlo tree searchDefault policy is uniform random up to level LTree policy is based on bandit ruleSelect action that maximizes Q(s,a)(note that this final action selection does not take the exploration term into account, just the Q-value estimate)The more simulations, the more accurateGuaranteed to pick suboptimal actions exponentially rarely after convergence (under some assumptions)Possible improvementsInitialize the state-action pairs with a heuristic (need to pick a weight)Think of a better-than-random rollout policy
Slide courtesy of A. Fern
42
UCT Summary & Theoretical Properties
Slide43LRTDP or UCT?
43
AAAI 2012!
Slide44Other Advances
Ordering the Bellman backups to maximise information flow.
[Wingate & Seppi’05]
[Dai & Hansen’07]
Partition the state space and combine value iterations from different partitions.
[Wingate & Seppi’05]
[Dai & Goldsmith’07]
External memory version of value iteration
[Edelkamp, Jabbar & Bonet’07]
…
Slide45Two Models of Evaluating Probabilistic Planning
IPPC (Probabilistic Planning Competition)How often did you reach the goal under the given time constraintsFF-HOPFF-Replan
Evaluate on the quality of the policy
Converging to optimal policy faster
LRTDP
mGPT
Kolobov’s
approach
Slide46Online Action Selection
Off-line policy generation
First compute the whole policyGet the initial stateCompute the optimal policy given the initial state and the goalsThen just execute the policyLoopDo action recommended by the policyGet the next stateUntil reaching goal statePros: Can anticipate all problems;Cons: May take too much time to start executing
Online action selection
LoopCompute the best action for the current state execute itget the new statePros: Provides fast first responseCons: May paint itself into a corner..
Policy Computation
Exec
Select
ex
Select
ex
Select
ex
Select
ex
Slide47FF-Replan
Simple replannerDeterminizes the probabilistic problemIF an action has multiple effect sets with different probabilitiesSelect the most likely onSplit the action into multiple actions one for each setupSolves for a plan in the determinized problem
S
G
a1
a2
a3
a4
a2
a3
a4
G
a5
Slide48All Outcome Replanning (FFRA)
Action
Effect 1
Effect 2
Probability
1
Probability
2
Action1
Effect 1
Action2
Effect 2
ICAPS-07
48
Slide491st IPPC & Post-Mortem..
IPPC Competitors
Most IPPC competitors used different approaches for offline policy generation.One group implemented a simple online “replanning” approach in addition to offline policy generationDeterminize the probabilistic problem Most-likely vs. All-outcomesLoopGet the state S; Call a classical planner (e.g. FF) with [S,G] as the problemExecute the first action of the planUmpteen reasons why such an approach should do quite badly..
Results and Post-mortem
To everyone’s surprise, the
replanning
approach wound up winning the competition.
Lots of hand-wringing ensued..
May be we should require that the planners really
really
use probabilities?
May be the domains should somehow be made “probabilistically interesting”?
Current understanding:
No reason to believe that off-line policy computation must dominate online action selection
The “
replanning
” approach is just a degenerate case of hind-sight optimization
Slide50Reducing calls to FF..
We can reduce calls to FF by
memoizing
successes
If we were given s0 and
sG
as the problem, and solved it using our
determinization
to get the plan s0—a0—s1—a1—s2—a2—s3…an—
sG
Then in addition to sending a1 to the simulator, we can
memoize
{
si—ai
} as the partial policy.
Whenever a new state is given by the simulator, we can see if it is already in the partial policy
Additionally, FF-
replan
can consider every state in the partial policy table as a goal state (in that if it reaches them, it knows how to get to goal state..)
Slide51Hindsight Optimization for Anticipatory Planning/Scheduling
Consider a deterministic planning (scheduling) domain, where the goals arrive probabilistically
Using up resources and/or doing greedy actions may preclude you from exploiting the later opportunities
How do you select actions to perform?
Answer: If you have a distribution of the goal arrival, then
Sample goals
upto
a certain horizon using this distribution
Now, we have a deterministic planning problem
with
known goals
Solve it; do the first action from it.
Can improve accuracy with multiple samples
FF-Hop uses this idea for stochastic planning. In anticipatory planning, the uncertainty is
exogenous
(it is the uncertain arrival of goals). In stochastic planning, the uncertainty is
endogenous
(the actions have multiple outcomes)
Slide52Probabilistic Planning(goal-oriented)
Action
Probabilistic
Outcome
Time 1
Time 2
Goal State
52
Action
State
Maximize Goal Achievement
Dead End
A1
A2
I
A1
A2
A1
A2
A1
A2
A1
A2
Left Outcomes are more likely
Slide53Probabilistic PlanningAll Outcome Determinization
Action
Probabilistic
Outcome
Time 1
Time 2
Goal State
53
Action
State
Find Goal
Dead End
A1
A2
A1
A2
A1
A2
A1
A2
A1
A2
I
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
Slide54Probabilistic Planning
All Outcome Determinization
Action
Probabilistic
Outcome
Time 1
Time 2
Goal State
54
Action
State
Find Goal
Dead End
A1
A2
A1
A2
A1
A2
A1
A2
A1
A2
I
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
A1-1
A1-2
A2-1
A2-2
Slide55Problems of FF-Replan and better alternative sampling
55
FF-Replan’s Static Determinizations don’t respect probabilities.We need “Probabilistic and Dynamic Determinization”
Sample Future Outcomes and Determinization in Hindsight
Each Future Sample Becomes a
Known-Future
Deterministic
Problem
Slide56Solving stochastic planning problems via determinizations
Quite an old idea (e.g. envelope extension methods)
What is new is that there is increasing realization that determinizing approaches provide state-of-the-art performance
Even
for probabilistically interesting domains
Should be a happy occasion..
Slide57Hindsight Optimization(Online Computation of VHS )
H-horizon future FH for M = [S,A,T,R]Mapping of state, action and time (h<H) to a stateS × A × h → SValue of a policy π for FH R(s,FH, π)Pick action a with highest Q(s,a,H) whereQ(s,a,H) = R(s) + V*(s,H-1)V*(s,H) = maxπ EFH [ R(s,FH,π) ]Compare this and the real valueVHS(s,H) = EFH [maxπ R(s,FH,π)]VFFRa(s) = maxF V(s,F) ≥ VHS(s,H) ≥ V*(s,H)Q(s,a,H) = (R(a) + EFH-1 [maxπ R(a(s),FH-1,π)] )In our proposal, computation of maxπ R(s,FH-1,π) is approximately done by FF [Hoffmann and Nebel ’01]
57
Done by FF
Each Future is a
Deterministic
Problem
Slide58Implementation FF-Hindsight
Constructs a set of futures
Solves the planning problem using the H-horizon futures using FF
Sums the rewards of each of the plans
Chooses action with largest Qhs value
Slide59Hindsight Optimization(Online Computation of VHS )
Pick action a with highest Q(s,a,H) whereQ(s,a,H) = R(s,a) + ST(s,a,s’)V*(s’,H-1)Compute V* by sampling H-horizon future FH for M = [S,A,T,R]Mapping of state, action and time (h<H) to a stateS × A × h → SCommon-random number (correlated) vs. independent futures..Time-independent vs. Time-dependent futuresValue of a policy π for FH R(s,FH, π)V*(s,H) = maxπ EFH [ R(s,FH,π) ]But this is still too hard to compute..Let’s swap max and expectationVHS(s,H) = EFH [maxπ R(s,FH,π)]maxπ R(s,FH-1,π) is approximated by FF plan
VHS underestimates V*Why?Intuitively, because VHS can assume that it can use different policies in different futures; while V* needs to pick one policy that works best (in expectation) in all futures.But then, VFFRa underestimates VHSViewed in terms of J*, VHS is a more informed admissible heuristic..
59
Slide60Probabilistic Planning(goal-oriented)
Action
Probabilistic
Outcome
Time 1
Time 2
Goal State
60
Action
State
Maximize Goal Achievement
Dead End
Left Outcomes are more likely
A1
A2
A1
A2
A1
A2
A1
A2
A1
A2
I
Slide61Improvement Ideas
Reuse
Generated futures that are still relevant
Scoring for action branches at each step
If expected outcomes occur, keep the plan
Future generation
Not just probabilistic
Somewhat even distribution of the space
Adaptation
Dynamic width and horizon for sampling
Actively detect and avoid unrecoverable failures on top of sampling
Slide62Hindsight Sample 1
Action
Probabilistic
Outcome
Time 1
Time 2
Goal State
62
Action
State
Maximize Goal Achievement
Dead End
A1: 1
A2: 0
Left Outcomes are more likely
A1
A2
A1
A2
A1
A2
A1
A2
A1
A2
I
Slide63Exploiting Determinism
Find the longest prefix for all plans
Apply the actions in the prefix to continuously until one is not applicable
Resume ZSL/OSL steps
Slide64Exploiting Determinism
G
S
1
G
S
1
G
S
1
a*
a*
a*
Plans generated for chosen action, a*
Longest prefix
for each plan
is identified and
executed without
running ZSL, OSL
or FF!
Slide65Handling unlikely outcomes:All-outcome Determinization
Assign each possible outcome an action
Solve for a plan
Combine the plan with the plans from the HOP solutions
Slide66Deterministic Techniques for Stochastic Planning
No longer the Rodney Dangerfield of Stochastic Planning?
Slide67Determinizations
Most-likely outcome
determinization
Inadmissible
e.g. if only path to goal relies on less likely outcome of an action
All outcomes
determinization
Admissible, but not very informed
e.g. Very unlikely action leads you straight to goal
Slide68Problems with transition systems
Transition systems are a great conceptual tool to understand the differences between the various planning problems
…However direct manipulation of transition systems tends to be too cumbersome
The size of the explicit graph corresponding to a transition system is often very large
The remedy is to provide “compact” representations for transition systems
Start by explicating the structure of the “states”
e.g. states specified in terms of state variables
Represent actions not as incidence matrices but rather functions specified directly in terms of the state variables
An action will work in any state where some state variables have certain values. When it works, it will change the values of certain (other) state variables
Slide69State Variable Models
World is made up of states which are defined in terms of state variables
Can be boolean (or multi-ary or continuous)
States are
complete assignments over state variables
So, k boolean state variables can represent how many states?
Actions change the values of the state variables
Applicability conditions of actions are also specified in terms of partial assignments over state variables
Slide70Blocks world
State variables: Ontable(x) On(x,y) Clear(x) hand-empty holding(x)
Stack(x,y) Prec: holding(x), clear(y) eff: on(x,y), ~cl(y), ~holding(x), hand-empty
Unstack(x,y) Prec: on(x,y),hand-empty,cl(x) eff: holding(x),~clear(x),clear(y),~hand-empty
Pickup(x) Prec: hand-empty,clear(x),ontable(x) eff: holding(x),~ontable(x),~hand-empty,~Clear(x)
Putdown(x) Prec: holding(x) eff: Ontable(x), hand-empty,clear(x),~holding(x)
Initial state: Complete specification of T/F values to state variables --By convention, variables with F values are omitted
Goal state: A partial specification of the desired state variable/value combinations --desired values can be both positive and negative
Init: Ontable(A),Ontable(B), Clear(A), Clear(B), hand-emptyGoal: ~clear(B), hand-empty
All the actions here have only positive preconditions; but this is not necessary
STRIPS ASSUMPTION: If an action changes a state variable, this must be explicitly mentioned in its effects
Slide71Why is STRIPS representation compact?(than explicit transition systems)
In explicit transition systems actions are represented as state-to-state transitions where in each action will be represented by an incidence matrix of size |S|x|S| In state-variable model, actions are represented only in terms of state variables whose values they care about, and whose value they affect. Consider a state space of 1024 states. It can be represented by log21024=10 state variables. If an action needs variable v1 to be true and makes v7 to be false, it can be represented by just 2 bits (instead of a 1024x1024 matrix)Of course, if the action has a complicated mapping from states to states, in the worst case the action rep will be just as largeThe assumption being made here is that the actions will have effects on a small number of state variables.
Sit. Calc
STRIPS rep
Transition rep
Firstorder
Rel/Prop
Atomic
Slide72Factored Representations fo MDPs: Actions
Actions can be represented directly in terms of their effects on the individual state variables (
fluents
).
The CPTs of the BNs can be represented compactly too!
Write a
Bayes
Network relating the value of
fluents
at the state before and after the action
Bayes
networks representing
fluents
at different time points are called “Dynamic
Bayes
Networks”
We look at 2TBN (2-time-slice dynamic
bayes
nets)
Go further by using STRIPS assumption
Fluents
not affected by the action are not represented explicitly in the model
Called Probabilistic STRIPS Operator (PSO) model
Slide73Action CLK
Slide74Slide75Slide76Factored Representations: Reward, Value and Policy Functions
Reward functions can be represented in factored form too. Possible representations include
Decision trees (made up of fluents)
ADDs (Algebraic decision diagrams)
Value functions are like reward functions (so they too can be represented similarly)
Bellman update can then be done directly using factored representations..
Slide77Slide78SPUDDs use of ADDs
Slide79Direct manipulation of ADDs
in SPUDD
Slide80Ideas for Efficient Algorithms..
Use heuristic search (and reachability information)LAO*, RTDPUse execution and/or Simulation“Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model)“Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc.
Use “factored” representations
Factored representations for Actions, Reward Functions, Values and Policies
Directly manipulating factored representations during the Bellman update
Slide81Probabilistic Planning
--The competition (IPPC)
--The Action language.. (PPDDL)
Slide82Slide83Not
ergodic
Slide84Slide85Slide86Reducing Heuristic Computation Cost by exploiting factored representations
The heuristics computed for a state might give us an idea about the heuristic value of other “similar” states
Similarity is possible to determine in terms of the state structure
Exploit overlapping structure of heuristics for different states
E.g. SAG idea for
McLUG
E.g. Triangle tables idea for plans (c.f.
Kolobov
)
Slide87A Plan is a Terrible Thing to Waste
Suppose we have a plan
s0—a0—s1—a1—s2—a2—s3…an—
sG
We realized that this tells us not just the estimated value of s0, but also of s1,s2…
sn
So we don’t need to compute the heuristic for them again
Is that all?
If we have states and actions in factored representation, then we can explain exactly what aspects of
si
are relevant for the plan’s success.
The “explanation” is a proof of correctness of the plan
Can be based on regression (if the plan is a sequence) or causal proof (if the plan is a partially ordered one.
The explanation will typically be just a subset of the literals making up the state
That means actually, the plan suffix from
si
may actually be relevant in many more states that are consistent with that explanation
Slide88Triangle Table Memoization
Use triangle tables / memoization
C
B
A
A
B
C
If the above problem is solved, then we don’t need to call FF again for the below:
B
A
A
B
Slide89Explanation-based Generalization (of Successes and Failures)
Suppose we have a plan P that solves a problem [S, G].
We can first find out what aspects of S does this plan actually depend on
Explain (prove) the correctness of the plan, and see which parts of S actually contribute to this proof
Now you can
memoize
this plan for just that subset of S
Slide90Relaxations for Stochastic Planning
Determinizations
can also be used as a basis for heuristics to initialize the V for value iteration [
mGPT
; GOTH etc]
Heuristics come from relaxation
We can relax along two separate dimensions:
Relax –
ve
interactions
Consider +
ve
interactions alone using relaxed planning graphs
Relax uncertainty
Consider
determinizations
Or a combination of both!
Slide91Solving Determinizations
If we relax –
ve
interactions
Then compute relaxed plan
Admissible if optimal relaxed plan is computed
Inadmissible otherwise
If we keep –
ve
interactions
Then use a deterministic planner (e.g. FF/LPG)
Inadmissible unless the underlying planner is optimal
Slide92Dimensions of Relaxation
Uncertainty
Negative Interactions
1
2
3
1
Relaxed Plan Heuristic
2
McLUG
3
FF/LPG
Reducing Uncertainty
Bound the number of stochastic outcomes
Stochastic “width”
4
4
Limited width stochastic planning?
Increasing consideration
Slide93Dimensions of Relaxation
NoneSomeFullNoneRelaxed PlanMcLUGSomeFullFF/LPGLimited width Stoch Planning
Uncertainty
-
ve
interactions
Slide94Expressiveness v. Cost
h = 0
McLUG
FF-
Replan
FF
Limited width stochastic planning
Node Expansions v.
Heuristic Computation Cost
Nodes Expanded
Computation Cost
FF
R
FF
Slide95