/
1 Planning under Uncertainty 1 Planning under Uncertainty

1 Planning under Uncertainty - PowerPoint Presentation

aaron
aaron . @aaron
Follow
369 views
Uploaded On 2018-03-16

1 Planning under Uncertainty - PPT Presentation

Todays Topics Sequential Decision Problems Markov Decision Process MDP Value Iteration Policy Iteration Partially Observable MDPs POMDPs Student Questions about the Midterm 2 Big Assumption in Most of the Planning Techniques Weve ID: 653258

policy state action states state policy states action optimal utility rewards utilities environment agent decision reward bellman iteration time

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Planning under Uncertainty" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1

Planning under UncertaintySlide2

Today’s TopicsSequential Decision ProblemsMarkov Decision Process (MDP)Value IterationPolicy IterationPartially Observable MDPs (POMDPs)Student Questions about the Midterm.2Slide3

Big Assumption in Most of the Planning Techniques We’ve Seen so FarWhat is it?NO UNCERTAINTY!Assumes the agent knows everything about the world and what can happen in it.Sources of UncertaintyAgent may not know all states of the world.Agent may not know what state of the world it is in.Outcomes of actions may not be known3Slide4

Sequential Decision ProblemExample4ProblemBeginning at the start state, choose an action at each time step.Problem terminates when either goal state is reached.Possible actions are Up, Down, Left, and RightAssume that the environment is fully observable, i.e., the agent always knows where it is.Slide5

Sequential Decision ProblemExample5Deterministic Solution:If the environment is deterministic and the objective is get the maximum reward  The solution is easy: (Up, Up, Right, Right, Right)Slide6

Sequential Decision ProblemExample6What if actions are unreliable?Suppose that there is a .8 probability to go to intended cell, but rest of the time it goes to cells at right angles of intended cell.If boundary or obstacle encountered, it does not move.The probability of reaching the goal state by executing (Up, Up, Right, Right, Right) is .85 + small probability of reaching the goal state the other path = .32776Slide7

Transition ModelA transition model is a specification of the outcome probabilities for each action in each possible state.T(s,a,s¢) denotes the probability of reaching state s¢ if action a is done on state s.Make Markov Assumption, i.e., the probability of reaching state s¢ from s depends only on s and not on the history of earlier states.7Slide8

Rewards and UtilitiesA utility function must be specified for the agent in order to determined the value of an action.Because the problem is sequential, the utility function depends on a sequence of states (environment history).Rewards are assigned to states, i.e., R(s) returns the reward of the state.For this example, assume the following:The reward for all states, except for the goal states, is -0.04.The utility function is the sum of all the states visited.E.g., if the agent reaches (4,3) in 10 steps, the total utility is 1 + (10 x -0.04) = 0.6.The negative reward is an incentive to stop interacting as quickly as possible.8Slide9

Markov Decision Process (MDP)Specification for a sequential decision problem for a fully observable environment with a Markovian transition model and additive rewards.Three components:Initial State: S0Transition Model: T(s,a,s¢)Reward Function: R(s)9Slide10

Solution for an MDP Since outcomes of actions are not deterministic, a fixed set of actions cannot be a solution.A solution must specify what an a agent should do for any state that the agent might reach.A policy, denoted by π, recommends an action for a given state, i.e., π(s) is the action recommended by policy π for state s.10Slide11

Quality of a PolicySince the environment is stochastic, each time a given policy is executed starting from the initial state, there can be different environment histories.Therefore, the quality of a policy is determined by the expected utility of the possible environment histories generated by that policy.11Slide12

Optimal PolicyAn Optimal policy is a policy that yields the highest expected utility.Optimal policy is denoted by π*.Once a π* is computed for a problem, then the agent, once identifying the state (s) that it is in, consults π*(s) for the next action to execute.12Slide13

Optimal Policy for Example13Note that at (3,1), the policy goes back towards the initial state. Why?Slide14

Balancing Risk and RewardThe balance of risk and reward depends on the value of R(s).Characteristic that appears often in the real world. MDPs have been studied in many fields (AI, OR, economics, control theory, etc.).The following four slides show π* for four different reward models.14Slide15

R(s) < -1.628415Get out of the environment as fast as possible.Slide16

-0.4278 < R(s) < -0.085016Take the fastest route to (4,3) without concern for risk.Slide17

-0.0221 < R(s) < 017Take no risks at all.Slide18

R(s) > 018Never leave the environment.Slide19

Decision-Making HorizonFinite Horizon – Fixed time N after which nothing matters.Optimal action could change over time.E.g., in our example, suppose agent starts at (3,1) and N=3, then optimal action is to take the short cut. But, if N=100,…Optimal policy is nonstationary.Infinite Horizon – no fixed time and optimal action depends on the current state.Optimal policy is stationary.19Slide20

Stationary Preferences between StatesAssumption about preferences remaining the same independent of time.If you prefer one future to another starting tomorrow, then you should still prefer that future if it were to start today.Given stationary preferences, there are two ways to assign utilities to sequences.20Slide21

Assignment of Utility to State SequencesUtility function for environment histories (sequences of states) is denoted as Uh([S0,S1,…,Sn]).Two methods:Additive rewards – Sum up rewards of states, i.e., Uh([S0,S1,…]) = R(s0) + R(s1) + R(s2) + …Discounted Rewards – Sum of progressively discounted rewards of states, i.e., Uh([S0

,S1,…]) = R(s0) + gR(s

1

) +

g

2

R(s

2

) + …, where

discount factor

g

is a number between 0 and 1.

Closer

g

to 0, the less future rewards count.

When

g

is 1, the same as Additive Rewards.

21Slide22

Issue with Calculating Utilities on Infinite HorizonsIf all environment histories are infinite (no terminal state reached), using additive rewards results in comparing +∞.3 SolutionsDiscounted rewards – if rewards bounded by Rmax and g < 1, then Uh([S0,S1,…]) ≤ Rmax/(1 - g).Ensure Proper

policy, i.e., policy that is guaranteed to reach a terminal state.Compare in terms of average reward (difficult to analyze).

22Slide23

Choosing between PoliciesThe value of a policy is the expected sum of discounted rewards obtained, where the expectation is taken over all possible state sequences that could occur, give that the policy is executed.23Slide24

Value IterationValue Iteration is an algorithm for computing an optimal policy.Basic idea: Calculate the utility of each state and then use the state utilities to select an optimal action in each state.24Slide25

25Utility of StatesUtility of a state is the expected utility of the state sequences that might follow it, which are determined by a policy.Let Uπ(s) be the utility of a state and st be the state the agent is in after executing π for t steps, thenLet U(s) be a shorthand for Uπ*(s)Slide26

Utilities for Example Problem26Note that utilities closer to (4,3) are higher because fewer steps are required to reach the exit.Slide27

Bellman Equationπ* selects the action that maximizes the expected utility of the subsequent state.The Bellman equation defines U(s) as the utility of s plus the discounted utility of the next state, assuming the optimal action, i.e.,27Slide28

Computing Bellman equation on Example Problem28The equation for state (1,1) isWhen we plug in the numbers from slide 26, we find that Up is the best action.Slide29

Using Bellman equations for solving MDPs.If there are n possible states, then there are n Bellman equations (one for each state).To compute the n utilities, we would like to solve simultaneously the n Bellman equations. Problematic because max is not a linear operator.Use iteration applying Bellman update:Start with the utilities of all states initialized to 0.Guaranteed to converge.29Slide30

Value-Iteration Algorithm30Slide31

Value-Iteration Convergence31Evolution of selected states using value iteration. Note that some states have negative reward and until +1 goal state utility propagation reaches themThe number of iterations required to guarantee an error of at most e = c x Rmax, for different values of c,, as a function of the discount factorg.Slide32

Are True Utilities for States Required?What matters is that utilities are good enough to recommend the optimal action in each state.In practice πi often becomes optimal before Ui has converged.For our example, the policy πi is optimal when i = 4 even though the maximum error in Ui is still 0.4632Slide33

Policy IterationSearches policy space.Basic idea:Policy Evaluation: start with a random policy π0 and calculate utilities based on if that policy were executed.Policy Improvement: Calculate a new MEU policy πi+1 based on computed utilities.Iterate until the policy does not change.33Slide34

Policy-Iteration Algorithm34Slide35

Policy EvaluationBecause policies are fixed, the max operator is removed and standard linear algebra methods can applied to solve simultaneous equations.Complexity is O(n3)For large state spaces, it may be prohibitive.Modified Policy Iteration - can do some number of value iterations (simplified because policy is fixed) to get reasonable approximation of utilities.35Slide36

Partially ObservabilityWhat do you do if the system state cannot always be determined?Action outcomes are not fully observable.Use a Partially Observable MDP (POMDP). Must add:a set of observations O to the modelan observation distribution U(s,o) for each state.an initial state distribution.36Slide37

POMDPBasic decision cycle:Given the current belief state b, execute the action a = π*(b).Receive the observationUpdate current belief state based on previous belief state, the action taken, and the new observation.Solve as an MDP by reasoning in belief spaceRequires calculating a probability distribution over the possible states given previous observations.37Slide38

Big Problem with MDPs and VariantsDoes not scale.Too many states in real-world problems.There are methods for focusing search only on significant states.What if outcome is not in transition model?Have been attempts to have hybrid approaches with MDP for short horizon and estimates through heuristic search for utilities for distant states.38