/
CS 573: Artificial Intelligence CS 573: Artificial Intelligence

CS 573: Artificial Intelligence - PowerPoint Presentation

white
white . @white
Follow
66 views
Uploaded On 2023-06-23

CS 573: Artificial Intelligence - PPT Presentation

Markov Decision Processes Dan Weld University of Washington Slides by Dan Klein amp Pieter Abbeel UC Berkeley httpaiberkeleyedu and by Mausam amp Andrey Kolobov Logistics ID: 1002258

iteration policy optimal reward policy iteration reward optimal state bellman states utility values time actions max agent 9living 2discount

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS 573: Artificial Intelligence" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. CS 573: Artificial IntelligenceMarkov Decision ProcessesDan WeldUniversity of WashingtonSlides by Dan Klein & Pieter Abbeel / UC Berkeley. (http://ai.berkeley.edu) and by Mausam & Andrey Kolobov

2. LogisticsPS 2 due todayMidterm in one week Covers all material through value iteration (wed / fri)Closed bookYou may bring one 8.5 x 11” double-sided sheet of paper

3. OutlineAdversarial GamesMinimax searchα-β searchEvaluation functionsMulti-player, non-0-sumStochastic GamesExpectimax Markov Decision ProcessesReinforcement Learning

4. Agent vs. EnvironmentAn agent is an entity that perceives and acts.A rational agent selects actions that maximize its utility function. AgentSensors?ActuatorsEnvironmentPerceptsActionsDeterministic vs. stochasticFully observable vs. partially observable

5. Rational PreferencesTheorem: Rational preferences imply behavior describable as maximization of expected utilityThe Axioms of Rationality

6. Theorem [Ramsey, 1931; von Neumann & Morgenstern, 1944]Given any preferences satisfying these constraints, there exists a real-valued function U such that:I.e. values assigned by U preserve preferences of both prizes and lotteries!Maximum expected utility (MEU) principle:Choose the action that maximizes expected utilityNote: an agent can be entirely rational (consistent with MEU) without ever representing or manipulating utilities and probabilitiesE.g., a lookup table for perfect tic-tac-toe, a reflex vacuum cleanerMEU Principle

7. Human Utilities

8. Utility ScalesNormalized utilities: u+ = 1.0, u- = 0.0Micromorts: one-millionth chance of death, useful for paying to reduce product risks, etc.QALYs: quality-adjusted life years, useful for medical decisions involving substantial riskNote: behavior is invariant under positive linear transformationWith deterministic prizes only (no lottery choices), only ordinal utility can be determined, i.e., total order on prizes

9. Utilities map states to real numbers. Which numbers?Standard approach to assessment (elicitation) of human utilities:Compare a prize A to a standard lottery Lp between“best possible prize” u+ with probability p“worst possible catastrophe” u- with probability 1-pAdjust lottery probability p until indifference: A ~ LpResulting p is a utility in [0,1]Human Utilities0.999999 0.000001No changePay $30Instant death

10. MoneyMoney does not behave as a utility function, but we can talk about the utility of having money (or being in debt)Given a lottery L = [p, $X; (1-p), $Y]The expected monetary value EMV(L) is p*X + (1-p)*YU(L) = p*U($X) + (1-p)*U($Y)Typically, U(L) < U( EMV(L) )In this sense, people are risk-averseWhen deep in debt, people are risk-prone

11. Example: InsuranceConsider the lottery [0.5, $1000; 0.5, $0]What is its expected monetary value? ($500)What is its certainty equivalent?Monetary value acceptable in lieu of lottery$400 for most peopleDifference of $100 is the insurance premiumThere’s an insurance industry because people will pay to reduce their riskIf everyone were risk-neutral, no insurance needed!It’s win-win: you’d rather have the $400 and the insurance company would rather have the lottery (their utility curve is flat and they have many lotteries)

12. Example: Human Rationality?Famous example of Allais (1953)A: [0.8, $4k; 0.2, $0]B: [1.0, $3k; 0.0, $0]C: [0.2, $4k; 0.8, $0]D: [0.25, $3k; 0.75, $0]Most people prefer B > A, C > DBut if U($0) = 0, thenB > A  U($3k) > 0.8 U($4k)C > D  0.8 U($4k) > U($3k)

13. Non-Deterministic Search

14. Example: Grid WorldA maze-like problemThe agent lives in a gridWalls block the agent’s pathNoisy movement: actions do not always go as planned80% of the time, the action North takes the agent North (if there is no wall there)10% of the time, North takes the agent West; 10% EastIf there is a wall in the direction the agent would have been taken, the agent stays putThe agent receives rewards each time stepSmall “living” reward each step (can be negative)Big rewards come at the end (good or bad)Goal: maximize sum of rewards

15. Grid World ActionsDeterministic Grid WorldStochastic Grid World

16. Markov Decision ProcessesAn MDP is defined by:A set of states s  SA set of actions a  AA transition function T(s, a, s’)Probability that a from s leads to s’, i.e., P(s’| s, a)Also called the model or the dynamicsT(s11, E, … …T(s31, N, s11) = 0 …T(s31, N, s32) = 0.8T(s31, N, s21) = 0.1T(s31, N, s41) = 0.1 …T is a Big Table!X 4 x 11 = 484 entriesFor now, we give this as input to the agent

17. Markov Decision ProcessesAn MDP is defined by:A set of states s  SA set of actions a  AA transition function T(s, a, s’)Probability that a from s leads to s’, i.e., P(s’| s, a)Also called the model or the dynamicsA reward function R(s, a, s’) …R(s32, N, s33) = -0.01 …R(s32, N, s42) = -1.01R(s33, E, s43) = 0.99 …Cost of breathingR is also a Big Table!For now, we also give this to the agent

18. Markov Decision ProcessesAn MDP is defined by:A set of states s  SA set of actions a  AA transition function T(s, a, s’)Probability that a from s leads to s’, i.e., P(s’| s, a)Also called the model or the dynamicsA reward function R(s, a, s’) Sometimes just R(s) or R(s’) … R(s33) = -0.01 R(s42) = -1.01 R(s43) = 0.99

19. Markov Decision ProcessesAn MDP is defined by:A set of states s  SA set of actions a  AA transition function T(s, a, s’)Probability that a from s leads to s’, i.e., P(s’| s, a)Also called the model or the dynamicsA reward function R(s, a, s’) Sometimes just R(s) or R(s’), e.g. in R&NA start stateMaybe a terminal stateMDPs are non-deterministic search problemsOne way to solve them is with expectimax searchWe’ll have a new tool soon

20. What is Markov about MDPs?“Markov” generally means that given the present state, the future and the past are independentFor Markov decision processes, “Markov” means action outcomes depend only on the current stateThis is just like search, where the successor function can only depend on the current state (not the history)Andrey Markov (1856-1922)

21. PoliciesOptimal policy when R(s, a, s’) = -0.03 for all non-terminals sIn deterministic single-agent search problems, we wanted an optimal plan, or sequence of actions, from start to a goalFor MDPs, we want an optimal policy *: S → AA policy  gives an action for each stateAn optimal policy is one that maximizes expected utility if followedAn explicit policy defines a reflex agentExpectimax didn’t output an entire policyIt computed the action for a single state only

22. Optimal PoliciesR(s) = -2.0R(s) = -0.4R(s) = -0.03R(s) = -0.01Cost of breathing

23. Example: Racing

24. Example: RacingA robot car wants to travel far, quicklyThree states: Cool, Warm, OverheatedTwo actions: Slow, FastGoing faster gets double rewardExcept…CoolWarmOverheatedFastFastSlowSlow0.5 0.5 0.5 0.5 1.0 1.0 +1 +1 +1 +2 +2 -10

25. Racing: Search TreeMight be generated with ExpectiMax, but …?

26. MDP Search TreesEach MDP state projects an expectimax-like search treeass’s, a(s,a,s’) called a transitionT(s,a,s’) = P(s’|s,a)R(s,a,s’)s,a,s’s is a state(s, a) is a q-state

27. Utilities of Sequences

28. Utilities of SequencesWhat preferences should an agent have over reward sequences?More or less?Now or later?[1, 2, 2][2, 3, 4] or[0, 0, 1][1, 0, 0] or

29. DiscountingIt’s reasonable to maximize the sum of rewardsIt’s also reasonable to prefer rewards now to rewards laterOne solution: values of rewards decay exponentiallyWorth NowWorth Next StepWorth In Two Steps

30. DiscountingHow to discount?Each time we descend a level, we multiply by the discountWhy discount?Sooner rewards probably do have higher utility than later rewardsAlso helps our algorithms convergeExample: discount of 0.5U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3U([1,2,3]) < U([3,2,1])

31. Stationary PreferencesTheorem: if we assume stationary preferences:Then: there are only two ways to define utilitiesAdditive utility:Discounted utility:

32. Quiz: DiscountingGiven:Actions: East, West, and Exit (only available in exit states a, e)Transitions: deterministicQuiz 1: For  = 1, what is the optimal policy?Quiz 2: For  = 0.1, what is the optimal policy?Quiz 3: For which  are West and East equally good when in state d?

33. Infinite Utilities?!Problem: What if the game lasts forever? Do we get infinite rewards?Solutions:Finite horizon: (similar to depth-limited search)Terminate episodes after a fixed T steps (e.g. life)Gives nonstationary policies ( depends on time left)Discounting: use 0 <  < 1Smaller  means smaller “horizon” – shorter term focusAbsorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “overheated” for racing)

34. Recap: Defining MDPsMarkov decision processes:Set of states SStart state s0Set of actions ATransitions P(s’|s,a) (or T(s,a,s’))Rewards R(s,a,s’) (and discount )MDP quantities so far:Policy = Choice of action for each stateUtility = sum of (discounted) rewardsass, as,a,s’s’

35. Solving MDPsValue IterationAsynchronous VIPolicy IterationReinforcement Learning

36. V* = Optimal Value Function The value (utility) of a state s:V*(s) “expected utility starting in s & acting optimally forever”

37. Q*The value (utility) of the q-state (s,a):Q*(s,a) “expected utility of 1) starting in state s 2) taking action a 3) acting optimally forever after that”Q*(s,a) = reward from executing a in s then ending in s’ plus… discounted value of V*(s’)

38. * Specifies The Optimal Policy*(s) = optimal action from state s

39. The Bellman EquationsHow to be optimal: Step 1: Take correct first action Step 2: Keep being optimal

40. The Bellman EquationsDefinition of “optimal utility” via expectimax recurrence gives a simple one-step lookahead relationship amongst optimal utility valuesThese are the Bellman equations, and they characterize optimal values in a way we’ll use over and over ass, as,a,s’s’(1920-1984)

41. Gridworld: Q*

42. Gridworld Values V*

43. Values of StatesFundamental operation: compute the (expectimax) value of a stateExpected utility under optimal actionAverage sum of (discounted) rewardsThis is just what expectimax computed!Recursive definition of value:ass, as,a,s’s’

44. Racing Search Tree

45. No End in Sight…We’re doing way too much work with expectimax!Problem 1: States are repeated Idea: Only compute needed quantities onceLike graph search (vs. tree search)Problem 2: Tree goes on foreverRewards @ each step  V changesIdea: Do a depth-limited computation, but with increasing depths until change is smallNote: deep parts of the tree eventually don’t matter if γ < 1

46. Time-Limited ValuesKey idea: time-limited valuesDefine Vk(s) to be the optimal value of s if the game ends in k more time stepsEquivalently, it’s what a depth-k expectimax would give from s[Demo – time-limited values (L8D6)]

47. Time-Limited Values: Avoiding Redundant Computation

48. Value Iteration

49. Value IterationaVk+1(s)s, as,a,s’Vk(s’)Forall s, Initialize V0(s) = 0 no time steps left means an expected reward of zeroRepeat do Bellman backups K += 1Repeat until |Vk+1(s) – Vk(s) | < ε, forall s “convergence”Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a)Called a “Bellman Backup”Successive approximation; dynamic programming} do ∀s, a}

50. Example: Bellman BackupV1= 0V1= 1V1= 2Q1(s,a1) = 2 +  0 ~ 2Q1(s,a2) = 5 +  0.9~ 1 +  0.1~ 2 ~ 6.1Q1(s,a3) = 4.5 +  2 ~ 6.5maxV2(s) = 6.5agreedy = a324.55a2a1a3ss1s2s30.90.1Assume γ ~ 1

51. Example: Value IterationAssume no discount (gamma=1) to keep math simple!Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a)

52. Example: Value Iteration 0 0 0Assume no discount (gamma=1) to keep math simple!Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a)

53. Example: Value Iteration 0 0 0Assume no discount (gamma=1) to keep math simple!Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a)0 Q( , ,slow) = Q( , ,fast) = Q1(s,a)= 0

54. Example: Value Iteration 0 0 0 1 Assume no discount (gamma=1) to keep math simple!Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a)-10 0 Q( , ,slow) = ½(1 + 0) + ½(1+0)Q( , ,fast) = -10 + 0Q1(s,a)= 0 1,Q( , ,slow) =

55. Example: Value Iteration 0 0 0 1 0Assume no discount (gamma=1) to keep math simple!Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a) 21,-10 0 Q( , fast) = ½(2 + 0) + ½(2 + 0)Q1(s,a)=Q( , slow) = 1*(1 + 0) 2Q( , fast) =Q( , slow) =1,

56. Example: Value Iteration 0 0 0 2 1 0 3.5 3.5 0Assume no discount (gamma=1) to keep math simple!Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a) 1, 2 1,-10 0 3,3.5 3.5,-10 0 Q1(s,a)=Q2(s,a)=

57. k=0Noise = 0.2Discount = 0.9Living reward = 0

58. k=1Noise = 0.2Discount = 0.9Living reward = 0If agent is in 4,3, it only has one legal action: get jewel. It gets a reward and the game is over.If agent is in the pit, it has only one legal action, die. It gets a penalty and the game is over.Agent does NOT get a reward for moving INTO 4,3.

59. k=2Noise = 0.2Discount = 0.9Living reward = 0

60. k=3Noise = 0.2Discount = 0.9Living reward = 0

61. k=4Noise = 0.2Discount = 0.9Living reward = 0

62. k=5Noise = 0.2Discount = 0.9Living reward = 0

63. k=6Noise = 0.2Discount = 0.9Living reward = 0

64. k=7Noise = 0.2Discount = 0.9Living reward = 0

65. k=8Noise = 0.2Discount = 0.9Living reward = 0

66. k=9Noise = 0.2Discount = 0.9Living reward = 0

67. k=10Noise = 0.2Discount = 0.9Living reward = 0

68. k=11Noise = 0.2Discount = 0.9Living reward = 0

69. k=12Noise = 0.2Discount = 0.9Living reward = 0

70. k=100Noise = 0.2Discount = 0.9Living reward = 0

71. VI: Policy Extraction

72. Computing Actions from ValuesLet’s imagine we have the optimal values V*(s)How should we act?In general, it’s not obvious!We need to do a mini-expectimax (one step)This is called policy extraction, since it gets the policy implied by the values

73. Computing Actions from Q-ValuesLet’s imagine we have the optimal q-values:How should we act?Completely trivial to decide!Important lesson: actions are easier to select from q-values than values!

74. Value Iteration - RecapaVk+1(s)s, as,a,s’Vk(s’)Forall s, Initialize V0(s) = 0 no time steps left means an expected reward of zeroRepeat do Bellman backups K += 1 Repeat for all states, s, and all actions, a:Until |Vk+1(s) – Vk(s) | < ε, forall s “convergence”Theorem: will converge to unique optimal valuesQk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a)

75. Convergence*How do we know the Vk vectors will converge?Case 1: If the tree has maximum depth M, then VM holds the actual untruncated valuesCase 2: If the discount is less than 1Sketch: For any state Vk and Vk+1 can be viewed as depth k+1 expectimax results in nearly identical search treesThe max difference happens if big reward at k+1 levelThat last layer is at best all RMAX But everything is discounted by γk that far outSo Vk and Vk+1 are at most γk max|R| differentSo as k increases, the values converge

76. Value Iteration - RecapaVk+1(s)s, as,a,s’Vk(s’)Forall s, Initialize V0(s) = 0 no time steps left means an expected reward of zeroRepeat do Bellman backups K += 1 Repeat for all states, s, and all actions, a:Until |Vk+1(s) – Vk(s) | < ε, forall s “convergence”Complexity of each iteration?Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a)

77. Value Iteration - RecapaVk+1(s)s, as,a,s’Vk(s’)Forall s, Initialize V0(s) = 0 no time steps left means an expected reward of zeroRepeat do Bellman backups K += 1 Repeat for all states, s, and all actions, a:Until |Vk+1(s) – Vk(s) | < ε, forall s “convergence”Complexity of each iteration: O(S2A)Number of iterations: poly(|S|, |A|, 1/(1-γ)) Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a)

78. Value Iteration as Successive Approximation Bellman equations characterize the optimal values:Value iteration computes them:Value iteration is just a fixed-point solution methodComputed using dynamic programming… though the Vk vectors are also interpretable as time-limited valuesaV(s)s, as,a,s’V(s’)Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a)

79. Problems with Value IterationValue iteration repeats the Bellman updates:Problem 1: It’s slow – O(S2A) per iterationProblem 2: The “max” at each state rarely changesProblem 3: The policy often converges long before the valuesass, as,a,s’s’[Demo: value iteration (L9D2)]Qk+1(s, a) = Σs’ T(s, a, s’) [ R(s, a, s’) + γ Vk(s’)]Vk+1(s) = Max a Qk+1 (s, a)

80. VI  Asynchronous VIIs it essential to back up all states in each iteration?No!States may be backed up many times or not at allin any orderAs long as no state gets starved…convergence properties still hold!!80

81. Prioritization of Bellman BackupsAre all backups equally important?Can we avoid some backups?Can we schedule the backups more appropriately?81

82. k=1Noise = 0.2Discount = 0.9Living reward = 0

83. k=2Noise = 0.2Discount = 0.9Living reward = 0

84. k=3Noise = 0.2Discount = 0.9Living reward = 0

85. Asynch VI: Prioritized SweepingWhy backup a state if values of successors unchanged?Prefer backing a statewhose successors had most changePriority Queue of (state, expected change in value)Backup in the order of priorityAfter backing up state s’, update priority queuefor all predecessors (ie all states where an action can reach s’)

86. Residual wrt Value Function V (ResV)Residual at s with respect to Vmagnitude(¢V(s)) after one Bellman backup at sResidual wrt respect to Vmax residualResV = maxs (ResV(s))86ResV <²(²-consistency)

87. (General) Asynchronous VI87

88. Asynchronous VIPros?Cons?

89. Async VI: Real Time Dynamic Programming [Barto, Bradtke, Singh’95]Trial: simulate greedy policy starting from start state; perform Bellman backup just on visited states RTDP: Repeat Trials until value function converges

90. Max??s0VnVnVnVnVnVnVnQn+1(s0,a)Vn+1(s0)agreedy = a2RTDP TrialGoala1a2a3?

91. Solving MDPsValue IterationPolicy IterationReinforcement Learning

92. Policy MethodsPolicy Iteration = Policy Evaluation Policy Improvement

93. Part 1 - Policy Evaluation

94. Fixed PoliciesExpectimax trees max over all actions to compute the optimal valuesIf we fixed some policy (s), then the tree would be simpler – only one action per state… though the tree’s value would depend on which policy we fixedass, as,a,s’s’(s)ss, (s)s, (s),s’s’Do the optimal actionDo what  says to do

95. Computing Utilities for a Fixed PolicyA new basic operation: compute the utility of a state s under a fixed (generally non-optimal) policyDefine the utility of a state s, under a fixed policy :V(s) = expected total discounted rewards starting in s and following Recursive relation (variation of Bellman equation):(s)ss, (s)s, (s),s’s’

96. Example: Policy EvaluationAlways Go RightAlways Go Forward

97. Example: Policy EvaluationAlways Go RightAlways Go Forward

98. Policy Evaluation AlgorithmHow do we calculate the V’s for a fixed policy ?Idea 1: Turn recursive Bellman equations into updates (like value iteration)Efficiency: O(S2) per iterationIdea 2: Without the maxes, the Bellman equations are just a linear systemSolve with Matlab (or your favorite linear system solver)(s)ss, (s)s, (s),s’s’

99. Part 2 - Policy Iteration

100. Policy IterationInitialize π(s) to random actionsRepeatStep 1: Policy evaluation: calculate utilities of π at each s using a nested loop Step 2: Policy improvement: update policy using one-step look-ahead “For each s, what’s the best action I could execute, assuming I then follow π? Let π’(s) = this best action. π = π’Until policy doesn’t change

101. Policy Iteration DetailsLet i =0Initialize πi(s) to random actionsRepeatStep 1: Policy evaluation:Initialize k=0; Forall s, V0π (s) = 0Repeat until Vπ convergesFor each state s, Let k += 1Step 2: Policy improvement: For each state, s, If πi == πi+1 then it’s optimal; return it. Else let i += 1

102. ExampleInitialize π0 to “always go right”Perform policy evaluationPerform policy improvement Iterate through states???Has policy changed?Yes! i += 1

103. Exampleπ1 says “always go up”Perform policy evaluationPerform policy improvement Iterate through states???Has policy changed?No! We have the optimal policy

104. Example: Policy EvaluationAlways Go RightAlways Go Forward

105. Policy Iteration PropertiesPolicy iteration finds the optimal policy, guaranteed!Often converges (much) faster

106. ComparisonBoth value iteration and policy iteration compute the same thing (all optimal values)In value iteration:Every iteration updates both the values and (implicitly) the policyWe don’t track the policy, but taking the max over actions implicitly recomputes itWhat is the space being searched?In policy iteration:We do fewer iterationsEach one is slower (must update all Vπ and then choose new best π)What is the space being searched?Both are dynamic programs for solving MDPs

107. Summary So Far: MDP AlgorithmsSo you want to….Compute optimal values: use value iteration or policy iterationCompute values for a particular policy: use policy evaluationTurn your values into a policy: use policy extraction (one-step lookahead)These all look the same!They all use variations of Bellman updatesThey all use one-step lookahead expectimax fragmentsThey differ whether we plug in a fixed policy or max over actionsWhether we search finite (policy) space or infinite (R valued value function) space

108. Double Bandits

109. Double-Bandit MDPActions: Blue, RedStates: Win, LoseWL$11.0$11.00.25 $00.75 $20.75 $20.25 $0No discount100 time stepsBoth states have the same value

110. Offline PlanningSolving MDPs is offline planningYou determine all quantities through computationYou need to know the details of the MDPYou do not actually play the game!Play RedPlay BlueValueNo discount100 time stepsBoth states have the same value150100WL$11.0$11.00.25 $00.75 $20.75 $20.25 $0

111. Let’s Play!$2$2$0$2$2$2$2$0$0$0

112. Online PlanningRules changed! Red’s win chance is different.WL$11.0$11.0?? $0?? $2?? $2?? $0

113. Let’s Play!$0$0$0$2$0$2$0$0$0$0

114. Policy EvaluationHow do we calculate the V’s for a fixed policy ?Idea 1: Turn recursive Bellman equations into updates (like value iteration)Efficiency: O(S2) per iterationIdea 2: Without the maxes, the Bellman equations are just a linear systemSolve with Matlab (or your favorite linear system solver)(s)ss, (s)s, (s),s’s’

115. What Just Happened?That wasn’t planning, it was learning!Specifically, reinforcement learningThere was an MDP, but you couldn’t solve it with just computationYou needed to actually act to figure it outImportant ideas in reinforcement learning that came upExploration: you have to try unknown actions to get informationExploitation: eventually, you have to use what you knowRegret: even if you learn intelligently, you make mistakesSampling: because of chance, you have to try things repeatedlyDifficulty: learning can be much harder than solving a known MDP

116. Next Time: Reinforcement Learning!

117. Asynchronous Value Iteration*In value iteration, we update every state in each iterationActually, any sequences of Bellman updates will converge if every state is visited infinitely oftenIn fact, we can update the policy as seldom or often as we like, and we will still convergeIdea: Update states whose value we expect to change: If is large then update predecessors of s

118. Interlude