/
UE2 Dynamic   Decision UE2 Dynamic   Decision

UE2 Dynamic Decision - PowerPoint Presentation

gagnon
gagnon . @gagnon
Follow
65 views
Uploaded On 2023-10-04

UE2 Dynamic Decision - PPT Presentation

Processes Instructor Prof Xiaolan Xie Schedule Date Time Where 10nov 1330 1800 S224 17nov 1330 1830 S224 18nov 1700 1800 S224 05jan 1330 1645 224 exam on courses ID: 1023020

decision cost state policy cost decision policy state control dynamic optimal optimality time period markov equation set rate function

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "UE2 Dynamic Decision" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. UE2Dynamic Decision ProcessesInstructor: Prof. Xiaolan XieSchedule :DateTimeWhere10-nov13:30 – 18:00S22417-nov13:30 – 18:30S22418-nov17:00 – 18:00S22405-jan13:30 – 16:45224, exam (on courses of me + SDP)Course materials on: http://www.emse.fr/~xie/master

2. UE2Dynamic Decision ProcessesLearning objectives :Able to model practical dynamic decision problemsUnderstanding decision policiesUnderstanding the principle of optimalityUnderstanding the relation between discounted and average-costDerive decision structural properties with optimality equationTextbooks :C. Cassandras and S. Lafortune, Introduction to Discrete Event Systems, Springer, 2007Martin Puterman, Markov decision processes, John Wiley & Sons, 1994D.P. Bertsekas, Dynamic Programming, Prentice Hall, 1987

3. Dynamic programmingIntroduction to Markov decision processesMarkov decision processes formulationDiscounted markov decision processesAverage cost markov decision processesContinuous-time Markov decision processesPlan

4. Dynamic programmingBasic principe of dynamic programmingSome applicationsStochastic dynamic programming

5. Dynamic programmingBasic principe of dynamic programmingSome applicationsStochastic dynamic programming

6. Dynamic programming (DP) is a general optimization technique based on implicit enumeration of the solution space.The problems should have a particular sequential structure, such that the set of unknowns can be made sequentially.It is based on the "principle of optimality"A wide range of problems can be put in seqential form and solved by dynamic programmingIntroduction

7. Introduction Applications :• Optimal control• Most problems in graph theory• Investment• Deterministic and stochastic inventory control• Project scheduling• Production schedulingWe limit ourselves to discrete optimization

8. Illustration of DP by shortest path problemProblem : We are planning the construction of a highway from city A to city K. Different construction alternatives and their costs are given in the following graph. The problem consists in determine the highway with the minimum total cost. ABFEDCGHIJK81014101073589810915

9. BELLMAN's principle of optimalityGeneral form:if C belongs to an optimal path from A to B, then the sub-path A to C and C to B are also optimalorall sub-path of an optimal path is optimalACBoptimaloptimalCorollary : SP(xo, y) = min {SP(xo, z) + l(z, y) | z : predecessor of y}

10. Solving a problem by DP 1. ExtensionExtend the problem to a family of problems of the same nature (SP(x))2. Recursive Formulation (application of the principle of optimality)Link optimal solutions of these problems by a recursive relation3. Decomposition into steps or phasesDefine the order of the resolution of the problems in such a way that, when solving a problem P, optimal solutions of all other problems needed for computation of P are already known.4. Computation by steps

11. Solving a problem by DP Difficulties in using dynamic programming :Identification of the family of problems transformation of the problem into a sequential form.

12. Shortest Path in an acyclic graph • Problem setting : find a shortest path from x0 (root of the graph) to a given node y0• Extension : Find a shortest path from x0 to any node y, denoted SP(x0, y)• Recursive formulation   SP(y) = min { SP(z) + l(z, y) : z predecessorr of y} • Decomposition into steps : At each step k, consider only nodes y with unknown SP(y) but for which the SP of all precedecssors are known. • Compute SP(y) step by stepRemarks :• It is a backward dynamic programming• It is also possible to solve this problem by forward dynamic programming

13. DP from a control point of view Consider the control ofa discrete-time dynamic system, withcosts generated over time depending on the states and the control actionsState tState t+1actionactionCostCostpresent decision epochnext decision epoch

14. DP from a control point of view State tState t+1actionactionCostCostpresent decision epochnext decision epochSystem dynamics :x t+1 = ft(xt, ut), t = 0, 1, ..., N-1wheret : temps indexxt : state of the systemut = control action to decide at t

15. DP from a control point of view State tState t+1actionactionCostCostpresent decision epochnext decision epochCriterion to optimizegN = terminating cost

16. DP from a control point of view State tState t+1actionactionCostCostpresent decision epochnext decision epochValue function or cost-to-go function:

17. DP from a control point of view State tState t+1actionactionCostCostpresent decision epochnext decision epochOptimality equation or Bellman equation

18. ApplicationsSingle machine scheduling (Knapsac)Inventory control Traveling salesman problem

19. ApplicationsSingle machine scheduling (Knapsac)Problem : Consider a set of N production requests, each needing a production time ti on a bottleneck machine and generating a profit pi. The capacity of the bottleneck machine is C. Question: determine the production requests to confirm in order to maximize the total profit.Formulation:max  pi Xisubject to: ti Xi  C

20. Knapsack ProblemKnapsack Problème : Mr Radin can take 7 KG without paying over-weight fee on his return flight. He decides to take advantage of it and look for some local products that he can sale at home for extra gain. He selects n most interesting objects, weighs each of them, and bargains the price.Which objects should he buy in order to maximize his gain?Object (i)123456Weight (wi)211321Expected gain (ri)855632

21. Knapsack ProblemGeneric formulation: Time = 1, …, 7State st = remaining capacity for objects t, t+1, …State space = {0, 1, 2, …, 7}Action at time t = selection or not object tAction space At(s) = {1=YES, 0=NO} if s ≥ wt and = {0} if s < wt Immediate gain at time t gt(st, ut) = rt if YES= 0 if NOState transition or system dynamics: st+1 = st – wt if YES = st if NO

22. Knapsack ProblemValue function: Jn(s) = Maximal gain from objects n, n+1, …, 6 with a remaing capacity of s KG. Optimality equation:

23. Knapsack Problemtime7 6 5 4 3 2 1   wiriwiriwiriwiriwiriwiri  122336151528stateJn(s)actionJn(s)actionJn(s)actionJn(s)actionJn(s)actionJn(s)actionJn(s)action00 0N0N0N0N0N0N10 2Y2N2N5Y5N5N20 2Y3Y3N7Y10Y10N30 2Y5Y6Y8Y12Y13Y40 2Y5Y8Y11Y13Y18Y50 2Y5Y9Y13Y16Y20Y60 2Y5Y11Y14Y18Y21Y70 2Y5Y11Y16Y19Y24Y  YESNOYESNOYESNOYESNOYESNOYESNO0  -10-10-10-10-10-101  20-12-125255-152  2032-13731078103  2052658612813124  205285118131118135  205295139161320166  20521151411181421187  2052115161119162419-1=Infeasibleaction

24. Knapsack Problem1234560NNNNNN1NNYNNY2NYYNYY3YYYYYY4YYYYYY5YYYYYY6YYYYYY7YYYYYYControl map or control policystatestage

25. Problem: determining the purchasing quantity at the beginning of each period in order to minimize the total expenseUnit price and the demand Storage capacity 5 (in 000), Initial stock = 0Fixed order cost K = 20 (00$) Unit inventory holding cost h = 1 (00$) ApplicationsInventory control

26. Generic formulation: Time = 1, …, 7State st = Inventory at the beginning of period tState space = {0, 1, 2, …, 5}Action at time t = purchasing quantity ut of period tAction space A(st) = {max(0, dt – st), …, 5 + dt - st}Immediate cost at time t gt(st, ut) = K + ptut + ht(st + ut - dt) if u > 0 = ht(st + ut - dt) if NOState transition or system dynamics: st+1 = st + ut - dt ApplicationsInventory control

27. Value function: Jn(s) = minimal total cost over periods n, n+1, …, 6 by starting with an inventory s at the beginning of period n. Optimality equation: ApplicationsInventory control

28. 2007ApplicationsTraveling salesman problemProblem : Data: a graph with N nodes and a distance matrix [dij] beteen any two nodes i and j. Question: determine a circuit of minimum total distance passing each node once. Extensions: C(y, S): shortest path from y to x0 passing once each node in S. Application: Machine scheduling with setups.

29. ApplicationsTotal tardiness minimization on a single machineJob123Due date di565Processing time pi324weight wi312

30. Stochastic dynamic programmingModel Consider the control ofa discrete-time stochastic dynamic system, withcosts generated over timeState tState t+1actionactionstage costcostpresent decision epochnext decision epochperturbationperturbation

31. System dynamics :x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1wheret : time indexxt : state of the systemut = decision at time twt : random perturbationsState tState t+1actionactioncostcostpresent decision epochnext decision epochperturbationStochastic dynamic programmingModel

32. CriterionState tState t+1actionactioncostcostpresent decision epochnext decision epochperturbationStochastic dynamic programmingModel

33. Consider a problem of ordering a quantity of a certain item at each of N periods so as to meet a stochastic demand, while minimizing the incurred expected cost.xt : stock available at the beginning of period tut : quantity ordered at the beginning of period twt : random demand during period t with given probability.xt+1 = xt + ut - wtStochastic dynamic programmingExample

34. Cost :purchaing cost cutinventory cost : r(xt + ut - wt)Total cost:Inventory systemwtstock at period txtorder quantitystock at period txt+1 = xt + ut - wtStochastic dynamic programmingExample

35. Open-loop control:Order quantities u1, u2, ..., uN-1 are determined once at time 0Closed-loop control:Order quantity ut at each period is determined dynamically with the knowledge of state xtStochastic dynamic programmingModel

36. 2007The rule for selecting at each period t a control action ut for each possible state xt.Examples of inventory control policies: Order a constant quantity ut = E[wt] Order up to policy : ut = St – xt, if xt  Stut = 0, if xt > Stwhere St is a constant order up to level.Stochastic dynamic programmingControl policy

37. 2007Mathematically, in closed-loop control, we want to find a sequence of functions mt, t = 0, ..., N-1, mapping state xt into control utso as to minimize the total expected cost.The sequence p = {m0, ..., mN-1} is called a policy.Stochastic dynamic programmingOptimal control policy

38. 2007Cost of a given policy p = {m0, ..., mN-1},Optimal control: minimize Jp(x0) over all possible polciy pStochastic dynamic programmingOptimal control

39. 2007State transition probabilty:pij(u, t) = P{xt+1 = j | xt = i, ut = u}depending on the control policy.Stochastic dynamic programmingState transition probabilities

40. 2007A discrete-time dynamic system :x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1Finite state space st  StFinite control space ut  CtControl policy p = {m0, ..., mN-1} with ut = mt(xt)State-transition probability: pij(u)stage cost : gt(xt, mt(xt), wt)Stochastic dynamic programmingBasic problem

41. Expected cost of a policyOptimal control policy p* is the policy with minimal cost:where P is the set of all admissible policies.J*(x) : optimal cost function or optimal value function.Stochastic dynamic programmingBasic problem

42. Let p* = {m*0, ..., m*N-1} be an optimal policy for the basic problem for the N time periods.Then the truncated policy {m*i, ..., m*N-1} is optimal for the following subproblemminimization of the following total cost (called cost-to-go function) from time i to time N by starting with state xi at time iStochastic dynamic programmingPrinciple of optimality

43. Theorem: For every initial state x0, the optimal cost J*(x0) of the basic problem is equal to J0(x0), given by the last step of the following algorithm, which proceeds backward in time from period N-1 to period 0Furthermore, if u*t = m*t(xt) minimizes the right side of Eq (B) for each xt and t, the policy p* = {m*0, ..., m*N-1} is optimal.Stochastic dynamic programmingDP algorithm

44. Consider the inventory control problem with the following:Excess demand is lost, i.e. xt+1 = max{0, xt + ut – wt}The inventory capacity is 2, i.e. xt + ut  2The inventory holding/shortage cost is : (xt + ut – wt)2Unit ordering cost is a, i.e. gt(xt,ut,wt) = aut + (xt + ut – wt)2.N = 3 and the terminal cost, gN(XN) = 0Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.2, P(wt = 2) = 0.7.Stochastic dynamic programmingExample

45. Generic formulation:Time = {1, 2, 3, 4=end}State xt = inventory level at the beginning of a periodState space = {0, 1, 2}Action ut = order quantity of period tAction space = {0, 1, …, 2 – xt}Perturbation dt = demand of period tImmediate cost = aut + (xt + ut – dt)2System dynamics xt+1 = max{0, xt + ut – dt}Stochastic dynamic programmingExample

46. Value function: Jn(s) = minimal total cost over periods n, n+1, …, 3 by starting with an inventory s at the beginning of period n. Optimality equation: Stochastic dynamic programmingExample

47. a0,25dt = w012Pw0,10,20,7means=0u=00143011,250,251,251,05(s,u)024,51,50,51,1101010,8114,251,250,250,85204100,6g(s,u,w) = 0.25u+(s+u– w)2mean stage cost=0.1g(s,u,0)+0.2g(s,u,1)+0.7g(s,u,2)Myopic policy: (s=0, u=1), (s=1, u=0), (s=2, u=0) Stochastic dynamic programmingExample – Immediate cost

48. Stage cost 0.25u+(s+u– w)2 + Remaining cost J4((s+u-w)+)Period n = 3 period-4 opt J4(s')a0,25dt = w012 Meantotal   Pw0,10,20,7  000143  011,250,251,251,05s'(s,u)024,51,50,51,100 101010,810 114,251,250,250,8520 204100,6Stochastic dynamic programmingExample – Period 3-problem

49. Period n = 2 period-3 opt J3(s')a0,25dt = w012Meantotal   Pw0,10,20,7  001,052,055,054,05  012,051,32,32,075s'(s,u)025,12,31,552,05501,05 101,81,052,051,82510,8 114,852,051,31,80520,6 204,61,81,051,555Stochastic dynamic programmingExample – Periods 2+3-problemStage cost 0.25u+(s+u– w)2 + Remaining cost J3((s+u-w)+)

50. Period n = 1 period-2 optJ2(s')a0,25dt = w012Meantotal   Pw0,10,20,7  002,0553,0556,0555,055  013,0552,3053,3053,08s'(s,u)026,0553,3052,5553,05502,055 102,8052,0553,0552,8311,805 115,8053,0552,3052,80521,555 205,5552,8052,0552,555Stochastic dynamic programmingExample – Periods 1+2+3-problem

51. Stochastic dynamic programmingExample – value function & control Stock 3-period policyStage 1Cos-to-go (order quantity) 2-period policyStage 2Cos-to-go(order quantity)1-period policyStage 3Cos-to-go(order quantity)0123.055 (2)2.805 (1)2.555 (0)2.055 (2)1.805 (1)1.555 (0)1.05 (1)0.8 (0)0.6 (0)Optimal policy a =0,25

52. From Long-term policy: (s=0, u=2), (s=1, u=1), (s=2, u=0) To Myopic policy: (s=0, u=1), (s=1, u=0), (s=2, u=0) Stock Period-1Period-2Period-3022111102000Long-term to short-termStochastic dynamic programmingExample – Control map or policy

53. Stock Period-1Period-2Period-3022111102000Control MapOr policy Period-1Period-2Period-3Period-4Sample path 1Initial stock0010Control220 Demand212 Sample path 2Initial stock0100Control211 Demand121 Sample path 3Initial stock0221Control200 Demand001 Sample path 4Initial stock2010Control020 Demand212 Sample path(2,1,2)(1,2,1)(0,0,1)DemandscenariosStochastic dynamic programmingExample – Sample paths

54. Introduction to Markov decision process

55. Sequential decision model PresentstateNextstateactionactioncostscostsKey ingredients:A set of decision epochsA set of system statesA set of available actionsA set of state/action dependent immediate costsA set of state/action dependent transition probabilitiesPolicy:a sequence of decision rules in order to mini. the cost functionIssues:Existence of opt. policyForm of the opt. policyComputation of opt. policy

56. Applications Inventory managementBus engine replacementHighway pavement maintenanceBed allocation in hospitalsPersonal staffing in fire departmentTraffic control in communication networks…

57. Example Consider a with one machine producing one product. The processing time of a part is exponentially distributed with rate p. The demand arrive according to a Poisson process of rate d. state Xt = stock level, Action : at = make or rest0123(make, p)(make, p)(make, p)ddd(make, p)d

58. Example Zero stock policy (M/M/1)-2-10ppddpd-2-101pppdddpdp0 = 1-r, p-n = rn p0, r = d/paverage cost =br/(1-r) Hedging point policy with hedging point 1p1 = 1-r, p-n = rn+1 p1average cost =h(1-r) + r.br/(1-r)Better iff h/b < r/(1-r)

59. MDP = Markov Decision ProcessMDP Model formulation

60. Decision epochsTimes at which decisions are made.The set T of decisions epochs can be either a discrete set or a continuum.The set T can be finite (finite horizon problem) or infinite (infinite horizon).

61. State and action setsAt each decision epoch, the system occupies a state.S : the set of all possible system states.As : the set of allowable actions in state s.A = sSAs: the set of all possible actions.S and As can be:finite setscountable infinite setscompact sets

62. Costs and Transition probabilitiesAs a result of choosing action a  As in state s at decision epoch t,the decision maker receives a cost Ct(s, a) andthe system state at the next decision epoch is determined by the probability distribution pt(. |s, a).If the cost depends on the state at next decision epoch, then Ct(s, a) =  jS Ct(s, a, j) pt(j|s, a).where Ct(s, a, j) is the cost if the next state is j.An Markov decision process is characterized by {T, S, As, pt(. |s, a), Ct(s, a)}

63. Exemple of inventory managementConsider the inventory control problem with the following:Excess demand is lost, i.e. xt+1 = max{0, xt + ut – wt}The inventory capacity is 2, i.e. xt + ut  2The inventory holding/shortage cost is : (xt + ut – wt)2Unit ordering cost is 1, i.e. gt(xt, ut, wt) = 0,25ut + (xt + ut – wt)2.N = 3 and the terminal cost, gN+1(XN+1) = 0Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.7, P(wt = 2) = 0.2.

64. Exemple of inventory managementDecision Epochs T = {0, 1, 2, …, N}Set of states : S = {0, 1, 2} indicating the initial stock XtAction set : As indicating the possible order quantity UtA0 = {0, 1, 2}, A1 = {0, 1}, A2 = {0}(s,a)P(0|s,a)P(1|s,a)P(2|s,a)(0, 0)1(0, 1)0,90,1(0, 2)0,20,70,1(1, 0)0,90,1(1, 1)0,20,70,1(2, 0)0,20,70,1Transition probabilityCost function (s,a)C(s,a)(0, 0)3(0, 1)1,05(0, 2)1,1(1, 0)0,8(1, 1)0,85(2, 0)0,6

65. Decision RulesA decision rule prescribes a procedure for action selection in each state at a specified decision epoch.A decision rule can be eitherMarkovian (memoryless) if the selection of action at is based only on the current state st;History dependent if the action selection depends on the past history, i.e. the sequence of state/actions ht = (s1, a1, …, st-1, at-1, st)

66. Decision RulesA decision rule can also be either Deterministic if the decision rule selects one action with certaintyRandomized if the decision rule only specifies a probability distribution on the set of actions.

67. Decision RulesAs a result, the decision rules can be:HR : history dependent and randomizedHD : history dependent and deterministicMR : Markovian and randomizedMD : Markovian and deterministic

68. PoliciesA policy specifies the decision rule to be used at all decision epoch.A policy p is a sequence of decision rules, i.e. p = {d1, d2, …, dN-1}A policy is stationary if dt = d for all t.Stationary deterministic or stationary randomized policies are important for infinite horizon markov decision processes.

69. ExampleDecision epochs: T = {1, 2, …, N}State : S = {s1, s2}Actions: As1 = {a11, a12}, As2 = {a21}Costs: Ct(s1, a11) =5, Ct(s1, a12) =10, Ct(s2, a21) = -1, CN(s1) = rN(s2) 0Transition probabilities: pt(s1 |s1, a11) = 0.5, pt(s2|s1, a11) = 0.5, pt(s1 |s1, a12) = 0, pt(s2|s1, a12) = 1, pt(s1 |s2, a21) = 0, pt(s2 |s2, a21) = 1S1S2a11{5, .5}a11{5, .5}{10, 1}a12a21{-1, 1}

70. ExampleA deterministic Markov policyDecision epoch 1: d1(s1) = a11, d1(s2) = a21Decision epoch 2: d2(s1) = a12, d2(s2) = a21S1S2a11{5, .5}a11{5, .5}{10, 1}a12a21{-1, 1}One state one action(also called control map)

71. ExampleA randomized Markov policyDecision epoch 1: P1, s1(a11) = 0.7, P1, s1(a12) = 0.3P1, s2(a21) = 1Decision epoch 2: P2, s1(a11) = 0.4, P2, s1(a12) = 0.6P2, s2(a21) = 1S1S2a11{5, .5}a11{5, .5}{10, 1}a12a21{-1, 1}One state one proba distribution of actions

72. ExampleA deterministic history-dependent policyDecision epoch 1: d1(s1) = a11d1(s2) = a21 Decision epoch 2: history h d2(h)(s1, a11, s1) a13(s1, a12, s1) infeasible(s1, a13, s1) a11 (s2, a21, s1) infeasible(*, *, s2) a21S1S2a11{5, .5}a11{5, .5}{10, 1}a12a21{-1, 1}a13{0, 1}One history one action

73. ExampleA randomized history-dependent policyDecision epoch 1: Decision epoch 2:P1, s1(a11) = 0.6 P1, s1(a12) = 0.3P1, s1(a12) = 0.1P1, s2(a21) = 1 history h P(a = a11) P(a = a12) P(a = a13) (s1, a11, s1) 0.4 0.3 0.3 (s1, a12, s1) infeasible infeasible infeasible (s1, a13, s1) 0.8 0.1 0.1 (s2, a21, s1) infeasible infeasible infeasibleS1S2a11{5, .5}a11{5, .5}{10, 1}a12a21{-1, 1}a13{0, 1}at s = s2, select a21

74. Stochastic inventory example revisitedDecision Epochs T = {0, 1, 2, …, N}Set of states : S = {0, 1, 2} indicating the initial stock XtAction set : As indicating the possible order quantity UtA0 = {0, 1, 2}, A1 = {0, 1}, A2 = {0}(s,a)P(0|s,a)P(1|s,a)P(2|s,a)(0, 0)1(0, 1)0,90,1(0, 2)0,70,20,1(1, 0)0,90,1(1, 1)0,70,20,1(2, 0)0,70,20,1Transition probabilityCost function (s,a)C(s,a)(0, 0)3(0, 1)1,05(0, 2)1,1(1, 0)0,8(1, 1)0,85(2, 0)0,6

75. Stochastic inventory control policiesState s = inventory at the beginning of a periodAction a = order quantity such that s+a  2MD : Markovian and deterministicStationary: {s=0: a = 2, s=1: a=1, s=2: a = 0}Nonstationary: {(s,a)=(0,2), (1,1), (2,0)} for period 1 to 5 {(s,a)=(0,1), (1,0), (2,0)} for period 6 onMR : Markovian and randomizedStationary: {s=0: a = 2 w.p. 0.5 a=0 w.p. 0.5, s=1: a=1, s=2: a = 0}Nonstationary: {(s,a)=(0,2), (1,1), (2,0)} for period 1 to 5 {(s,a)=(0,2 w.p. 0.5 & 0 w.p. 0.5), (1,0), (2,0)} for period 6 onwhere w.p. = with probability

76. Stochastic inventory control policiesHD history dependent and deterministicHRhistory dependent and randomizedsAction a02 if lost sales (s+a-d < 0) for last two periods 1 if demand for the last period0 if no demand for the last period11 if lost sale for the last period0 if no demand for the last period20sAction a02 if lost sales for last two periods 2 w.p. 0.5 & 0 w.p. 0.5 if demandfor the last period1 w.p. 0.3 & 0 w.p. 0.7 if no demand for the last period11 w.p. 0.5 & 0 w.p. 0.5 if lost sale for the last period0 if no demand for the last period20

77. RemarksEach Markov policy leads to a discrete time Markov Chain and the policy can be evaluated by solving the related Markov chain.

78. RemarksMD : Markovian and deterministics=0: a = 2,s=1: a=1, s=2: a = 0MR : Markovian and randomizeds=0: a = 2 w.p. 0.5 a=0 w.p. 0.5, s=1: a=1,s=2: a = 0s01200,70,20,110,70,20,120,70,20,1Transition matrixTransition matrixStationary Markov chain (to draw)s01200,850,10,0510,70,20,120,70,20,1

79. RemarksNonstationary MD : Markovian and deterministic{(s,a)=(0,2), (1,1), (2,0)} for period 1 to 2{(s,a)=(0,1), (1,0), (2,0)} for period 3 onNon Stationary MR : Markovian and randomized{(s,a)=(0,2), (1,1), (2,0)} for period 1 to 2{(s,a)=(0,2 w.p. 0.5 & 0 w.p. 0.5), (1,0), (2,0)} for period 3 onperiod 1period 2period 3period 4s01201201201200,70,20,10,70,20,10,90,1 0,90,1 10,70,20,10,70,20,10,90,1 0,90,1 20,70,20,10,70,20,10,70,20,10,70,20,1

80. Finite Horizon Markov Decision Processes

81. AssumptionsAssumption 1: The decision epochs T = {1, 2, …, N}Assumption 2: The state space S is finite or countableAssumption 3: The action space As is finite for each sCriterion:where PHR is the set of all possible policies.

82. Optimality of Markov deterministic policyTheorem : Assume S is finite or countable, and that As is finite for each s  S.Then there exists a Markovian deterministic policy which is optimal.

83. Optimality equationsTheorem : The following value functionssatisfy the following optimality equation:and the action a that minimizes the above term defines the optimal policy.

84. Optimality equationsThe optimality equation can also be expressed as:where Q(s,a) is a Q-function used to evaluate the consequence of an action from a state s.

85. Backward induction algorithmSet t = N and Substitute t-1 for t and compute the following for each st S3. Repeat 2 till t = 1.

86. Infinite Horizon discounted Markov decision processes

87. AssumptionsAssumption 1: The decision epochs T = {1, 2, …}Assumption 2: The state space S is finite or countableAssumption 3: The action space As is finite for each sAssumption 4: Stationary costs and transition probabilities; C(s, a) and p(j |s, a), do not vary from decision epoch to decision epochAssumption 5: Bounded costs: | Ct(s, a) |  M for all a  As and all s  S (to be relaxed)

88. AssumptionsCriterion:where 0 < l < 1 is the discounting factorPHR is the set of all possible policies.

89. Discounting factorLarge discounting factor l  1 : long-term optimumSmall discounting factor l  0 : short-term optimum or myopic

90. Optimality equationsTheorem: Under assumptions 1-5, the following optimal cost function V*(s) exists:and satisfies the following optimality equation:Further, V*(.) is the unique solution of the optimality equation. Moreover, a statonary policy p is optimal iff (if and only if) it gives the minimum value in the optimality equation.

91. Computation of optimal policyValue IterationValue iteration algorithm:Select any bounded value function V0 = 0, let n =0 For each s S, computeRepeat 2 until convergence. For each s S, computeMeaning of Vn

92. Theorem: Under assumptions 1-5,Vn converges to V* The stationary policy defined in the value iteration algorithm converges to an optimal policy.Computation of optimal policyValue Iteration

93. Policy iteration algorithm:Select arbitrary stationary policy p0, let n =0 (Policy evaluation) Obtain the value function Vn of policy pn.(Policy improvement) Choose pn+1 = {dn+1, dn+1,…} such thatRepeat 2-3 till pn+1 = pn.Computation of optimal policyPolicy Iteration

94. Policy evaluation:For any stationary deterministic policy p = {d, d, …}, its value functionis the unique solution of the following equation:Computation of optimal policyPolicy Iteration

95. Theorem:The value functions Vn generated by the policy iteration algorithm is such that Vn+1 <= Vn. Further, if Vn+1 = Vn, Vn = V*.Computation of optimal policyPolicy Iteration

96. Recall the optimality equationComputation of optimal policyLinear programmingThe optimal value function can be determine by the following Linear programme with a > 0 and Sa(s) = 1:

97. Dual linear program1/ Optimal basic solution x* gives a deterministic optimal policy.2/ x(s, a) = total discounted joint proba under initiate-state distribution a that the system occupies state s and choose action a3/ Dual linear program extends to constrained model with upper limit C of total discounted cost, i.e. Computation of optimal policyLinear programming

98. Extensition to Unbounded CostsTheorem 1. Under the condition C(s, a) ≥ 0 (or C(s, a) ≤0) for all states i and control actions a, the optimal cost function V*(s) among all stationary determinitic policies satisfies the optimality equationTheorem 2. Assume that the set of control actions is finite. Then, under the condition C(s, a) ≥ 0 for all states i and control actions a, we havewhere VN(s) is the solution of the value iteration algorithm with V0(s) = 0.Implication of Theorem 2 : The optimal cost can be obtained as the limit of value iteration and the optimal stationary policy can also be obtained in the limit.

99. ExampleConsider a computer system consisting of M different processors.Using processor i for a job incurs a finite cost Ci with C1 < C2 < ... < CM.When we submit a job to this system, processor i is assigned to our job with probability pi. At this point we can (a) decide to go with this processor or (b) choose to hold the job until a lower-cost processor is assigned. The system periodically return to our job and assign a processor in the same way. Waiting until the next processor assignment incurs a fixed finite cost c. Question: How do we decide to go with the processor currently assigned to our job versus waiting for the next assignment?Suggestions:The state definition should include all information useful for decisionThe problem belongs to the so-called stochastic shortest path problem.

100. Why does it work: PreliminaryPolicy p value function (cost minimization)Without loss of generality, 0  C(s, a)  MTransformation by C’(s, a) = C(s, a) + M if | C(s, a) |  M

101. Why does it work: DP & optimality equationDP (Dynamic Programming)Optimality equation

102. Why does it work: DP & optimality equationDP operator TContraction of the DP operator

103. Why does it work : DP convergenceLemma 1: If 0  C(s,a)  M, then VN(s) is monotone converging and limN VN(s) = V*(s) Taking min on both side of the inequalities, Due to C(s,a) ≥ 0, Due to C(s,a)  M, Proof. Part one due to VN(s)  VN+1(s) and VN(s)  M/(1-l) . Property guarantees the existence of V*(s).

104. Why does it work : convergence of value iterationLemma 2: If 0  C(s,a)  M, for any bounded function f, then limN TN(f(s)) = V*(s) and limN TpN(f(s)) = Vp (s) Similary, limN TpN(f(s)) = Vp(s)

105. Why does it work : optimality equationTheorem 1: If 0  C(s,a)  M, V*(s) is the unique bounded function of the optimality equation. Moreover, any stationary policy is optimal iff p(s) is any minimizer of the right hand term.

106. Why does it work : optimality equationTheorem 1: If 0  C(s,a)  M, V*(s) is the unique bounded function of the optimality equation. Moreover, any stationary policy is optimal iff p(s) is any minimizer of the right hand term.

107. Why does it work : optimality equationTheorem A: If 0  C(s,a)  M, V*(s) is the unique bounded function of the optimality equation. Moreover, any stationary policy is optimal iff p(s) is any minimizer of the right hand term.

108. Why does it work : convergence of policy iterationTheorem B: The value functions Vn generated by the policy iteration algorithm is such that Vn+1  Vn.

109. Why does it work : convergence of policy iterationTheorem B: The value functions Vn generated by the policy iteration algorithm is such that Vn+1  Vn.

110. Infinite Horizon average cost Markov decision processes

111. AssumptionsAssumption 1: The decision epochs T = {1, 2, …}Assumption 2: The state space S is finiteAssumption 3: The action space As is finite for each sAssumption 4: Stationary costs and transition probabilities; C(s, a) and p(j |s, a) do not vary from decision epoch to decision epochAssumption 5: Bounded costs: | Ct(s, a) |  M for all a  As and all s  SAssumption 6: The markov chain correponding to any stationary deterministic policy contains a single recurrent class. (Unichain) – Existence of steady state for all policies

112. AssumptionsCriterion:where PHR is the set of all possible policies.

113. Optimal policyMain Theorem: Under Assumptions 1-6,There exists an optimal stationary deterministic policy.There exists a real g and a value function h(s) that satisfy the following optimality equation:For any solutions (g, h) and (g’, h’) of the optimality equation: (a) g = g’ is the optimal average reward; h(s) = h’(s) + k (translation closure);Any maximizer of the optimality equation is an optimal policy.

114. Relation between discounted and average cost MDP It can be shown that (why? online)for any given state x0. = reference statedifferential cost

115. Relation between discounted and average cost MDPwhere x0 = any given reference stateh(s) : differential reward/cost (starting from s vs x0)

116. Relation between discounted and average cost MDP Whyif limits interchangeableIf the discounted policy converges to average cost policy,Blackwell optimality

117. Computation of the optimal policy by LPRecall the optimality equation:This leads to the following LP for optimal policy computationRemarks: Value iteration and policy iteration can also be extended to the average cost case.

118. Computation of optimal policyValue IterationSelect any bounded value function h0 with h0(s0) = 0, let n =0 For each s S, computeRepeat 2 until convergence. For each s S, compute

119. Computation of optimal policy : Policy Iteration Select any policy p0, let n =0Policy evaluation: determine gp = stationary expected reward and solve Policy improvement: Set n := n+1 and repeat 2-3 till convergence.

120. Extensions to unbounded costTheorem. Assume that the set of control actions is finite. Suppose that there exists a finite constant L and some state x0 such that|Vl(x) - Vl(x0)| ≤ Lfor all states x and for all l (0,1). Then, for some sequence {ln} converging to 1, the following limit exist and satisfy the optimality equation.Easy extension to policy iteration.More conditions: Sennott, L.I. (1999) Stochastic Dynamic Programming and the Control of Queueing Systems, New York: Wiley.

121. Why does it work : convergence of policy iterationTheorem: If all policies generated by policy iteration are unichains, then gn+1 ≥ gn.

122. Continuous time Markov decision processes

123. AssumptionsAssumption 1: The decision epochs T = R+Assumption 2: The state space S is finiteAssumption 3: The action space As is finite for each sAssumption 4: Stationary cost rates and transition rates; C(s, a) and m(j |s, a) do not vary from decision epoch to decision epoch

124. AssumptionsCriterion:

125. Example Consider a system with one machine producing one product. The processing time of a part is exponentially distributed with rate p. The demand arrive according to a Poisson process of rate d. state Xt = stock level, Action : at = make or rest0123(make, p)(make, p)(make, p)ddd(make, p)d

126. Formal definitionDecision Epochs T = [0, +∞)Set of states : S = {0, 1, 2} indicating the initial stock XtAction : order quantity A0 = {0, 1, 2}, A1 = {0, 1}, A2 = {0}Cost rate(s,a)C(s,a)(0, 0)3(0, 1)1,05(0, 2)1,1(1, 0)0,8(1, 1)0,85(2, 0)0,6(s,a)m(0|s,a)m(1|s,a)m(2|s,a)(0, 0)0,52(0, 1)0,1(0, 2)0,20,1(1, 0)0,9(1, 1)0,70,1(2, 0)0,70,2Transition rate

127. UniformizationAny continuous-time Markov chain can be converted to a discrete-time chain through a process called « uniformization ».Each Continuous Time Markov Chain is characterized by the transition rates mij of all possible transitions.The sojourn time Ti in each state i is exponentially distributed with rate m(i) = Sj≠i mij, i.e. E[Ti] = 1/m(i)Transitions different states are unpaced and asynchronuous depending on m(i).

128. UniformizationIn order to synchronize (uniformize) the transitions at the same pace, we choose a uniformization rateg  MAX{m(i)}« Uniformized » Markov chain withtransitions occur only at instants generated by a common a Poisson process of rate g (also called standard clock)state-transition probabilitiespij = mij / gpii = 1 - m(i)/ gwhere the self-loop transitions correspond to fictitious events.

129. UniformizationS1S2abS1S2a/g1-a/gb/g1-b/gCTMCDTMC by uniformizationStep1: Determine rate of the statesm(S1) = a, m(S2) = bStep 2: Select an uniformization rateg ≥ max{m(i)}Step 3: Add self-loop transitions to states of CTMC.Step 4: Derive the corresponding uniformized DTMCS1S2abUniformized CTMCg-ag-b

130. UniformizationRates associated to statesm(0,0) = l1+l2m(1,0) = m1+l2m(0,1) = l1+m2m(1,1) = m1

131. UniformizationFor Markov decision process, the uniformization rate shoudl be such thatg  m(s, a) = SjS m(j|s, a) for all states s and for all possible control actions a.The state-transition probabilities of a uniformized Markov decision process becomes:p(j|s, a) = m(j|s, a)/ gp(s|s, a) = 1- SjS m(j|s, a)/ g

132. Uniformization0123(make, p)(make, p)(make, p)ddd(make, p)d0123(make, p/g)d/gUniformized Markov decision process at rate g = p+d(not make, p/g)(make, p/g)(make, p/g)(make, p/g)(make, p/g)d/gd/gd/gd/g(not make, p/g)(not make, p/g)(not make, p/g)

133. UniformizationUnder the uniformization, a sequence of discrete decision epochs T1, T2, … is generated where Tk+1 – Tk = EXP(g).The discrete-time markov chain describes the state of the system at these decision epochs.All criteria can be easily converted.T0T1T2T3EXP(g)EXP(g)EXP(g)continuous cost C(s,a) per unit timePoisson process at rate g(s,a)

134. Cost function convertion for uniformized Markov chainDiscounted cost of a stationary policy p (only with continuous cost):State change & action taken only at TkMutual independence of (Xk, ak) and event-clocks (Tk, Tk+1)Tk is a Poisson process at rate gAverage cost of a stationary policy p (only with continuous cost):

135. Cost function convertion for uniformized Markov chainTk is a Poisson process at rate g, i.e. Tk = t1 + … + tk, ti = EXP(g)

136. Equivalent discrete time discounted MDPa discrete-time Markov chain with uniform transition rate ga discount factor l = g/(g+b)a stage cost C(s, a)/(b+g), Optimality equationOptimality equation: discounted cost case 

137. Equivalent discrete time average-cost MDPa discrete-time Markov chain with uniform transition rate ga stage cost given by C(s, a)/g whenever a state s is entered and an action a is chosen.Optimality equation for average cost per uniformized period:where g = average cost/uniformized period, gg =average cost/time unit,h(s) = differential cost with respect to reference state s0 and h(s0) = 0Optimality equation: average cost case

138. Multiply both side of the optimality equation by g leads to:Alternative optimality equation 1:where G = gg optimal average cost per time unitH(s) = modified differential cost with H(s) = g(V(s) – V(s0))Optimality equation: average cost case

139. Alternative optimality equation 2:where h(s) = differential cost with respect to a reference state s0m(j|s,a) = transition rate from (s,a) to j with i.e. m(j|s,a) = gp(j|s,a) and m(s|s,a) = gp(s|s,a) - g Optimality equation: average cost caseHamilton-Jacobi-Bellman equation

140. Example (continue)Uniformize the Markov decision process with rate g = p+dThe optimality equation:

141. Example (continue)From the optimality equation:If V(s) is convex, then there exists a K such that :V(s+1) –V(s) > 0 and the decision is not producing, for all s >= K andV(s+1) –V(s) <= 0 and the decision is producing, for all s < K

142. Example (continue)Convexity proved by value iterationProof by induction.V0 is convex.If Vn is convex with minimum at s = K, then Vn+1 is convex.K-1Ks

143. Example (continue)Convexity proved by value iterationAssume Vn is convex with minimum at s = K. Vn+1 is convex ifi.e. DU(s)  DU(s+1) where DU(s) = U(s+1) – U(s)True for s +1 < K-1 and s > K-1 by induction.Proof established ifDU(K-2)  DU(K-1)  DVn(K-1)  0DU(K-1)  DU(K)  0  DVn(K) K-1KsV(s+1)V(s)

144. Extended cost structureT0T1T2T3EXP(m(s|s,a))(s,a)Initial fixed cost K(s,a)continuous cost C(s,a) per unit time(j,a’)Terminal fixed cost k(s,a, j)Convertion into modified cost rate : case of average cost  

145. Condition for optimality of monotone policy(first order properties)

146. Monotone policyMonotone policy p(s) nondecreasing or nonincreasing in sQuestion: When there exists an optimal monotone policy?Answers: monotonicity (addressed here) and convexity (addressed in the previous example) Only finite-horizon case is considered but can be extented to discounted and average cost.

147. Submodularity and SupermodularityA function g(x, y) is said supermodular if for x+ ≥ x- and y+ ≥ y-,It is said submodular if Supermodularity  Increasing difference, i.e.Submodularity  Decreasing difference

148. Submodularity and SupermodularitySupermodular functions:Property 1: If g(x,y) is supermodular (submodular), thenf(x) = min or max selection of the set argmaxy g(x,y) of maximizersis monotone nondecreasing (nonincreasing) in x.

149. Dynamic Programming OperatorDP operator Tequivalently

150. DP Operator: monotonicity preservationProperty 2 T[Vt(s)] is nondecreasing (nonincreasing) ifr(s,a) is nondecreasing (nonincreasing) in s for all a(snext|s,a) is nondecreasing in s for all a, k

151. DP Operator: control monotonicity

152. DP Operator: control monotonicity

153. Batch delivery modelCustomer demand Dt for a product arrives over time.State set S = {0, 1, …}: quantity of pending demandAction set A = {0=no delivery, 1=deliver all pending demand}Cost C(s,a) = hs(1-a) + aK where h = unit holding cost, K= fixed delivery costTransition snext = s(1-a) + D where P(D=i) = pi, i=0, 1, …GOAL: minimize the total costSubmodularitya(s) nondecreasing

154. Batch delivery modelMin submodular  Max supermodular

155. A machine replacement modelMachine deteriorates by a random number I of states per periodState set S = {0, 1, …} from best to worse conditionAction set A = {1=replace,0=not replace}Reward r(s,a) = R – h(s(1-a)) – aK R = fixed income per period, h(s) = nondecreasing operation cost K= replacement costTransition snext = s(1-a) + I where P(I=i) = pi, i=0, 1, …GOAL: maximize the total rewardSupermodularitya(s) nondecreasing

156. A machine replacement model

157. A general framework forvalue function property analysisBased on G. Koole, “Structural results for the control of queueing systems using event-based dynamic programming,” Queueing Systems 30:323-339, 1998

158. Introduction: event operators

159. Introduction: a single server queueexponential serverPoisson arrivals of which the admission can be controlled l: arrival rate m: service rate , l+m= 1c: unit rejection costC(x): holding cost of x customers

160. Introduction: discrete-time queue1: customer arrival rate, i.e. one customer per periodp: geometric service ratex: queue length before admission decision and service completion

161. One-dimension models : operators

162. One-dimension models: operators

163. One-dimension models: operators

164. One-dimension models : operators

165. One-dimension models : properties

166. One-dimension models : property propagation

167. One-dimension models : property propagation

168. One-dimension models : property propagationProof of Lemma 1.Tcosts and Tunif : results follow directly as increasingness and convexity are closed under convex combinations. TA(1) : results follow directly, by replacing x by x + e1 in the inequalities. TFS(1) : certain terms cancel out. TD(1) : Increasingness follows as for TA(1), except if x1 = b1. In this case TD(1)f(x) = TD(1)f(x + e1). Also for the convexity the only non-trivial case is x1 = b1. This reduces to f(x)  f(x + e1). TMD(1) : Roughly the same arguments are used.TAC(1) : TCD(1) : similar proof as for TAC(1).

169. One-dimension models : property propagation

170. One-dimension models : property propagation

171. a single server queuel: arrival rate is l, m: service rate , l+m= 1c: unit rejection costC(x): holding cost of x customers

172. discrete-time queue1: customer arrival rate, p: geometric service ratex: queue length before admission decision and service completion

173. Production-inventory system

174. Multi-machine production-inventory with preemption

175. Examples of Tenv(i)Control policy keeps its structure but depends on the environment

176. Examples

177. Two-dimension models : operators

178. Two-dimension models : properties

179. Two-dimension models : propertiesRLLRRLLR+ConvLRRLRLLRSuperSuperCRLLR+LRRL+IncRLLRLRRLSubSubCRLLR+Super(i,j) + SuperC(i,j)  Conv(i)+Conv(j)Super(i,j) + Conv(i) + Conv(j)  SubC(i,j)Sub(i,j) + SubC(i,j)  Conv(i)+Conv(j)Sub(i,j) + Conv(i) + Conv(j)  SuperC(i,j)

180. 2-dimension models : property propagation

181. 2-dimension models : property propagation

182. 2-dimension models : property propagation

183. 2-dimension models : property propagation

184. 2-dimension models : property propagation

185. 2-dimension models : property propagation

186. 2-dimension models : property propagation

187. 2-dimension modelsControl structure under Super(1, 2) + SuperC(1, 2)Super(1, 2) + SuperC(1, 2)  Conv(1) + Conv(2)Conv(1) TAC(1) : threshold admission in x1Conv(2) TAC(2) : threshold admission in x2 Super(1, 2) TAC(1) : of threshold form in x2Super(1, 2) TAC(2) : of threshold form in x1SuperC(1, 2) For TAC(1) : rejection in x+e2  rejection in x+e1SuperC(1, 2) For TAC(2): rejection in x+e1  rejection in x + e2. TAC(1) & TAC(2) : decreasing switching curve below which customers are admitted. TCD(1) and TCD(2) can be seen as dual to TAC(1) and TAC(2), with corresponding results.admissionreject

188. 2-dimension modelsControl structure under Super(1, 2) + SuperC(1, 2)SuperC(1, 2): TR: an increasing switching curve above (below) which customers are assigned to queue 1 (2)SuperC(1, 2): TCJ(1,2): the optimal control is increasing in x1 and decreasing in x2, i.e. an increasing switching curve, below which jockeying occurs.TCJ(2,1): an increasing switching curve, above which jockeying occurs.queue1queue2

189. 2-dimension models : property propagation

190. 2-dimension modelsControl structure under Super(1, 2)Admission control for class 1 is decreassing in class 2 and vice vers.

191. 2-dimension models : property propagation

192. 2-dimension modelsControl structure under Sub(1, 2) + SubC(1, 2)Sub(1, 2) + SubC(1, 2)  Conv(1) + Conv(2)Conv(i) threshold admission rule for TAC(i) in xiSub(1, 2) TAC(1) is of threshold form in x2Sub(1, 2) TAC(2) is of threshold form in x1SubC(1, 2) TAC(1) (TAC(2)) : increasing switching curve above (below) which customers are admitted. Also the effects of TCD(i) amount to balancing in some sense the two queues. The two queues “attract” each other.TACF(1,2) has a decreasing switching curve below which customers are admitted

193. 2-dimension modelsControl structure under Sub(1, 2) + SubC(1, 2)Admission queue1NoNoAdmission queue2TAC(1)TAC(2)

194. Examples: a queue served by two serversA common queue served by two servers (1= fast, 2=slow)Poisson arrivals to the queueExponential servers but with different mean service timesGoal: minimizes the mean sojourn time

195. To slowTCJ(1,2)Examples: a queue served by two servers

196. Examples: production line with Poisson demandM1 feeds buffers 1, M2 transfers to buffer 2Poisson demand filled from queue 2Production rate control of both machines

197. Examples: tandem queues with Poisson demandM1 produceM2 producex1x2

198. Examples: admission control of tandem queuesTwo tandem queues: queue 1 feeds queue 2Convex holding cost hi(xi)Service rate control of both queuesAdmission control of arrival to queue 1

199. Examples: cyclic tandem queuesTwo cyclic queues: queue 1 feed queue 2, vice versaConvex holding cost hi(xi)Service rate control of both queues

200. Multi-machine production-inventory with non preemptionAC(2)AC(1)

201. Examples: stochastic knapsackPacking a knapsack of integer volume B with objects from 2 different classes to maximize profitPoisson arrivals

202. Examples