IntroRecap OnlineTDRL Optimisticplanning Conclusions ReminderMarkovdecisionprocess Observestates x applyactions u receiverewards r Systemstochastic dynamicsxk124fxkuk1 Performance rew ID: 824040
Download Pdf The PPT/PDF document "OfineAVI&API" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Intro&RecapOfineAVI&APIOnlineTDRL
Intro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsReminder:MarkovdecisionprocessObservestatesx,applyactionsu,receiverewardsr
System:stochasticdynamicsxk+1f(xk;
System:stochasticdynamicsxk+1f(xk;uk;)Performance:rewardfunctionrk+1=r(xk;uk;xk+1)Controller:uk=h(xk)Intro&RecapOfineAVI&APIOnlineT
DRLOptimisticplanningConclusionsGoal
DRLOptimisticplanningConclusionsGoalFindhtomaximizefromanyx0expecteddiscountedreturn:Rh(x0)=EnX1k=0 krk+1jhoEquivalenttominimizingtheexpectedcos
t:EfJ(x0)jhg=EnX1k=0 k(rk+1)
t:EfJ(x0)jhg=EnX1k=0 k(rk+1)jhoIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsValue&policyiterationwithQ-functions
Q-valueiterationTurnBellmanoptimalityeq
Q-valueiterationTurnBellmanoptimalityequationintoiterativeupdaterepeatQ(x;u) Efr(x;u;x0)+ maxu0Q(x0;u0)g8x;uuntilconvergencetoQPolicyiteration
iterativelyevaluate&improvepoliciesrepe
iterativelyevaluate&improvepoliciesrepeatpolicyevaluation:solveQh(x;u)=Er(x;u;x0)+ Qh(x0;h(x0)) (e.g.byusingVI-likeupdate)policyimprovement:h(
x) argmaxuQh(x;u)8xuntilconvergencetoh&
x) argmaxuQh(x;u)8xuntilconvergencetohIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsTaxonomyofmethodsBypathtosolution:1
Approximatevalueiteration:FindbQ,u
Approximatevalueiteration:FindbQ,useittocomputebh2Approximatepolicyiteration:Evaluateh(ndbQh),improveh,repeat3Approximatepolicysearch
:LookdirectlyforbhBylevelofinterac
:LookdirectlyforbhBylevelofinteraction:1Ofine(batch):datacollectedinadvance2Online:learnbyinteractionIntro&RecapOfineAVI&APIOnlin
eTDRLOptimisticplanningConclusionsAss
eTDRLOptimisticplanningConclusionsAssumptionsDeterministicsystemBFsnormalized:PNi=1i(x)=1AtcenterxiofeachBF,i(xi)=1,i0(xi)=08i0S
implestBFssatisfyingthistriangula
implestBFssatisfyingthistriangular:)multilinearinterpolationIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsFuzzyQ-iter
ationRevisitexactQ-iteration:repeatate
ationRevisitexactQ-iteration:repeatateachiteration`Q`+1(x;u) r(x;u;x0)+ maxu0Q`(x0;u0)forallx;u=:[T(Q`)](x;u)withT:Q!QBellmanmappinguntilconverg
encetoQoutputQ,h(x)=argmaxu
encetoQoutputQ,h(x)=argmaxuQ(x;u)FuzzyQ-iteration(Bus¸oniuetal.,2007)repeatateachiteration`W`+1;i;j [T(bQW`)](xi;uj)foralli;j=
r(xi;uj;x0)+ maxu0bQW`(x0;u0)untilconve
r(xi;uj;x0)+ maxu0bQW`(x0;u0)untilconvergencetoWoutputbQW,hW(x)=argmaxujbQW(x;uj)Intro&RecapOfineAVI&APIOnlineTDRLOptimistic
planningConclusionsSolutionqualityCha
planningConclusionsSolutionqualityCharacterizeapproximationpowerbyminimumdistancetoQ:"=minW2RNM QbQW 1Theorem(continued)Return
edQ-functionisnear-optimal: Q
edQ-functionisnear-optimal: QbQW 12"1 ...andcorrespondingpolicyalsonear-optimal: QQhW 14 "(1 )2
Intro&RecapOfineAVI&APIOnli
Intro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsExample:FuzzyQIfortheinvertedpendulumBFsoverstatespace:triangular,on41
2;21equidistantgridActiondiscretization
2;21equidistantgridActiondiscretization:gridof5actions,centeredon0Intro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusions1Introducti
on&Recap2Ofineapproximatevalueiter
on&Recap2Ofineapproximatevalueiteration&policyiterationAVI:FuzzyQ-iterationAVI:FittedQ-iterationAPI:Least-squarespolicyiteration3Onlinetempor
al-differenceRLClassicalQ-learningandSA
al-differenceRLClassicalQ-learningandSARSAApproximateQ-learningandSARSA4Optimisticplanning5ConclusionsIntro&RecapOfineAVI&APIOnlineTDRLOpt
imisticplanningConclusionsGeneralizing
imisticplanningConclusionsGeneralizingfuzzyQ-iterationArbitraryapproximator,parametricornonparametric(fuzzyQI:restrictedBFs+discreteactions)Arbitra
rytransitionsamples(xis;uis;ris;x0is),
rytransitionsamples(xis;uis;ris;x0is),is=1;:::;ns(fuzzyQI:center-discreteactionpairs(xi;uj))PossiblystochasticsystemIntro&RecapOfineAVI&APIO
nlineTDRLOptimisticplanningConclusions
nlineTDRLOptimisticplanningConclusionsFittedQ-iteration(cont'd)Ifsystemisstochastic,weactuallyaimfor:bQW(x;u)Enr(x;u;x0)+ maxu0bQW`(x0;u0)o
150;regressionon(xis;uis)!qisnaturallyac
150;regressionon(xis;uis)!qisnaturallyachievesthisConvergencetonear-optimalregion(givenMDPandapproximatorcharacteristics)(Munos&Szepesv´ari,2008)Lef
t:ttedQIRight:fuzzyQIIntro&Recap
t:ttedQIRight:fuzzyQIIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsGuaranteesTheorem(Bertsekas&Tsitsiklis1996,Lagouda
kis&Parr2003)Ifeverypolicyevaluationis"
kis&Parr2003)Ifeverypolicyevaluationis"-accurate: bQh`Qh` 1",anear-optimalpolicyiseventuallyobtained:limsup`!1 Qh`Q 12 "
(1 )2Intro&RecapOfineAVI&APIO
(1 )2Intro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsProjectedpolicyevaluation(LSTD)LinearapproximatorbQW(x;u)=
;(x;u)WbQhTh(bQh)instantiatedtobQW
;(x;u)WbQhTh(bQh)instantiatedtobQW=P[Th(bQW)]withPweightedleast-squaresprojectionHasmeaningfulsolutionunderconditionsonPCalledleast-squarestem
poraldifferenceIntro&RecapOfineAVI
poraldifferenceIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsProjectedpolicyevaluation(cont'd)bQW=P[Th(bQW)]bQWlinearinW;Th(
Q)linearinQ;P(Q)linearinQ)rewriteasli
Q)linearinQ;P(Q)linearinQ)rewriteaslinearequationinW:AW=bA;bcanbeestimatedfromsamples(xis;uis;ris;x0is):A A+(xis;uis)(xis;uis)
0; (xis;uis)(x0is;h(x0
0; (xis;uis)(x0is;h(x0is))b b+(xis;uis)risIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsExample:LSPIfor
theinvertedpendulumBFsoverstatespace:R
theinvertedpendulumBFsoverstatespace:RBFsona159gridActiondiscretization:3actionsSamples:7500transitionsfromrandom(x,discreteu)pairsIntro&Reca
pOfineAVI&APIOnlineTDRLOptimistic
pOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsComparisonbetweenAPIandAVISimilartheoreticalbehavioringeneral:convergencetonear-optimalreg
ionInpractice,APItypicallyconvergesinfe
ionInpractice,APItypicallyconvergesinfeweriterations......butalsohaslargercomplexityperiterationIntro&RecapOfineAVI&APIOnlineTDRLOptimisticpla
nningConclusions1Introduction&Recap2
nningConclusions1Introduction&Recap2Ofineapproximatevalueiteration&policyiterationAVI:FuzzyQ-iterationAVI:FittedQ-iterationAPI:Least-squares
policyiteration3Onlinetemporal-differe
policyiteration3Onlinetemporal-differenceRLClassicalQ-learningandSARSAApproximateQ-learningandSARSA4Optimisticplanning5ConclusionsIntro&RecapO
fineAVI&APIOnlineTDRLOptimisticpla
fineAVI&APIOnlineTDRLOptimisticplanningConclusionsQ-learning1Take(model-based)Q-iterationupdate:Q`+1(x;u) Efr(x;u;x0)+ maxu0Q`(x0;u0)g2Uset
ransitionsample(xk;uk;rk+1;xk+1)ateachs
ransitionsample(xk;uk;rk+1;xk+1)ateachstepk:Q(xk;uk) rk+1+ maxu0Q(xk+1;u0)notexk+1=x0;rk+1=r(x;u;x0)indeterministiccase;instochasticcasetheyjus
tprovideasampleofr.h.s.3Makeupdateinc
tprovideasampleofr.h.s.3Makeupdateincrementalwithlearningratek2(0;1]:Q(xk;uk) Q(xk;uk)+k[rk+1+ maxu0Q(xk+1;u0)Q(xk;uk)]Intro&
RecapOfineAVI&APIOnlineTDRLOptimi
RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsTemporaldifference[rk+1+ maxu0Q(xk+1;u0)Q(xk;uk)]istheQ-learningtemporaldifferenc
e:errorinBellmanoptimalitye
e:errorinBellmanoptimalityequationforcurrentsampleIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsExploration-explo
itationtradeoffEssentialconditionforcon
itationtradeoffEssentialconditionforconvergencetoQ:all(x;u)pairsmustbevisitedinnitelyoften)Explorationnecessary:sometimes,chooseactionsrando
mlyExploitationofcurrentknowledgeisals
mlyExploitationofcurrentknowledgeisalsonecessary:sometimes,chooseactionsgreedilySimplesolution:"-greedyuk=(argmaxuQ(xk;u)withprobability(1"k)ar
andomactionwithprobability"kwithexplorat
andomactionwithprobability"kwithexplorationprobability"k2(0;1)decreasingintimeIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsSA
RSAAlgorithmSARSAwith"-greedyexploratio
RSAAlgorithmSARSAwith"-greedyexplorationforeverytrialdoinitializex0u0=(argmaxuQ(x0;u)w.p.(1"0)randomw.p."0repeateachstepkapplyuk,measurexk+1,r
eceiverk+1uk+1=(argmaxuQ(xk+1;u)w.p.(1
eceiverk+1uk+1=(argmaxuQ(xk+1;u)w.p.(1"k+1)randomw.p."k+1Q(xk;uk) Q(xk;uk)+k[rk+1+ Q(xk+1;uk+1)Q(xk;uk)]untiltrialnishedendfo
rIntro&RecapOfineAVI&APIOnlineTDR
rIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusions1Introduction&Recap2Ofineapproximatevalueiteration&policyiterationAVI:F
uzzyQ-iterationAVI:FittedQ-iterationAP
uzzyQ-iterationAVI:FittedQ-iterationAPI:Least-squarespolicyiteration3Onlinetemporal-differenceRLClassicalQ-learningandSARSAApproximateQ-learningan
dSARSA4Optimisticplanning5Conclusion
dSARSA4Optimisticplanning5ConclusionsIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsApproximateQ-learningParametricapproxi
mationbQW(x;u)Gradientdescentontheerro
mationbQW(x;u)Gradientdescentontheerror[Q(xk;uk)bQW(xk;uk)]2:W W12k@@WhQ(xk;uk)bQW(xk;uk)i2=W+k@bQW(xk;uk)@Wh
Q(xk;uk)bQW(xk;uk)iEstimateQ
Q(xk;uk)bQW(xk;uk)iEstimateQ(xk;uk)usingBellman:W W+k@bQW(xk;uk)@Wrk+1+ maxu0bQW(xk+1;u0)bQW(xk;uk)(approximatete
mporaldifference)Intro&RecapOfineA
mporaldifference)Intro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsApproximateSARSAApproximateSARSA,"-greedy(Sutton&Barto,1998)
foreverytrialdoinitializex0u0=(argmaxu
foreverytrialdoinitializex0u0=(argmaxubQW(x0;u)w.p.(1"0)randomw.p."0repeatateachstepkapplyuk,measurexk+1,receiverk+1uk+1=(argmaxubQW(xk+1;u)w.
p.(1"k+1)randomw.p."k+1W W+k@b
p.(1"k+1)randomw.p."k+1W W+k@bQW(xk;uk)@Whrk+1+ bQW(xk+1;uk+1)bQW(xk;uk)iuntiltrialnishedendforIntro&RecapOfineAVI&API
OnlineTDRLOptimisticplanningConclusion
OnlineTDRLOptimisticplanningConclusionsApplication:RoboticgoalkeeperVision-basedcontrol:learnhowtocatchballusingvideocameraimage6states,2actions(mo