/
OfineAVI&API OfineAVI&API

OfineAVI&API - PDF document

yvonne
yvonne . @yvonne
Follow
345 views
Uploaded On 2020-11-24

OfineAVI&API - PPT Presentation

IntroRecap OnlineTDRL Optimisticplanning Conclusions ReminderMarkovdecisionprocess Observestates x applyactions u receiverewards r Systemstochastic dynamicsxk124fxkuk1 Performance rew ID: 824040

optimisticplanning api recap conclusions api optimisticplanning conclusions recap intro ineavi onlinetdrl bqw iteration uis xis maxu0q learningandsarsa avi fuzzyq

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "OfineAVI&API" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Intro&RecapOfineAVI&APIOnlineTDRL
Intro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsReminder:MarkovdecisionprocessObservestatesx,applyactionsu,receiverewardsr

System:stochasticdynamicsxk+1f(xk;
System:stochasticdynamicsxk+1f(xk;uk;)Performance:rewardfunctionrk+1=r(xk;uk;xk+1)Controller:uk=h(xk)Intro&RecapOfineAVI&APIOnlineT

DRLOptimisticplanningConclusionsGoal
DRLOptimisticplanningConclusionsGoalFindhtomaximizefromanyx0expecteddiscountedreturn:Rh(x0)=EnX1k=0 krk+1jhoEquivalenttominimizingtheexpectedcos

t:EfJ(x0)jhg=EnX1k=0 k(–rk+1)
t:EfJ(x0)jhg=EnX1k=0 k(–rk+1)jhoIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsValue&policyiterationwithQ-functions

Q-valueiterationTurnBellmanoptimalityeq
Q-valueiterationTurnBellmanoptimalityequationintoiterativeupdaterepeatQ(x;u) Efr(x;u;x0)+ maxu0Q(x0;u0)g8x;uuntilconvergencetoQPolicyiteration

iterativelyevaluate&improvepoliciesrepe
iterativelyevaluate&improvepoliciesrepeatpolicyevaluation:solveQh(x;u)=Er(x;u;x0)+ Qh(x0;h(x0)) (e.g.byusingVI-likeupdate)policyimprovement:h(

x) argmaxuQh(x;u)8xuntilconvergencetoh&
x) argmaxuQh(x;u)8xuntilconvergencetohIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsTaxonomyofmethodsBypathtosolution:1

Approximatevalueiteration:FindbQ,u
Approximatevalueiteration:FindbQ,useittocomputebh2Approximatepolicyiteration:Evaluateh(ndbQh),improveh,repeat3Approximatepolicysearch

:LookdirectlyforbhBylevelofinterac
:LookdirectlyforbhBylevelofinteraction:1Ofine(batch):datacollectedinadvance2Online:learnbyinteractionIntro&RecapOfineAVI&APIOnlin

eTDRLOptimisticplanningConclusionsAss
eTDRLOptimisticplanningConclusionsAssumptionsDeterministicsystemBFsnormalized:PNi=1i(x)=1AtcenterxiofeachBF,i(xi)=1,i0(xi)=08i0S

implestBFssatisfyingthis–triangula
implestBFssatisfyingthis–triangular:)multilinearinterpolationIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsFuzzyQ-iter

ationRevisitexactQ-iteration:repeatate
ationRevisitexactQ-iteration:repeatateachiteration`Q`+1(x;u) r(x;u;x0)+ maxu0Q`(x0;u0)forallx;u=:[T(Q`)](x;u)withT:Q!QBellmanmappinguntilconverg

encetoQoutputQ,h(x)=argmaxu
encetoQoutputQ,h(x)=argmaxuQ(x;u)FuzzyQ-iteration(Bus¸oniuetal.,2007)repeatateachiteration`W`+1;i;j [T(bQW`)](xi;uj)foralli;j=

r(xi;uj;x0)+ maxu0bQW`(x0;u0)untilconve
r(xi;uj;x0)+ maxu0bQW`(x0;u0)untilconvergencetoWoutputbQW,hW(x)=argmaxujbQW(x;uj)Intro&RecapOfineAVI&APIOnlineTDRLOptimistic

planningConclusionsSolutionqualityCha
planningConclusionsSolutionqualityCharacterizeapproximationpowerbyminimumdistancetoQ:"=minW2RNM Q�bQW 1Theorem(continued)Return

edQ-functionisnear-optimal: Q�
edQ-functionisnear-optimal: Q�bQW 12"1� ...andcorrespondingpolicyalsonear-optimal: Q�QhW 14 "(1� )2

—Intro&RecapOfineAVI&APIOnli
—Intro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsExample:FuzzyQIfortheinvertedpendulumBFsoverstatespace:triangular,on41&#

2;21equidistantgridActiondiscretization
2;21equidistantgridActiondiscretization:gridof5actions,centeredon0Intro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusions1Introducti

on&Recap2Ofineapproximatevalueiter
on&Recap2Ofineapproximatevalueiteration&policyiterationAVI:FuzzyQ-iterationAVI:FittedQ-iterationAPI:Least-squarespolicyiteration3Onlinetempor

al-differenceRLClassicalQ-learningandSA
al-differenceRLClassicalQ-learningandSARSAApproximateQ-learningandSARSA4Optimisticplanning5ConclusionsIntro&RecapOfineAVI&APIOnlineTDRLOpt

imisticplanningConclusionsGeneralizing
imisticplanningConclusionsGeneralizingfuzzyQ-iterationArbitraryapproximator,parametricornonparametric(fuzzyQI:restrictedBFs+discreteactions)Arbitra

rytransitionsamples(xis;uis;ris;x0is),
rytransitionsamples(xis;uis;ris;x0is),is=1;:::;ns(fuzzyQI:center-discreteactionpairs(xi;uj))PossiblystochasticsystemIntro&RecapOfineAVI&APIO

nlineTDRLOptimisticplanningConclusions
nlineTDRLOptimisticplanningConclusionsFittedQ-iteration(cont'd)Ifsystemisstochastic,weactuallyaimfor:bQW(x;u)Enr(x;u;x0)+ maxu0bQW`(x0;u0)o&#

150;regressionon(xis;uis)!qisnaturallyac
150;regressionon(xis;uis)!qisnaturallyachievesthisConvergencetonear-optimalregion(givenMDPandapproximatorcharacteristics)(Munos&Szepesv´ari,2008)Lef

t:ttedQIRight:fuzzyQIIntro&Recap
t:ttedQIRight:fuzzyQIIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsGuaranteesTheorem(Bertsekas&Tsitsiklis1996,Lagouda

kis&Parr2003)Ifeverypolicyevaluationis"
kis&Parr2003)Ifeverypolicyevaluationis"-accurate: bQh`�Qh` 1",anear-optimalpolicyiseventuallyobtained:limsup`!1 Qh`�Q 12 "

(1� )2Intro&RecapOfineAVI&APIO
(1� )2Intro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsProjectedpolicyevaluation(LSTD)LinearapproximatorbQW(x;u)=�

;(x;u)WbQhTh(bQh)instantiatedtobQW
;(x;u)WbQhTh(bQh)instantiatedtobQW=P[Th(bQW)]withPweightedleast-squaresprojectionHasmeaningfulsolutionunderconditionsonPCalledleast-squarestem

poraldifferenceIntro&RecapOfineAVI
poraldifferenceIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsProjectedpolicyevaluation(cont'd)bQW=P[Th(bQW)]bQWlinearinW;Th(

Q)linearinQ;P(Q)linearinQ)rewriteasli
Q)linearinQ;P(Q)linearinQ)rewriteaslinearequationinW:AW=bA;bcanbeestimatedfromsamples(xis;uis;ris;x0is):A A+(xis;uis)�(xis;uis)&#

0; (xis;uis)�(x0is;h(x0
0; (xis;uis)�(x0is;h(x0is))b b+(xis;uis)risIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsExample:LSPIfor

theinvertedpendulumBFsoverstatespace:R
theinvertedpendulumBFsoverstatespace:RBFsona159gridActiondiscretization:3actionsSamples:7500transitionsfromrandom(x,discreteu)pairsIntro&Reca

pOfineAVI&APIOnlineTDRLOptimistic
pOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsComparisonbetweenAPIandAVISimilartheoreticalbehavioringeneral:convergencetonear-optimalreg

ionInpractice,APItypicallyconvergesinfe
ionInpractice,APItypicallyconvergesinfeweriterations......butalsohaslargercomplexityperiterationIntro&RecapOfineAVI&APIOnlineTDRLOptimisticpla

nningConclusions1Introduction&Recap2
nningConclusions1Introduction&Recap2Ofineapproximatevalueiteration&policyiterationAVI:FuzzyQ-iterationAVI:FittedQ-iterationAPI:Least-squares

policyiteration3Onlinetemporal-differe
policyiteration3Onlinetemporal-differenceRLClassicalQ-learningandSARSAApproximateQ-learningandSARSA4Optimisticplanning5ConclusionsIntro&RecapO

fineAVI&APIOnlineTDRLOptimisticpla
fineAVI&APIOnlineTDRLOptimisticplanningConclusionsQ-learning1Take(model-based)Q-iterationupdate:Q`+1(x;u) Efr(x;u;x0)+ maxu0Q`(x0;u0)g2Uset

ransitionsample(xk;uk;rk+1;xk+1)ateachs
ransitionsample(xk;uk;rk+1;xk+1)ateachstepk:Q(xk;uk) rk+1+ maxu0Q(xk+1;u0)–notexk+1=x0;rk+1=r(x;u;x0)indeterministiccase;instochasticcasetheyjus

tprovideasampleofr.h.s.3Makeupdateinc
tprovideasampleofr.h.s.3Makeupdateincrementalwithlearningrate k2(0;1]:Q(xk;uk) Q(xk;uk)+ k[rk+1+ maxu0Q(xk+1;u0)�Q(xk;uk)]Intro&

RecapOfineAVI&APIOnlineTDRLOptimi
RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsTemporaldifference[rk+1+ maxu0Q(xk+1;u0)�Q(xk;uk)]istheQ-learningtemporaldifferenc

e:“error”inBellmanoptimalitye
e:“error”inBellmanoptimalityequationforcurrentsampleIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsExploration-explo

itationtradeoffEssentialconditionforcon
itationtradeoffEssentialconditionforconvergencetoQ:all(x;u)pairsmustbevisitedinnitelyoften)Explorationnecessary:sometimes,chooseactionsrando

mlyExploitationofcurrentknowledgeisals
mlyExploitationofcurrentknowledgeisalsonecessary:sometimes,chooseactionsgreedilySimplesolution:"-greedyuk=(argmaxuQ(xk;u)withprobability(1�"k)ar

andomactionwithprobability"kwithexplorat
andomactionwithprobability"kwithexplorationprobability"k2(0;1)decreasingintimeIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsSA

RSAAlgorithmSARSAwith"-greedyexploratio
RSAAlgorithmSARSAwith"-greedyexplorationforeverytrialdoinitializex0u0=(argmaxuQ(x0;u)w.p.(1�"0)randomw.p."0repeateachstepkapplyuk,measurexk+1,r

eceiverk+1uk+1=(argmaxuQ(xk+1;u)w.p.(1
eceiverk+1uk+1=(argmaxuQ(xk+1;u)w.p.(1�"k+1)randomw.p."k+1Q(xk;uk) Q(xk;uk)+ k[rk+1+ Q(xk+1;uk+1)�Q(xk;uk)]untiltrialnishedendfo

rIntro&RecapOfineAVI&APIOnlineTDR
rIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusions1Introduction&Recap2Ofineapproximatevalueiteration&policyiterationAVI:F

uzzyQ-iterationAVI:FittedQ-iterationAP
uzzyQ-iterationAVI:FittedQ-iterationAPI:Least-squarespolicyiteration3Onlinetemporal-differenceRLClassicalQ-learningandSARSAApproximateQ-learningan

dSARSA4Optimisticplanning5Conclusion
dSARSA4Optimisticplanning5ConclusionsIntro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsApproximateQ-learningParametricapproxi

mationbQW(x;u)Gradientdescentontheerro
mationbQW(x;u)Gradientdescentontheerror[Q(xk;uk)�bQW(xk;uk)]2:W W�12 k@@WhQ(xk;uk)�bQW(xk;uk)i2=W+ k@bQW(xk;uk)@Wh

Q(xk;uk)�bQW(xk;uk)iEstimateQ
Q(xk;uk)�bQW(xk;uk)iEstimateQ(xk;uk)usingBellman:W W+ k@bQW(xk;uk)@Wrk+1+ maxu0bQW(xk+1;u0)�bQW(xk;uk)(approximatete

mporaldifference)Intro&RecapOfineA
mporaldifference)Intro&RecapOfineAVI&APIOnlineTDRLOptimisticplanningConclusionsApproximateSARSAApproximateSARSA,"-greedy(Sutton&Barto,1998)

foreverytrialdoinitializex0u0=(argmaxu
foreverytrialdoinitializex0u0=(argmaxubQW(x0;u)w.p.(1�"0)randomw.p."0repeatateachstepkapplyuk,measurexk+1,receiverk+1uk+1=(argmaxubQW(xk+1;u)w.

p.(1�"k+1)randomw.p."k+1W W+ k@b
p.(1�"k+1)randomw.p."k+1W W+ k@bQW(xk;uk)@Whrk+1+ bQW(xk+1;uk+1)�bQW(xk;uk)iuntiltrialnishedendforIntro&RecapOfineAVI&API

OnlineTDRLOptimisticplanningConclusion
OnlineTDRLOptimisticplanningConclusionsApplication:RoboticgoalkeeperVision-basedcontrol:learnhowtocatchballusingvideocameraimage6states,2actions(mo

Related Contents


Next Show more