continuouscontroltasksthatrequiregeneralizingfromdatacollectedforadiff - PDF document

scarlett . @scarlett

345 views
Uploaded On 2021-01-11

continuouscontroltasksthatrequiregeneralizingfromdatacollectedforadiff - PPT Presentation

Kumaretal2016Janneretal2019Luoetal2018makeanaturalchoiceforenablinggeneralizationforanumberofreasonsFirstmodelbasedRLalgorithmseffectivelyreceivemoresupervisionsincethemodelistrainedonev ID: 828963

2019 arxivpreprintarxiv andlevine 2018 arxivpreprintarxiv 2019 2018 andlevine kumar inadvancesinneuralinformationprocessingsystems zhang 2016 pritzel 1807 2017 fujimoto model 1802 squad

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/828963" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Pdf The PPT/PDF document "continuouscontroltasksthatrequiregeneral..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

1 continuouscontroltasksthatrequiregeneral
continuouscontroltasksthatrequiregeneralizingfromdatacollectedforadifferenttask.1.IntroductionRecentadvancesinmachinelearningusingdeepneuralnetworkshaveshownsigniÞcantsuccessesinscalingtolargedatasets,suchasImageNet(Dengetal.,2009)incom-putervision,SQuAD(Rajpurkaretal.,2016)inNLP,andRoboNet(Dasarietal.,2019)inrobotlearning.Reinforce-mentlearning(RL)methods,incontrast,struggletoscaletomanyreal-worldapplications,e.g.,autonomousdriving(Yuetal.,2018)andhealthcare(Gottesmanetal.,2019),be-causetheyrelyoncostlyonlinetrial-and-error.DesigningRLalgorithmsthatcanlearnfromdiverse,staticdatasets Kumaretal.,2016;Janneretal.,2019;Luoetal.,2018)makeanaturalchoiceforenablinggeneralization,foranumberofreasons.First,model-basedRLalgorithmseffectivelyreceivemoresupervision,sincethemodelistrainedoneverytransition,eveninsparse-rewardsettings.Second,theyaretrainedwithsupervisedlearning,whichprovidesmorestablea

2 ndlessnoisygradientsthanboot-strapping.L
ndlessnoisygradientsthanboot-strapping.Lastly,uncertaintyestimationtechniques,suchasbootstrapensembles,arewelldevelopedforsupervisedlearningmethods(Lakshminarayananetal.,2017;Kuleshovetal.,2018;Snoeketal.,2019)andareknowntoperformpoorlyforvalue-basedRLmethods( )therewardfunction,µ0theinitialstatedistribution,and!!(0,1)thedis-countfactor.ThegoalinRListooptimizeapolicy"(a|s)thatmaximizestheexpecteddiscountedreturn#M("):=E!,T,µ0[!"t=0!tr(st,at)].ThevaluefunctionV!M(s):=E )|s0=s]givestheex-pecteddiscountedreturnunder"ifstartingfromstates.Inthemodel-basedapproachwewillhaveadynamicsmodel"TestimatedfromthetransitionsinthestaticdatasetDenv={ ,A,"T,r,µ0 )"(a|s).Notethat$!!T,asdeÞnedhere,isnotaproperlynormalizedprobabilitydistribution,asitinte-gratesto1/(1"!).Wewilldenote(improper)expectationswithrespectto$ "),thereturnundertheestimateddynamics.Theerrorofthisestimatordependson,potentiallyinacomplex

3 fashion,theerrorof" [f(X)]"EY$Q[f(Y)]$$,
fashion,theerrorof" [f(X)]"EY$Q[f(Y)]$$,relatestheerrorofthemodelwiththeerrorofthereturn.AlthoughAssumption3.1issomewhatabstract,ittypicallyholdsinpractice:ifweassumethattherewardfunctionisboundedbyrmax,thenV!Misboundedbyrmax/ ",soAssumption3.1holdswithc=rmax/(1"!)andF={f:$f$"%1}.InthiscasedFisthetotalvariationdistance.WenotethatwithstrongerassumptionsontheMDP,onecanalsoobtainboundsintermsof1-Wassersteindistanceormaximummeandiscrepancy.Assumption3.2.Weassumeafunctionu:S&A'Rwhichisanadmissibleerrorestimatorfor"T,meaningthatdF("T(s,a),T(s,a))%u(s,a)foralls,a.Lemma3.3.LetMand#MbetwoMDPswithrewardfunctionrbutdifferentdynamicsTand"Trespectively.Then,underAssumptions3.1and3.2,letting%:=c!|#"M(")"#M(")|%%øE(s,a)$ T[u(s,a)].Thenthelearnedpolicyö"inMOPO(Algorithm1)satisÞes#M(ö 1527.91030.0mediumwalker2d498.43752.7645.5±464.8582.6±348.8 S.D4rl:Datasetsfordeepdata-drivenreinforcementlearning,2020.Fu

4 jimoto,S.,Meger,D.,andPrecup,D.Off-polic
jimoto,S.,Meger,D.,andPrecup,D.Off-policydeepre-inforcementlearningwithoutexploration.arXivpreprintarXiv:1812.02900,2018a.Fujimoto,S.,VanHoof,H.,andMeger,D.Addressingfunc-tionapproximationerrorinactor-criticmethods.arXivpreprintarXiv:1802.09477,2018b.Gottesman,O.,Johansson,F.,Komorowski,M.,Faisal,A.,Sontag,D.,Doshi-Velez,F.,andCeli,L.A.Guidelinesforreinforcementlearninginhealthcare.NatMed,25(1):16Ð18,2019.Haarnoja,T.,Zhou,A.,Abbeel,P.,andLevine,S.Softactor-critic:Off-policymaximumentropydeepreinforce-mentlearningwithastochasticactor.arXivpreprintarXiv:1801.01290,2018.Janner,M.,Fu,J.,Zhang,M.,andLevine,S.Whentotrustyourmodel:Model-basedpolicyoptimization.InAdvancesinNeuralInformationProcessingSystems,pp.12498Ð12509,2019.Jaques,N.,Ghandeharioun,A.,Shen,J.H.,Ferguson,C.,Lapedriza,A.,Jones,N.,Gu,S.,andPicard,R.Wayoff-policybatchdeepreinforcementlearningofimplicithumanpreferencesindialog.arXiv

5 preprintarXiv:1907.00456,2019.Kuleshov,V
preprintarXiv:1907.00456,2019.Kuleshov,V.,Fenner,N.,andErmon,S.Accurateuncertain-tiesfordeeplearningusingcalibratedregression.arXivpreprintarXiv:1807.00263,2018.Kumar,A.,Fu,J.,Soh,M.,Tucker,G.,andLevine,S.Stabilizingoff-policyq-learningviabootstrappingerrorreduction.InAdvancesinNeuralInformationProcessingSystems,pp.11761Ð11771,2019.Kumar,V.,Todorov,E.,andLevine,S.Optimalcontrolwithlearnedlocalmodels:Applicationtodexterousmanipula-tion.In2016IEEEInternationalConferenceonRoboticsandAutomation(ICRA),pp.378Ð383.IEEE,2016.Lakshminarayanan,B.,Pritzel,A.,andBlundell,C.Simpleandscalablepredictiveuncertaintyestimationusingdeepensembles.InAdvancesinneuralinformationprocessingsystems,pp.6402Ð6413,2017.Levine,S.andKoltun,V.Guidedpolicysearch.InIn-ternationalConferenceonMachineLearning,pp.1Ð9,2013.Lillicrap,T.P.,Hunt,J.J.,Pritzel,A.,Heess,N.,Erez,T.,Tassa,Y.,Silver,D.,andWierstra,D.Continuouscontrolwi

6 thdeepreinforcementlearning.arXivpreprin
thdeepreinforcementlearning.arXivpreprintarXiv:1509.02971,2015.Luo,Y.,Xu,H.,Li,Y.,Tian,Y.,Darrell,T.,andMa,T.Algorithmicframeworkformodel-baseddeepreinforce-mentlearningwiththeoreticalguarantees.arXivpreprintarXiv:1807.03858,2018.Miyato,T.,Kataoka,T.,Koyama,M.,andYoshida,Y.Spec-tralnormalizationforgenerativeadversarialnetworks.arXivpreprintarXiv:1802.05957,2018.uller,A.Integralprobabilitymetricsandtheirgeneratingclassesoffunctions.AdvancesinAppliedProbability,29(2):429Ð443,1997.Nachum,O.,Dai,B.,Kostrikov,I.,Chow,Y.,Li,L.,andSchuurmans,D.Algaedice:Policygradientfromarbi-traryexperience.arXivpreprintarXiv:1912.02074,2019.Peng,X.B.,Kumar,A.,Zhang,G.,andLevine,S.Advantage-weightedregression:Simpleandscalableoff-policyreinforcementlearning.arXivpreprintarXiv:1910.00177,2019.Rajpurkar,P.,Zhang,J.,Lopyrev,K.,andLiang,P.Squad:100,000+questionsformachinecomprehensionoftext.arXivpreprintarXiv:1606.

continuouscontroltasksthatrequiregeneralizingfromdatacollectedforadiff - PDF document

continuouscontroltasksthatrequiregeneralizingfromdatacollectedforadiff - PPT Presentation

Share:

Link:

Embed:

Related Contents