CS 285 Deep Reinforcement Learning Decision Making and ControlSergey LevineClass Notes1Homework 4 due todayRecap whats the problemthis is easy mostlythis is impossibleWhyRecap classes of exploration m ID: 881120
Download Pdf The PPT/PDF document "Exploration Part 2" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1 Exploration: Part 2 CS 285: Deep Reinfo
Exploration: Part 2 CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine Class Notes 1. Homework 4 due today! RecapÍ whatâs the problem? this is easy (mostly) this is impossible Why? Recap: classes of explo
2 ration methods in deep RL ⢠Optimistic
ration methods in deep RL ⢠Optimistic exploration: ⢠new state = good state ⢠requires estimating state visitation frequencies or novelty ⢠typically realized by means of exploration bonuses ⢠Thompson sampling style algorithms
3 : ⢠learn distribution over Q - functi
: ⢠learn distribution over Q - functions or policies ⢠sample and act according to sample ⢠Information gain style algorithms ⢠reason about information gain from visiting new states Posterior sampling in deep RL Thompson samplin
4 g: What do we sample? How do we represen
g: What do we sample? How do we represent the distribution? since Q - learning is off - policy, we donât care which Q - function was used to collect data Bootstrap Osband et alÍ âDeep Exploration via Bootstrapped DQNâ Why does thi
5 s work? Osband et alÍ âDeep Explorati
s work? Osband et alÍ âDeep Exploration via Bootstrapped DQNâ Exploring with random actions (e.g., epsilon - greedy): oscillate back and forth, might not go to a coherent or interesting place Exploring with random Q - functions: com
6 mit to a randomized but internally cons
mit to a randomized but internally consistent strategy for an entire episode + no change to original reward function - very good bonuses often do better Reasoning about information gain (approximately) Info gain: Generally intractable to
7 use exactly, regardless of what is bein
use exactly, regardless of what is being estimated! Reasoning about information gain (approximately) Generally intractable to use exactly, regardless of what is being estimated A few approximations: ( Schmidhuber â91, Bellemare â16)
8 intuition: if density changed a lot, th
intuition: if density changed a lot, the state was novel ( Houthooft et alÍ âVIaEâ) Reasoning about information gain (approximately) VIME implementation: Houthooft et alÍ âVIaEâ Reasoning about information gain (approximately)
9 VIME implementation: Houthooft et alÍ â
VIME implementation: Houthooft et alÍ âVIaEâ + appealing mathematical formalism - models are more complex, generally harder to use effectively Approximate IG: Exploration with model errors Stadie et al. 2015: ⢠encode image observ
10 ations using auto - encoder ⢠build pr
ations using auto - encoder ⢠build predictive model on auto - encoder latent states ⢠use model error as exploration bonus Schmidhuber et alÍ (see, eÍgÍ âFormal Theory of Creativity, Fun, and Intrinsic aotivation)Í â¢ explorat
11 ion bonus for model error ⢠exploratio
ion bonus for model error ⢠exploration bonus for model gradient ⢠many other variations Many others! low novelty high novelty Recap: classes of exploration methods in deep RL ⢠Optimistic exploration: ⢠Exploration with counts an
12 d pseudo - counts ⢠Different models f
d pseudo - counts ⢠Different models for estimating densities ⢠Thompson sampling style algorithms: ⢠Maintain a distribution over models via bootstrapping ⢠Distribution over Q - functions ⢠Information gain style algorithms â
13 ¢ Generally intractable ⢠Can use vari
¢ Generally intractable ⢠Can use variational approximation to information gain Suggested readings Schmidhuber . (1992). A Possibility for Implementing Curiosity and Boredom in Model - Building Neural Controllers. Stadie , Levine, Ab
14 beel (2015). Incentivizing Exploration
beel (2015). Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. Osband , Blundell, Pritzel , Van Roy. (2016). Deep Exploration via Bootstrapped DQN. Houthooft , Chen, Duan , Schulman, De Turck , Abbeel
15 . (2016). VIME: Variational Informati
. (2016). VIME: Variational Information Maximizing Exploration. Bellemare , Srinivasan, Ostroviski , Schaul , Saxton, Munos . (2016). Unifying Count - Based Exploration and Intrinsic Motivation. Tang, Houthooft , Foote, Stooke
16 , Chen, Duan , Schulman, De Turck , A
, Chen, Duan , Schulman, De Turck , Abbeel . (2016). #Exploration: A Study of Count - Based Exploration for Deep Reinforcement Learning. Fu, Co - Reyes, Levine. (2017). EX2: Exploration with Exemplar Models for Deep Reinforcement L
17 earning. Break Imitation vs. Reinforceme
earning. Break Imitation vs. Reinforcement Learning imitation learning reinforcement learning ⢠Requires demonstrations ⢠Must address distributional shift ⢠Simple, stable supervised learning ⢠Only as good as the demo ⢠Requir
18 es reward function ⢠Must address expl
es reward function ⢠Must address exploration ⢠Potentially non - convergent RL ⢠Can become arbitrarily good Can we get the best of both? e.g., what if we have demonstrations and rewards? training data supervised learning Imitatio
19 n Learning Reinforcement Learning Addres
n Learning Reinforcement Learning Addressing distributional shift with RL? Update reward using samples & demos generate policy samples from Ï policy Ï reward r policy Ï generator Addressing distributional shift with RL? IRL already a
20 ddresses distributional shift via RL thi
ddresses distributional shift via RL this part is regular âforwardâ RL But it doesnât use a known reward function! Simplest combination: pretrain & finetune ⢠Demonstrations can overcome exploration: show us how to do the task â¢
21 Reinforcement learning can improve beyo
Reinforcement learning can improve beyond performance of the demonstrator ⢠Idea: initialize with imitation learning, then finetune with reinforcement learning! Simplest combination: pretrain & finetune Muelling et alÍ â13 Simplest
22 combination: pretrain & finetune Pretrai
combination: pretrain & finetune Pretrain & finetune vs. DAgger Whatâs the problem? Pretrain & finetune can be very bad (due to distribution shift) first batch of (very) bad data can destroy initialization Can we avoid forgetting the d
23 emonstrations? Off - policy reinforcemen
emonstrations? Off - policy reinforcement learning ⢠Off - policy RL can use any data ⢠If we let it use demonstrations as off - policy samples, can that mitigate the exploration challenges? ⢠Since demonstrations are provided as d
24 ata in every iteration, they are never f
ata in every iteration, they are never forgotten ⢠But the policy can still become better than the demos, since it is not forced to mimic them off - policy policy gradient (with importance sampling) off - policy Q - learning Policy gra
25 dient with demonstrations includes demon
dient with demonstrations includes demonstrations and experience Why is this a good idea? Donât we want on - policy samples? optimal importance sampling Policy gradient with demonstrations How do we construct the sampling distribution
26 ? this works best with self - normalized
? this works best with self - normalized importance sampling self - normalized IS standard IS Example: importance sampling with demos Levine, Koltun â13Í âGuided policy searchâ Q - learning with demonstrations ⢠Q - learning is
27 already off - policy, no need to bother
already off - policy, no need to bother with importance weights! ⢠Simple solution: drop demonstrations into the replay buffer Q - learning with demonstrations Vecerik et alÍ, â17, âLeveraging Demonstrations for Deep Reinforcemen
28 t LearningÍâ Whatâs the problem? Im
t LearningÍâ Whatâs the problem? Importance sampling: recipe for getting stuck Q - learning: just good data is not enough More problems with Q learning dataset of transitions (âreplay bufferâ) off - policy Q - learning See, e.g.
29 Riedmiller , Neural Fitted Q - Iteration
Riedmiller , Neural Fitted Q - Iteration â05 Ernst et al., Tree - Based Batch aode RL â05 what action will this pick? More problems with Q learning See: Kumar, Fu, Tucker, Levine. Stabilizing Off - Policy Q - Learning via Bootstrapp
30 ing Error Reduction. See also: Fujimoto,
ing Error Reduction. See also: Fujimoto, Meger , Precup . Off - Policy Deep Reinforcement Learning without Exploration. naïve RL distrib . matching (BCQ) random data only use values inside support region support constraint pessimist
31 ic w.r.t. epistemic uncertainty BEAR So
ic w.r.t. epistemic uncertainty BEAR So farÍ â¢ Pure imitation learning ⢠Easy and stable supervised learning ⢠Distributional shift ⢠No chance to get better than the demonstrations ⢠Pure reinforcement learning ⢠Unbiased r
32 einforcement learning, can get arbitrari
einforcement learning, can get arbitrarily good ⢠Challenging exploration and optimization problem ⢠Initialize & finetune ⢠Almost the best of both worlds ⢠Íbut can forget demo initialization due to distributional shift ⢠Pur
33 e reinforcement learning, with demos as
e reinforcement learning, with demos as off - policy data ⢠Unbiased reinforcement learning, can get arbitrarily good ⢠Demonstrations donât always help ⢠Can we strike a compromise? A little bit of supervised, a little bit of RL?
34 Imitation as an auxiliary loss function
Imitation as an auxiliary loss function (or some variant of this) (or some variant of this) need to be careful in choosing this weight Example: hybrid policy gradient increase demo likelihood standard policy gradient Rajeswaran et alÍ,
35 â17, âLearning Complex Dexterous aan
â17, âLearning Complex Dexterous aanipulationÍâ Example: hybrid Q - learning Hester et alÍ, â17, âLearning from DemonstrationsÍâ Q - learning loss n - step Q - learning loss regularization loss because why notÍ Whatâs th
36 e problem? ⢠Need to tune the weight â
e problem? ⢠Need to tune the weight ⢠The design of the objective, esp. for imitation, takes a lot of care ⢠Algorithm becomes problem - dependent ⢠Pure imitation learning ⢠Easy and stable supervised learning ⢠Distribution
37 al shift ⢠No chance to get better tha
al shift ⢠No chance to get better than the demonstrations ⢠Pure reinforcement learning ⢠Unbiased reinforcement learning, can get arbitrarily good ⢠Challenging exploration and optimization problem ⢠Initialize & finetune â¢
38 Almost the best of both worlds ⢠Íbut
Almost the best of both worlds ⢠Íbut can forget demo initialization due to distributional shift ⢠Pure reinforcement learning, with demos as off - policy data ⢠Unbiased reinforcement learning, can get arbitrarily good ⢠Demonst
39 rations donât always help ⢠Hybrid o
rations donât always help ⢠Hybrid objective, imitation as an âauxiliary lossâ ⢠Like initialization & finetuning, almost the best of both worlds ⢠No forgetting ⢠But no longer pure RL, may be biased, may require lots of tu