Embed / Share  Deep Reinforcement Learning: QLearning
Slide1
Deep Reinforcement Learning: QLearning
Garima Lalwani Karan Ganju Unnat Jain
Slide2Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
Slide3Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
Slide4QLearning
David Silver’s
Introduction to RL lectures
Peter Abbeel
’s Artificial Intelligence 
Berkeley (Spring 2015)
Slide5QLearning
David Silver’s
Introduction to RL lectures
Peter Abbeel’s Artificial Intelligence 
Berkeley (Spring 2015)
Slide6QLearning
David Silver’s
Introduction to RL lectures
Peter Abbeel’s Artificial Intelligence 
Berkeley (Spring 2015)
Slide7Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
Slide8Function Approximation  Why?
Value functions
Every state
s
has an entry
V(s)
Every stateaction pair
(s, a)
has an entry
Q(s, a)How to get Q(s,a) → Table lookup What about large MDPs?Estimate value function with function approximation
Generalise from seen states to unseen states
Slide9Function Approximation  How?
Why Q?
How to approximate?
Features for state s  s,a →
x
(s,a)
Linear model  Q(s,a) =
w
Tx(s,a)Deep Neural Nets  CS598  Q(s,a) =
NN
(s,a)
Slide10Function Approximation  Demo
Slide11Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
Slide12Deep Q Network
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
2) Output: Q(s,a
i
)
Q(s,a
1
)
Q(s,a
2
)
Q(s,a
3
)
.
.
.
.
.
.
.
.
.
Q(s,a
18
)
1) Input
:
4 images = current frame + 3 previous
Slide13Deep Q Network
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
2) Output: Q(s,a
i
)
Q(s,a
1
)
Q(s,a
2
)
Q(s,a
3
)
.
.
.
.
.
.
.
.
.
Q(s,a
18
)
1) Input:
4 images = current frame + 3 previous
Slide14Deep Q Network
1) Input:
4 images = current frame + 3 previous
2) Output: Q(s,a
i
)
Q(s,a
1
)
Q(s,a
2
)
Q(s,a
3
)
.
.
.
.
.
.
.
.
.
Q(s,a
18
)
?
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide15Deep Q Network
1) Input:
4 images = current frame + 3 previous
2) Output: Q(s,a
i
)
Q(s,a
1
)
Q(s,a
2
)
Q(s,a
3
)
.
.
.
.
.
.
.
.
.
Q(s,a
18
)
𝝓(s)
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide16Deep Q Network
1) Input:
4 images = current frame + 3 previous
2) Output: Q(s,a
i
)
Q(s,a
1
)
Q(s,a
2
)
Q(s,a
3
)
.
.
.
.
.
.
.
.
.
Q(s,a
18
)
𝝓(s)
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide17Deep Q Network
1) Input:
4 images = current frame + 3 previous
2) Output: Q(s,a
i
)
Q(s,a
1
)
Q(s,a
2
)
Q(s,a
3
)
.
.
.
.
.
.
.
.
.
Q(s,a
18
)
𝝓(s)
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide18Supervised SGD (lec2) vs QLearning SGD
SGD update assuming supervision
David Silver’s
Deep Learning Tutorial
, ICML 2016
Slide19Supervised SGD (lec2) vs QLearning SGD
SGD update assuming supervision
David Silver’s
Deep Learning Tutorial
, ICML 2016
Slide20Supervised SGD (lec2) vs QLearning SGD
SGD update assuming supervision
SGD update for QLearning
David Silver’s
Deep Learning Tutorial
, ICML 2016
Slide21Training tricks
Issues:
Data is sequential
Successive samples are correlated, noniid
An experience is visited only once in online learning
Policy changes rapidly with slight changes to Qvalues
Policy may oscillate
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide22Training tricks
Issues:
Data is sequential
Successive samples are correlated, noniid
An experience is visited only once in online learning
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide23Training tricks
Issues:
Data is sequential
Successive samples are correlated, noniid
An experience is visited
only once in online
learning
Solution:
‘Experience Replay’
: Work on a dataset  Sample randomly and repeatedly
Build dataset
Take action a
t
according to 𝛆greedy policy
Store transition/experience (s
t
, a
t
,r
t+1
,s
t+1
) in dataset D (‘
replay memory’
)
Sample randomly minibatch (32 experiences) of (s, a, r, s
’
)) from D
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide24Training tricks
Issues:
Data is sequential
Successive samples are correlated, noniid
An experience is visited only once in online learning
Solution:
‘Experience Replay’
: Work on a dataset  Sample randomly and repeatedly
Build dataset
Take action a
t
according to 𝛆greedy policy
Store transition/experience (s
t
, a
t
,r
t+1
,s
t+1
) in dataset D (‘
replay memory’
)
Sample randomly minibatch (32 experiences) of (s, a, r, s
’
)) from D
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide25Training tricks
Issues:
Data is sequential
Experience replay
Successive samples are correlated, noniid
An experience is visited
only once in online
learning
Policy changes rapidly with slight changes to Qvalues
Policy may oscillate
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide26Training tricks
Issues:
Data is sequential
Experience replay
Successive samples are correlated, noniid
An experience is visited only once in online learning
Policy changes rapidly with slight changes to Qvalues
Policy may oscillate
Solution:
‘Target Network’
: Stale updates
C step delay between update of Q and its use as targets
Tells me Q(s,a) targets
(w
i 1
)
WikiCommons [
Img link
]
Q values are updated every SGD step
(w
i
)
Network 2
Network 1
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide27Training tricks
Issues:
Data is sequential
Experience replay
Successive samples are correlated, noniid
An experience is visited
only once in online
learning
Policy changes rapidly with slight changes to Qvalues
Policy may oscillate
Solution:
‘Target Network’
: Stale updates
C step delay between update of Q and its use as targets
WikiCommons [
Img link
]
After 10,000 SGD updates
Tells me Q(s,a) targets
(w
i
)
Q values are updated every SGD step
(w
i +1
)
Network 2
Network 1
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide28Training tricks
Issues:
Data is sequential
Experience replay
Successive samples are correlated, noniid
An experience is visited
only once in online
learning
Policy changes rapidly with slight changes to Qvalues
Target Network
Policy may oscillate
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide29Training tricks
Issues:
Data is sequential
Experience replay
Successive samples are correlated, noniid
An experience is visited only once in online learning
Policy changes rapidly with slight changes to Qvalues
Target Network
Policy may oscillate
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide30DQN: Results
Why not just use VGGNet features?
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide31DQN: Results
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide32Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
Slide33QLearning for Roulette
Hasselt, Hado V. "Double Qlearning." In
Advances in Neural Information Processing Systems
, pp. 26132621. 2010.
https://en.wikipedia.org/wiki/File:130227spielbankwiesbadenbyRalfR094.jpg
Slide34QLearning for Roulette
Hasselt, Hado V. "Double Qlearning." In
Advances in Neural Information Processing Systems
, pp. 26132621. 2010.
https://en.wikipedia.org/wiki/File:130227spielbankwiesbadenbyRalfR094.jpg
Slide35QLearning for Roulette
Hasselt, Hado V. "Double Qlearning." In
Advances in Neural Information Processing Systems
, pp. 26132621. 2010.
https://en.wikipedia.org/wiki/File:130227spielbankwiesbadenbyRalfR094.jpg
Slide36QLearning Overestimation : Function Approximation
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Q Estimate
Actual Q Value
Slide37QLearning Overestimation : Function Approximation
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Slide38One (Estimator) Isn’t Good Enough?
Slide39One (Estimator) Isn’t Good Enough?
Use
Two
.
https://pbs.twimg.com/media/C5ymV2tVMAYtAev.jpg
Slide40Double QLearning
Two estimators:
Estimator Q
1
: Obtain best action
Estimator Q
2
: Evaluate Q for the above action
Chances of both estimators overestimating at same action is lesser
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Slide41Double QLearning
Two estimators:
Estimator Q
1
: Obtain best action
Estimator Q
2
: Evaluate Q for the above action
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Q Target
Slide42Double QLearning
Two estimators:
Estimator Q
1
: Obtain best action
Estimator Q
2
: Evaluate Q for the above action
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Q Target
Double
Q Target
Slide43Double QLearning
Two estimators:
Estimator Q
1
: Obtain best action
Estimator Q
2
: Evaluate Q for the above action
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Q1
Q2
Slide44Results  All Atari Games
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Slide45Results  Solves Overestimations
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Slide46Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
Slide47Pong  Up or Down
Mnih, Volodymyr, et al. "
Humanlevel control through deep reinforcement learning
."
Nature
518.7540 (2015): 529533.
Slide48Enduro  Left or Right?
http://img.vivaolinux.com.br/imagens/dicas/comunidade/Enduro.png
Slide49Enduro  Left or Right?
http://twolivesleft.com/Codea/User/enduro.png
Slide50Advantage Function
Learning action values ≈ Inherently learning both state values and relative value of the action in that state!
We can use this to help generalize learning for the state values.
Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. "Dueling network architectures for deep reinforcement learning."
arXiv preprint arXiv:1511.06581
(2015).
Slide51Dueling Architecture
Aggregating Module
http://torch.ch/blog/2016/04/30/dueling_dqn.html
Slide52Dueling Architecture
Aggregating Module
http://torch.ch/blog/2016/04/30/dueling_dqn.html
Slide53Dueling Architecture
Aggregating Module
http://torch.ch/blog/2016/04/30/dueling_dqn.html
Slide54Results
Where does V(s) attend to?
Where does A(s,a) attend to?
Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. "Dueling network architectures for deep reinforcement learning."
arXiv preprint arXiv:1511.06581
(2015).
Slide55Results
Improvements of dueling architecture over Prioritized DDQN baseline measured by metric above over 57 Atari games
Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. "Dueling network architectures for deep reinforcement learning."
arXiv preprint arXiv:1511.06581
(2015).
Slide56Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
Slide57Moving to more General and Complex Games
All games may not be representable using MDPs; some may be POMDPs
FPS shooter games
Scrabble
Even Atari games
Entire history a solution?
Slide58Moving to more General and Complex Games
All games may not be representable using MDPs; some may be POMDPs
FPS shooter games
Scrabble
Even Atari games
Entire history a solution?
LSTMs !
Slide59Deep Recurrent QLearning
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
FC
Slide60Deep Recurrent QLearning
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
Slide61Deep Recurrent QLearning
h
1
h
2
h
3
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
Slide62DRQN Results
Misses
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
Slide63DRQN Results
Paddle
Deflection
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
Slide64DRQN Results
Wall
Deflections
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
Slide65Results  Robustness to partial observability
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
POMDP
MDP
Slide66Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
Slide67Application of DRQN: Playing ‘Doom’
Lample, Guillaume, and Devendra Singh Chaplot. "
Playing FPS games with deep reinforcement learning
."
Slide68Doom Demo
Slide69How does DRQN help ?
Observe o
t
instead of s
t
Limited field of view
Instead of estimating Q(s
t
, at), estimate Q(ht,at) where ht = LSTM(h
t1
,o
t
)
Slide70Architecture: Comparison with Baseline DRQN
Slide71Training Tricks
Jointly training DRQN model and game feature detection
Slide72Training Tricks
Jointly training DRQN model and game feature detection
What do you think is the advantage of this ?
Slide73Training Tricks
Jointly training DRQN model and game feature detection
What do you think is the advantage of this ?
CNN layers capture relevant information about features of the game that maximise action value scores
Slide74Modular Architecture
Enemy spotted
Action Network (DRQN)
All clear!
Slide75Modular Architecture
Enemy spotted
Action Network (DRQN)
All clear!
DRQN
DQN
Slide76Modular Network : Advantages
C
an be trained and tested independently
Both can be trained in parallel
Reduces the stateaction pairs
space
: Faster Training
Mitigates camper behavior : “Tendency to stay in one area of the map and wait for enemies”
Slide77Rewards Formulation for DOOM
What do you think ?
Slide78Rewards Formulation for DOOM
What do you think ?
Positive rewards for Kills and Negative rewards for suicides
Small Intermediate Rewards :
Positive Reward for object pickup
Negative Reward for losing health
Negative Reward for shooting or losing ammo
Small Positive Rewards proportional to the distance travelled since last step
(Agent avoids running in circles)
Slide79Performance with Separate Navigation Network
Slide80Results
Slide81Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
Slide82hDQN
Slide83Double DQN
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Slide84Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Dueling Networks
Slide85How is this game different ?
Complex Game Environment
Sparse and Longer Range Delayed Rewards
Insufficient Exploration : We need temporally extended exploration
Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B. Tenenbaum. “
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.
” NIPS 2016
Slide86How is this game different ?
Complex Game Environment
Sparse and Longer Range Delayed Rewards
Insufficient Exploration : We need temporally extended exploration
Dividing Extrinsic Goal into Hierarchical Intrinsic Subgoals
Slide87Intrinsic Goals in Montezuma’s Revenge
Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B. Tenenbaum. “
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.
” NIPS 2016
Slide88Hierarchy of DQNs
Agent
Environment
Slide89Hierarchy of DQNs
Agent
Environment
Slide90Architecture Block for hDQN
Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B. Tenenbaum. “
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.
” NIPS 2016
Slide91hDQN Learning Framework (1)
V(s,g) : Value function of a state for achieving the given goal g ∈ G
Option :
A multistep action policy to achieve these intrinsic goals g ∈ G
Can also be primitive actions
𝚷
g
: Policy Over Options to achieve goal g
Agents learns
which Intrinsic goals are important
𝚷
g
correct sequence of such policies 𝚷
g
Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B. Tenenbaum. “
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.
” NIPS 2016
Slide92hDQN Learning Framework (2)
Objective Function for MetaController :
Maximise Cumulative Extrinsic Reward
F
t
=
Objective Function for Controller :
Maximise Cumulative Intrinsic Reward
R
t
=
Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B. Tenenbaum. “
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.
” NIPS 2016
Slide93Training
Two disjoint memories D
1
and D
2
for Experience Replay
Experiences (s
t
, g
t , ft , st+N) for Q2 are stored in D2 Experiences (s
t
, a
t
, g
t
, r
t
, s
t+1
) for Q
1
are stored in D
1
Different time scales
Transitions from Controller (Q
1
) are picked at every time step
Transitions from MetaController (Q
2
) are picked only when controller terminates on reaching the intrinsic goal or epsiode ends
Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B. Tenenbaum. “
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.
” NIPS 2016
Slide94Results :
Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B. Tenenbaum. “
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.
” NIPS 2016
Slide95Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
Slide96References
Basic RL
David Silver’s
Introduction to RL lectures
Peter Abbeel’s Artificial Intelligence 
Berkeley (Spring 2015)
DQN
Mnih, Volodymyr, et al. "Humanlevel control through deep reinforcement learning."
Nature
518.7540 (2015): 529533.
Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning."
arXiv preprint arXiv:1312.5602
(2013).
DDQN
Hasselt, Hado V. "Double Qlearning." In
Advances in Neural Information Processing Systems
, pp. 26132621. 2010.
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Dueling DQN
Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. "Dueling network architectures for deep reinforcement learning."
arXiv preprint arXiv:1511.06581
(2015).
Slide97References
DRQN
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
Doom
Lample, Guillaume, and Devendra Singh Chaplot. "Playing FPS games with deep reinforcement learning."
hDQN
Kulkarni, Tejas D., Karthik Narasimhan, Ardavan Saeedi and Joshua B. Tenenbaum. “Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.” NIPS 2016
Additional NLP/Vision applications
Narasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "Language understanding for textbased games using deep reinforcement learning." EMNLP 20155
Caicedo, Juan C., and Svetlana Lazebnik. "Active object localization with deep reinforcement learning."
Proceedings of the IEEE International Conference on Computer Vision
. 2015.
Zhu, Yuke, et al. "Targetdriven visual navigation in indoor scenes using deep reinforcement learning."
arXiv preprint arXiv:1609.05143
(2016).
Slide98Deep Q Learning for textbased games
Narasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "
Language understanding for textbased games using deep reinforcement learning.
" EMNLP 2015
Slide99Text Based Games : Back in 1970’s
Predecessors to Modern Graphical Games
MUD (Multi User Dungeon Games) still prevalent
Narasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "
Language understanding for textbased games using deep reinforcement learning.
" EMNLP 2015
Slide100State Spaces and Action Spaces
Hidden state space h ∈ H but given textual description
{ ψ : H ➡ S}
Actions are commands (actionobject pairs) A = {(a,o)}
T
hh’
(a,o)
: Transition Probabilities
Jointly learn state representations and control policies as learned Strategy/Policy directly builds on the text interpretation
Narasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "
Language understanding for textbased games using deep reinforcement learning.
" EMNLP 2015
Slide101Learning Representations and Control Policies
Narasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "
Language understanding for textbased games using deep reinforcement learning.
" EMNLP 2015
Slide102Results (1) :
Learnt Useful Representations for the Game
Narasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "
Language understanding for textbased games using deep reinforcement learning.
" EMNLP 2015
Slide103Results (2) :
Narasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. "
Language understanding for textbased games using deep reinforcement learning.
" EMNLP 2015
Slide104Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
More applications:
Text based games
Object Detection
Indoor Navigation
Slide105Object Detection as a RL problem?
States:
Actions:
Slide106Object detection as a RL problem?
States:
Actions:
[imagelink]
Slide107Object detection as a RL problem?
States:
Actions:
[imagelink]
Slide108Object detection as a RL problem?
States:
Actions:
[imagelink]
Slide109Object detection as a RL problem?
States:
Actions:
[imagelink]
Slide110Object detection as a RL problem?
States:
Actions:
[imagelink]
Slide111Object detection as a RL problem?
States:
Actions:

c*(x2x1) , c*(y2y1) relative translation
 scale
 aspect ratio
 trigger when IoU is high
[imagelink]
J. Caicedo and S. Lazebnik,
ICCV 2015
[imagelink]
Slide112Object detection as a RL problem?
States: fc6 feature of pretrained VGG19
Actions:

c*(x2x1) , c*(y2y1) relative translation
 scale
 aspect ratio
 trigger when IoU is high
[imagelink]
J. Caicedo and S. Lazebnik,
ICCV 2015
Slide113Object detection as a RL problem?
States:
fc6 feature of pretrained VGG19
Actions:

c*(x2x1) , c*(y2y1) relative translation
 scale
 aspect ratio
 trigger when IoU is high
Reward:
[imagelink]
J. Caicedo and S. Lazebnik,
ICCV 2015
Slide114Object detection as a RL problem?
States:
fc6 feature of pretrained VGG19
Actions:

c*(x2x1) , c*(y2y1) relative translation
 scale
 aspect ratio
 trigger when IoU is high
Reward:
[imagelink]
J. Caicedo and S. Lazebnik,
ICCV 2015
Slide115Object detection as a RL problem?
Q(s,a
1
=scale up)
State(s)
Current bounding box
Q(s,a
2
=scale down)
Q(s,a
3
=shift left)
Q(s,a
9
=
trigger
)
J. Caicedo and S. Lazebnik,
ICCV 2015
Slide116Object detection as a RL problem?
Q(s,a
1
=scale up)
Q(s,a
2
=scale down)
Q(s,a
3
=shift left)
Q(s,a
9
=
trigger
)
State(s)
Current bounding box
J. Caicedo and S. Lazebnik,
ICCV 2015
Slide117Object detection as a RL problem?
Q(s,a
1
=scale up)
Q(s,a
2
=scale down)
Q(s,a
3
=shift left)
Q(s,a
9
=
trigger
)
State(s)
Current bounding box
J. Caicedo and S. Lazebnik,
ICCV 2015
History
Slide118Object detection as a RL problem?
Fine details:
 Class specific, attentionaction model
 Does not follow a fixed sliding window trajectory, image dependent trajectory

Use 16 pixel neighbourhood to incorporate context
J. Caicedo and S. Lazebnik,
ICCV 2015
Slide119Object detection as a RL problem?
J. Caicedo and S. Lazebnik,
ICCV 2015
Slide120Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
More applications:
Text based games
Object detection
Indoor Navigation
Slide121Navigation as a RL problem?

States:
 Actions:
“Targetdriven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”
Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li FeiFei, Ali Farhadi
Slide122Navigation as a RL problem?
 States: ResNet50 features
 Actions:
“Targetdriven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”
Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li FeiFei, Ali Farhadi
Slide123Navigation as a RL problem?

States: ResNet50 feature
 Actions:
“Targetdriven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”
Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li FeiFei, Ali Farhadi
Slide124Navigation as a RL problem?

States: ResNet50 feature
 Actions:
 Forward/backward 0.5 m
 Turn left/right 90 deg
 Trigger
“Targetdriven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”
Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li FeiFei, Ali Farhadi
Slide125Navigation as a RL problem?
Q(s,a
1
=forward)
State (s)
Current frame and the target frame
Q(s,a
2
=backward)
Q(s,a
3
=turn left)
Q(s,a
6
=
trigger
)
Q(s,a
2
=turn right)
“Targetdriven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”
Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li FeiFei, Ali Farhadi
Slide126Navigation as a RL problem?
Q(s,a
1
=forward)
State (s)
Current frame and the target frame
Q(s,a
2
=backward)
Q(s,a
3
=turn left)
Q(s,a
6
=
trigger
)
Q(s,a
2
=turn right)
“Targetdriven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning”
Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li FeiFei, Ali Farhadi
Slide127Navigation as a RL problem?
Real environment
Simulated environment
Slide128Today’s takeaways
Bonus RL recap
Functional Approximation
Deep Q Network
Double Deep Q Network
Dueling Networks
Recurrent DQN
Solving “Doom”
Hierarchical DQN
More applications:
Text based games
Object detection
Indoor Navigation
Slide129QLearning Overestimation : Function Approximation
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Slide130QLearning Overestimation : Intuition
[Jensen’s Inequality]
https://hadovanhasselt.files.wordpress.com/2015/12/doubleqposter.pdf
Slide131QLearning Overestimation : Intuition
[Jensen’s Inequality]
https://hadovanhasselt.files.wordpress.com/2015/12/doubleqposter.pdf
What we want
Slide132QLearning Overestimation : Intuition
[Jensen’s Inequality]
https://hadovanhasselt.files.wordpress.com/2015/12/doubleqposter.pdf
What we want
What we estimate in QLearning
Slide133Double QLearning : Function Approximation
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double QLearning." In
AAAI
, pp. 20942100. 2016.
Slide134Results
Mean and median scores across all 57 Atari games, measured in percentages of human performance
Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. "Dueling network architectures for deep reinforcement learning."
arXiv preprint arXiv:1511.06581
(2015).
Slide135Results : Comparison to 10 frame DQN
Captures in one frame (and history state) what DQN captures in a stack of 10 for Flickering Pong
10 frame DQN conv1 captures paddle information
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
Slide136Results : Comparison to 10 frame DQN
Captures in one frame (and history state) what DQN captures in a stack of 10 for Flickering Pong
10 frame DQN conv2 captures paddle and ball direction information
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
Slide137Results : Comparison to 10 frame DQN
Captures in one frame (and history state) what DQN captures in a stack of 10 for Flickering Pong
10 frame DQN conv3 captures paddle, ball direction, velocity and deflection information
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
Slide138Results : Comparison to 10 frame DQN
Scores are comparable to 10frame DQN  outperforming in some and losing in some
Hausknecht, Matthew, and Peter Stone. "Deep recurrent qlearning for partially observable mdps."
arXiv preprint arXiv:1507.06527
(2015).
Download  The PPT/PDF document "Deep Reinforcement Learning: QLearning" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, noncommercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
View more...If you wait a while, download link will show on top.Please download the presentation after loading the download link.
Garima Lalwani Karan Ganju Unnat Jain Todays takeaways Bonus RL recap Functional Approximation Deep Q Network Double Deep Q Network Dueling Networks Recurrent DQN Solving Doom ID: 808968 Download