Prediction is important for action selection The problem prediction of future reward The algorithm temporal difference learning Neural implementation dopamine dependent learning in BG A precise computational model of learning allows one to look in the brain for hidden variables p ID: 385072
Download Presentation The PPT/PDF document "Summary of part I: prediction and RL" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Summary of part I: prediction and RL
Prediction is important for action selectionThe problem: prediction of future rewardThe algorithm: temporal difference learningNeural implementation: dopamine dependent learning in BGA precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the modelPrecise (normative!) theory for generation of dopamine firing patternsExplains anticipatory dopaminergic responding, second order conditioningCompelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areasSlide2
prediction error hypothesis of dopamine
model prediction errormeasured firing rate
Bayer & Glimcher (2005)
at end of trial:
t
= r
t
- V
t
(just like R-W)Slide3
Global plan
Reinforcement learning I:prediction classical conditioning dopamineReinforcement learning II:dynamic programming; action selectionPavlovian misbehaviourvigorChapter 9 of Theoretical NeuroscienceSlide4
Action Selection
Evolutionary specificationImmediate reinforcement:leg flexionThorndike puzzle boxpigeon; rat; human matchingDelayed reinforcement:these tasksmazeschess
Bandler;
BlanchardSlide5
Immediate Reinforcement
stochastic policy:based on action values:5Slide6
6
Indirect Actor
use RW rule:
switch every 100 trialsSlide7
Direct ActorSlide8
8
Direct ActorSlide9
Could we Tell?
correlate past rewards, actions with present choiceindirect actor (separate clocks):direct actor (single clock):Slide10
Matching: Concurrent VI-VI
Lau, Glimcher, Corrado,
Sugrue, NewsomeSlide11
Matching
income not returnapproximately exponential in ralternation choice kernelSlide12
12
Action at a (Temporal) Distance
learning an appropriate action at
x
=
1
:
depends
on the actions at
x
=2 and x=3
gains no immediate feedbackidea: use prediction as surrogate feedback
x
=1
x
=2
x
=3
x
=1
x
=2
x
=3Slide13
13
Action
Selection
start
with policy:
evaluate
it:
improve
it:
thus choose R more frequently than L;C
0.025
-0.175
-0.125
0.125
x
=1
x
=2
x
=3
x
=1
x
=2
x
=3
x
=1
x
=3
x
=2Slide14
14
Policy
value is too pessimistic
action is better than average
x
=1
x
=3
x
=2Slide15
actor/critic
m
1
m
2
m
3
m
n
dopamine signals to both motivational & motor striatum appear, surprisingly the same
suggestion: training both values & policiesSlide16
Formally: Dynamic Programming
]
policy iteration:
=
value iteration
=
Slide17
Variants: SARSA
Morris et al, 2006Slide18
Variants: Q learning
Roesch et al, 2007Slide19
Summary
prediction learningBellman evaluationactor-criticasynchronous policy iterationindirect method (Q learning)asynchronous value iterationSlide20
Impulsivity & Hyperbolic Discounting
humans (and animals) show impulsivity in:dietsaddictionspending, …intertemporal conflict between short and long term choicesoften explained via hyperbolic discount functionsalternative is Pavlovian imperative to an immediate reinforcerframing, trolley dilemmas, etcSlide21
Direct/Indirect Pathways
direct: D1: GO; learn from DA increaseindirect: D2: noGO; learn from DA decreasehyperdirect (STN) delay actions given strongly attractive choices
FrankSlide22
Frank
DARPP-32: D1 effectDRD2: D2 effectSlide23
Three Decision Makers
tree searchposition evaluationsituation memorySlide24
Multiple Systems in RL
model-based RLbuild a forward model of the task, outcomessearch in the forward model (online DP)optimal use of informationcomputationally ruinouscached-based RLlearn Q values, which summarize future worthcomputationally trivialbootstrap-based; so statistically inefficientlearn both – select according to uncertaintySlide25
Animal Canary
OFC; dlPFC; dorsomedial striatum; BLA?dosolateral striatum, amygdalaSlide26
Two Systems:Slide27
Behavioural EffectsSlide28
Effects of Learning
distributional value iteration
(Bayesian Q learning)
fixed additional uncertainty per stepSlide29
One Outcome
shallow tree
implies
goal-directed
control
winsSlide30
Human Canary
...if a c and c £££ , then do more of a or b?MB: bMF: a (or even no effect)
a
b
cSlide31
Behaviour
action values depend on both systems:expect that will vary by subject (but be fixed) Slide32
Neural Prediction Errors (12)
note that MB RL does not use this prediction error – training signal?
R ventral striatum
(anatomical definition)Slide33
Neural Prediction Errors (1)
right nucleus accumbens
behaviour
1-2, not 1Slide34
34
VigourTwo components to choice:what:lever pressingdirection to runmeal to choose
when/how fast/how vigorous
free operant tasks
real-valued DPSlide35
35
The model
choose
(action,
)
= (LP,
1
)
1
time
Costs
Rewards
choose
(action,
)
= (LP,
2
)
Costs
Rewards
cost
LP
NP
Other
?
how fast
2
time
S
1
S
2
vigour cost
unit cost
(reward)
S
0
U
R
P
R
goalSlide36
36
The model
choose
(action,
)
= (LP,
1
)
1
time
Costs
Rewards
choose
(action,
)
= (LP,
2
)
Costs
Rewards
2
time
S
1
S
2
S
0
Goal
: Choose actions and latencies to maximize
the
average
rate
of
return
(rewards minus costs per time)
ARLSlide37
37
Compute differential values of actions
Differential value of taking action
L
with latency
when in state x
ρ
= average rewards minus costs, per unit time
steady state behavior (not learning dynamics)
(Extension of Schwartz 1993)
Q
L
,
(x
) = Rewards – Costs +
Future Returns
Average Reward RLSlide38
38
Choose action with largest expected reward minus cost
1. Which action to take?
slow delays (all) rewards
net rate of rewards = cost of delay
(opportunity cost of time)
Choose rate that balances vigour and opportunity costs
How fast to perform it?
slow less costly (vigour cost)
Average Reward Cost/benefit Tradeoffs
explains faster (irrelevant) actions under hunger, etc
masochismSlide39
39
Optimal response rates
Experimental data
Niv, Dayan, Joel, unpublished
1
st
Nose poke
seconds since reinforcement
Model simulation
1
st
Nose poke
seconds since reinforcement
seconds
secondsSlide40
40
Optimal response rates
Model simulation
50
50
0
0
% Reinforcements on lever A
% Responses on lever A
Model
Perfect matching
0
20
40
60
80
100
100
80
60
40
20
Pigeon A
Pigeon B
Perfect matching
% Reinforcements on key A
% Responses on key A
Herrnstein 1961
Experimental data
More
:
# responses
interval length
amount of reward
ratio vs. interval
breaking point
temporal structure
etc. Slide41
41
Effects of motivation (in the model)
RR25
low utility
high utility
mean latency
LP Other
energizing
effectSlide42
42
Effects of motivation (in the model)
U
R
50%
response rate / minute
seconds from reinforcement
RR25
response rate / minute
seconds from reinforcement
directing effect
1
low utility
high utility
mean latency
LP Other
energizing
effect
2Slide43
43
What about tonic dopamine?
Phasic dopamine firing = reward prediction error
more
less
Relation to DopamineSlide44
44
Tonic dopamine = Average reward rateNB. phasic signal RPE for choice/value learning
Aberman and Salamone 1999
# LPs in 30 minutes
1
4
16
64
500
1000
1500
2000
2500
Control
DA depleted
Model simulation
# LPs in 30 minutes
Control
DA depleted
explains pharmacological manipulations
dopamine control of vigour through BG pathways
eating time confound
context/state dependence (motivation & drugs?)
less switching=perseverationSlide45
45
Tonic dopamine hypothesis
Satoh and Kimura 2003
Ljungberg, Apicella and Schultz 1992
reaction time
firing rate
…also explains effects of phasic dopamine on response times
$
$
$
$
$
$
♫
♫
♫
♫
♫
♫Slide46
Sensory Decisions as Optimal Stopping
consider listening to:decision: choose, or sampleSlide47
Optimal Stopping
equivalent of state u=1 isand states u=2,3 is Slide48
Transition ProbabilitiesSlide49
Computational Neuromodulation
dopaminephasic: prediction error for rewardtonic: average reward (vigour)serotoninphasic: prediction error for punishment?acetylcholine:expected uncertainty?norepinephrineunexpected uncertainty; neural interrupt?Slide50
50
ConditioningEthologyoptimalityappropriatenessPsychologyclassical/operant
conditioning
Computation
dynamic progr.
Kalman filtering
Algorithm
TD/delta rules
simple weights
Neurobiology
neuromodulators; amygdala; OFC
nucleus accumbens; dorsal striatum
prediction
: of important events
control
: in the light of those predictionsSlide51
class of stylized tasks with
states, actions & rewardsat each timestep t the world takes on state st and delivers reward rt, and the agent chooses an action at
Markov
Decision
ProcessSlide52
World:
You are in state 34. Your immediate reward is 3. You have 3 actions.Robot: I’ll take action 2.World: You are in state 77. Your immediate reward is -7. You have 2 actions.Robot: I’ll take action 1.World: You’re in state 34 (again). Your immediate reward is 3. You have 3 actions.
Markov
Decision
ProcessSlide53
Markov Decision Process
Stochastic process defined by:
reward
function:
r
t
~ P(
r
t | s
t)transition function: st ~ P(st+1 | st, at)Slide54
Markov Decision Process
Stochastic process defined by:
reward function:
r
t
~ P(
r
t
| s
t)transition function: st ~ P(st+1 | st, at)
Markov
property
future conditionally independent of past, given
s
tSlide55
The optimal policy
Definition: a policy such that at every state, its expected value is better than (or equal to) that of all other policiesTheorem: For every MDP there exists (at least) one deterministic optimal policy. by the way, why is the optimal policy just a mapping from states to actions? couldn’t you earn more reward by choosing a different action depending on last 2 states?Slide56
Pavlovian & Instrumental Conditioning
Pavlovianlearning values and predictionsusing TD errorInstrumentallearning actions:by reinforcement (leg flexion)by (TD) critic(actually different forms: goal directed & habitual)Slide57
Pavlovian-Instrumental Interactions
synergisticconditioned reinforcementPavlovian-instrumental transferPavlovian cue predicts the instrumental outcomebehavioural inhibition to avoid aversive outcomesneutralPavlovian-instrumental transferPavlovian cue predicts outcome with same motivational valenceopponentPavlovian-instrumental transferPavlovian cue predicts opposite motivational valencenegative automaintenanceSlide58
-ve Automaintenance in Autoshaping
simple choice taskN: nogo gives reward r=1G: go gives reward r=0learn three quantitiesaverage valueQ value for N Q value for G instrumental propensity isSlide59
-ve Automaintenance in Autoshaping
Pavlovian actionassert: Pavlovian impetus towards G is v(t)weight Pavlovian and instrumental advantages by ω – competitive reliability of Pavlovnew propensitiesnew action choiceSlide60
-ve Automaintenance in Autoshaping
basic –ve automaintenance effect (μ=5)lines are theoretical asymptotesequilibrium probabilities of action