Summary of part I: prediction and RL - PowerPoint Presentation

416 views
Uploaded On 2017-01-27

Summary of part I: prediction and RL - PPT Presentation

Prediction is important for action selection The problem prediction of future reward The algorithm temporal difference learning Neural implementation dopamine dependent learning in BG A precise computational model of learning allows one to look in the brain for hidden variables p ID: 514500

action prediction dopamine model prediction action model dopamine learning rewards choose time reinforcement reward costs rate cost optimal error

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/514500" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Summary of part I: prediction and RL" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Summary of part I: prediction and RL

Prediction is important for action selectionThe problem: prediction of future rewardThe algorithm: temporal difference learningNeural implementation: dopamine dependent learning in BGA precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the modelPrecise (normative!) theory for generation of dopamine firing patternsExplains anticipatory dopaminergic responding, second order conditioningCompelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areasSlide2

prediction error hypothesis of dopamine

model prediction errormeasured firing rate

Bayer & Glimcher (2005)

at end of trial: 

= r

- V

(just like R-W)Slide3

Global plan

Reinforcement learning I:prediction classical conditioning dopamineReinforcement learning II:dynamic programming; action selectionPavlovian misbehaviourvigorChapter 9 of Theoretical NeuroscienceSlide4

Action Selection

Evolutionary specificationImmediate reinforcement:leg flexionThorndike puzzle boxpigeon; rat; human matchingDelayed reinforcement:these tasksmazeschess

Bandler;

BlanchardSlide5

Immediate Reinforcement

stochastic policy:based on action values:5Slide6

Indirect Actor

use RW rule:

switch every 100 trialsSlide7

Direct ActorSlide8

Direct ActorSlide9

Could we Tell?

correlate past rewards, actions with present choiceindirect actor (separate clocks):direct actor (single clock):Slide10

Matching: Concurrent VI-VI

Lau, Glimcher, Corrado,

Sugrue, NewsomeSlide11

Matching

income not returnapproximately exponential in ralternation choice kernelSlide12

Action at a (Temporal) Distance

learning an appropriate action at

depends

on the actions at

=2 and x=3

gains no immediate feedbackidea: use prediction as surrogate feedback

=3Slide13

Action Selection

start

with policy:

evaluate

it:

improve

it:

thus choose R more frequently than L;C

0.025

-0.175

-0.125

0.125

=2Slide14

Policy

value is too pessimistic

action is better than average

=2Slide15

actor/critic

dopamine signals to both motivational & motor striatum appear, surprisingly the same

suggestion: training both values & policiesSlide16

Formally: Dynamic Programming

]

policy iteration:

value iteration

Slide17

Variants: SARSA

Morris et al, 2006Slide18

Variants: Q learning

Roesch et al, 2007Slide19

Summary

prediction learningBellman evaluationactor-criticasynchronous policy iterationindirect method (Q learning)asynchronous value iterationSlide20

Impulsivity & Hyperbolic Discounting

humans (and animals) show impulsivity in:dietsaddictionspending, …intertemporal conflict between short and long term choicesoften explained via hyperbolic discount functionsalternative is Pavlovian imperative to an immediate reinforcerframing, trolley dilemmas, etcSlide21

Direct/Indirect Pathways

direct: D1: GO; learn from DA increaseindirect: D2: noGO; learn from DA decreasehyperdirect (STN) delay actions given strongly attractive choices

FrankSlide22

Frank

DARPP-32: D1 effectDRD2: D2 effectSlide23

Three Decision Makers

tree searchposition evaluationsituation memorySlide24

Multiple Systems in RL

model-based RLbuild a forward model of the task, outcomessearch in the forward model (online DP)optimal use of informationcomputationally ruinouscached-based RLlearn Q values, which summarize future worthcomputationally trivialbootstrap-based; so statistically inefficientlearn both – select according to uncertaintySlide25

Animal Canary

OFC; dlPFC; dorsomedial striatum; BLA?dosolateral striatum, amygdalaSlide26

Two Systems:Slide27

Behavioural EffectsSlide28

Effects of Learning

distributional value iteration

(Bayesian Q learning)

fixed additional uncertainty per stepSlide29

One Outcome

shallow tree

implies

goal-directed

control

winsSlide30

Human Canary...

if a  c and c  £££ , then do more of a or b?MB: bMF: a (or even no effect)

cSlide31

Behaviour

action values depend on both systems:expect that will vary by subject (but be fixed) Slide32

Neural Prediction Errors (12)

note that MB RL does not use this prediction error – training signal?

R ventral striatum

(anatomical definition)Slide33

Neural Prediction Errors (1)

right nucleus accumbens

behaviour

1-2, not 1Slide34

Pavlovian ControlSlide35

The 6-State Task

Huys et al, 2012Slide36

Evaluation

full treerandom pruning :+ specific pruning :+ pigeon: Slide37

Results

full lookahead:unbiased termination:value-based pruning:pigeon effect:full modelbest modelSlide38

Parameter Values

BDI (mean 3.7; range 0-15)

more reliance on pruning means more `prone to depression’Slide39

Direct Evidence of Pruning

no weird loss aversion+140vs-XSlide40

VigourTwo components to choice:what:lever pressingdirection to runmeal to choose

when/how fast/how vigorous

free operant tasks

real-valued DPSlide41

The model

choose

(action,



)

= (LP,



)



time

Costs

Rewards

choose

(action,



)

= (LP,



)

Costs

Rewards



cost

Other

how fast



time

vigour cost

unit cost

(reward)

goalSlide42

The model

choose

(action,



)

= (LP,



)



time

Costs

Rewards

choose

(action,



)

= (LP,



)

Costs

Rewards



time

Goal

: Choose actions and latencies to maximize

the

average

rate

return

(rewards minus costs per time)

ARLSlide43

Compute differential values of actions

Differential value of taking action

with latency



when in state x

= average rewards minus costs, per unit time

steady state behavior (not learning dynamics)

(Extension of Schwartz 1993)



) = Rewards – Costs +

Future Returns

Average Reward RLSlide44

Choose action with largest expected reward minus cost

1. Which action to take?

slow  delays (all) rewards

net rate of rewards = cost of delay

(opportunity cost of time)

Choose rate that balances vigour and opportunity costs

How fast to perform it?

slow  less costly (vigour cost)

Average Reward Cost/benefit Tradeoffs

explains faster (irrelevant) actions under hunger, etc

masochismSlide45

Optimal response rates

Experimental data

Niv, Dayan, Joel, unpublished

Nose poke

seconds since reinforcement

Model simulation

Nose poke

seconds since reinforcement

seconds

secondsSlide46

Optimal response rates

Model simulation

% Reinforcements on lever A

% Responses on lever A

Model

Perfect matching

100

Pigeon A

Pigeon B

Perfect matching

% Reinforcements on key A

% Responses on key A

Herrnstein 1961

Experimental data

# responses

interval length

amount of reward

ratio vs. interval

breaking point

temporal structure

etc. Slide47

Effects of motivation (in the model)

RR25

low utility

high utility

mean latency

LP Other

energizing

effectSlide48

Effects of motivation (in the model)

50%

response rate / minute

seconds from reinforcement

RR25

response rate / minute

seconds from reinforcement

directing effect

low utility

high utility

mean latency

LP Other

energizing

effect

2Slide49

What about tonic dopamine?

Phasic dopamine firing = reward prediction error

less

Relation to DopamineSlide50

Tonic dopamine = Average reward rateNB. phasic signal RPE for choice/value learning

Aberman and Salamone 1999

# LPs in 30 minutes

500

1000

1500

2000

2500

Control

DA depleted

Model simulation

# LPs in 30 minutes

Control

DA depleted

explains pharmacological manipulations

dopamine control of vigour through BG pathways

eating time confound

context/state dependence (motivation & drugs?)

less switching=perseverationSlide51

Tonic dopamine hypothesis

Satoh and Kimura 2003

Ljungberg, Apicella and Schultz 1992

reaction time

firing rate

…also explains effects of phasic dopamine on response times

♫

♫Slide52

Sensory Decisions as Optimal Stopping

consider listening to:decision: choose, or sampleSlide53

Optimal StoppingSlide54

Key QuantitiesSlide55

Optimal StoppingSlide56

Optimal Stopping

equivalent of state u=1 isand states u=2,3 is Slide57

Transition ProbabilitiesSlide58

Computational Neuromodulation

dopaminephasic: prediction error for rewardtonic: average reward (vigour)serotoninphasic: prediction error for punishment?acetylcholine:expected uncertainty?norepinephrineunexpected uncertainty; neural interrupt?Slide59

ConditioningEthologyoptimalityappropriatenessPsychologyclassical/operant

conditioning

Computation

dynamic progr.

Kalman filtering

Algorithm

TD/delta rules

simple weights

Neurobiology

neuromodulators; amygdala; OFC

nucleus accumbens; dorsal striatum

prediction