/
Summary of part I:  prediction and RL Summary of part I:  prediction and RL

Summary of part I: prediction and RL - PowerPoint Presentation

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
415 views
Uploaded On 2016-07-01

Summary of part I: prediction and RL - PPT Presentation

Prediction is important for action selection The problem prediction of future reward The algorithm temporal difference learning Neural implementation dopamine dependent learning in BG A precise computational model of learning allows one to look in the brain for hidden variables p ID: 385072

reward action pavlovian prediction action reward prediction pavlovian dopamine learning model actions reinforcement rewards choose state instrumental time policy

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Summary of part I: prediction and RL" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Summary of part I: prediction and RL

Prediction is important for action selectionThe problem: prediction of future rewardThe algorithm: temporal difference learningNeural implementation: dopamine dependent learning in BGA precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the modelPrecise (normative!) theory for generation of dopamine firing patternsExplains anticipatory dopaminergic responding, second order conditioningCompelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areasSlide2

prediction error hypothesis of dopamine

model prediction errormeasured firing rate

Bayer & Glimcher (2005)

at end of trial: 

t

= r

t

- V

t

(just like R-W)Slide3

Global plan

Reinforcement learning I:prediction classical conditioning dopamineReinforcement learning II:dynamic programming; action selectionPavlovian misbehaviourvigorChapter 9 of Theoretical NeuroscienceSlide4

Action Selection

Evolutionary specificationImmediate reinforcement:leg flexionThorndike puzzle boxpigeon; rat; human matchingDelayed reinforcement:these tasksmazeschess

Bandler;

BlanchardSlide5

Immediate Reinforcement

stochastic policy:based on action values:5Slide6

6

Indirect Actor

use RW rule:

switch every 100 trialsSlide7

Direct ActorSlide8

8

Direct ActorSlide9

Could we Tell?

correlate past rewards, actions with present choiceindirect actor (separate clocks):direct actor (single clock):Slide10

Matching: Concurrent VI-VI

Lau, Glimcher, Corrado,

Sugrue, NewsomeSlide11

Matching

income not returnapproximately exponential in ralternation choice kernelSlide12

12

Action at a (Temporal) Distance

learning an appropriate action at

x

=

1

:

depends

on the actions at

x

=2 and x=3

gains no immediate feedbackidea: use prediction as surrogate feedback

x

=1

x

=2

x

=3

x

=1

x

=2

x

=3Slide13

13

Action

Selection

start

with policy:

evaluate

it:

improve

it:

thus choose R more frequently than L;C

0.025

-0.175

-0.125

0.125

x

=1

x

=2

x

=3

x

=1

x

=2

x

=3

x

=1

x

=3

x

=2Slide14

14

Policy

value is too pessimistic

action is better than average

x

=1

x

=3

x

=2Slide15

actor/critic

m

1

m

2

m

3

m

n

dopamine signals to both motivational & motor striatum appear, surprisingly the same

suggestion: training both values & policiesSlide16

Formally: Dynamic Programming

]

policy iteration:

=

value iteration

=

 Slide17

Variants: SARSA

Morris et al, 2006Slide18

Variants: Q learning

Roesch et al, 2007Slide19

Summary

prediction learningBellman evaluationactor-criticasynchronous policy iterationindirect method (Q learning)asynchronous value iterationSlide20

Impulsivity & Hyperbolic Discounting

humans (and animals) show impulsivity in:dietsaddictionspending, …intertemporal conflict between short and long term choicesoften explained via hyperbolic discount functionsalternative is Pavlovian imperative to an immediate reinforcerframing, trolley dilemmas, etcSlide21

Direct/Indirect Pathways

direct: D1: GO; learn from DA increaseindirect: D2: noGO; learn from DA decreasehyperdirect (STN) delay actions given strongly attractive choices

FrankSlide22

Frank

DARPP-32: D1 effectDRD2: D2 effectSlide23

Three Decision Makers

tree searchposition evaluationsituation memorySlide24

Multiple Systems in RL

model-based RLbuild a forward model of the task, outcomessearch in the forward model (online DP)optimal use of informationcomputationally ruinouscached-based RLlearn Q values, which summarize future worthcomputationally trivialbootstrap-based; so statistically inefficientlearn both – select according to uncertaintySlide25

Animal Canary

OFC; dlPFC; dorsomedial striatum; BLA?dosolateral striatum, amygdalaSlide26

Two Systems:Slide27

Behavioural EffectsSlide28

Effects of Learning

distributional value iteration

(Bayesian Q learning)

fixed additional uncertainty per stepSlide29

One Outcome

shallow tree

implies

goal-directed

control

winsSlide30

Human Canary

...if a  c and c  £££ , then do more of a or b?MB: bMF: a (or even no effect)

a

b

cSlide31

Behaviour

action values depend on both systems:expect that will vary by subject (but be fixed) Slide32

Neural Prediction Errors (12)

note that MB RL does not use this prediction error – training signal?

R ventral striatum

(anatomical definition)Slide33

Neural Prediction Errors (1)

right nucleus accumbens

behaviour

1-2, not 1Slide34

34

VigourTwo components to choice:what:lever pressingdirection to runmeal to choose

when/how fast/how vigorous

free operant tasks

real-valued DPSlide35

35

The model

choose

(action,

)

= (LP,

1

)

1

time

Costs

Rewards

choose

(action,

)

= (LP,

2

)

Costs

Rewards

cost

LP

NP

Other

?

how fast

2

time

S

1

S

2

vigour cost

unit cost

(reward)

S

0

U

R

P

R

goalSlide36

36

The model

choose

(action,

)

= (LP,

1

)

1

time

Costs

Rewards

choose

(action,

)

= (LP,

2

)

Costs

Rewards

2

time

S

1

S

2

S

0

Goal

: Choose actions and latencies to maximize

the

average

rate

of

return

(rewards minus costs per time)

ARLSlide37

37

Compute differential values of actions

Differential value of taking action

L

with latency

when in state x

ρ

= average rewards minus costs, per unit time

steady state behavior (not learning dynamics)

(Extension of Schwartz 1993)

Q

L

,

(x

) = Rewards – Costs +

Future Returns

Average Reward RLSlide38

38

Choose action with largest expected reward minus cost

1. Which action to take?

slow  delays (all) rewards

net rate of rewards = cost of delay

(opportunity cost of time)

Choose rate that balances vigour and opportunity costs

How fast to perform it?

slow  less costly (vigour cost)

Average Reward Cost/benefit Tradeoffs

explains faster (irrelevant) actions under hunger, etc

masochismSlide39

39

Optimal response rates

Experimental data

Niv, Dayan, Joel, unpublished

1

st

Nose poke

seconds since reinforcement

Model simulation

1

st

Nose poke

seconds since reinforcement

seconds

secondsSlide40

40

Optimal response rates

Model simulation

50

50

0

0

% Reinforcements on lever A

% Responses on lever A

Model

Perfect matching

0

20

40

60

80

100

100

80

60

40

20

Pigeon A

Pigeon B

Perfect matching

% Reinforcements on key A

% Responses on key A

Herrnstein 1961

Experimental data

More

:

# responses

interval length

amount of reward

ratio vs. interval

breaking point

temporal structure

etc. Slide41

41

Effects of motivation (in the model)

RR25

low utility

high utility

mean latency

LP Other

energizing

effectSlide42

42

Effects of motivation (in the model)

U

R

50%

response rate / minute

seconds from reinforcement

RR25

response rate / minute

seconds from reinforcement

directing effect

1

low utility

high utility

mean latency

LP Other

energizing

effect

2Slide43

43

What about tonic dopamine?

Phasic dopamine firing = reward prediction error

more

less

Relation to DopamineSlide44

44

Tonic dopamine = Average reward rateNB. phasic signal RPE for choice/value learning

Aberman and Salamone 1999

# LPs in 30 minutes

1

4

16

64

500

1000

1500

2000

2500

Control

DA depleted

Model simulation

# LPs in 30 minutes

Control

DA depleted

explains pharmacological manipulations

dopamine control of vigour through BG pathways

eating time confound

context/state dependence (motivation & drugs?)

less switching=perseverationSlide45

45

Tonic dopamine hypothesis

Satoh and Kimura 2003

Ljungberg, Apicella and Schultz 1992

reaction time

firing rate

…also explains effects of phasic dopamine on response times

$

$

$

$

$

$

♫Slide46

Sensory Decisions as Optimal Stopping

consider listening to:decision: choose, or sampleSlide47

Optimal Stopping

equivalent of state u=1 isand states u=2,3 is Slide48

Transition ProbabilitiesSlide49

Computational Neuromodulation

dopaminephasic: prediction error for rewardtonic: average reward (vigour)serotoninphasic: prediction error for punishment?acetylcholine:expected uncertainty?norepinephrineunexpected uncertainty; neural interrupt?Slide50

50

ConditioningEthologyoptimalityappropriatenessPsychologyclassical/operant

conditioning

Computation

dynamic progr.

Kalman filtering

Algorithm

TD/delta rules

simple weights

Neurobiology

neuromodulators; amygdala; OFC

nucleus accumbens; dorsal striatum

prediction

: of important events

control

: in the light of those predictionsSlide51

class of stylized tasks with

states, actions & rewardsat each timestep t the world takes on state st and delivers reward rt, and the agent chooses an action at

Markov

Decision

ProcessSlide52

World:

You are in state 34. Your immediate reward is 3. You have 3 actions.Robot: I’ll take action 2.World: You are in state 77. Your immediate reward is -7. You have 2 actions.Robot: I’ll take action 1.World: You’re in state 34 (again). Your immediate reward is 3. You have 3 actions.

Markov

Decision

ProcessSlide53

Markov Decision Process

Stochastic process defined by:

reward

function:

r

t

~ P(

r

t | s

t)transition function: st ~ P(st+1 | st, at)Slide54

Markov Decision Process

Stochastic process defined by:

reward function:

r

t

~ P(

r

t

| s

t)transition function: st ~ P(st+1 | st, at)

Markov

property

future conditionally independent of past, given

s

tSlide55

The optimal policy

Definition: a policy such that at every state, its expected value is better than (or equal to) that of all other policiesTheorem: For every MDP there exists (at least) one deterministic optimal policy. by the way, why is the optimal policy just a mapping from states to actions? couldn’t you earn more reward by choosing a different action depending on last 2 states?Slide56

Pavlovian & Instrumental Conditioning

Pavlovianlearning values and predictionsusing TD errorInstrumentallearning actions:by reinforcement (leg flexion)by (TD) critic(actually different forms: goal directed & habitual)Slide57

Pavlovian-Instrumental Interactions

synergisticconditioned reinforcementPavlovian-instrumental transferPavlovian cue predicts the instrumental outcomebehavioural inhibition to avoid aversive outcomesneutralPavlovian-instrumental transferPavlovian cue predicts outcome with same motivational valenceopponentPavlovian-instrumental transferPavlovian cue predicts opposite motivational valencenegative automaintenanceSlide58

-ve Automaintenance in Autoshaping

simple choice taskN: nogo gives reward r=1G: go gives reward r=0learn three quantitiesaverage valueQ value for N Q value for G instrumental propensity isSlide59

-ve Automaintenance in Autoshaping

Pavlovian actionassert: Pavlovian impetus towards G is v(t)weight Pavlovian and instrumental advantages by ω – competitive reliability of Pavlovnew propensitiesnew action choiceSlide60

-ve Automaintenance in Autoshaping

basic –ve automaintenance effect (μ=5)lines are theoretical asymptotesequilibrium probabilities of action