/
Behaviorist Psychology Behaviorist Psychology

Behaviorist Psychology - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
387 views
Uploaded On 2016-07-25

Behaviorist Psychology - PPT Presentation

R R P P B F Skinners operant conditioning Behaviorist Psychology R R P P B F Skinners operant conditioning behaviorist shaping Reward schedule Frequency Randomized In topology two continuous functions from one topological space to another are called ID: 419135

reward feedback action shaping feedback reward shaping action tamer based learning potential neutral http devlin pbrs ala2014 tutorials sam

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Behaviorist Psychology" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Behaviorist Psychology

R+R-P+P-

B. F. Skinner’s operant conditioningSlide2

Behaviorist Psychology

R+R-P+P-

B. F. Skinner’s operant conditioning / behaviorist shaping

Reward schedule

Frequency

RandomizedSlide3
Slide4

In topology, two continuous functions from one topological space to another are called

homotopic if one can be “continuously deformed” into the other, such a deformation being called a

homotopy between the two functions. What does shaping mean for computational reinforcement learning? T. Erez and W. D. Smart, 2008Slide5

According to

Erez and Smart, there are (at least) six ways of thinking about shaping from a homotopic standpoint. What can you come up with?Slide6

Modify Reward

: successive approximations, subgoalingDynamics: physical properties of environment or of agent: changing length of pole, changing amount of noiseInternal parameters: simulated annealing, schedule for learning rate, complexification in NEATInitial state: learning from easy missionsAction space: infants lock their arms, abstracted (simpler) space

Extending time horizon: POMDPs, decrease discount factor, pole balancing time horizonSlide7

The infant development timeline and its application to robot

shaping. J. Law+, 2011Bio-inspired vs. Bio-mimicking vs. Computational understandingSlide8

Potential-based reward shaping

Sam Devlin, http://swarmlab.unimaas.nl/ala2014/tutorials/pbrs-tut.pdfSlide9

Potential-based reward shaping

Sam Devlin, http://swarmlab.unimaas.nl/ala2014/tutorials/pbrs-tut.pdfSlide10

Potential-based reward shaping

Sam Devlin, http://swarmlab.unimaas.nl/ala2014/tutorials/pbrs-tut.pdfSlide11

Potential-based reward shaping

Sam Devlin, http://swarmlab.unimaas.nl/ala2014/tutorials/pbrs-tut.pdfSlide12

Potential-based reward shaping

Sam Devlin, http://swarmlab.unimaas.nl/ala2014/tutorials/pbrs-tut.pdfSlide13

Potential-based reward shaping

Sam Devlin, http://swarmlab.unimaas.nl/ala2014/tutorials/pbrs-tut.pdfSlide14

Potential-based reward shaping

Sam Devlin, http://swarmlab.unimaas.nl/ala2014/tutorials/pbrs-tut.pdfSlide15

Other recent advances

Harutyunyan, A., Devlin, S., Vrancx, P., & Nowe, A. Expressing Arbitrary Reward Functions as Potential-Based Advice. AAAI-15Brys, T., Harutyunyan, A., Taylor, M. E., & Nowe, A. Policy Transfer using Reward Shaping

. AAMAS-15Harutyunyan, A., Brys, T., Vrancx, P., & Nowe, A. Multi-Scale Reward Shaping via an Off-Policy Ensemble. AAMAS-15Brys, T., Nowé, A., Kudenko, D., & Taylor, M. E. Combining Multiple Correlated Reward and Shaping Signals by Measuring Confidence. AAAI-14Slide16

Learning from feedback (interactive shaping)

TAMER

Knox and Stone, K-CAP 2009Key insight: trainer evaluates behavior using a model of its long-term qualitySlide17

Learning from feedback (interactive shaping)

TAMER

Learn a model of

human reinforcement

Directly exploit the model

to determine action

If greedy:

Knox and Stone, K-CAP 2009Slide18

Learning from feedback (interactive shaping)

TAMER

Or, could try to maximize R and H….How would you combine them?Slide19

An a priori comparison

Interface

LfD interface may be familiar to video game players LfF interface is simpler and task-independentDemonstration more specifically points to the correct action

Task expertise

LfF - easier to judge than to control

Easier for human to increase expertise while training with LfD

Cognitive load

- less for LfFSlide20

http://www.bradknox.net/human-reward/tamer/tamer-in-action

/Slide21

Other recent work

Guangliang Li, Hayley Hung, Shimon Whiteson, and W. Bradley Knox. Using Informative Behavior to Increase Engagement in the TAMER Framework. AAMAS 2013Slide22

Bayesian Inference Approach

TAMER and COBOT treat feedback as numeric rewardsHere, feedback is categoricalUse Bayesian approachFind maximum a posteriori (MAP) estimate of target behaviorSlide23

Goal

Human can give positive or negative feedbackAgent tries to learn policy λ*Maps observations to actionsFor now: think contextual banditSlide24

Example: Dog Training

Teach dog to sit & shake λ*Mapping from observations to actionsFeedback: {Bad Dog, Good Boy}

“Sit” →“Shake” →Slide25

History in Dog Training

Feedback history hObservation: “sit”, Action: , Feedback: “Bad Dog”Observation: “sit”, Action: , Feedback: “Good Boy”…

Really make sense to assign numeric rewards to these?Slide26

Bayesian Framework

Trainer desires policy λ*ht is the training history at time tFind MAP hypothesis of λ*:

Model of training processPrior distribution over policiesSlide27

Learning from Feedback

λ* is a mapping from observations to actionsAt time t:Agent observes otAgent takes action

atTrainer gives feedbackSlide28

l

+ is positive feedback (reward)l0 is no feedback (neutral)l

- is negative feedback (punishment)# of positive feedbacks for action a, observation o: po,a# of negative feedbacks….: no,a# of neutral feedbacks….: µo,aSlide29

Assumed trainer behavior

Decide if action is correctDoes λ*(o)=a ? Trainer makes an error with p(ε)

Decide if should give feedbackµ+, µ- are probabilities of neutral feedbackIf thinks correct, give positive feedback with p(1- µ+)If thinks incorrect, give negative feedback with p(1- µ-)Could depend on trainerSlide30

Feedback Probabilities

Probability of feedback lt at time t is:Slide31

Fixed Parameter (Bayesian) Learning Algorithm

Assume µ+ = µ- and is fixedNeutral feedback doesn’t affect inferenceInference becomes:Slide32

Bayesian Algorithm

Initially, assume all policies are equally probableMaximum likelihood hypothesis = maximum a posteriori hypothesisGiven training history up to time t, can get to equation on previous slide from the following general statement:Action depends only on

current observationError rate: Between 0 and 0.5 to cancel outSlide33

Inferring Neutral

Try to learn µ+ and µ-

Don’t assume they’re equalMany trainers don’t use punishmentNeutral feedback could be punishmentSome don’t use rewardNeutral feedback could be rewardSlide34

EM step

Where

λi is ith estimate of maximum likelihood hypothesisCan simplify this (eventually) to:α has to do with the value of neutral feedback (relative to |β|)β is negative when neutral implies punishment and positive when implies rewardSlide35

User StudySlide36

Comparisons

Sim-TAMERNumerical reward functionZero ignoredNo delay assumedSim-COBOTSimilar to Sim-TAMER Doesn’t ignore zero rewardsSlide37

Categorical Feedback outperforms Numeric FeedbackSlide38

Categorical Feedback outperforms Numeric FeedbackSlide39

Leveraging Neutral Improves PerformanceSlide40

Leveraging Neutral Improves PerformanceSlide41

More recent work

Sequential tasksLearn language simultaneouslyHow do humans create sequence of tasks?Slide42

Poll: What’s next?