/
A Simple Reinforcement Learning Algorithm For Biped Walking Jun Morimoto Gordon Cheng A Simple Reinforcement Learning Algorithm For Biped Walking Jun Morimoto Gordon Cheng

A Simple Reinforcement Learning Algorithm For Biped Walking Jun Morimoto Gordon Cheng - PDF document

briana-ranney
briana-ranney . @briana-ranney
Follow
569 views
Uploaded On 2015-03-19

A Simple Reinforcement Learning Algorithm For Biped Walking Jun Morimoto Gordon Cheng - PPT Presentation

cojp gordonatrcojp httpwwwcnsatrcojphrcn Christopher G Atkeson and Garth Zeglin The Robotics Institute Carnegie Mellon University cgacscmuedu garthzricmuedu httpwwwricmuedu Abstract We propose a modelbased reinforcement learning algorithm for biped ID: 47609

cojp gordonatrcojp httpwwwcnsatrcojphrcn Christopher

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "A Simple Reinforcement Learning Algorith..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Proceedings of the 2004 IEEE 3030 ASimpleReinforcementLearningAlgorithmForBipedWalkingJunMorimoto,GordonCheng,DepartmentofHumanoidRoboticsandComputationalNeuroscienceATRComputationalNeuroscienceLabsxmorimo@atr.co.jp,gordon@atr.co.jphttp://www.cns.atr.co.jp/hrcnChristopherG.Atkeson,andGarthZeglinTheRoboticsInstituteCarnegieMellonUniversitycga@cs.cmu.edu,garthz@ri.cmu.eduhttp://www.ri.cmu.edu—Weproposeamodel-basedreinforcementlearningalgorithmforbipedwalkinginwhichtherobotlearnstoappropriatelyplacetheswingleg.ThisdecisionisbasedonalearnedmodelofthePoincaremapoftheperiodicwalkingpattern.Themodelmapsfromastateatthemiddleofastepandfootplacementtoastateatnextmiddleofastep.We Fig.1.Threelinkrobotmodel Fig.2.FivelinkbipedrobotTABLEIHYSICALPARAMETERSOFTHETHREELINKROBOTMODEL trunkleg mass[ 2.8 010 inertia[m2] 00010 II.ESTIMATIONOFNATURALBIPEDWALKINGTIMINGInorderforourfootplacementalgorithmtoplacethefootattheappropriatetime,wemustestimatethenaturalbipedwalkingperiod,orequivalently,frequency.Thistimingchanges,forexample,whenwalkingdownslopes.Ourgoalistoadaptthewalkingcycletimingtothedynamicsoftherobotandenvironment. TABLEIIHYSICALPARAMETERSOFTHEFIVELINKROBOTMODEL trunkthighshin mass[ 640 [m] 010 inertia( 1.4 A.EstimationmethodWederivethetargetwalkingfrequencyfromthewalkingwhichismeasuredfromanactualhalf-cycleperiod(onefootfalltoanother): Theupdateruleforthewalkingfrequencyisisthefrequencyadaptationgain,andistheestimatedfrequencyaftersteps.Aninterestingfeatureofthismethodisthatthesimpleaveraging(low-passÞltering)method(Eq.2)canestimateappropriatetimingofthewalkingcycleforthegivenrobotdynamics.Thismethodwasalsoadoptedin[14],[15].Severalstudiessuggestthatphaseresettingiseffectivetomatchwalkingcycletimingtothenaturaldynamicsofthebiped[19],[24],[14],[15].Hereweproposeanadaptivephaseresettingmethod.Phaseisresetwhentheswinglegtouchestheground:istheaveragephase,andisthephaseadaptationB.AsimpleexampleoftimingestimationWeusethesimulatedthreelinkbipedrobot(Fig.1)todemonstratethetimingestimationmethod.Atargetbipedwalkingtrajectoryisgeneratedusingsinusoidalfunctionswith=10andasimplecontrollerisdesignedtofollowthetargettrajectoriesforeachleg:denotesthelefthiptorque,denotestherighthipisapositiongain,isavelocitygain,areleftandrighthipjointangles.Estimatedisgivenby,whereisthecurrenttime.Forcomparison,weapplythiscontrollertothesimulatedrobotwithoutusingthetimingestimationmethod,soÞxedandincreaseslinearlywithtime(Thewalkingperiodwassettoandfrequency=10rad/sec).Theinitialaveragephasewassettofortherightlegandfortheleftleg,thefrequencyadaptationgainwassetto,andthephaseadaptationgainwassetWithaninitialconditionwhichhasabodyvelocityof,thesimulated3linkrobotwalkedstablyonadownwardslope(Fig.3(Top)).However,therobotcouldnotwalkstablyonadownwardslope(Fig.3(Bottom)).Whenweusedtheonlineestimateofandtheadaptivephaseresettingmethod,therobotwalkedstablyonthetwotestdownwardslope(Fig.4(Top))anddownwardslope(Fig.4(Bottom)).InÞgure5,weshowtheestimatedwalkingfrequency. Fig.3.Bipedwalkingpatternwithouttimingadaptation:(Top)downwardslope,(Bottom)downwardslope Fig.4.Bipedwalkingpatternwithtimingadaptation:(Top)downwardslope,(Bottom)downwardslope 10 20 30 40 50 8 8.5 9 9.5 10  [rad/s]Steps °4.0 ° Fig.5.EstimatedwalkingfrequencyIII.MBASEDREINFORCEMENTLEARNINGFORBIPEDLOCOMOTIONTowalkstablyweneedtocontroltheplacementaswellasthetimingofthenextstep.Here,weproposealearningmethodtoacquireastabilizingcontroller. A.Model-basedreinforcementlearningWeuseamodel-basedreinforcementlearningframe-work[4],[17].Reinforcementlearningrequiresasourceofreward.WelearnaPoincaremapoftheeffectoffootplacement,andthenlearnacorrespondingvaluefunctionforstatesatphase 2and=3 (Fig.6),wherewedeÞneastherightfoottouchdown.1)LearningthePoincaremapofbipedwalking:Welearnamodelthatpredictsthestateofthebipedahalfcycleahead,basedonthecurrentstateandthefootplacementattouchdown.WearepredictingthelocationofthesysteminaPoincaresectionatphase basedonthesystemÕslocationinaPoincaresectionatphase .Weusethesamemodeltopredictthelocationatphase basedonthelocationatphase (Fig.6).Becausethestateoftherobotdrasticallychangesatfoottouchdown(),weselectthephases 2and=3 asPoincaresections.WeapproximatethisPoincaremapusingafunctionapproximatorwithaparametervector 2=ˆf(x 2,u wheretheinputstateisdeÞnedasdenotesthehorizontaldistancebetweenthestancefootpositionandthebodyposition(Fig.7).Here,weusethehippositionasthebodypositionbecausethecenterofmassisalmostatthesamepositionasthehipposition(Fig.2).Theactionoftherobotactisthetargetkneejointangleoftheswinglegwhichdeterminesthefootplacement(Fig.7).2)Representationofbipedwalkingtrajectoriesandthelow-levelcontroller:Onecycleofbipedwalkingisrepre-sentedbyfourvia-pointsforeachjoint(Fig.6).Theoutputofacurrentpolicyactisusedtospecifyvia-points(TableIII).Weinterpolatedtrajectoriesbetweentargetposturesbyusingtheminimumjerkcriteria[6],[21]exceptforpushingoffatthestancekneejoint.Forpushingoffatthestanceknee,weinstantaneouslychangethedesiredjointangletodeliverapushofftoaÞxedtargettoacceleratethemotion.ZerodesiredvelocityandaccelerationarespeciÞedateachvia-point.Tofollowthegeneratedtargettrajectories,thetorqueoutputateachjointisgivenbyaPDservocontroller:isthetargetjointanglefor-thjoint(positiongainissettoexceptforthekneejointofthestanceleg(weusedforthekneejointofthestanceleg),andthevelocitygainissetto.TableIIIshowsthetargetpostures.3)Rewards:Therobotgetsarewardifitsuccessfullycontinueswalkingandgetspunishment(negativereward)ifitfallsdown.Oneachtransitionfromphase 2(or=3 )tophase 2(or=1 ),therobotgetsarewardof0.1iftheheightofthebodyremainsabove0.35mduringthepasthalfcycle.Iftheheightofthebodygoesbelow0.35m,therobotisgivenanegativereward(-1)andthetrialisterminated.TABLEIIIARGETPOSTURESATEACHPHASEISPROVIDEDBYTHEOUTPUTOFCURRENTPOLICYHEUNITFORNUMBERSINTHISTABLEISDEGREES righthiprightkneelefthipleftknee  00 5 0 7 00 = 00 5 0 7 010 Fig.6.Bipedwalkingtrajectoryusingfourvia-points:weupdateparametersandselectactionsatPoincaresectionsonphase 2and=3 .L:leftleg,R:rightleg4)Learningthevaluefunction:Inareinforcementlearningframework,thelearnertriestocreateacontrollerwhichmaximizesexpectedtotalreturn.Here,wedeÞnethevaluefunctionforthepolicy))==r(t+1)++2)++3)+istherewardattime,and)isthediscountfactor.Inthisframework,weevaluatethevaluefunctiononlyat 2and(t .Thus,weconsiderourlearningframeworkasmodel-basedreinforcementlearningforasemi-Markovdecisionprocess(SMDP)[18].Weusedafunctionapproximatorwithaparametervectortoestimatethevaluefunction:Byconsideringthedeviationfromequation(9),wecandeÞnethetemporaldifferenceerror(TD-error)[17],[18]:isthetimewhen 2or(tT .Theupdateruleforthevaluefunctioncanbederivedas))+isalearningrate.Theparametervectorupdatedbyequation(19).WefollowedthedeÞnitionofthevaluefunctionin[17] Fig.7.(left)Inputstate,(right)Outputofthecontroller5)Learningapolicyforbipedlocomotion:Weuseastochasticpolicytogenerateexploratoryaction.Thepolicyisrepresentedbyaprobabilisticmodel:))=  2Š(u(t)ŠA(x(ta denotesthemeanofthemodel,whichisrepresentedbyafunctionapproximator,whereisaparametervector.Wechangedthevarianceaccordingtothetrialastrial trialtrial,wheretrialdenotesthenumberoftrials.TheoutputofthepolicyisindicateanormaldistributionwhichhasmeanandvarianceWederivetheupdateruleforapolicybyusingthevaluefunctionandtheestimatedPoincaremap.1)Derivethegradientofthevaluefunction atthecurrentstate2)Derivethegradientofthedynamicsmodel thepreviousstateandthenominalaction3)Updatethepolicy x f(x,u) isthelearningrate.Theparametervectorisupdatedbyequation(19).WecanconsidertheoutputisanintheSMDP[18]initiatedinstateattime 2(or=3 ),anditterminatesattime 2(or=1 6)Functionapproximator:WeusedReceptiveFieldWeightedRegression(RFWR)[16]asthefunctionapproxi-matorforthepolicy,thevaluefunctionandtheestimateddynamicsmodel.Here,weapproximatethetargetfunction )=exp isthecenterofthe-thbasisfunction,isthedistancemetricofthe-thbasisfunction,isthenumberofbasisfunctions,and=((istheaugmentedstate.Theupdaterulefortheparameterisgivenby: PkŠPk˜xk˜xTkPk ak istheforgettingfactor.Wealignbasisfunctionsatevenintervalsineachdimensionofinputspace(Fig.7)[].Weused)basisfunctionsforapproximatingthepolicyandthevaluefunction.Wealsoalign20basisfunctionsatevenintervalsintheoutputspaceradactrad(Fig.7).Weused8000(=)basisfunctionsforapproximatingthePoincaremap.Wesetthedistancemetricdiagforthepolicyandthevaluefunction,anddiagforthePoincaremap.ThecentersofthebasisfunctionsandthedistancemetricsofthebasisfunctionsareÞxedduringlearning.IV.RESULTSA.LearningfootplacementWeappliedtheproposedmethodtothe5linksimulatedrobot(Fig.2).Weusedamanuallygeneratedinitialsteptogetthepatternstarted.Wesetthewalkingperiodtotorad/secAtrialterminatedafter50stepsoraftertherobotfelldown.Figure8(Top)showsthewalkingpatternbeforelearning,andFigure8(Middle)showsthewalkingpatternafter30trials.Targetkneejointanglesfortheswinglegwerevariedbecauseofexploratorybehavior(Fig.8(Middle)).Figure10showstheaccumulatedrewardateachtrial.WedeÞnedasuccessfultrialwhentherobotachieved50steps.Astablebipedwalkingcontrollerwasacquiredafter80trials(averagedover5experiments).TheshapeofthevaluefunctionisshowninFig.11.Themaximumvalueofthevaluefunctionislocatedatpositive)andnegativeFigure9showsjointangletrajectoriesofstablebipedwalkingafterlearning.Notethattherobotaddedenergytoitsinitiallyslowwalkbychoosingactappropriatelywhichaffectsbothfootplacementandthesubsequentpushoff.TheacquiredwalkingpatternisshowninFig.8(Bottom).B.EstimationofbipedwalkingperiodTheestimatedphaseofthecycleplaysanimportantroleinthiscontroller.Itisessentialthatthecontrollerphasematchesthedynamicsofthemechanicalsystem.WeappliedourtimingestimationmethoddescribedinsectionII-Atothelearnedbipedcontroller.Theinitialaveragephasewassettofortheleftlegandfortherightleg,thefrequencyadaptationgainwassetto,andthephaseadaptationgainwassetto Fig.8.Acquiredbipedwalkingpattern:(Top)Beforelearning,(Middle)After30trials,(Bottom)AfterlearningWeevaluatedthecombinedmethodonadownwardslope.Thesimulatedrobotwiththeacquiredcontrollerinprevioussectioncouldnotwalkstablyonthedownwardslope(Fig.12(Top)).However,whenweusedtheonlineestimateofthewalkingperiodandtheadaptivephaseresettingmethodwiththelearnedcontroller,therobotwalkedstablyonthedownwardslope(Fig.12(Bottom)).C.StabilityanalysisoftheacquiredpolicyWeanalyzedthestabilityoftheacquiredpolicyintermsofthePoincaremap,mappingfromaPoincaresectionat tophase (Fig.6).WeestimatedtheJacobianmatrixofthePoincaremapatthePoincaresections,andcheckedifornot,wherearetheeigenvalues[3],[7].Becauseweuseddifferentiablefunctionsasfunctionapproximators,wecanestimatetheJacobianmatrixbasedon: dx= f x+ f u u(x) Figure13showstheaverageeigenvaluesateachtrial.Theeigenvaluesdecreasedasthelearningproceeded,andbecamestable,i.e.V.DInthisstudy,weusedswinglegkneeangleacttodecidefootplacementbecausethelowerleghassmallermassandtrackingthetargetjointangleatthekneeiseasierthanusingthehipjoint.However,usinghipjointsorusingdifferentvariablesfortheoutputofthepolicyareinterestingtopicsforfuturework.Wealsoareconsideringusingcaptureddataofahumanwalkingpattern[23]asanominaltrajectoryinsteadofusingahand-designedwalkingpattern.Wearecurrentlyapplyingtheproposedapproachtothephysicalbipedrobot. 1 2 3 4 5 6 50 40 30 20 10 0 10 20 Time [sec]Left Hip [deg] Target 1 2 3 4 5 6 50 40 30 20 10 0 10 20 Time [sec]Right Hip [deg] 1 2 3 4 5 6 10 0 10 20 30 40 50 60 70 80 Time [sec]Left Knee [deg] 1 2 3 4 5 6 10 0 10 20 30 40 50 60 70 80 Time [sec]Right knee [deg] Fig.9.Jointangletrajectoriesafterlearning 50 100 15 0 1 0 1 2 3 4 5 6 TrialsAccumulated reward Fig.10.Accumulatedrewardateachtrial:ResultsofÞveexperimentsInpreviouswork,wehaveproposedatrajectoryopti-mizationmethodforbipedlocomotion[10],[11]basedondifferentialdynamicprogramming[5],[9].Weareconsider-ingcombiningthistrajectoryoptimizationmethodwiththeproposedreinforcementlearningmethod.CKNOWLEDGMENTWewouldliketothankMitsuoKawato,JunNakanishi,GenEndoatATRComputationalNeuroscienceLaboratories,Japan,andSeiichiMiyakoshioftheDigitalHumanResearchCenter,AIST,Japanforhelpfuldiscussions.AtkesonispartiallysupportedbyNSFawardECS-0325383. 0.2 0.1 0 0.1 0. 2 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Fig.11.Shapeofacquiredvaluefunction Fig.12.Bipedwalkingpatternwithtimingadaptationondownwardslope:(Top)Withouttimingadaptation,(Bottom)Withtimingadaptation[1]H.BenbrahimandJ.Franklin.Bipeddynamicwalkingusingreinforce-mentlearning.RoboticsandAutonomousSystems,22:283Ð302,1997.[2]C.ChewandG.A.Pratt.Dynamicbipedalwalkingassistedbylearning.,20:477Ð491,2002.[3]R.Q.VanderLinde.PassivebipedalwalkingwithphasicmuscleBiologicalCybernetics,82:227Ð237,1999.[4]K.Doya.ReinforcementLearninginContinuousTimeandSpace.NeuralComputation,12(1):219Ð245,2000.[5]P.DyerandS.R.McReynolds.TheComputationandTheoryofOptimalControl.AcademicPress,NewYork,NY,1970.[6]T.FlashandN.Hogan.Thecoordinationofarmmovements:Anexper-imentallyconÞrmedmathematicalmodel.TheJournalofNeuroscience5:1688Ð1703,1985.[7]M.Garcia,A.Chatterjee,A.Ruina,andM.J.Coleman.Thesimplestwalkingmodel:stability,complexityu,andscaling.ASMEJounalofBiomechanicalEngineering,120(2):281Ð288,1998.[8]K.Hirai,M.Hirose,andT.Takenaka.TheDevelopmentofHondaHumanoidRobot.InProceedingsofthe1998IEEEInternationalConferenceonRoboticsandAutomation,pages160Ð165,1998.[9]D.H.JacobsonandD.Q.Mayne.DifferentialDynamicProgrammingElsevier,NewYork,NY,1970.[10]J.MorimotoandC.G.Atkeson.Robustlow-torquebipedwalkingusingdifferentialdynamicprogrammingwithaminimaxcriterion.InPhilippeBidaudandFaizBenAmar,editors,Proceedingsofthe5thInternationalConferenceonClimbingandWalkingRobots,pages453Ð459.ProfessionalEngineeringPublishing,BuryStEdmundsandLondon,UK,2002.[11]J.MorimotoandC.G.Atkeson.Minimaxdifferentialdynamicprogramming:Anapplicationtorobustbipedwalking.InSuzannaBecker,SebastianThrun,andKlausObermayer,editors,Advancesin 50 100 15 0 0 0.5 1 1.5 2 2.5 EigenvalueTrials 1||2| Fig.13.AveragedeigenvalueofJacobianmatrixateachtrialNeuralInformationProcessingSystems15,pages1563Ð1570.MITPress,Cambridge,MA,2003.[12]K.Nagasaka,M.Inaba,andH.Inoue.Stabilizationofdynamicwalkonahumanoidusingtorsopositioncompliancecontrol.InProceedingsof17thAnnualConferenceonRoboticsSocietyofJapan,pages1193Ð1194,1999.[13]Y.Nakamura,M.Sato,andS.Ishii.Reinforcementlearningforbipedrobot.InProceedingsofthe2ndInternationalSymposiumonAdaptiveMotionofAnimalsandMachines,pagesThPÐIIÐ5,2003.[14]J.Nakanishi,J.Morimoto,G.Endo,G.Cheng,S.Schaal,andM.Kawato.Learningfromdemonstrationandadaptationofbipedlocomotionwithdynamicalmovementprimitives.InWorkshoponRobotProgrammingbyDemonstration,IEEE/RSJInternationalConferenceonIntelligentRobotsandSystems,LasVegas,NV,USA,2003.[15]J.Nakanishi,J.Morimoto,G.Endo,G.Cheng,S.Schaal,andM.Kawato.LearningfromdemonstrationandadaptationofbipedRoboticsandAutonomousSystems(toappear),2004.[16]S.SchaalandC.G.Atkeson.Constructiveincrementallearningfromonlylocalinformation.NeuralComputation,10(8):2047Ð2084,1998.[17]R.S.SuttonandA.G.Barto.ReinforcementLearning:AnIntroductionTheMITPress,Cambridge,MA,1998.[18]R.S.Sutton,D.Precup,andS.Singh.BetweenMDPsandsemi-MDPs:AFrameworkforTemporalAbstractioninReinforcementLearning.ArtiÞcialIntelligence,112:181Ð211,1999.[19]K.Tsuchiya,S.Aoi,andK.Tsujita.Locomotioncontrolofabipedlocomotionrobotusingnonlinearoscillators.InProceedingsoftheIEEE/RSJInternationalConferenceonIntelligentRobotsandSystemspages1745Ð1750,LasVegas,NV,USA,2003.[20]J.Vucobratovic,B.Borovac,D.Surla,andD.Stokic.BipedLocomotion:Dynamics,Stability,ControlandApplications.Springer-Verlag,Berlin,[21]Y.WadaandM.Kawato.Atheoryforcursivehandwritingbasedontheminimizationprinciple.BiologicalCybernetics,73:3Ð15,1995.[22]J.Yamaguchi,A.Takanishi,andI.Kato.Developmentofabipedwalkingrobotcompensatingforthree-axismomentbytrunkmotion.JournaloftheRoboticsSocietyofJapan,11(4):581Ð586,1993.[23]K.YamaneandY.Nakamura.DynamicsÞlterÐconceptandimplemen-tationofon-linemotiongeneratorforhumanÞgures.InProceedingsofthe2000IEEEInternationalConferenceonRoboticsandAutomationpages688Ð693,2000.[24]T.Yamasaki,T.Nomura,andS.Sato.Possiblefunctionalrolesofphaseresettingduringwalking.BiologicalCybernetics,88(6):468Ð496,2003. 3031 3032 3033 3034 3035