193K - views

A Simple Reinforcement Learning Algorithm For Biped Walking Jun Morimoto Gordon Cheng Department of Humanoid Robotics and Computational Neuroscience ATR Computational Neuroscience Labs xmorimoatr

cojp gordonatrcojp httpwwwcnsatrcojphrcn Christopher G Atkeson and Garth Zeglin The Robotics Institute Carnegie Mellon University cgacscmuedu garthzricmuedu httpwwwricmuedu Abstract We propose a modelbased reinforcement learning algorithm for biped

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "A Simple Reinforcement Learning Algorith..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

A Simple Reinforcement Learning Algorithm For Biped Walking Jun Morimoto Gordon Cheng Department of Humanoid Robotics and Computational Neuroscience ATR Computational Neuroscience Labs xmorimoatr






Presentation on theme: "A Simple Reinforcement Learning Algorithm For Biped Walking Jun Morimoto Gordon Cheng Department of Humanoid Robotics and Computational Neuroscience ATR Computational Neuroscience Labs xmorimoatr"— Presentation transcript:

Proceedings of the 2004 IEEE 3030 ASimpleReinforcementLearningAlgorithmForBipedWalkingJunMorimoto,GordonCheng,DepartmentofHumanoidRoboticsandComputationalNeuroscienceATRComputationalNeuroscienceLabsxmorimo@atr.co.jp,gordon@atr.co.jphttp://www.cns.atr.co.jp/hrcnChristopherG.Atkeson,andGarthZeglinTheRoboticsInstituteCarnegieMellonUniversitycga@cs.cmu.edu,garthz@ri.cmu.eduhttp://www.ri.cmu.eduWeproposeamodel-basedreinforcementlearningalgorithmforbipedwalkinginwhichtherobotlearnstoappropriatelyplacetheswingleg.ThisdecisionisbasedonalearnedmodelofthePoincaremapoftheperiodicwalkingpattern.Themodelmapsfromastateatthemiddleofastepandfootplacementtoastateatnextmiddleofastep.We Fig.1.Threelinkrobotmodel Fig.2.FivelinkbipedrobotTABLEIHYSICALPARAMETERSOFTHETHREELINKROBOTMODEL trunkleg mass[ 2.8 010 inertia[m2] 00010 II.ESTIMATIONOFNATURALBIPEDWALKINGTIMINGInorderforourfootplacementalgorithmtoplacethefootattheappropriatetime,wemustestimatethenaturalbipedwalkingperiod,orequivalently,frequency.Thistimingchanges,forexample,whenwalkingdownslopes.Ourgoalistoadaptthewalkingcycletimingtothedynamicsoftherobotandenvironment. TABLEIIHYSICALPARAMETERSOFTHEFIVELINKROBOTMODEL trunkthighshin mass[ 640 [m] 010 inertia( 1.4 A.EstimationmethodWederivethetargetwalkingfrequencyfromthewalkingwhichismeasuredfromanactualhalf-cycleperiod(onefootfalltoanother): Theupdateruleforthewalkingfrequencyisisthefrequencyadaptationgain,andistheestimatedfrequencyaftersteps.Aninterestingfeatureofthismethodisthatthesimpleaveraging(low-passltering)method(Eq.2)canestimateappropriatetimingofthewalkingcycleforthegivenrobotdynamics.Thismethodwasalsoadoptedin[14],[15].Severalstudiessuggestthatphaseresettingiseffectivetomatchwalkingcycletimingtothenaturaldynamicsofthebiped[19],[24],[14],[15].Hereweproposeanadaptivephaseresettingmethod.Phaseisresetwhentheswinglegtouchestheground:istheaveragephase,andisthephaseadaptationB.AsimpleexampleoftimingestimationWeusethesimulatedthreelinkbipedrobot(Fig.1)todemonstratethetimingestimationmethod.Atargetbipedwalkingtrajectoryisgeneratedusingsinusoidalfunctionswith=10andasimplecontrollerisdesignedtofollowthetargettrajectoriesforeachleg:denotesthelefthiptorque,denotestherighthipisapositiongain,isavelocitygain,areleftandrighthipjointangles.Estimatedisgivenby,whereisthecurrenttime.Forcomparison,weapplythiscontrollertothesimulatedrobotwithoutusingthetimingestimationmethod,soxedandincreaseslinearlywithtime(Thewalkingperiodwassettoandfrequency=10rad/sec).Theinitialaveragephasewassettofortherightlegandfortheleftleg,thefrequencyadaptationgainwassetto,andthephaseadaptationgainwassetWithaninitialconditionwhichhasabodyvelocityof,thesimulated3linkrobotwalkedstablyonadownwardslope(Fig.3(Top)).However,therobotcouldnotwalkstablyonadownwardslope(Fig.3(Bottom)).Whenweusedtheonlineestimateofandtheadaptivephaseresettingmethod,therobotwalkedstablyonthetwotestdownwardslope(Fig.4(Top))anddownwardslope(Fig.4(Bottom)).Ingure5,weshowtheestimatedwalkingfrequency. Fig.3.Bipedwalkingpatternwithouttimingadaptation:(Top)downwardslope,(Bottom)downwardslope Fig.4.Bipedwalkingpatternwithtimingadaptation:(Top)downwardslope,(Bottom)downwardslope 10 20 30 40 50 8 8.5 9 9.5 10  [rad/s]Steps 4.0 Fig.5.EstimatedwalkingfrequencyIII.MBASEDREINFORCEMENTLEARNINGFORBIPEDLOCOMOTIONTowalkstablyweneedtocontroltheplacementaswellasthetimingofthenextstep.Here,weproposealearningmethodtoacquireastabilizingcontroller. A.Model-basedreinforcementlearningWeuseamodel-basedreinforcementlearningframe-work[4],[17].Reinforcementlearningrequiresasourceofreward.WelearnaPoincaremapoftheeffectoffootplacement,andthenlearnacorrespondingvaluefunctionforstatesatphase 2and=3 (Fig.6),wherewedeneastherightfoottouchdown.1)LearningthePoincaremapofbipedwalking:Welearnamodelthatpredictsthestateofthebipedahalfcycleahead,basedonthecurrentstateandthefootplacementattouchdown.WearepredictingthelocationofthesysteminaPoincaresectionatphase basedonthesystemslocationinaPoincaresectionatphase .Weusethesamemodeltopredictthelocationatphase basedonthelocationatphase (Fig.6).Becausethestateoftherobotdrasticallychangesatfoottouchdown(),weselectthephases 2and=3 asPoincaresections.WeapproximatethisPoincaremapusingafunctionapproximatorwithaparametervector 2=f(x 2,u wheretheinputstateisdenedasdenotesthehorizontaldistancebetweenthestancefootpositionandthebodyposition(Fig.7).Here,weusethehippositionasthebodypositionbecausethecenterofmassisalmostatthesamepositionasthehipposition(Fig.2).Theactionoftherobotactisthetargetkneejointangleoftheswinglegwhichdeterminesthefootplacement(Fig.7).2)Representationofbipedwalkingtrajectoriesandthelow-levelcontroller:Onecycleofbipedwalkingisrepre-sentedbyfourvia-pointsforeachjoint(Fig.6).Theoutputofacurrentpolicyactisusedtospecifyvia-points(TableIII).Weinterpolatedtrajectoriesbetweentargetposturesbyusingtheminimumjerkcriteria[6],[21]exceptforpushingoffatthestancekneejoint.Forpushingoffatthestanceknee,weinstantaneouslychangethedesiredjointangletodeliverapushofftoaxedtargettoacceleratethemotion.Zerodesiredvelocityandaccelerationarespeciedateachvia-point.Tofollowthegeneratedtargettrajectories,thetorqueoutputateachjointisgivenbyaPDservocontroller:isthetargetjointanglefor-thjoint(positiongainissettoexceptforthekneejointofthestanceleg(weusedforthekneejointofthestanceleg),andthevelocitygainissetto.TableIIIshowsthetargetpostures.3)Rewards:Therobotgetsarewardifitsuccessfullycontinueswalkingandgetspunishment(negativereward)ifitfallsdown.Oneachtransitionfromphase 2(or=3 )tophase 2(or=1 ),therobotgetsarewardof0.1iftheheightofthebodyremainsabove0.35mduringthepasthalfcycle.Iftheheightofthebodygoesbelow0.35m,therobotisgivenanegativereward(-1)andthetrialisterminated.TABLEIIIARGETPOSTURESATEACHPHASEISPROVIDEDBYTHEOUTPUTOFCURRENTPOLICYHEUNITFORNUMBERSINTHISTABLEISDEGREES righthiprightkneelefthipleftknee  00 5 0 7 00 = 00 5 0 7 010 Fig.6.Bipedwalkingtrajectoryusingfourvia-points:weupdateparametersandselectactionsatPoincaresectionsonphase 2and=3 .L:leftleg,R:rightleg4)Learningthevaluefunction:Inareinforcementlearningframework,thelearnertriestocreateacontrollerwhichmaximizesexpectedtotalreturn.Here,wedenethevaluefunctionforthepolicy))==r(t+1)++2)++3)+istherewardattime,and)isthediscountfactor.Inthisframework,weevaluatethevaluefunctiononlyat 2and(t .Thus,weconsiderourlearningframeworkasmodel-basedreinforcementlearningforasemi-Markovdecisionprocess(SMDP)[18].Weusedafunctionapproximatorwithaparametervectortoestimatethevaluefunction:Byconsideringthedeviationfromequation(9),wecandenethetemporaldifferenceerror(TD-error)[17],[18]:isthetimewhen 2or(tT .Theupdateruleforthevaluefunctioncanbederivedas))+isalearningrate.Theparametervectorupdatedbyequation(19).Wefollowedthedenitionofthevaluefunctionin[17] Fig.7.(left)Inputstate,(right)Outputofthecontroller5)Learningapolicyforbipedlocomotion:Weuseastochasticpolicytogenerateexploratoryaction.Thepolicyisrepresentedbyaprobabilisticmodel:))=  2(u(t)A(x(ta denotesthemeanofthemodel,whichisrepresentedbyafunctionapproximator,whereisaparametervector.Wechangedthevarianceaccordingtothetrialastrial trialtrial,wheretrialdenotesthenumberoftrials.TheoutputofthepolicyisindicateanormaldistributionwhichhasmeanandvarianceWederivetheupdateruleforapolicybyusingthevaluefunctionandtheestimatedPoincaremap.1)Derivethegradientofthevaluefunction atthecurrentstate2)Derivethegradientofthedynamicsmodel thepreviousstateandthenominalaction3)Updatethepolicy x f(x,u) isthelearningrate.Theparametervectorisupdatedbyequation(19).WecanconsidertheoutputisanintheSMDP[18]initiatedinstateattime 2(or=3 ),anditterminatesattime 2(or=1 6)Functionapproximator:WeusedReceptiveFieldWeightedRegression(RFWR)[16]asthefunctionapproxi-matorforthepolicy,thevaluefunctionandtheestimateddynamicsmodel.Here,weapproximatethetargetfunction )=exp isthecenterofthe-thbasisfunction,isthedistancemetricofthe-thbasisfunction,isthenumberofbasisfunctions,and=((istheaugmentedstate.Theupdaterulefortheparameterisgivenby: PkPkxkxTkPk ak istheforgettingfactor.Wealignbasisfunctionsatevenintervalsineachdimensionofinputspace(Fig.7)[].Weused)basisfunctionsforapproximatingthepolicyandthevaluefunction.Wealsoalign20basisfunctionsatevenintervalsintheoutputspaceradactrad(Fig.7).Weused8000(=)basisfunctionsforapproximatingthePoincaremap.Wesetthedistancemetricdiagforthepolicyandthevaluefunction,anddiagforthePoincaremap.Thecentersofthebasisfunctionsandthedistancemetricsofthebasisfunctionsarexedduringlearning.IV.RESULTSA.LearningfootplacementWeappliedtheproposedmethodtothe5linksimulatedrobot(Fig.2).Weusedamanuallygeneratedinitialsteptogetthepatternstarted.Wesetthewalkingperiodtotorad/secAtrialterminatedafter50stepsoraftertherobotfelldown.Figure8(Top)showsthewalkingpatternbeforelearning,andFigure8(Middle)showsthewalkingpatternafter30trials.Targetkneejointanglesfortheswinglegwerevariedbecauseofexploratorybehavior(Fig.8(Middle)).Figure10showstheaccumulatedrewardateachtrial.Wedenedasuccessfultrialwhentherobotachieved50steps.Astablebipedwalkingcontrollerwasacquiredafter80trials(averagedover5experiments).TheshapeofthevaluefunctionisshowninFig.11.Themaximumvalueofthevaluefunctionislocatedatpositive)andnegativeFigure9showsjointangletrajectoriesofstablebipedwalkingafterlearning.Notethattherobotaddedenergytoitsinitiallyslowwalkbychoosingactappropriatelywhichaffectsbothfootplacementandthesubsequentpushoff.TheacquiredwalkingpatternisshowninFig.8(Bottom).B.EstimationofbipedwalkingperiodTheestimatedphaseofthecycleplaysanimportantroleinthiscontroller.Itisessentialthatthecontrollerphasematchesthedynamicsofthemechanicalsystem.WeappliedourtimingestimationmethoddescribedinsectionII-Atothelearnedbipedcontroller.Theinitialaveragephasewassettofortheleftlegandfortherightleg,thefrequencyadaptationgainwassetto,andthephaseadaptationgainwassetto Fig.8.Acquiredbipedwalkingpattern:(Top)Beforelearning,(Middle)After30trials,(Bottom)AfterlearningWeevaluatedthecombinedmethodonadownwardslope.Thesimulatedrobotwiththeacquiredcontrollerinprevioussectioncouldnotwalkstablyonthedownwardslope(Fig.12(Top)).However,whenweusedtheonlineestimateofthewalkingperiodandtheadaptivephaseresettingmethodwiththelearnedcontroller,therobotwalkedstablyonthedownwardslope(Fig.12(Bottom)).C.StabilityanalysisoftheacquiredpolicyWeanalyzedthestabilityoftheacquiredpolicyintermsofthePoincaremap,mappingfromaPoincaresectionat tophase (Fig.6).WeestimatedtheJacobianmatrixofthePoincaremapatthePoincaresections,andcheckedifornot,wherearetheeigenvalues[3],[7].Becauseweuseddifferentiablefunctionsasfunctionapproximators,wecanestimatetheJacobianmatrixbasedon: dx= f x+ f u u(x) Figure13showstheaverageeigenvaluesateachtrial.Theeigenvaluesdecreasedasthelearningproceeded,andbecamestable,i.e.V.DInthisstudy,weusedswinglegkneeangleacttodecidefootplacementbecausethelowerleghassmallermassandtrackingthetargetjointangleatthekneeiseasierthanusingthehipjoint.However,usinghipjointsorusingdifferentvariablesfortheoutputofthepolicyareinterestingtopicsforfuturework.Wealsoareconsideringusingcaptureddataofahumanwalkingpattern[23]asanominaltrajectoryinsteadofusingahand-designedwalkingpattern.Wearecurrentlyapplyingtheproposedapproachtothephysicalbipedrobot. 1 2 3 4 5 6 50 40 30 20 10 0 10 20 Time [sec]Left Hip [deg] Target 1 2 3 4 5 6 50 40 30 20 10 0 10 20 Time [sec]Right Hip [deg] 1 2 3 4 5 6 10 0 10 20 30 40 50 60 70 80 Time [sec]Left Knee [deg] 1 2 3 4 5 6 10 0 10 20 30 40 50 60 70 80 Time [sec]Right knee [deg] Fig.9.Jointangletrajectoriesafterlearning 50 100 15 0 1 0 1 2 3 4 5 6 TrialsAccumulated reward Fig.10.Accumulatedrewardateachtrial:ResultsofveexperimentsInpreviouswork,wehaveproposedatrajectoryopti-mizationmethodforbipedlocomotion[10],[11]basedondifferentialdynamicprogramming[5],[9].Weareconsider-ingcombiningthistrajectoryoptimizationmethodwiththeproposedreinforcementlearningmethod.CKNOWLEDGMENTWewouldliketothankMitsuoKawato,JunNakanishi,GenEndoatATRComputationalNeuroscienceLaboratories,Japan,andSeiichiMiyakoshioftheDigitalHumanResearchCenter,AIST,Japanforhelpfuldiscussions.AtkesonispartiallysupportedbyNSFawardECS-0325383. 0.2 0.1 0 0.1 0. 2 1 0.5 0 0.5 1 1 0.5 0 0.5 1 Fig.11.Shapeofacquiredvaluefunction Fig.12.Bipedwalkingpatternwithtimingadaptationondownwardslope:(Top)Withouttimingadaptation,(Bottom)Withtimingadaptation[1]H.BenbrahimandJ.Franklin.Bipeddynamicwalkingusingreinforce-mentlearning.RoboticsandAutonomousSystems,22:283302,1997.[2]C.ChewandG.A.Pratt.Dynamicbipedalwalkingassistedbylearning.,20:477491,2002.[3]R.Q.VanderLinde.PassivebipedalwalkingwithphasicmuscleBiologicalCybernetics,82:227237,1999.[4]K.Doya.ReinforcementLearninginContinuousTimeandSpace.NeuralComputation,12(1):219245,2000.[5]P.DyerandS.R.McReynolds.TheComputationandTheoryofOptimalControl.AcademicPress,NewYork,NY,1970.[6]T.FlashandN.Hogan.Thecoordinationofarmmovements:Anexper-imentallyconrmedmathematicalmodel.TheJournalofNeuroscience5:16881703,1985.[7]M.Garcia,A.Chatterjee,A.Ruina,andM.J.Coleman.Thesimplestwalkingmodel:stability,complexityu,andscaling.ASMEJounalofBiomechanicalEngineering,120(2):281288,1998.[8]K.Hirai,M.Hirose,andT.Takenaka.TheDevelopmentofHondaHumanoidRobot.InProceedingsofthe1998IEEEInternationalConferenceonRoboticsandAutomation,pages160165,1998.[9]D.H.JacobsonandD.Q.Mayne.DifferentialDynamicProgrammingElsevier,NewYork,NY,1970.[10]J.MorimotoandC.G.Atkeson.Robustlow-torquebipedwalkingusingdifferentialdynamicprogrammingwithaminimaxcriterion.InPhilippeBidaudandFaizBenAmar,editors,Proceedingsofthe5thInternationalConferenceonClimbingandWalkingRobots,pages453459.ProfessionalEngineeringPublishing,BuryStEdmundsandLondon,UK,2002.[11]J.MorimotoandC.G.Atkeson.Minimaxdifferentialdynamicprogramming:Anapplicationtorobustbipedwalking.InSuzannaBecker,SebastianThrun,andKlausObermayer,editors,Advancesin 50 100 15 0 0 0.5 1 1.5 2 2.5 EigenvalueTrials 1||2| Fig.13.AveragedeigenvalueofJacobianmatrixateachtrialNeuralInformationProcessingSystems15,pages15631570.MITPress,Cambridge,MA,2003.[12]K.Nagasaka,M.Inaba,andH.Inoue.Stabilizationofdynamicwalkonahumanoidusingtorsopositioncompliancecontrol.InProceedingsof17thAnnualConferenceonRoboticsSocietyofJapan,pages11931194,1999.[13]Y.Nakamura,M.Sato,andS.Ishii.Reinforcementlearningforbipedrobot.InProceedingsofthe2ndInternationalSymposiumonAdaptiveMotionofAnimalsandMachines,pagesThPII5,2003.[14]J.Nakanishi,J.Morimoto,G.Endo,G.Cheng,S.Schaal,andM.Kawato.Learningfromdemonstrationandadaptationofbipedlocomotionwithdynamicalmovementprimitives.InWorkshoponRobotProgrammingbyDemonstration,IEEE/RSJInternationalConferenceonIntelligentRobotsandSystems,LasVegas,NV,USA,2003.[15]J.Nakanishi,J.Morimoto,G.Endo,G.Cheng,S.Schaal,andM.Kawato.LearningfromdemonstrationandadaptationofbipedRoboticsandAutonomousSystems(toappear),2004.[16]S.SchaalandC.G.Atkeson.Constructiveincrementallearningfromonlylocalinformation.NeuralComputation,10(8):20472084,1998.[17]R.S.SuttonandA.G.Barto.ReinforcementLearning:AnIntroductionTheMITPress,Cambridge,MA,1998.[18]R.S.Sutton,D.Precup,andS.Singh.BetweenMDPsandsemi-MDPs:AFrameworkforTemporalAbstractioninReinforcementLearning.ArticialIntelligence,112:181211,1999.[19]K.Tsuchiya,S.Aoi,andK.Tsujita.Locomotioncontrolofabipedlocomotionrobotusingnonlinearoscillators.InProceedingsoftheIEEE/RSJInternationalConferenceonIntelligentRobotsandSystemspages17451750,LasVegas,NV,USA,2003.[20]J.Vucobratovic,B.Borovac,D.Surla,andD.Stokic.BipedLocomotion:Dynamics,Stability,ControlandApplications.Springer-Verlag,Berlin,[21]Y.WadaandM.Kawato.Atheoryforcursivehandwritingbasedontheminimizationprinciple.BiologicalCybernetics,73:315,1995.[22]J.Yamaguchi,A.Takanishi,andI.Kato.Developmentofabipedwalkingrobotcompensatingforthree-axismomentbytrunkmotion.JournaloftheRoboticsSocietyofJapan,11(4):581586,1993.[23]K.YamaneandY.Nakamura.Dynamicslterconceptandimplemen-tationofon-linemotiongeneratorforhumangures.InProceedingsofthe2000IEEEInternationalConferenceonRoboticsandAutomationpages688693,2000.[24]T.Yamasaki,T.Nomura,andS.Sato.Possiblefunctionalrolesofphaseresettingduringwalking.BiologicalCybernetics,88(6):468496,2003. 3031 3032 3033 3034 3035