/
utonomous helicopter ight via einf or cement lear ning Andr ew Ng Stanford Uni ersity utonomous helicopter ight via einf or cement lear ning Andr ew Ng Stanford Uni ersity

utonomous helicopter ight via einf or cement lear ning Andr ew Ng Stanford Uni ersity - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
553 views
Uploaded On 2014-12-16

utonomous helicopter ight via einf or cement lear ning Andr ew Ng Stanford Uni ersity - PPT Presentation

Jin Kim Michael I ordan and Shankar Sastry Uni ersity of California Berk ele CA 94720 Abstract Autonomous helicopter 64258ight represents challenging control problem with comple x noisy dynamics In this paper we describe successful application of re ID: 24731

Jin Kim Michael

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "utonomous helicopter ight via einf or ce..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AutonomoushelicopterightviareinforcementlearningAndrewY.NgStanfordUniversityStanford,CA94305H.JinKim,MichaelI.Jordan,andShankarSastryUniversityofCaliforniaBerkeley,CA94720AbstractAutonomoushelicopterightrepresentsachallengingcontrolproblem,withcomplex,noisy,dynamics.Inthispaper,wedescribeasuccessfulapplicationofreinforcementlearningtoautonomoushelicopteright.Wersttastochastic,nonlinearmodelofthehelicopterdynamics.Wethenusethemodeltolearntohoverinplace,andtoyanumberofmaneuverstakenfromanRChelicoptercompetition.1IntroductionHelicoptersrepresentachallengingcontrolproblemwithhigh-dimensional,complex,asymmetric,noisy,non-linear,dynamics,andarewidelyregardedassignicantlymoredifculttocontrolthanxed-wingaircraft.[6]Consider,forinstance,ahelicopterhov-eringinplace.Asinglehorizontally-orientedmainrotorisattachedtothehelicopterviatherotorshaft.Supposethemainrotorrotatesclockwise(viewedfromabove),blowingairdownwardsandhencegeneratingupwardthrust.Byapplyingclockwisetorquetothemainrotortomakeitrotate,ourhelicopterexperiencesananti-torquethattendstocausethemainchassistospinanti-clockwise.Thus,itisnecessarytouseatailrotortoblowairsideways/rightwardstogenerateanappropriatemomenttocounteractthespin.But,thissidewaysforcenowcausesthehelicoptertodriftleftwards.So,forahelicoptertohoverinplace,itmustactuallybetiltedslightlytotheright,sothatthemainrotor'sthrustisdirecteddownwardsandslightlytotheleft,tocounteractthistendencytodriftsideways.Helicopterightisrifewithsuchexamplesofingenioussolutionstoproblemscausedbysolutionstootherproblems,andofcomplex,nonintuitivedynamicsthatmakethemchal-lengingtocontrol.Inthispaper,wedescribeasuccessfulapplicationofreinforcementlearningtodesigningacontrollerforautonomoushelicopteright.Duetospacecon-straints,ourdescriptionofthisworkisnecessarilybrief;adetailedtreatmentisprovidedin[8].Foradiscussionofrelatedworkonautonomousight,alsosee[8,12].2AutonomousHelicopterThehelicopterusedinthisworkwasaYamahaR-50helicopter,whichisapproximately3.6mlong,carriesuptoa20kgpayload,andisshowninFigure1a.Adetaileddescriptionofthedesignandconstructionofitsinstrumentationisin[12].ThehelicoptercarriesanInertialNavigationSystem(INS)consistingof3accelerometersand3rategyroscopesinstalledinexactlyorthogonalx,y,zdirections,andadifferentialGPSsystem,whichwiththeassistanceofagroundstation,givespositionestimateswitharesolutionof2cm.AnonboardnavigationcomputerrunsaKalmanlterwhichintegratesthesensorinformationfromtheGPS,INS,andadigitalcompass,andreports(at50Hz)12numberscorrespondingtotheestimatesofthehelicopter'sposition(x;y;z),orientation(roll,pitch,yaw!),velocity(_x;_y;_z)andangularvelocities(_;_;_!). (a)(b)Figure1:(a)Autonomoushelicopter.(b)Helicopterhoveringundercontroloflearnedpolicy.Mosthelicoptersarecontrolledviaa4-dimensionalactionspace:a1;a2:Thelongitudinal(front-back)andlatitudinal(left-right)cyclicpitchcon-trols.Therotorplaneistheplaneinwhichthehelicopter'srotorsrotate.Bytiltingthisplaneeitherforwards/backwardsorsideways,thesecontrolscausethehelicoptertoaccelerateforward/backwardsorsideways.a3:The(mainrotor)collectivepitchcontrol.Asthehelicoptermain-rotor'sbladessweepthroughtheair,theygenerateanamountofupwardthrustthat(generally)increaseswiththeangleatwhichtherotorbladesaretilted.Byvaryingthetiltangleoftherotorblades,thecollectivepitchcontrolaffectsthemainrotor'sthrust.a4:Thetailrotorcollectivepitchcontrol.Usingamechanismsimilartothemainrotorcollectivepitchcontrol,thiscontrolsthetailrotor'sthrust.UsingthepositionestimatesgivenbytheKalmanlter,ourtaskistopickgoodcontrolactionsevery50thofasecond.3ModelidenticationTotamodelofthehelicopter'sdynamics,webeganbyaskingahumanpilottoythehelicopterforseveralminutes,andrecordedthe12-dimensionalhelicopterstateand4-dimensionalhelicoptercontrolinputsasitwasown.Inwhatfollows,weused339secondsofightdataformodeltting,andanother140secondsofdataforhold-outtesting.Therearemanynaturalsymmetriesinhelicopteright.Forinstance,ahelicopterat(0,0,0)facingeastbehavesinawayrelatedonlybyatranslationandrotationtooneat(10,10,50)facingnorth,ifwecommandeachtoaccelerateforwards.Wewouldliketoencodethesesymmetriesdirectlyintothemodelratherthanforceanalgorithmtolearnthemfromscratch.Thus,modelidenticationistypicallydonenotinthespatial(world)coordi-natess=[x;y;z;;;!;_x;_y;_z;_;_;_!],butinsteadinthehelicopterbodycoordinates,inwhichthex,y,andzaxesareforwards,sideways,anddownrelativetothecurrentpositionofthehelicopter.Wherethereisriskofconfusion,wewillusesuperscriptsandbtodis-tinguishbetweenspatialandbodycoordinates;thus,_xbisforwardvelocity,regardlessoforientation.Ourmodelisidentiedinthebodycoordinatessb=[;;_xb;_yb;_zb;_;_;_!].whichhasfourfewervariablesthanss.Notethatoncethismodelisbuilt,itiseasilyconvertedbackusingsimplegeometrytooneintermsofspatialcoordinates.Ourmaintoolformodelttingwaslocallyweightedlinearregression(e.g.,[11,3]).Givenadatasetf(xi;yi)gmi=1wherethexi'sarevector-valuedinputsandtheyi'sarethereal-valuedoutputstobepredicted,weletXbethedesignmatrixwhosei-throwisxi,andlet~ybethevectorofyi's.Inresponsetoaqueryatx,locallyweightedlinearregressionmakesthepredictiony= Tx,where =(XTWX)1XTW~y,andWisadiagonalmatrixwith(say)Wii=exp(12(xxi)T1(xxi)),sothattheregressiongivesdatapointsnearxalargerweight.Here,1determineshowweightsfalloffwithdistancefromx,andwaspickedinourexperimentsvialeave-one-outcrossvalidation.1Usingtheestimatorfornoise2givenin[3],thisgivesamodely= Tx+,whereNormal(0;2).By1Actually,sincewewerettingamodeltoatime-series,samplestendtobecorrelatedintime, 00.5100.20.40.60.81secondsmean squared errorxdot00.5100.10.20.30.40.5secondsmean squared errorxdot00.5100.10.20.30.40.5secondsmean squared errorxdot00.5100.0020.0040.0060.0080.010.012secondsmean squared errorthetadot0246810-3-2-1012timeydotSSSSSSSSSSSxxerrerryerrwa1+1qferrzzwyaaa234(a)(b)(c)Figure2:(a)Examplesofplotscomparingamodeltusingtheparameterizationdescribedinthetext(solidlines)tosomeothermodels(dash-dotlines).Eachpointplottedshowsthemean-squarederrorbetweenthepredictedvalueofastatevariable—whenamodelisusedtothesimulatethehelicopter'sdynamicsforacertaindurationindicatedonthex-axis—andthetruevalueofthatstatevariable(asmeasuredontestdata)afterthesameduration.Topleft:Comparisonof_x-errortomodelnotusinga1s,etc.terms.Topright:Comparisonof_x-errortomodelomittingintercept(bias)term.Bottom:Comparisonof_xand_tolineardeterministicmodelidentiedby[12].(b)Thesolidlineisthetruehelicopter_ystateon10softestdata.Thedash-dotlineisthehelicopterstatepredictedbyourmodel,giventheinitialstateattime0andalltheintermediatecontrolinputs.Thedottedlinesshowtwostandarddeviationsintheestimatedstate.Everytwoseconds,theestimatedstateis“reset”tothetruestate,andthetrackrestartswithzeroerror.Notethattheestimatedstateisofthefull,high-dimensionalstateofthehelicopter,butonly_yisshownhere.(c)Policyclass.Thepicturesinsidethecirclesindicatewhetheranodeoutputsthesumoftheirinputs,orthetanhofthesumoftheirinputs.Eachedgewithanarrowinthepicturedenotesatunableparameter.Thesolidlinesshowthehoveringpolicyclass(Section5).Thedashedlinesshowtheextraweightsaddedfortrajectoryfollowing(Section6).applyinglocally-weightedregressionwiththestatestandactionatasinputs,andtheone-stepdifferences(e.g.,t+1t)ofeachofthestatevariablesinturnasthetargetoutput,thisgivesusanon-linear,stochastic,modelofthedynamics,allowingustopredictst+1asafunctionofstandatplusnoise.Weactuallyusedseveralrenementstothismodel.Similartotheuseofbodycoordinatestoexploitsymmetries,thereisotherpriorknowledgethatcanbeincorporated.Sincebothtand_tarestatevariables,andweknowthat(at50Hz)t+1t+_t=50,thereisnoneedtocarryoutaregressionfor.Similarly,weknowthattherollangleofthehelicoptershouldhavenodirecteffectonforwardvelocity_x.So,whenperformingregressiontoestimate_x,thecoefcientin correspondingtocanbesetto0.Thisallowsustoreducethenumberofparametersthathavetobet.Similarreasoningallowsustoconclude(cf.[12])thatcertainotherparametersshouldbe0,1=50org(gravity),andthesewerealsohard-codedintothemodel.Finally,weaddedthreeextra(unobserved)variablesa1s,b1s;_!fbtomodellatenciesintheresponsestothecontrols.(See[8]fordetails.)Someofthe(other)choicesthatweconsideredinselectingamodelincludewhethertousethea1s,b1sand/or_!fbterms;whethertoincludeaninterceptterm;atwhatfrequencytoidentifythemodel;whethertohardwirecertaincoefcientsasdescribed;andwhethertouseweightedorunweightedlinearregression.Ourmaintoolforchoosingamongthemod-elswasplotssuchasthoseshowninFigure2a.(Seegurecaption.)Wewereparticularlyinterestedincheckinghowaccurateamodelisnotjustforpredictingst+1fromst;at,buthowaccurateitisatlongertimescales.EachofthepanelsinFigure2ashows,foramodel,themean-squarederror(asmeasuredontestdata)betweenthehelicopter'struepositionandtheestimatedpositionatacertaintimeinthefuture(indicatedonthex-axis).Thehelicopter'sblade-tipmovesatanappreciablefractionofthespeedofsound.Giventheandthepresenceoftemporallyclose-bysamples—whichwillbespatiallyclose-byaswell—maymakedataseemmoreabundantthaninreality(leadingtobigger1thanmightbeoptimalfortestdata).Thus,whenleavingoutasampleincrossvalidation,weactuallyleftoutalargewindow(16seconds)ofdataaroundthatsample,todiminishthisbias. dangerandexpense(about$70,000)ofautonomoushelicopters,wewantedtoverifythettedmodelcarefully,soastobereasonablycondentthatacontrollertestedsuccessfullyinsimulationwillalsobesafeinreallife.Spaceprecludesafulldiscussion,butoneofourconcernswasthepossibilitythatunmodeledcorrelationsinmightmeanthenoisevarianceoftheactualdynamicsismuchlargerthanpredictedbythemodel.(See[8]fordetails.)Tocheckagainstthis,weexaminedmanyplotssuchasshowninFigure2b,tocheckthatthehelicopterstate“rarely”goesoutsidetheerrorbarspredictedbyourmodelatvarioustimescales(seecaption).4Reinforcementlearning:ThePEGASUSalgorithmWeusedthePEGASUSreinforcementlearningalgorithmof[9],whichwebrieyreviewhere.ConsideranMDPwithstatespaceS,initialstates02S,actionspaceA,statetransitionprobabilitiesPsa(),rewardfunctionR:S7!R,anddiscount\r.Alsoletsomefamilyofpolicies:S7!Abegiven,andsupposeourgoalistondapolicyinwithhighutility,wheretheutilityofisdenedtobeU()=E[R(s0)+\rR(s1)+\r2R(s2)+j];wheretheexpectationisovertherandomsequenceofstatess0;s1;:::visitedovertimewhenisexecutedintheMDPstartingfromstates0.Theseutilitiesareingeneralintractabletocalculateexactly,butsupposewehaveacom-putersimulatoroftheMDP'sdynamics—thatis,aprogramthatinputss;aandoutputss0drawnfromPsa().Thenastandardwaytodeneanestimate^U()ofU()isviaMonteCarlo:Wecanusethesimulatortosampleatrajectorys0;s1;:::,andbytakingtheempiricalsumofdiscountedrewardsR(s0)+\rR(s1)+onthissequence,weobtainone“sample”withwhichtoestimateU().Moregenerally,wecouldgeneratemsuchse-quences,andaveragetoobtainabetterestimator.Wecanthentrytooptimizetheestimatedutilitiesandsearchfor“argmax^U().”Unfortunately,thisisadifcultstochasticoptimizationproblem:Evaluating^U()involvesaMonteCarlosamplingprocess,andtwodifferentevaluationsof^U()willtypicallygiveslightlydifferentanswers.Moreover,evenifthenumberofsamplesmthatweaverageoverisarbitrarilylarge,^U()willfailwithprobability1tobea(“uniformly”)goodestimateofU().Inourexperiments,thisfailstolearnanyreasonablecontrollerforourhelicopter.ThePEGASUSmethodusestheobservationthatalmostallcomputersimulationsoftheformdescribedsamples0Psa()byrstcallingarandomnumbergeneratortogetone(ormore)randomnumbersp,andthencalculatings0assomedeterministicfunctionoftheinputs;aandtherandomp.Ifwedemandthatthesimulatorexposeitsinterfacetotherandomnumbergenerator,thenbypre-samplingalltherandomnumberspinadvanceandxingthem,wecanthenusethesesame,xed,randomnumberstoevaluateanypolicy.Sincealltherandomnumbersarexed,^U:7!Risjustanordinarydeterministicfunc-tion,andstandardsearchheuristicscanbeusedtosearchforargmax^U().Importantly,thisalsoallowsustoshowthat,solongasweaverageoveranumberofsamplesmthatisatmostpolynomialinallquantitiesofinterest,thenwithhighprobability,^UwillbeauniformlygoodestimateofU(j^U()U()j).Thisalsoallowsustogiveguaranteesontheperformanceofthesolutionsfound.ForfurtherdiscussionofPEGASUSandotherworksuchasvariancereductionandstochasticestimationmethods(cf.[5,10]),see[8].5LearningtoHoverOnepreviousattempthadbeenmadetousealearningalgorithmtoythishelicopter,using-synthesis[2].Thissucceededinyingthehelicopterinsimulation,butnotontheactualhelicopter(Shim,pers.comm.).Similarly,preliminaryexperimentsusingH2andH1controllerstoyasimilarhelicopterwerealsounsuccessful.Thesecommentsshouldnotbetakenasconclusiveoftheviabilityofanyofthesemethods;rather,wetakethemtobeindicativeofthedifcultyandsubtletyinvolvedinlearningahelicoptercontroller. 051015202530-1.5-1-0.500.511.5x-velocity (m/s)051015202530-0.6-0.4-0.200.20.4y-velocity (m/s)051015202530-0.500.51z-velocity (m/s)05101520253061.56262.56363.56464.56565.566x-position (m)051015202530-80-75-70-65-60-55-50-45y-position (m)0510152025304.555.566.57z-position (m)Figure3:Comparisonofhoveringperformanceoflearnedcontroller(solidline)vs.Yamahali-censed/speciallytrainedhumanpilot(dottedline).Top:x;y;zvelocities.Bottom:x;y;zpositions.Webeganbylearningapolicyforhoveringinplace.Wewantacontrollerthat,giventhecurrenthelicopterstateandadesiredhoveringpositionandorientation(x;y;z;!),computescontrolsa2[1;1]4tomakeithoverstablythere.Forourpolicyclass,wechosethesimpleneuralnetworkdepictedinFigure2c(solidedgesonly).Eachoftheedgesinthegurerepresentsaweight,andtheconnectionswerechosenviasimplereasoningaboutwhichcontrolchannelshouldbeusedtocontrolwhichstatevariables.Forinstance,considerthelongitudinal(forward/backward)cyclicpitchcontrola1,whichcausestherotorplanetotiltforward/backward,thuscausingthehelicoptertopitch(and/oraccelerate)forwardorbackward.FromFigure2c,wecanreadoffthea1controlast1=w1+w2errxb+w3tanh(w4errxb)+w5_xb+w6;a1=w7tanh(w8t1)+w9t1:Here,thewi'sarethetunableparameters(weights)ofthenetwork,anderrxb=xbxbdesiredisdenedtobetheerrorinthexb-position(forwarddirection,inbodycoordinates)betweenwherethehelicoptercurrentlyisandwherewewishittohover.Wechoseaquadraticcostfunctiononthe(spatialrepresentationofthe)state,where2R(s)=( x(xx)2+ y(yy)2+ z(zz)2+ _x_x2+ _y_y2+ _z_z2+ !(!!)2):(1)Thisencouragesthehelicoptertohovernear(x;y;z;!),whilealsokeepingtheveloc-itysmallandnotmakingabruptmovements.Theweights x; y,etc.(distinctfromtheweightswiparameterizingourpolicyclass)werechosentoscaleeachofthetermstoberoughlythesameorderofmagnitude.Toencouragesmallactionsandsmoothcontrolofthehelicopter,wealsousedaquadraticpenaltyforactions:R(a)=( a1a21+ a2a22+ a3a23+ a4a24),andtheoverallrewardwasR(s;a)=R(s)+R(a).UsingthemodelidentiedinSection3,wecannowapplyPEGASUStodeneapprox-imations^U()totheutilitiesofpolicies.Sincepoliciesaresmoothlyparameterizedintheweights,andthedynamicsarethemselvescontinuousintheactions,theestimatesofutilitiesarealsocontinuousintheweights.3Wemaythusapplystandardhillclimbingal-gorithmstomaximize^U()intermsofthepolicy'sweights.Wetriedbothagradient2The!!errortermiscomputedwithappropriatewrappingabout2rad,sothatif!=0:01rad,andthehelicopteriscurrentlyfacing!=20:01rad,theerroris0.02,not20:02rad.3Actually,thisisnottrue.Onelastcomponentoftherewardthatwedidnotmentionearlierwasthat,ifinperformingthelocallyweightedregression,thematrixXTWXissingulartonumericalprecision,thenwedeclarethehelicoptertohave“crashed,”terminatethesimulation,andgiveitahugenegative(-50000)reward.BecausethetestcheckingifXTWXissingulartonumericalprecisionreturnseither1or0,^U()hasadiscontinuitybetween“crash”and“not-crash.” -68-67-66-65-64-63-62-61-60-59-5863.5645.566.577.588.599.510-105-100-95-906466687072747678808266.26.4-81-80-79-78-77-76-75-74-73-72-7167.56868.55.566.577.588.599.5Figure4:Toprow:ManeuverdiagramsfromRChelicoptercompetition.[ImagescourtesyoftheAcademyofModelAeronautics.]Bottomrow:Actualtrajectoriesownusinglearnedcontroller.ascentalgorithm,inwhichwenumericallyevaluatethederivativeof^U()withrespecttotheweightsandthentakeastepintheindicateddirection,andarandom-walkalgorithminwhichweproposearandomperturbationtotheweights,andmovethereifitincreases^U().Bothofthesealgorithmsworkedwell,thoughwithgradientascent,itwasimportanttoscalethederivativesappropriately,sincetheestimatesofthederivativesweresometimesnumericallyunstable.4Itwasalsoimportanttoapplysomestandardheuristicstopreventitssolutionsfromdiverging(suchasverifyingaftereachstepthatwedidindeedtakeastepuphillontheobjective^U,andundoing/redoingthestepusingasmallerstepsizeifthiswasnotthecase).ThemostexpensivestepinpolicysearchwastherepeatedMonteCarloevaluationtoobtain^U().Tospeedthisup,weparallelizedourimplementation,andMonteCarloevaluationsusingdifferentsampleswererunondifferentcomputers,andtheresultswerethenaggre-gatedtoobtain^U().WeranPEGASUSusing30MonteCarloevaluationsof35secondsofyingtimeeach,and\r=0:9995.Figure1bshowstheresultofimplementingandrunningtheresultingpolicyonthehelicopter.Onitsmaidenight,ourlearnedpolicywassuccessfulinkeepingthehelicopterstabilizedintheair.(Wenotethat[1]wasalsosuc-cessfulatusingourPEGASUSalgorithmtocontrolasubset,thecyclicpitchcontrols,ofahelicopter'sdynamics.)WealsocomparetheperformanceofourlearnedpolicyagainstthatofourhumanpilottrainedandlicensedbyYamahatoytheR-50helicopter.Figure3showsthevelocitiesandpositionsofthehelicopterunderourlearnedpolicyandunderthehumanpilot'scontrol.Aswesee,ourcontrollerwasabletokeepthehelicopteryingmorestablythanwasahumanpilot.Videosofthehelicopteryingareavailableathttp://www.cs.stanford.edu/˜ang/rl/6FlyingcompetitionmaneuversWewerenextinterestedinmakingthehelicopterlearntoyseveralchallengingmaneu-vers.TheAcademyofModelAeronautics(AMA)(toourknowledgethelargestRCheli-copterorganization)holdsanannualRChelicoptercompetition,inwhichhelicoptershavetobeaccuratelyownthroughanumberofmaneuvers.ThiscompetitionisorganizedintoClassI(forbeginners,withtheeasiestmaneuvers)throughClassIII(withthemostdifcultmaneuvers,forthemostadvancedpilots).Wetooktherstthreemaneuversfromthemostchallenging,ClassIII,segmentoftheircompetition.Figure4showsmaneuverdiagramsfromtheAMAwebsite.Intherstofthesemaneuvers4Aproblemexacerbatedbythediscontinuitiesdescribedinthepreviousfootnote. (III.1),thehelicopterstartsfromthemiddleofthebaseofatriangle,iesbackwardstothelower-rightcorner,performsa180pirouette(turninginplace),iesbackwardsupanedgeofthetriangle,backwardsdowntheotheredge,performsanother180pirouette,andiesbackwardstoitsstartingposition.Flyingbackwardsisasignicantlylessstablemaneuverthanyingforwards,whichmakesthismaneuverinterestingandchallenging.Inthesecondmaneuver(III.2),thehelicopterhastoperformanose-inturn,inwhichitiesbackwardsouttotheedgeofacircle,pauses,andtheniesinacirclebutalwayskeepingthenoseofthehelicopterpointedatcenterofrotation.Afteritnishescircling,itreturnstothestartingpoint.Manyhumanpilotsseemtondthissecondmaneuverparticularlychallenging.Lastly,maneuverIII.3involvesyingthehelicopterinaverticalrectangle,withtwo360pirouettesinoppositedirectionshalfwayalongtherectangle'sverticalsegments.Howdoesonedesignacontrollerforyingtrajectories?Givenacontrollerforkeepingasystem'sstateatapoint(x;y;z;!),onestandardwaytomakethesystemmovethroughaparticulartrajectoryistoslowlyvary(x;y;z;!)alongasequenceofsetpointsonthattrajectory.(E.g.,see[4].)Forinstance,ifweaskourhelicoptertohoverat(0;0;0;0),thenafractionofasecondlateraskittohoverat(0:01;0;0;0),thenat(0:02;0;0;0)andsoon,ourhelicopterwillslowlyyinthexs-direction.Bytakingthisprocedureand“wrapping”itaroundouroldpolicyclassfromFigure2c,wethusobtainacomputerprogram—thatis,anewpolicyclass—notjustforhovering,butalsoforyingarbitrarytrajectories.I.e.,wenowhaveafamilyofpoliciesthattakeasinputatrajectory,andthatattempttomakethehelicopterythattrajectory.Moreover,wecannowalsoretrainthepolicy'sparametersforaccuratetrajectoryfollowing,notjusthovering.Foryingtrajectories,wealsoaugmentedthepolicyclasstotakeintoaccountmoreofthecouplingbetweenthehelicopter'sdifferentsubdynamics.Forinstance,thesimplestwaytoturnistochangethetailrotorcollectivepitch/thrust,sothatityawseitherleftorright.Thisworkswellforsmallturns,butforlargeturns,thethrustfromthetailrotoralsotendstocausethehelicoptertodriftsideways.Thus,weenrichedthepolicyclasstoallowittocorrectforthisdriftbyapplyingtheappropriatecyclicpitchcontrols.Also,havingahelicopterclimbordescendchangestheamountofworkdonebythemainrotor,andhencetheamountoftorque/anti-torquegenerated,whichcancausethehelicoptertoturn.So,wealsoaddedalinkbetweenthecollectivepitchcontrolandthetailrotorcontrol.ThesemodicationsareshowninFigure2c(dashedlines).Wealsoneededtospecifyarewardfunctionfortrajectoryfollowing.OnesimplechoiceforRwouldhavebeentouseEquation(1)withthenewly-dened(time-varying)(x;y;z;!).Butwedidnotconsiderthistobeagoodchoice.Specically,considermakingthehelicopteryintheincreasingx-direction,sothat(x;y;z;!)startsoffas(0;0;0;0)(say),andhasitsrstcoordinatexslowlyincreasedovertime.Then,whiletheactualhelicopterpositionxswillindeedincrease,itwillalsoalmostcertainlylagcon-sistentlybehindx.Thisisbecausethehoveringcontrollerisalwaystryingto“catchup”tothemoving(x;y;z;!).Thus,xxmayremainlarge,andthehelicopterwillcontinuouslyincuraxxcost,evenifitisinfactyingaveryaccuratetrajectoryintheincreasingx-directionexactlyasdesired.Itwouldbeundesirabletohavethehelicopterrisktryingtoymoreaggressivelytoreducethisfake“error,”particularlyifitisatthecostofincreasederrorintheothercoordinates.So,wechangedtherewardfunctiontopenal-izedeviationnotfrom(x;y;z;!),butinsteaddeviationfrom(xp;yp;zp;!p),where(xp;yp;zp;!p)isthe“projection”ofthehelicopter'spositionontothepathoftheidealized,desiredtrajectory.(Inourexampleofyinginastraightline,forahelicopterat(x;y;z;!),weeasilysee(xp;yp;zp;!p)=(x;0;0;0).)Thus,weimaginean“externalobserver”thatlooksattheactualhelicopterstateandestimateswhichpartoftheidealizedtrajectorythehelicopteristryingtoythrough(takingcarenottobeconfusedifatrajectoryloopsbackonitself),andthelearningalgorithmpaysapenaltythatisquadraticbetweentheactualpositionandthe“tracked”positionontheidealizedtrajectory.Wealsoneededtomakesurethehelicopterisrewardedformakingprogressalongthe trajectory.Todothis,weusedthepotential-basedshapingrewardsof[7].Since,wearealreadytrackingwherealongthedesiredtrajectorythehelicopteris,wechoseapotentialfunctionthatincreasesalongthetrajectory.Thus,wheneverthehelicopter's(xp;yp;zp;!p)makesforwardprogressalongthistrajectory,itreceivespositivereward.(See[7].)Finally,ourmodicationshavedecoupledourdenitionoftherewardfunctionfrom(x;y;z;!)andtheevolutionof(x;y;z;!)intime.So,wearenowalsofreetoconsiderallowing(x;y;z;!)toevolveinawaythatisdifferentfromthepathofthedesiredtrajectory,butnonethelessinawaythatallowsthehelicoptertofollowtheactual,desiredtrajectorymoreaccurately.(Incontroltheory,thereisarelatedpracticeofusingtheinversedynamicstoobtainbettertrackingbehavior.)Weconsideredseveralalternatives,butthemainoneusedendedupbeingamodicationforyingtrajectoriesthathavebothaverticalandahorizontalcomponent(suchasalongthetwoupperedgesofthetriangleinIII.1).Specically,itturnsoutthatthez(vertical)-responseofthehelicopterisveryfast:Toclimb,weneedonlyincreasethecollectivepitchcontrol,whichalmostimmediatelycausesthehelicoptertostartacceleratingupwards.Incontrast,thexandyresponsesaremuchslower.Thus,if(x;y;z;!)movesat45upwardsasinmaneuverIII.1,thehe-licopterwilltendtotrackthez-componentofthetrajectorymuchmorequickly,sothatitacceleratesintoaclimbsteeperthan45,resultingina“bowed-out”trajectory.Similarly,anangleddescentresultsina“bowed-in”trajectory.Tocorrectforthis,wearticiallysloweddownthez-response,sothatwhen(x;y;z;!)ismovingintoanangledclimbordescent,the(x;y;!)portionwillevolvenormallywithtime,butthechangestozwillbedelayedbytseconds,wherethereisanotherparameterinourpolicyclass,tobeautomaticallylearnedbyouralgorithm.Usingthissetupandretrainingourpolicy'sparametersforaccuratetrajectoryfollowing,wewereabletolearnapolicythatiesallthreeofthecompetitionmaneuversfairlyac-curately.Figure4(bottom)showsactualtrajectoriestakenbythehelicopterwhileyingthesemaneuvers.VideosofthehelicopteryingthesemaneuversarealsoavailableattheURLgivenattheendofSection5.References[1]J.BagnellandJ.Schneider.Autonomoushelicoptercontrolusingreinforcementlearningpolicysearchmethods.InInt'lConf.RoboticsandAutomation.IEEE,2001.[2]G.Balas,J.Doyle,K.Glover,A.Packard,andR.Smith.-analysisandsynthesistoolboxuser'sguide,1995.[3]W.Cleveland.Robustlocallyweightedregressionandsmoothingscatterplots.J.Amer.Stat.Assoc,74,1979.[4]GeneF.Franklin,J.DavidPowell,andAbbasEmani-Naeini.FeedbackControlofDynamicSystems.Addison-Wesley,1995.[5]J.KieferandJ.Wolfowitz.Stochasticestimationofthemaximumofaregressionfunction.AnnalsofMathematicalStatistics,23:462–466,1952.[6]J.Leishman.PrinciplesofHelicopterAerodynamics.CambridgeUniv.Press,2000.[7]A.Y.Ng,D.Harada,andS.Russell.Policyinvarianceunderrewardtransformations:Theoryandapplicationtorewardshaping.InProc.16thICML,pages278–287,1999.[8]AndrewY.Ng.Shapingandpolicysearchinreinforcementlearning.PhDthesis,EECS,UniversityofCalifornia,Berkeley,2003.[9]AndrewY.NgandMichaelI.Jordan.PEGASUS:ApolicysearchmethodforlargeMDPsandPOMDPs.InProc.16thConf.UncertaintyinArticialIntelligence,2000.[10]HerbertRobbinsandSuttonMonro.Astochasticapproximationmethod.AnnalsofMathematicalStatistics,22:40–407,1951.[11]C.AtkesonS.SchaalandA.Moore.Locallyweightedlearning.AIReview,11,1997.[12]HyunchulShim.Hierarchicalightcontrolsystemsynthesisforrotorcraft-basedun-mannedaerialvehicles.PhDthesis,Mech.Engr.,U.C.Berkeley,2000.