CentroUniversitarioFEIBrazilemailrbianchifeiedubrInstitutoTecnologicodeAeronauticaBrazilemailcarlositabrEscolaPolitecnicadaUniversidadedeSaoPauloBrazilemailThispaperisorganizedasfollows ID: 289910
Download Pdf The PPT/PDF document "HeuristicallyAcceleratedReinforcementLea..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperimentalResultsReinaldoA.C.Bianchi,CarlosH.C.RibeiroAnnaH.R.CostaSincendingcontrolpoliciesusingReinforcementLearning(RL)canbeverytimeconsuming,inrecentyearsseveralauthorshaveinvestigatedhowtospeedupRLalgorithmsbymak-ingimprovedactionselectionsbasedonheuristics.Inthisworkwepresentnewtheoreticalresults convergenceandasuperiorlimitforvalueestimationerrors fortheclassthatencompassesallheuristics-basedalgorithms,calledHeuristicallyAcceleratedReinforcementLearning.Wealsoexpandthisnewclassbyproposingthreenewal-gorithms,theHeuristicallyAcceleratedQ(),SARSA()andTD(therstalgorithmsthatusesbothheuristicsandeligibilitytraces.Empiricalevaluationswereconductedintraditionalcontrolproblemsandresultsshowthatusingheuristicssignicantlyenhancestheper-formanceofthelearningprocess. CentroUniversitarioFEI,Brazil,email:rbianchi@fei.edu.brInstitutoTecnologicodeAeronautica,Brazil,email:carlos@ita.brEscolaPolitecnicadaUniversidadedeSaoPaulo,Brazil,email:Thispaperisorganizedasfollows:Sections2and3brieyreviewsRLandtheheuristicapproachtospeedupRL.Section4presents TwoalgorithmsthatcanbeusedtoiterativelyapproximatetheQ-learning[15]andtheSARSA[14]algorithms.Theruleis:isadiscountfactorandisthelearningrate.TheSARSAalgorithmisamodicationofQ-learningthatelimi-natesthemaximizationoftheactionsinequation(4),separatingthechoiceoftheactionstobetakenfromtheupdateoftheQvalues.TheQ()[15]andtheSARSA()[10]algorithmsextendtheorigi-nalalgorithmsby,insteadofupdatingastate-actionpairateachiter-ation,updatingallpairsinaeligibilitytrace,proposedinitiallyintheFinally,toworkwithproblemswithcontinuousstatespaces,algo-rithmscanbeimplementedusingfunctionapproximators insteadofatable tocomputetheaction-valuefunction.Inthisworkthealgorithmsusedusestwofunctionapproximatorstocomputethevalue:aCMACfunctionapproximator[1]andthefunctionapproxi-matordescribedinBartoetal.al.3HeuristicsinReinforcementLearningetal.[4]denedaHeuristicallyAcceleratedReinforcementLearning(HARL)algorithmasawaytosolveanMDPproblemwithexplicituseofaheuristicfunctionS×Aforinuencingthechoiceofactionsbythelearningagent.Theheuristicfunctionisstronglyassociatedwiththepolicy,indicatingwhichactionmustbetakenregardlessoftheaction-valueoftheotheractionsthatcouldbeusedinthestate.Theheuristicfunctionisanactionpolicymodierwhichdoesnotinterferewiththestandardbootstrap-likeupdatemechanismofRLalgorithms.IntheHARLalgorithms,insteadofusingonlythevalue(oraction-value)estimationintheactionselectionmethodofanRLalgorithm,amathematicalcombinationbetweentheestimationfunc-tionandaheuristicfunctionisused:\bHS×Aisanestimateofavaluefunctionthatdenestheexpectedcumulativereward(forexample,itisfortheS×Aistheheuristicfunctionthatplaysaroleintheactionchoice,deningtheimportanceofexecutingactionisafunctionthatoperatesonrealnumbersandproducesavaluefromanorderedsetandaredesignparametersusedtocontroltheinuenceoftheheuristicfunction(theycanbeloweredtodecreasetheinuenceoftheheuristicwithtime).Thisformulationismoregeneralthanotherpreviousproposals,allowingheuristicstobeusedwithdifferentactionselectionmethodsandRLalgorithms.OneproposedstrategyforactionchoiceisanGreedymechanismwhereisconsidered,thus:random)=argmax\bHisaparameterthatdenetheexploration/exploitationtradeoff,isarandomnumberbetweenrandomisanactionran-domlychosenamongthoseavailableinstateAnotherpossiblestrategythatcanuseheuristicsisBoltzmannex-ploration[12],astrategythatassignsaprobabilitytoanypossibleactionaccordingtoitsexpectedutility,i.e.,actionswithhigherhavegreaterprobabilityofbeingchosen.AHARLusingthisstrat-egychoosesactionwithprobability:probability:Ft(st,at)H Ft(st,a)Histhetemperature,whichdecreaseswithtime.Ingeneral,thevalueofmustbelargerthanthevaria-tionamongthevaluesofforagiven,sothatitcaninuencetheactionchoice.Ontheotherhand,itmustbeassmallaspossibleinordertominimizetheerror.IfisasumandaheuristicsthatcanbeusedwiththeGreedymechanismcanbedenedby:by:Ft(st,a)]Ft(st,H(st))+isasmallvalueandisaheuristicsobtainedusinganappropriatemethod,thatisdesiredtobeusedinForinstance,let[]bethevaluesofforthreepossibleactions[]foragivenstate.Ifthedesiredactionistherstone(),wecanuse,resultinginandzerofortheotheractions.AnimportantcharacteristicofaHARLalgorithmisthattheheuristicfunctioncanbemodiedoradaptedonline,aslearningprogressesandnewinformationforenhancementoftheheuristicsbecomesavailable.Inparticular,eitherpriordomaininformationorinitiallearningstageinformationcanbeusedtodeneheuristicstoacceleratelearning.4TheoreticalresultsAstheheuristicfunctionisusedonlyinthechoiceoftheactiontobetaken,anewHARLalgorithmisdifferentfromtheoriginalRLoneonlyinthewayexplorationiscarriedout.AstheRLalgorithmoperationisnotmodied,manyoftheconclusionsreachedinRLarealsovalidforHARL.Inthissectionwepresentnewtheoremsthatconrmthisstatementandlimitthemaximumerrorcausedbyusingaheuristics.Theorem1ConsideraHARLagentlearninginadeterministicMDP,withnitesetsofstatesandactions,boundedrewardss,as,a,discountfactorsuchthatandwherethevaluesusedontheheuristicfunctionareboundedby.Forthisagent,thewillconvergeto,withprobabilityoneuniformlyoverallthestates,ifeachstate-actionpairisvisitedinnitelyoften.Proof:InHARL,theupdateofthevaluefunctionapproximationdoesnotdependexplicitlyonthevalueoftheheuristics.Thenec-essaryconditionsfortheconvergenceofanRLalgorithmthatcouldbeaffectedwiththeuseoftheheuristics,aretheonesthatdependonthechoiceoftheaction.Oftheconditionspresentedin[8],theonlyonethatdependsontheactionchoiceisthenecessityofinnitevisitationtoeachpairstate-action.Asequation6considersanexplo-rationstrategy greedyregardlessofthefactthatthevaluefunc-tionisinuencedbytheheuristics,theinnitevisitationcondition isguaranteedandthealgorithmconverges.Theconditionofinnitevisitationofeachstate-actionpaircanbeconsideredvalidforotherexplorationstrategies(e.g.,BoltzmannexplorationinEquation8)byusingothervisitationstrategies,suchasintercalatingstepswherethealgorithmmakesalternateuseoftheheuristicsandexplorationsteps,recedingtheinuenceoftheheuristicswithtimeorusingtheheuris-ticsduringaperiodoftime,smallerthanthetotallearningtimeforq.e.d.Thefollowingtheoremguaranteesthatsmallerrorsintheapprox-imationofanoptimalvaluefunctioncannotproducearbitrarilybadperformancewhenactionsareselectedusingthe-greedyruleinu-encedbyheuristics(Equation6).TheproofsherearebasedontheworkofSingh[11,Section4.5.1].Denition1Thelossintheapproximationofthevaluefunctioncausedbytheuseofheuristicscanbedenedas:whereistheestimatedvaluefunctioncalculatedforthepolicyindicatedbytheheuristics,ThetheorempresentedbelowdenestheupperboundforthelossTheorem2ThemaximallossthatcanbecausedbytheuseofaheuristicfunctionboundedbyinaHARLalgorithmlearninginadeterministicMDP,withnitesetsofstatesandactions,boundedrewards,discountfactorsuchthatandwhereistheaddition,hasanupperbound:bound:hmaxhmin].(11)Proof:Thereexistsastatethatcausesmaximumloss:S,L.Forthisstate,consideranoptimalac-andtheactionindicatedbytheheuristicsresultsinthestate,andusingresultsinthestate.Be-causethechoiceofactionismadefollowingan-greedypolicy,mustseematleastasgoodasz,az,az,bz,bRearrangingthisequationwehave:z,az,bz,bz,aUsingthedenitionofthelossintheapproximationofthevaluefunction(Equation10)andthedenitionofz,az,bSubstituting(12)in(13)gives:z,bz,aBecausetheactionischoseninsteadoftheactionz,bz,a.Asthevalueofisboundedby,itcanbeconcludedthat:that:hmaxhmin],\tstS.q.e.d.Isitpossibletoimprovethedenitionofthemaximalloss.ThefollowingtwolemmasareresultsknowntobevalidforRLalgo-rithms,whicharealsovalidfortheHARLalgorithms.Lemma1ForanyRLorHARLalgorithm,learninginadetermin-isticMDP,withnitesetsofstatesandactions,boundedrewards,discountfactorsuchthat,themaximumvaluethatcanreachhasanupperboundofProof:Fromtheexpecteddiscountedvaluefunctiondenition(Equation1)wehave:Andfromthedenitionottheaction-valuefunction(Equation3):isthesequenceofrewardsobtainedwhenstartingfromtoselecttheactionsandwhereisthediscountfactorsuchAssumingthat,inthebestcase,allreceivedrewardsinallsteps,wehavethat:Finally,inthelimit,wehave:)=lim Ifthepositiverewardisgivenonlywhentheterminalstateistherearenootherrewardsfor,weconcludethatLemma2ForanyRLorHARLalgorithmlearninginadetermin-isticMDP,withnitesetsofstatesandactions,boundedrewards,discountfactorsuchthat,theminimumvaluethatcanreachhasalowerboundofProof:Assumingthat,intheworstcase,allreceivedrewardsinallstepswere,wehavethat:Inthelimitwhen)=lim 1R.A.C.Bianchietal./HeuristicallyAcceleratedReinforcementLTheoreticalandExperimentalResults171 a2a2a2 Figure1.Problemwherethestatehaveactionsthatwillreceiveboththemaximumandminimumvaluesfortheaction-valuefunctionTheorem3ThemaximallossthatcanbecausedbytheuseofaheuristicfunctioninaHARLalgorithmlearninginade-terministicMDP,withnitesetsofstatesandactions,boundedrewards,discountfactorsuchandwhereistheaddition,hasanupperbound: Proof:FromEquation9,wehave:,and=maxxFt(st,a)]Ft(st,H(st))+Thevalueoftheheuristicswillbemaximumwhenboththeasthearefoundinthesamestate.Inthiscase 1rmin BysubstitutionofintheresultofTheorem2,wehave: 1rmin 1+\f0=\brmaxrmin .q.e.d.Figure1showsanexampleofproblemcongurationwherebothandthearefoundinthesamestate,.Init,stateisaterminalstate;movetogeneratesarewardandanyothermovementgeneratesareward5HARLAlgorithmsAgenericprocedureforaHARLalgorithmwasdenedbyBianchietal.[4]asaprocessthatissequentiallyrepeateduntilastoppingcriteriaismet(Algorithm1).BasedonthisdescriptionitispossibletocreatemanynewalgorithmsfromexistingRLones.TherstHARLalgorithmproposedwastheHeuristicallyAcceler-atedQ Learning(HAQL)[3],asanextensionoftheQ Learningal-gorithm[15].Theonlydifferencebetweenthetwoalgorithmsisthat Algorithm1TheHARLgenericalgorithm[Bianchietal. Produceanarbitraryestimationforthevaluefunction.DeneaninitialheuristicfunctionObservethecurrentstaterepeatSelectanactionbyadequatelycombiningtheheuristicfunc-tionandthevaluefunction.ExecuteReceivethereinforcementandobserveUpdatevalue(ortheaction-value)function.usinganappropriatemethod.Updatestateuntilastoppingcriteriaismet HAQLmakesuseofaheuristicfunctions,ainthegreedyactionchoicerule,thatcanbewrittenas:argmaxs,as,arandomInthisworkweproposethreenewalgorithms:theHA-Q(whichextendstheQ()algorithmbyusingthesameactionchoiceruleastheHAQL(Eq.21),theHA-SARSA(),whichextendsthe)algorithm[10]inthesameway,andtheHA-TD(),thatextendsthetraditionalTD()algorithm[13],usinganactionchoiceruleinwhichaprobabilityfunctionisinuencedbytheheuristic(showninsection6.2).Exceptforthenewactionchoicerule,thenewalgorithmsworksexactlyastheoriginalones.6ExperimentsusingHeuristicsThissectionpresentstwoexperimentsconductedtoverifythattheapproachbasedonheuristicscanbeappliedtodifferentRLalgo-rithms,thatitisdomainindependentandthatitcanbeusedinprob-lemswithcontinuousstatespace.Theheuristicsusedweredenedbasedonaprioriknowledgeofthedomain.Itisimportanttono-ticethattheheuristicsusedherearenotacompletesolution(i.e.,theoptimalpolicy)tosolvetheproblems.6.1TheMountainCarproblemusingHAQLandTheMountainCarProblem[9]isadomainthathasbeentradi-tionallyusedbyresearcherstotestnewreinforcementlearningal-gorithms.Inthisproblem,acarthatislocatedatthebottomofavalleymustbepushedbackandforwarduntilitreachesthetopofahill.Theagentmustgeneralizeacrosscontinuousstatevariablesinordertolearnhowtodrivethecaruptothegoalstate.Twocontinuousvariablesdescribethecarstate:thehorizontalpo-restrictedtotheranges[-1.2,0.6]andvelocitystrictedtotheranges[-0.07,0.07].Thecarmayselectoneofthreeactionsoneverystep:Left(),Neutral(),Right),whichchangethevelocityby-0.0007,0,and0.0007,re-spectively.Tosolvethisproblem,sixalgorithmswereused:theQ Learning,theSARSA(),theQ(),theHAQL,theHA-SARSA()andthe),therstthree,classicRLalgorithms,andthelastthree,heuristicversionsofthem.Becausetheinputvariablesarecontinu-ous,aCMACfunctionapproximator[1]with10layersand8inputpositionsforeachvariablewasusedtorepresentthevalue-actionfunction(inthesixalgorithms).R.A.C.Bianchietal./HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperimentalResults Theheuristicsusedwasdenedfollowingasimplerule:alwaysincreasethemoduleofthevelocity.ThevalueoftheheuristicsusedintheHARLsisdenedusingEq.(9)as:=+1Theparametersusedinthesimulationarethesameforallalgo-,explorationrate==10.Therewardiswhenapplyingaforce(,andwhenreachingthegoalstate.Table1shows,forthesixalgorithms,thenumberofstepsofthebestsolutionandthetimetondit(averageoftrainingsessionslimitedtoepisodes).ItmaybenotedthattheQ Learningalgo-rithmhastheworstperformance,asexpected.Itcanalsobeseenthatthealgorithmsthatuseheuristicsarefasterthantheothers. Algorithm BestSolution Time (steps) (insec.) Q-Learning 430±45 31±3 SARSA() 171±14 41±18 Q() 123±11 24±14 HAQL 115±1 7±5 HA-SARSA() 119±1 4±1 HA-Q() 107±1 4±1 Table1.ResultsfortheMountainCarproblem:averagenumberofstepsofthebestsolutionandtheaveragetimetondit.Figure2showstheevolutionofthenumberofstepsneededtoreachthegoalforthesixalgorithms(averageoftrainingsessions).Asexpected,theQ learninghastheworstperformance(thebegin-ningofitsevolutionisnotpresentedbecausevaluesareabovesteps)andtheHARLalgorithmspresentthebestresults.Asthelearningproceeds,theperformanceofallalgorithmsbecomesimilar,asexpected(Q-learningwillreachtheoptimalsolutionaftersteps).ThisgurealsoallowsonetoinferthereasonwhythetimerequiredfortheRLalgorithmstondthesolutionwithfewersteps(presentedintable1)isgreaterthanthetimeneededbytheHARLalgorithms:thesmallernumberofstepsexecutedbytheHARLsatthebeginningoftraining.Thepathsmadebythecarinthestatespace(positionspeed)atthersttrainingsession,whencontrolledbytheSARSA()andHA-)algorithmscanbeseeninFigure3(theoptimalcontrolpolicyisalsopresented).ItcanbeseenhowSARSA()explorestheenvironmentatthebeginningoftrainingand,whencomparedwith),onecannoticethatthegreatadvantageoftheHARLalgorithmsistonotperformsuchanintenseexplorationofthestateFinally,Students testwasusedtoverifythehypothesisthattheuseofheuristicsspeedsupthelearningprocess.Resultsconrmthehypothesis,withacondencelevelgreaterthan6.2Thecart-poleproblemusingHATD(Thecart-poletaskisusedsincetheearlyworkonRL,suchas[2].Thegoalofthecart-poletaskistomaintaintheverticalpositionofthepolewhilekeepingthecarwithinaxedboundary[12].Afailureoccurswhenthepoleistiltedmorethan12degreesfromverticalorif 100 1000 50 100 150 200 250 Steps per trialEpisodesQ-Learning SARSA() Q() HAQL HA-SARSA() HA-Q() Figure2.Evolutionofthenumberofstepsneededtoreachthegoalforthesixalgorithms(averageoftrainingsessions,axisinlog -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 VelocityPositionSARSA() HA-SARSA() Optimal Policy Figure3.Pathsmadebythecarinthestatespace.thecarthitstheendofthetrack.Thestatevariablesforthisproblemarecontinuous:thepositionofthecartcart2.4,2.4],thespeedofthecart,theanglebetweenthepoleandtheverticalertical12,12]degreesandtherateofchangeofthepolesangle(thedynamicequationscanbefoundin[2]).Weusetwoalgorithmstosolvethisproblem:TD()andHATD(),whichimplementsaheuristicversionoftheTD()algo-rithm.Theheuristicsusedwassimilartothatoftheprevioussection:ifthepoleisfallingtotheleft,movethecarttotheleft,ifitisfallingtotheright,movetotheright:Thisheuristicsinuencesthechoiceofactions,whichisgivenbytheR.A.C.Bianchietal./HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperimentalResults 0 2000 4000 6000 8000 10000 12000 14000 0 10 20 30 40 50 60 70 Time steps until failureEpisodesTD() HA-TD() Figure4.Evolutionofthenumberofstepsuntilfailureforthecart-poleproblem.Thisistheaverageof100trials,thereforeitisnotpossibletoseeindividualresultsthatreachthesuccesscriterionofstepswithoutfailure.probabilityfunction: whereiftheroundingofequalszero,theforceappliedispos-itive,iftheroundingisequaltoone,theforceisnegative.Itcanbeseenthattheheuristics,inthiscase,iscombinedwiththeValueinsidetherulethatisusedbytheTD()algo-rithm(whichisnotthe-greedyrule).Torunourexperiments,weusedthesimulatordistributedbySut-tonandBarto[12],whichimplementsthefunctionapproximatorde-scribedinBartoetal.[2].Trialsconsistedofepisodes.Thegoalistokeepthepolewithoutfallingforsteps,inwhichcasethetrialterminatessuccessfully.TheparametersusedwerethesameasinBartoetal.[2].ThevalueofusedbytheHATD()isTherewardisuponfailure.Thepoleisresettoverticalaftereachfailure.Table2showstheresultsobtained(averageoftrials).Onecanseethatboth,thenumberoftheepisodeinwhichthepolewassuccessfullycontrolledandthenumberofstepsneededtolearntobalancethepoleissmallerinHATD().Figure4showsthenumberofstepsuntilfailureforbothalgorithms.Itcanbeseenthatatthebeginning,theHATD()presentsabetterperformance,andthatbothalgorithmsbecamesimilarastheyconvergetotheoptimalpolicy.Finally,Students testwasusedtoverifythehypothesisthattheuseofheuristicsspeedsupthelearningprocess.Theresultscon-rmthatHATD()issignicantlybetterthanTD()untilthe50episode,withacondencelevelgreaterthan95%. Algorithm FirstSuccessful Steps Episode until1stsuccess TD() 67±16 1,115,602±942,752 HATD( 23±14 637,708±237,398 Table2.Resultsforthecart-poleproblem.7ConclusionInthisworkwepresentednewtheoreticalresultsfortheclassthaten-compassesallheuristics-basedalgorithms,calledHeuristicallyAc-celeratedReinforcementLearning.Wealsohavecontributedthreenewlearningalgorithm,HA-Q(),HA-SARSA()andHATD(therstalgorithmsthatusesbothheuristicsandeligibilitytraces.Empiricalevaluationofthesealgorithmsinthemountain-carandcart-poleproblemswerecarriedout.Experimentalresultsshowedthattheperformanceofthelearningalgorithmcanbeimprovedevenusingverysimpleheuristicfunc-tions.AnimportanttopictobeinvestigatedinfutureworksistheuseofgeneralizationinthevaluefunctionspacetogeneratetheheuristicACKNOWLEDGEMENTSReinaldoBianchiacknowledgesthesupportoftheFAPESP(grantnumber2012/04089-3).CarlosRibeiroisgratefultoFAPESP(2012/10528-0and2011/17610-0)andCNPq(305772/2010-4)andAnnaCostaisgratefultoFAPESP(2011/19280-8)andCNPqReferences[1]J.S.Albus,Anewapproachtomanipulatorcontrol:Thecerebellarmodelarticulationcontroller(CMAC),Trans.oftheASME,J.Dy-namicSystems,Measurement,andControl(3),220 227,(1975).[2]A.G.Barto,R.S.Sutton,andC.W.Anderson,Neuronlikeelementsthatcansolvedifcultlearningcontrolproblems,IEEETransactionsonSystems,Man,andCybernetics,(13),834 846,(1983).[3]ReinaldoA.C.Bianchi,CarlosH.C.Ribeiro,andAnnaH.R.Costa,HeuristicallyAcceleratedQ-learning:anewapproachtospeeduprein-forcementlearning,LectureNotesinArticialIntelligence,245 254,(2004).[4]ReinaldoA.C.Bianchi,CarlosH.C.Ribeiro,andAnnaH.R.Costa,Acceleratingautonomouslearningbyusingheuristicselectionofac-JournalofHeuristics(2),135 168,(2008).[5]ReinaldoA.C.Bianchi,CarlosH.C.Ribeiro,andAnnaHelenaRe-aliCosta,Heuristicselectionofactionsinmultiagentreinforcementlearning,in,ed.,ManuelaM.Veloso,pp.690 695,(2007).[6]ReinaldoA.C.Bianchi,RaquelRos,andRamonLopezdeMImprovingreinforcementlearningbyusingcasebasedheuristics,inLectureNotesinComputerScience,pp.75 89.Springer,(2009).[7]A.BurkovandB.Chaib-draa,AdaptiveplayQ-learningwithinitialheuristicapproximation,in,pp.1749 1754.IEEE,(2007).[8]M.L.LittmanandC.Szepesvari,Ageneralizedreinforcementlearn-ingmodel:convergenceandapplications,inICML96,pp.310 318,[9]AndrewMoore,Variableresolutiondynamicprogramming:Efcientlylearningactionmapsinmultivariatereal-valuedstate-spaces,inPro-ceedingsoftheEighthInternationalConferenceonMachineLearning(June1991).MorganKaufmann.[10]G.RummeryandM.Niranjan.On-lineQ-learningusingconnection-istsystems,1994.TechnicalReportCUED/F-INFENG/TR166.Cam-bridgeUniversity,EngineeringDepartment.[11]S.P.Singh,LearningtosolveMarkovianDecisionProcesses,Ph.D.Dissertation,UniversityofMassachusetts,Amherst,1994.[12]R.S.SuttonandA.G.Barto,ReinforcementLearning:AnIntroductionMITPress,Cambridge,MA,1998.[13]R.S.Sutton,Learningtopredictbythemethodsoftemporaldiffer-MachineLearning(1),9 44,(1988).[14]R.S.Sutton,Generalizationinreinforcementlearning:Successfulex-amplesusingsparsecoarsecoding,inAdvancesinNeuralInformationProcessingSystems,pp.1038 1044.TheMITPress,(1996).[15]C.J.C.H.Watkins,LearningfromDelayedRewards,Ph.D.disserta-tion,UniversityofCambridge,1989.R.A.C.Bianchietal./HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperimentalResults