/
HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperiment HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperiment

HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperiment - PDF document

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
367 views
Uploaded On 2016-04-23

HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperiment - PPT Presentation

CentroUniversitarioFEIBrazilemailrbianchifeiedubrInstitutoTecnologicodeAeronauticaBrazilemailcarlositabrEscolaPolitecnicadaUniversidadedeSaoPauloBrazilemailThispaperisorganizedasfollows ID: 289910

CentroUniversitarioFEI Brazil email:rbianchi@fei.edu.brInstitutoTecnologicodeAeronautica Brazil email:carlos@ita.brEscolaPolitecnicadaUniversidadedeSaoPaulo Brazil email:Thispaperisorganizedasfollows:

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "HeuristicallyAcceleratedReinforcementLea..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperimentalResultsReinaldoA.C.Bianchi,CarlosH.C.RibeiroAnnaH.R.CostaSince“ndingcontrolpoliciesusingReinforcementLearning(RL)canbeverytimeconsuming,inrecentyearsseveralauthorshaveinvestigatedhowtospeedupRLalgorithmsbymak-ingimprovedactionselectionsbasedonheuristics.Inthisworkwepresentnewtheoreticalresults…convergenceandasuperiorlimitforvalueestimationerrors…fortheclassthatencompassesallheuristics-basedalgorithms,calledHeuristicallyAcceleratedReinforcementLearning.Wealsoexpandthisnewclassbyproposingthreenewal-gorithms,theHeuristicallyAcceleratedQ(),SARSA()andTD(the“rstalgorithmsthatusesbothheuristicsandeligibilitytraces.Empiricalevaluationswereconductedintraditionalcontrolproblemsandresultsshowthatusingheuristicssigni“cantlyenhancestheper-formanceofthelearningprocess. CentroUniversitarioFEI,Brazil,email:rbianchi@fei.edu.brInstitutoTecnologicodeAeronautica,Brazil,email:carlos@ita.brEscolaPolitecnicadaUniversidadedeSaoPaulo,Brazil,email:Thispaperisorganizedasfollows:Sections2and3brie”yreviewsRLandtheheuristicapproachtospeedupRL.Section4presents TwoalgorithmsthatcanbeusedtoiterativelyapproximatetheQ-learning[15]andtheSARSA[14]algorithms.Theruleis:isadiscountfactorandisthelearningrate.TheSARSAalgorithmisamodi“cationofQ-learningthatelimi-natesthemaximizationoftheactionsinequation(4),separatingthechoiceoftheactionstobetakenfromtheupdateoftheQvalues.TheQ()[15]andtheSARSA()[10]algorithmsextendtheorigi-nalalgorithmsby,insteadofupdatingastate-actionpairateachiter-ation,updatingallpairsinaeligibilitytrace,proposedinitiallyintheFinally,toworkwithproblemswithcontinuousstatespaces,algo-rithmscanbeimplementedusingfunctionapproximators…insteadofatable…tocomputetheaction-valuefunction.Inthisworkthealgorithmsusedusestwofunctionapproximatorstocomputethevalue:aCMACfunctionapproximator[1]andthefunctionapproxi-matordescribedinBartoetal.al.3HeuristicsinReinforcementLearningetal.[4]de“nedaHeuristicallyAcceleratedReinforcementLearning(HARL)algorithmasawaytosolveanMDPproblemwithexplicituseofaheuristicfunctionS×Aforin”uencingthechoiceofactionsbythelearningagent.Theheuristicfunctionisstronglyassociatedwiththepolicy,indicatingwhichactionmustbetakenregardlessoftheaction-valueoftheotheractionsthatcouldbeusedinthestate.Theheuristicfunctionisanactionpolicymodi“erwhichdoesnotinterferewiththestandardbootstrap-likeupdatemechanismofRLalgorithms.IntheHARLalgorithms,insteadofusingonlythevalue(oraction-value)estimationintheactionselectionmethodofanRLalgorithm,amathematicalcombinationbetweentheestimationfunc-tionandaheuristicfunctionisused:\bHS×Aisanestimateofavaluefunctionthatde“nestheexpectedcumulativereward(forexample,itisfortheS×Aistheheuristicfunctionthatplaysaroleintheactionchoice,de“ningtheimportanceofexecutingactionisafunctionthatoperatesonrealnumbersandproducesavaluefromanorderedsetandaredesignparametersusedtocontrolthein”uenceoftheheuristicfunction(theycanbeloweredtodecreasethein”uenceoftheheuristicwithtime).Thisformulationismoregeneralthanotherpreviousproposals,allowingheuristicstobeusedwithdifferentactionselectionmethodsandRLalgorithms.OneproposedstrategyforactionchoiceisanGreedymechanismwhereisconsidered,thus:random)=argmax\bHisaparameterthatde“netheexploration/exploitationtradeoff,isarandomnumberbetweenrandomisanactionran-domlychosenamongthoseavailableinstateAnotherpossiblestrategythatcanuseheuristicsisBoltzmannex-ploration[12],astrategythatassignsaprobabilitytoanypossibleactionaccordingtoitsexpectedutility,i.e.,actionswithhigherhavegreaterprobabilityofbeingchosen.AHARLusingthisstrat-egychoosesactionwithprobability:probability:Ft(st,at)H Ft(st,a)Histhetemperature,whichdecreaseswithtime.Ingeneral,thevalueofmustbelargerthanthevaria-tionamongthevaluesofforagiven,sothatitcanin”uencetheactionchoice.Ontheotherhand,itmustbeassmallaspossibleinordertominimizetheerror.IfisasumandaheuristicsthatcanbeusedwiththeGreedymechanismcanbede“nedby:by:Ft(st,a)]ŠFt(st,H(st))+isasmallvalueandisaheuristicsobtainedusinganappropriatemethod,thatisdesiredtobeusedinForinstance,let[]bethevaluesofforthreepossibleactions[]foragivenstate.Ifthedesiredactionisthe“rstone(),wecanuse,resultinginandzerofortheotheractions.AnimportantcharacteristicofaHARLalgorithmisthattheheuristicfunctioncanbemodi“edoradaptedonline,aslearningprogressesandnewinformationforenhancementoftheheuristicsbecomesavailable.Inparticular,eitherpriordomaininformationorinitiallearningstageinformationcanbeusedtode“neheuristicstoacceleratelearning.4TheoreticalresultsAstheheuristicfunctionisusedonlyinthechoiceoftheactiontobetaken,anewHARLalgorithmisdifferentfromtheoriginalRLoneonlyinthewayexplorationiscarriedout.AstheRLalgorithmoperationisnotmodi“ed,manyoftheconclusionsreachedinRLarealsovalidforHARL.Inthissectionwepresentnewtheoremsthatcon“rmthisstatementandlimitthemaximumerrorcausedbyusingaheuristics.Theorem1ConsideraHARLagentlearninginadeterministicMDP,with“nitesetsofstatesandactions,boundedrewardss,as,a,discountfactorsuchthatandwherethevaluesusedontheheuristicfunctionareboundedby.Forthisagent,thewillconvergeto,withprobabilityoneuniformlyoverallthestates,ifeachstate-actionpairisvisitedin“nitelyoften.Proof:InHARL,theupdateofthevaluefunctionapproximationdoesnotdependexplicitlyonthevalueoftheheuristics.Thenec-essaryconditionsfortheconvergenceofanRLalgorithmthatcouldbeaffectedwiththeuseoftheheuristics,aretheonesthatdependonthechoiceoftheaction.Oftheconditionspresentedin[8],theonlyonethatdependsontheactionchoiceisthenecessityofin“nitevisitationtoeachpairstate-action.Asequation6considersanexplo-rationstrategy…greedyregardlessofthefactthatthevaluefunc-tionisin”uencedbytheheuristics,thein“nitevisitationcondition isguaranteedandthealgorithmconverges.Theconditionofin“nitevisitationofeachstate-actionpaircanbeconsideredvalidforotherexplorationstrategies(e.g.,BoltzmannexplorationinEquation8)byusingothervisitationstrategies,suchasintercalatingstepswherethealgorithmmakesalternateuseoftheheuristicsandexplorationsteps,recedingthein”uenceoftheheuristicswithtimeorusingtheheuris-ticsduringaperiodoftime,smallerthanthetotallearningtimeforq.e.d.Thefollowingtheoremguaranteesthatsmallerrorsintheapprox-imationofanoptimalvaluefunctioncannotproducearbitrarilybadperformancewhenactionsareselectedusingthe-greedyrulein”u-encedbyheuristics(Equation6).TheproofsherearebasedontheworkofSingh[11,Section4.5.1].De“nition1Thelossintheapproximationofthevaluefunctioncausedbytheuseofheuristicscanbede“nedas:whereistheestimatedvaluefunctioncalculatedforthepolicyindicatedbytheheuristics,Thetheorempresentedbelowde“nestheupperboundforthelossTheorem2ThemaximallossthatcanbecausedbytheuseofaheuristicfunctionboundedbyinaHARLalgorithmlearninginadeterministicMDP,with“nitesetsofstatesandactions,boundedrewards,discountfactorsuchthatandwhereistheaddition,hasanupperbound:bound:hmaxŠhmin].(11)Proof:Thereexistsastatethatcausesmaximumloss:S,L.Forthisstate,consideranoptimalac-andtheactionindicatedbytheheuristicsresultsinthestate,andusingresultsinthestate.Be-causethechoiceofactionismadefollowingan-greedypolicy,mustseematleastasgoodasz,az,az,bz,bRearrangingthisequationwehave:z,az,bz,bz,aUsingthede“nitionofthelossintheapproximationofthevaluefunction(Equation10)andthede“nitionofz,az,bSubstituting(12)in(13)gives:z,bz,aBecausetheactionischoseninsteadoftheactionz,bz,a.Asthevalueofisboundedby,itcanbeconcludedthat:that:hmaxŠhmin],\tstS.q.e.d.Isitpossibletoimprovethede“nitionofthemaximalloss.ThefollowingtwolemmasareresultsknowntobevalidforRLalgo-rithms,whicharealsovalidfortheHARLalgorithms.Lemma1ForanyRLorHARLalgorithm,learninginadetermin-isticMDP,with“nitesetsofstatesandactions,boundedrewards,discountfactorsuchthat,themaximumvaluethatcanreachhasanupperboundofProof:Fromtheexpecteddiscountedvaluefunctionde“nition(Equation1)wehave:Andfromthede“nitionottheaction-valuefunction(Equation3):isthesequenceofrewardsobtainedwhenstartingfromtoselecttheactionsandwhereisthediscountfactorsuchAssumingthat,inthebestcase,allreceivedrewardsinallsteps,wehavethat:Finally,inthelimit,wehave:)=lim Ifthepositiverewardisgivenonlywhentheterminalstateistherearenootherrewardsfor,weconcludethatLemma2ForanyRLorHARLalgorithmlearninginadetermin-isticMDP,with“nitesetsofstatesandactions,boundedrewards,discountfactorsuchthat,theminimumvaluethatcanreachhasalowerboundofProof:Assumingthat,intheworstcase,allreceivedrewardsinallstepswere,wehavethat:Inthelimitwhen)=lim 1ŠR.A.C.Bianchietal./HeuristicallyAcceleratedReinforcementLTheoreticalandExperimentalResults171 a2a2a2 Figure1.Problemwherethestatehaveactionsthatwillreceiveboththemaximumandminimumvaluesfortheaction-valuefunctionTheorem3ThemaximallossthatcanbecausedbytheuseofaheuristicfunctioninaHARLalgorithmlearninginade-terministicMDP,with“nitesetsofstatesandactions,boundedrewards,discountfactorsuchandwhereistheaddition,hasanupperbound: Proof:FromEquation9,wehave:,and=maxxFt(st,a)]ŠFt(st,H(st))+Thevalueoftheheuristicswillbemaximumwhenboththeasthearefoundinthesamestate.Inthiscase 1ŠŠrmin BysubstitutionofintheresultofTheorem2,wehave: 1ŠŠrmin 1Š+\fŠ0=\b rmaxŠrmin .q.e.d.Figure1showsanexampleofproblemcon“gurationwherebothandthearefoundinthesamestate,.Init,stateisaterminalstate;movetogeneratesarewardandanyothermovementgeneratesareward5HARLAlgorithmsAgenericprocedureforaHARLalgorithmwasde“nedbyBianchietal.[4]asaprocessthatissequentiallyrepeateduntilastoppingcriteriaismet(Algorithm1).BasedonthisdescriptionitispossibletocreatemanynewalgorithmsfromexistingRLones.The“rstHARLalgorithmproposedwastheHeuristicallyAcceler-atedQ…Learning(HAQL)[3],asanextensionoftheQ…Learningal-gorithm[15].Theonlydifferencebetweenthetwoalgorithmsisthat Algorithm1TheHARLgenericalgorithm[Bianchietal. Produceanarbitraryestimationforthevaluefunction.De“neaninitialheuristicfunctionObservethecurrentstaterepeatSelectanactionbyadequatelycombiningtheheuristicfunc-tionandthevaluefunction.ExecuteReceivethereinforcementandobserveUpdatevalue(ortheaction-value)function.usinganappropriatemethod.Updatestateuntilastoppingcriteriaismet HAQLmakesuseofaheuristicfunctions,ainthegreedyactionchoicerule,thatcanbewrittenas:argmaxs,as,arandomInthisworkweproposethreenewalgorithms:theHA-Q(whichextendstheQ()algorithmbyusingthesameactionchoiceruleastheHAQL(Eq.21),theHA-SARSA(),whichextendsthe)algorithm[10]inthesameway,andtheHA-TD(),thatextendsthetraditionalTD()algorithm[13],usinganactionchoiceruleinwhichaprobabilityfunctionisin”uencedbytheheuristic(showninsection6.2).Exceptforthenewactionchoicerule,thenewalgorithmsworksexactlyastheoriginalones.6ExperimentsusingHeuristicsThissectionpresentstwoexperimentsconductedtoverifythattheapproachbasedonheuristicscanbeappliedtodifferentRLalgo-rithms,thatitisdomainindependentandthatitcanbeusedinprob-lemswithcontinuousstatespace.Theheuristicsusedwerede“nedbasedonaprioriknowledgeofthedomain.Itisimportanttono-ticethattheheuristicsusedherearenotacompletesolution(i.e.,theoptimalpolicy)tosolvetheproblems.6.1TheMountainCarproblemusingHAQLandTheMountainCarProblem[9]isadomainthathasbeentradi-tionallyusedbyresearcherstotestnewreinforcementlearningal-gorithms.Inthisproblem,acarthatislocatedatthebottomofavalleymustbepushedbackandforwarduntilitreachesthetopofahill.Theagentmustgeneralizeacrosscontinuousstatevariablesinordertolearnhowtodrivethecaruptothegoalstate.Twocontinuousvariablesdescribethecarstate:thehorizontalpo-restrictedtotheranges[-1.2,0.6]andvelocitystrictedtotheranges[-0.07,0.07].Thecarmayselectoneofthreeactionsoneverystep:Left(),Neutral(),Right),whichchangethevelocityby-0.0007,0,and0.0007,re-spectively.Tosolvethisproblem,sixalgorithmswereused:theQ…Learning,theSARSA(),theQ(),theHAQL,theHA-SARSA()andthe),the“rstthree,classicRLalgorithms,andthelastthree,heuristicversionsofthem.Becausetheinputvariablesarecontinu-ous,aCMACfunctionapproximator[1]with10layersand8inputpositionsforeachvariablewasusedtorepresentthevalue-actionfunction(inthesixalgorithms).R.A.C.Bianchietal./HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperimentalResults Theheuristicsusedwasde“nedfollowingasimplerule:alwaysincreasethemoduleofthevelocity.ThevalueoftheheuristicsusedintheHARLsisde“nedusingEq.(9)as:=+1Theparametersusedinthesimulationarethesameforallalgo-,explorationrate==10.Therewardiswhenapplyingaforce(,andwhenreachingthegoalstate.Table1shows,forthesixalgorithms,thenumberofstepsofthebestsolutionandthetimeto“ndit(averageoftrainingsessionslimitedtoepisodes).ItmaybenotedthattheQ…Learningalgo-rithmhastheworstperformance,asexpected.Itcanalsobeseenthatthealgorithmsthatuseheuristicsarefasterthantheothers. Algorithm BestSolution Time (steps) (insec.) Q-Learning 430±45 31±3 SARSA() 171±14 41±18 Q() 123±11 24±14 HAQL 115±1 7±5 HA-SARSA() 119±1 4±1 HA-Q() 107±1 4±1 Table1.ResultsfortheMountainCarproblem:averagenumberofstepsofthebestsolutionandtheaveragetimeto“ndit.Figure2showstheevolutionofthenumberofstepsneededtoreachthegoalforthesixalgorithms(averageoftrainingsessions).Asexpected,theQ…learninghastheworstperformance(thebegin-ningofitsevolutionisnotpresentedbecausevaluesareabovesteps)andtheHARLalgorithmspresentthebestresults.Asthelearningproceeds,theperformanceofallalgorithmsbecomesimilar,asexpected(Q-learningwillreachtheoptimalsolutionaftersteps).This“gurealsoallowsonetoinferthereasonwhythetimerequiredfortheRLalgorithmsto“ndthesolutionwithfewersteps(presentedintable1)isgreaterthanthetimeneededbytheHARLalgorithms:thesmallernumberofstepsexecutedbytheHARLsatthebeginningoftraining.Thepathsmadebythecarinthestatespace(positionspeed)atthe“rsttrainingsession,whencontrolledbytheSARSA()andHA-)algorithmscanbeseeninFigure3(theoptimalcontrolpolicyisalsopresented).ItcanbeseenhowSARSA()explorestheenvironmentatthebeginningoftrainingand,whencomparedwith),onecannoticethatthegreatadvantageoftheHARLalgorithmsistonotperformsuchanintenseexplorationofthestateFinally,Students…testwasusedtoverifythehypothesisthattheuseofheuristicsspeedsupthelearningprocess.Resultscon“rmthehypothesis,withacon“dencelevelgreaterthan6.2Thecart-poleproblemusingHATD(Thecart-poletaskisusedsincetheearlyworkonRL,suchas[2].Thegoalofthecart-poletaskistomaintaintheverticalpositionofthepolewhilekeepingthecarwithina“xedboundary[12].Afailureoccurswhenthepoleistiltedmorethan12degreesfromverticalorif 100 1000 50 100 150 200 250 Steps per trialEpisodesQ-Learning SARSA() Q() HAQL HA-SARSA() HA-Q() Figure2.Evolutionofthenumberofstepsneededtoreachthegoalforthesixalgorithms(averageoftrainingsessions,axisinlog -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 VelocityPositionSARSA() HA-SARSA() Optimal Policy Figure3.Pathsmadebythecarinthestatespace.thecarthitstheendofthetrack.Thestatevariablesforthisproblemarecontinuous:thepositionofthecartcartŠ2.4,2.4],thespeedofthecart,theanglebetweenthepoleandtheverticalerticalŠ12,12]degreesandtherateofchangeofthepolesangle(thedynamicequationscanbefoundin[2]).Weusetwoalgorithmstosolvethisproblem:TD()andHATD(),whichimplementsaheuristicversionoftheTD()algo-rithm.Theheuristicsusedwassimilartothatoftheprevioussection:ifthepoleisfallingtotheleft,movethecarttotheleft,ifitisfallingtotheright,movetotheright:Thisheuristicsin”uencesthechoiceofactions,whichisgivenbytheR.A.C.Bianchietal./HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperimentalResults 0 2000 4000 6000 8000 10000 12000 14000 0 10 20 30 40 50 60 70 Time steps until failureEpisodesTD() HA-TD() Figure4.Evolutionofthenumberofstepsuntilfailureforthecart-poleproblem.Thisistheaverageof100trials,thereforeitisnotpossibletoseeindividualresultsthatreachthesuccesscriterionofstepswithoutfailure.probabilityfunction: whereiftheroundingofequalszero,theforceappliedispos-itive,iftheroundingisequaltoone,theforceisnegative.Itcanbeseenthattheheuristics,inthiscase,iscombinedwiththeValueinsidetherulethatisusedbytheTD()algo-rithm(whichisnotthe-greedyrule).Torunourexperiments,weusedthesimulatordistributedbySut-tonandBarto[12],whichimplementsthefunctionapproximatorde-scribedinBartoetal.[2].Trialsconsistedofepisodes.Thegoalistokeepthepolewithoutfallingforsteps,inwhichcasethetrialterminatessuccessfully.TheparametersusedwerethesameasinBartoetal.[2].ThevalueofusedbytheHATD()isTherewardisuponfailure.Thepoleisresettoverticalaftereachfailure.Table2showstheresultsobtained(averageoftrials).Onecanseethatboth,thenumberoftheepisodeinwhichthepolewassuccessfullycontrolledandthenumberofstepsneededtolearntobalancethepoleissmallerinHATD().Figure4showsthenumberofstepsuntilfailureforbothalgorithms.Itcanbeseenthatatthebeginning,theHATD()presentsabetterperformance,andthatbothalgorithmsbecamesimilarastheyconvergetotheoptimalpolicy.Finally,Students…testwasusedtoverifythehypothesisthattheuseofheuristicsspeedsupthelearningprocess.Theresultscon-“rmthatHATD()issigni“cantlybetterthanTD()untilthe50episode,withacon“dencelevelgreaterthan95%. Algorithm FirstSuccessful Steps Episode until1stsuccess TD() 67±16 1,115,602±942,752 HATD( 23±14 637,708±237,398 Table2.Resultsforthecart-poleproblem.7ConclusionInthisworkwepresentednewtheoreticalresultsfortheclassthaten-compassesallheuristics-basedalgorithms,calledHeuristicallyAc-celeratedReinforcementLearning.Wealsohavecontributedthreenewlearningalgorithm,HA-Q(),HA-SARSA()andHATD(the“rstalgorithmsthatusesbothheuristicsandeligibilitytraces.Empiricalevaluationofthesealgorithmsinthemountain-carandcart-poleproblemswerecarriedout.Experimentalresultsshowedthattheperformanceofthelearningalgorithmcanbeimprovedevenusingverysimpleheuristicfunc-tions.AnimportanttopictobeinvestigatedinfutureworksistheuseofgeneralizationinthevaluefunctionspacetogeneratetheheuristicACKNOWLEDGEMENTSReinaldoBianchiacknowledgesthesupportoftheFAPESP(grantnumber2012/04089-3).CarlosRibeiroisgratefultoFAPESP(2012/10528-0and2011/17610-0)andCNPq(305772/2010-4)andAnnaCostaisgratefultoFAPESP(2011/19280-8)andCNPqReferences[1]J.S.Albus,Anewapproachtomanipulatorcontrol:Thecerebellarmodelarticulationcontroller(CMAC),Trans.oftheASME,J.Dy-namicSystems,Measurement,andControl(3),220…227,(1975).[2]A.G.Barto,R.S.Sutton,andC.W.Anderson,Neuronlikeelementsthatcansolvedif“cultlearningcontrolproblems,IEEETransactionsonSystems,Man,andCybernetics,(13),834…846,(1983).[3]ReinaldoA.C.Bianchi,CarlosH.C.Ribeiro,andAnnaH.R.Costa,HeuristicallyAcceleratedQ-learning:anewapproachtospeeduprein-forcementlearning,LectureNotesinArti“cialIntelligence,245…254,(2004).[4]ReinaldoA.C.Bianchi,CarlosH.C.Ribeiro,andAnnaH.R.Costa,Acceleratingautonomouslearningbyusingheuristicselectionofac-JournalofHeuristics(2),135…168,(2008).[5]ReinaldoA.C.Bianchi,CarlosH.C.Ribeiro,andAnnaHelenaRe-aliCosta,Heuristicselectionofactionsinmultiagentreinforcementlearning,in,ed.,ManuelaM.Veloso,pp.690…695,(2007).[6]ReinaldoA.C.Bianchi,RaquelRos,andRamonLopezdeMImprovingreinforcementlearningbyusingcasebasedheuristics,inLectureNotesinComputerScience,pp.75…89.Springer,(2009).[7]A.BurkovandB.Chaib-draa,AdaptiveplayQ-learningwithinitialheuristicapproximation,in,pp.1749…1754.IEEE,(2007).[8]M.L.LittmanandC.Szepesvari,Ageneralizedreinforcementlearn-ingmodel:convergenceandapplications,inICML96,pp.310…318,[9]AndrewMoore,Variableresolutiondynamicprogramming:Ef“cientlylearningactionmapsinmultivariatereal-valuedstate-spaces,inPro-ceedingsoftheEighthInternationalConferenceonMachineLearning(June1991).MorganKaufmann.[10]G.RummeryandM.Niranjan.On-lineQ-learningusingconnection-istsystems,1994.TechnicalReportCUED/F-INFENG/TR166.Cam-bridgeUniversity,EngineeringDepartment.[11]S.P.Singh,LearningtosolveMarkovianDecisionProcesses,Ph.D.Dissertation,UniversityofMassachusetts,Amherst,1994.[12]R.S.SuttonandA.G.Barto,ReinforcementLearning:AnIntroductionMITPress,Cambridge,MA,1998.[13]R.S.Sutton,Learningtopredictbythemethodsoftemporaldiffer-MachineLearning(1),9…44,(1988).[14]R.S.Sutton,Generalizationinreinforcementlearning:Successfulex-amplesusingsparsecoarsecoding,inAdvancesinNeuralInformationProcessingSystems,pp.1038…1044.TheMITPress,(1996).[15]C.J.C.H.Watkins,LearningfromDelayedRewards,Ph.D.disserta-tion,UniversityofCambridge,1989.R.A.C.Bianchietal./HeuristicallyAcceleratedReinforcementLearning:TheoreticalandExperimentalResults

Related Contents


Next Show more