/
JournalofMachineLearningResearch1(2008)815-857Submitted6/05;Revised2/0 JournalofMachineLearningResearch1(2008)815-857Submitted6/05;Revised2/0

JournalofMachineLearningResearch1(2008)815-857Submitted6/05;Revised2/0 - PDF document

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
368 views
Uploaded On 2016-06-05

JournalofMachineLearningResearch1(2008)815-857Submitted6/05;Revised2/0 - PPT Presentation

MUNOSANDSZEPESV ID: 349302

MUNOSANDSZEPESV

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch1(2008)8..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch1(2008)815-857Submitted6/05;Revised2/07;Published5/08Finite-TimeBoundsforFittedValueIterationR´emiMunosREMI.MUNOS@INRIA.FRSequeLproject,INRIALille-NordEurope40avenueHalley59650Villeneuved'Ascq,FranceCsabaSzepesv´ariSZEPESVA@CS.UALBERTA.CADepartmentofComputingScienceUniversityofAlbertaEdmontonT6G2E8,CanadaEditor:ShieMannorAbstractInthispaperwedevelopatheoreticalanalysisoftheperformanceofsampling-basedttedvalueit-eration(FVI)tosolveinnitestate-space,discounted-rewardMarkoviandecisionprocesses(MDPs)undertheassumptionthatagenerativemodeloftheenvironmentisavailable.Ourmainresultscomeintheformofnite-timeboundsontheperformanceoftwoversionsofsampling-basedFVI.TheconvergencerateresultsobtainedallowustoshowthatbothversionsofFVIarewellbehav-inginthesensethatbyusingasufcientlylargenumberofsamplesforalargeclassofMDPs,arbitrarygoodperformancecanbeachievedwithhighprobability.AnimportantfeatureofourprooftechniqueisthatitpermitsthestudyofweightedLp-normperformancebounds.Asaresult,ourtechniqueappliestoalargeclassoffunction-approximationmethods(e.g.,neuralnetworks,adaptiveregressiontrees,kernelmachines,locallyweightedlearning),andourboundsscalewellwiththeeffectivehorizonoftheMDP.TheboundsshowadependenceonthestochasticstabilitypropertiesoftheMDP:theyscalewiththediscounted-averageconcentrabilityofthefuture-statedistributions.Theyalsodependonanewmeasureoftheapproximationpowerofthefunctionspace,theinherentBellmanresidual,whichreectshowwellthefunctionspaceis“aligned”withthedynamicsandrewardsoftheMDP.Theconditionsofthemainresult,aswellastheconceptsintroducedintheanalysis,areextensivelydiscussedandcomparedtoprevioustheoreticalresults.Numericalexperimentsareusedtosubstantiatethetheoreticalndings.Keywords:ttedvalueiteration,discountedMarkoviandecisionprocesses,generativemodel,reinforcementlearning,supervisedlearning,regression,Pollard'sinequality,statisticallearningtheory,optimalcontrol1.IntroductionDuringthelastdecade,reinforcementlearning(RL)algorithmshavebeensuccessfullyappliedtoanumberofdifcultcontrolproblems,suchasjob-shopscheduling(ZhangandDietterich,1995),backgammon(Tesauro,1995),elevatorcontrol(CritesandBarto,1997),machinemaintenance(Ma-hadevanetal.,1997),dynamicchannelallocation(SinghandBertsekas,1997),orairlineseatal-location(Gosavi,2004).Thesetofpossiblestatesintheseproblemsisverylarge,andsoonly.MostofthisworkwasdonewhiletheauthorwaswiththeComputerandAutomationResearchInst.oftheHungarianAcademyofSciences,Kendeu.13-17,Budapest1111,Hungary.c\r2008R´emiMunosandCsabaSzepesv´ari. MUNOSANDSZEPESV´ARIalgorithmsthatcansuccessfullygeneralizetounseenstatesareexpectedtoachievenon-trivialper-formance.Theapproachintheabovementionedworksistolearnanapproximationtotheoptimalvaluefunctionthatassignstoeachstatethebestpossibleexpectedlong-termcumulativerewardresultingfromstartingthecontrolfromtheselectedstate.Theknowledgeoftheoptimalvaluefunctionissufcienttoachieveoptimalcontrol(BertsekasandTsitsiklis,1996),andinmanycasesanapproximationtotheoptimalvaluefunctionisalreadysufcienttoachieveagoodcontrolper-formance.Inlargestate-spaceproblems,afunctionapproximationmethodisusedtorepresentthelearntvaluefunction.Inalltheabovesuccessfulapplicationsthisistheapproachused.Yet,theinteractionofRLalgorithmsandfunctionapproximationmethodsisstillnotverywellunderstood.OurgoalinthispaperistoimprovethissituationbystudyingoneofthesimplestwaystocombineRLandfunctionapproximation,namely,usingfunctionapproximationinvalueiteration.Theadvantageofstudyingsuchasimplecombinationisthatsometechnicaldifcultiesareavoided,yet,aswewillsee,theproblemstudiedischallengingenoughtomakeitsstudyworthwhile.Valueiterationisadynamicprogrammingalgorithmwhichuses`valuebackups'togenerateasequenceofvaluefunctions(i.e.,functionsdenedoverthestatespace)inarecursivemanner.Afterasufcientlylargenumberofiterationstheobtainedfunctioncanbeusedtocomputeagoodpolicy.Exactcomputationsofthevaluebackupsrequirecomputingparametricintegralsoverthestatespace.Exceptinafewspecialcases,neithersuchexactcomputations,northeexactrepresenta-tionoftheresultingfunctionsispossible.Theideaunderlyingsampling-basedttedvalueiteration(FVI)istocalculatetheback-upsapproximatelyusingMonte-Carlointegrationatanitenumberofpointsandthenndabesttwithinauser-chosensetoffunctionstothecomputedvalues.Thehopeisthatifthefunctionsetisrichenoughthenthettedvaluefunctionwillbeagoodapproximationtothenextiterate,ultimatelyleadingtoagoodpolicy.Alargenumberofsuccessfulexperimen-talworksconcernedalgorithmsthatsharemanysimilaritieswithFVI(e.g.,WangandDietterich,1999;DietterichandWang,2002;LagoudakisandParr,2003;JungandUthmann,2004;Ernstetal.,2005;Riedmiller,2005).Hence,inthispaperweconcentrateonthetheoreticalanalysisofFVI,aswebelievethatsuchananalysiscanyieldtousefulinsightsintowhyandwhensampling-basedap-proximatedynamicprogramming(ADP)canbeexpectedtoperformwell.Therelativesimplicityofthesetupallowsasimpliedanalysis,yetitalreadysharesmanyofthedifcultiesthatonehastoovercomeinother,moreinvolvedscenarios.(Inourfollowupworkwestudiedothervariantsofsampling-basedADP.ThereaderinterestedinsuchextensionsshouldconsultthepapersAntosetal.(2006,2007,2008).)Despitetheappealingsimplicityoftheideaandthesuccessfuldemonstrationsofvarioussampling-basedADPalgorithms,withoutanyfurtherconsiderationsitisstillunclearwhethersampling-basedADP,andinparticularsampling-basedFVIisindeeda“good”algorithm.Inparticular,Baird(1995)andTsitsiklisandVanRoy(1996)gavesimplecounterexamplesshowingthatFVIcanbeunstable.Thesecounterexamplesaredeceptivelysimple:theMDPisnite,exactbackupscanbeandarecomputed,theapproximatevaluefunctioniscalculatedusingalinearcombinationofanumberofxedbasisfunctionsandtheoptimalvaluefunctioncanberepresentedexactlybysuchalinearcom-bination.Hence,thefunctionsetseemssufcientlyrich.Despitethis,theiteratesdiverge.Sincevalueiterationwithoutprojectioniswellbehaved,wemustconcludethattheinstablebehavioristheresultoftheerrorsintroducedwhentheiteratesareprojectedontothefunctionspace.Ouraiminthispaperistodevelopabetterunderstandingofwhy,despitetheconceivabledifculties,prac-titionersoftenndthatsampling-basedFVIiswellbehavingand,inparticular,wewanttodevelopatheoryexplainingwhentoexpectgoodperformance.816 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONThesettingstudiedinthispaperisasfollows:WeassumethatthestatespaceisacompactsubsetofaEuclideanspaceandthattheMDPhasanitenumberofactions.Theproblemistondapolicy(orcontroller)thatmaximizestheexpectationoftheinnite-horizon,discountedsumofrewards.Weshallassumethatthesolvercansampleanytransitionfromanystate,thatis,thatagenerativemodel(orsimulator)oftheenvironmentisavailable.Thismodelhasbeenusedinanumberofpreviousworks(e.g.,Kearnsetal.,1999;NgandJordan,2000;KakadeandLangford,2002;Kakade,2003).Anextensionofthepresentworktothecasewhenonlyasingletrajectoryisavailableforlearningispublishedelsewhere(Antosetal.,2006).Weinvestigatetwoversionsofthebasicalgorithm:Inthemulti-samplevariantafreshsamplesetisgeneratedineachiteration,whileinthesingle-samplevariantthesamesamplesetisusedthroughoutalltheiterations.Interestingly,wendthatnoseriousdegradationofperformanceresultsfromreusingthesamples.Infact,wendthatwhenthediscountfactorisclosetoonethenthesingle-samplevariantcanbeexpectedtobemoreefcientinthesenseofyieldingsmallererrorsusingfewersamples.Themotivationofthiscomparisonistogetpreparedforthecasewhenthesamplesaregivenorwhentheyareexpensivetogenerateforsomereason.Ourresultscomeintheformofhigh-probabilityboundsontheperformanceasafunctionofthenumberofsamplesgenerated,somepropertiesoftheMDPandthefunctionclassusedforapproximatingthevaluefunctions.Wewillcompareourboundstothoseavailableinsupervisedlearning(regression),wherealikeboundshavetwoterms:oneboundingthebiasofthealgorithm,whiletheotherboundingthevariance,orestimationerror.Thetermboundingthebiasdecreaseswhentheapproximationpowerofthefunctionclassisincreased(hencethistermisoccasionallycalledtheapproximationerrorterm).Thetermboundingthevariancedecreasesasthenumberofsamplesisincreased,butincreaseswhentherichnessofthefunctionclassisincreased.Althoughourboundsaresimilartoboundsofsupervisedlearning,therearesomenotabledif-ferences.Inregressionestimation,theapproximationpowerofthefunctionsetisusuallymeasuredw.r.t.(withrespectto)somexedreferenceclassG:d(G;F)=supg2Ginff2Fkfgk:ThereferenceclassGistypicallyaclassicalsmoothnessclass,suchasaLipschitzspace.ThismeasureisinadequateforourpurposessinceinthecounterexamplesofBaird(1995)andTsitsiklisandVanRoy(1996)thetargetfunction(whateverfunctionspaceitbelongsto)canbeapproximatedwithzeroerror,butFVIstillexhibitsunboundederrors.Infact,ourboundsuseadifferentcharac-terizationoftheapproximationpowerofthefunctionclassF,whichwecalltheinherentBellmanerrorofF:d(TF;F)=supg2Finff2FkfTgk:HereTistheBellmanoperatorunderlyingtheMDP(capturingtheessentialpropertiesoftheMDP'sdynamics)andkkisanappropriateweightedp-normthatischosenbytheuser(theexactdeni-tionswillbegiveninSection2).Observethatnoexternalreferenceclassisusedinthedenitionofd(TF;F):theinherentBellmanerrorreectshowwellthefunctionspaceFis`aligned'totheBellmanoperator,thatis,thedynamicsoftheMDP.Intheabove-mentionedcounterexamplestheinherentBellmanerrorofthechosenfunctionspaceisinniteandsothebound(correctly)indicatesthepossibilityofdivergence.Theboundsonthevarianceareclosertotheirregressioncounterparts:Justlikeinregressionourvarianceboundsdependonthecapacityofthefunctionspaceusedanddecaypolynomially817 MUNOSANDSZEPESV´ARIwiththenumberofsamples.However,therateofdecayisslightlyworsethanthecorrespondingrateof(optimal)regressionprocedures.Thedifferencecomesfromthebiasofapproximatingthemaximumofexpectedvaluesbythemaximumofsampleaverages.Neverthelesstheboundsstillindicatethatthemaximalerroroftheprocedureinthelimitwhenthenumberofsamplesgrowstoinnityconvergestoanitepositivenumber.ThisinturniscontrolledbytheinherentBellmanerrorofthefunctionspace.Asitwasalreadyhintedabove,ourboundsdisplaytheusualbias-variancetradeoff:Inordertokeepboththeapproximationandestimationerrorssmall,thenumberofsamplesandthepowerofthefunctionclasshastobeincreasedsimultaneously.Whenthisisdoneinaprincipledway,theresultingalgorithmbecomesconsistent:It'serrorinthelimitdisappearsforalargeclassofMDPs.ConsistencyisanimportantpropertyofanyMDPalgorithm.Ifanalgorithmfailstoprovetobeconsistent,wewouldbesuspiciousaboutitsuse.OurboundsapplyonlytothoseMDPswhoseso-calleddiscounted-averageconcentrabilityoffuture-statedistributionsisnite.TheprecisemeaningofthiswillbegiveninSection5,herewenoteinpassingthatthisconditionholdstriviallyineveryniteMDP,andalso,moregenerally,iftheMDP'stransitionkernelpossessesaboundeddensity.ThislatterclassofMDPshasbeenconsideredinmanyprevioustheoreticalworks(e.g.,ChowandTsitsiklis,1989,1991;Rust,1996b;Szepesv´ari,2001).Infact,thisclassofMDPsisquitelargeinthesensethattheycontainhardinstances.ThisisdiscussedinsomedetailinSection8.Asfaraspracticalexamplesareconcerned,letusmentionthatresourceallocationproblemswilltypicallyhavethisproperty.WewillalsoshowaconnectionbetweenourconcentrabilityfactorandLyapunovexponents,wellknownfromthestabilityanalysisofdynamicalsystems.OurproofsbuildonarecenttechniqueproposedbyMunos(2003)thatallowsthepropagationofweightedp-normlossesinapproximatevalueiteration.Incontrast,mostpreviousanalysisofFVIreliedonpropagatingerrorsw.r.t.themaximumnorm(Gordon,1995;TsitsiklisandVanRoy,1996).Theadvantageofusingp-normlossboundsisthatitallowstheanalysisofalgorithmsthatusep-normtting(inparticular,2-normtting).UnlikeMunos(2003)andthefollow-upwork(Munos,2005),weexplicitlydealwithinnitestatespaces,theeffectsofusinganiterandomsample,thatis,thebias-variancedilemmaandtheconsistencyofsampling-basedFVI.Thepaperisorganizedasfollows:Inthenextsectionweintroducetheconceptsandnotationusedintherestofthepaper.TheproblemisformallydenedandthealgorithmsaregiveninSection3.Next,wedevelopnite-sampleboundsfortheerrorcommittedinasingleiteration(Section4).ThisboundisusedinprovingourmainresultsinSection5.WeextendtheseresultstotheproblemofobtainingagoodpolicyinSection6,followedbyaconstructionthatallowsonetoachieveasymptoticconsistencywhentheunknownMDPissmoothwithanunknownsmoothnessfactor(Section7).RelationshiptopreviousworksisdiscussedindetailsinSection8.Anexperimentinasimulatedenvironment,highlightingthemainpointsoftheanalysisisgiveninSection9.TheproofsofthestatementsaregivenintheAppendix.2.MarkovianDecisionProcessesAdiscountedMarkovianDecisionProcess(discountedMDP)isa5-tuple(X;A;P;S;g),whereXisthestatespace,Aistheactionspace,Pisthetransitionprobabilitykernel,Sistherewardkerneland0g1isthediscountfactor(BertsekasandShreve,1978;Puterman,1994).Inthispaperweconsidercontinuousstatespace,niteactionMDPs(i.e.,jAj+¥).Forthesakeof818 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONsimplicityweassumethatXisabounded,closedsubsetofaEuclideanspace,Rd.ThesystemofBorel-measurablesetsofXshallbedenotedbyB(X).TheinterpretationofanMDPasacontrolproblemisasfollows:EachinitialstateX0andactionsequencea0;a1;:::givesrisetoasequenceofstatesX1;X2;:::andrewardsR1;R2;:::satisfying,foranyBandCBorel-measurablesetstheequalitiesP(Xt+12BjXt=x;at=a)=P(Bjx;a);andP(Rt2CjXt=x;at=a)=S(Cjx;a):Equivalently,wewriteXt+1P(jXt;a),RtS(jXt;a).Inwords,wesaythatwhenactionatisexecutedfromstateXt=xtheprocessmakesatransitionfromxtothenextstateXt+1andareward,Rt,isincurred.ThehistoryoftheprocessuptotimetisHt=(X0;a0;R0;:::;Xt1;at1;Rt1;Xt).WeassumethattherandomrewardsfRtgareboundedbysomepositivenumberˆRmax,w.p.1(withprobabilityone).1Apolicyisasequenceoffunctionsthatmapspossiblehistoriestoprobabilitydistributionsoverthespaceofactions.HenceifthespaceofhistoriesattimesteptisdenotedbyHtthenapolicypisasequencep0;p1;:::,whereptmapsHttoM(A),thespaceofallprobabilitydistributionsoverA.2`Followingapolicy'meansthatforanytimesteptgiventhehistoryx0;a0;:::;xttheprobabilityofselectinganactionaequalspt(x0;a0;:::;xt)(a).Apolicyiscalledstationaryifptdependsonlyonthelaststatevisited.Equivalently,apolicyp=(p0;p1;:::)iscalledstationaryifpt(x0;a0;:::;xt)=p0(xt)holdsforallt0.Apolicyiscalleddeterministicifforanyhistoryx0;a0;:::;xtthereexistssomeactionasuchthatpt(x0;a0;:::;xt)isconcentratedonthisaction.Hence,anydeterministicstationarypolicycanbeidentiedbysomemappingfromthestatespacetotheactionspaceandsointhefollowing,atthepriceofabusingthenotationandtheterminologyslightly,wewillcallsuchmappingspolicies,too.Thegoalistondapolicypthatmaximizestheexpectedtotaldiscountedrewardgivenanyinitialstate.Underthiscriterionthevalueofapolicypandastatex2XisgivenbyVp(x)=E"¥åt=0gtRptjX0=x#;whereRptistherewardincurredattimetwhenexecutingpolicyp.TheoptimalexpectedtotaldiscountedrewardwhentheprocessisstartedfromstatexshallbedenotedbyV(x);ViscalledtheoptimalvaluefunctionandisdenedbyV(x)=suppVp(x).Apolicypiscalledoptimalifitattainstheoptimalvaluesforanystatex2X,thatisifVp(x)=V(x)forallx2X.WealsoletQ(x;a)denotethelong-termtotalexpecteddiscountedrewardwhentheprocessisstartedfromstatex,therstexecutedactionisaanditisassumedthataftertherststepanoptimalpolicyisfollowed.Sincebyassumptiontheactionspaceisnite,therewardsarebounded,andweassumediscounting,theexistenceofdeterministicstationaryoptimalpoliciesisguaranteed(BertsekasandShreve,1978).1.Thisconditioncouldbereplacedbyastandardmomentconditionontherandomrewards(Gy¨oretal.,2002)withoutchangingtheresults.2.Infact,ptmustbeameasurablemappingsothatweareallowedtotalkabouttheprobabilityofexecutinganaction.Measurabilityissuesbynowarewellunderstoodandhenceweshallnotdealwiththemhere.ForacompletetreatmenttheinterestedreaderisreferredtoBertsekasandShreve(1978).819 MUNOSANDSZEPESV´ARILetusnowintroduceafewfunctionspacesandoperatorsthatwillbeneededintherestofthepaper.LetusdenotethespaceofboundedmeasurablefunctionswithdomainXbyB(X).Further,thespaceofmeasurablefunctionsboundedby0Vmax+¥shallbedenotedbyB(X;Vmax).Adeterministicstationarypolicyp:X!AdenesthetransitionprobabilitykernelPpaccordingtoPp(dyjx)=P(dyjx;p(x)).Fromthiskerneltworelatedoperatorsarederived:aright-linearoperator,Pp:B(X)!B(X),denedby(PpV)(x)=ZV(y)Pp(dyjx);andaleft-linearoperator,Pp:M(X)!M(X),denedby(µPp)(dy)=ZPp(dyjx)µ(dx):Hereµ2M(X)andM(X)isthespaceofallprobabilitydistributionsoverX.Inwords,(PpV)(x)istheexpectedvalueofVafterfollowingpforasingletime-stepwhenstartingfromx,andµPpisthedistributionofstatesifthesystemisstartedfromX0µandpolicypisfollowedforasingletime-step.TheproductoftwokernelsPp1andPp2isdenedinthenaturalway:(Pp1Pp2)(dzjx)=ZPp1(dyjx)Pp2(dzjy):Hence,µPp1Pp2isthedistributionofstatesifthesystemisstartedfromX0µ,policyp1isfol-lowedfortherststepandthenpolicyp2isfollowedforthesecondstep.Theinterpretationof(Pp1Pp2V)(x)issimilar.Wesaythata(deterministicstationary)policypisgreedyw.r.t.afunctionV2B(X)if,forallx2X,p(x)2argmaxa2Ar(x;a)+gZV(y)P(dyjx;a);wherer(x;a)=RzS(dzjx;a)istheexpectedrewardofexecutingactionainstatex.Weassumethatrisabounded,measurablefunction.Actionsmaximizingr(x;a)+gRV(y)P(dyjx;a)aresaidtobegreedyw.r.t.V.SinceAisnitethesetofgreedyactionsisnon-emptyforanyfunctionV.DeneoperatorT:B(X)!B(X)by(TV)(x)=maxa2Ar(x;a)+gZV(y)P(dyjx;a);V2B(X):OperatorTiscalledtheBellmanoperatorunderlyingtheMDP.Similarly,toanystationarydeter-ministicpolicyptherecorrespondsanoperatorTp:B(X)!B(X)denedby(TpV)(x)=r(x;p(x))+(PpV)(x):ItiswellknownthatTisacontractionmappinginsupremumnormwithcontractioncoefcientg:kTVTV0k¥gkVV0k¥.Hence,byBanach'sxed-pointtheorem,Tpossessesauniquexedpoint.Moreover,thisxedpointturnsouttobeequaltotheoptimalvaluefunction,V.Thenasimplecontractionargumentshowsthattheso-calledvalue-iterationalgorithm,Vk+1=TVk;820 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONwitharbitraryV02B(X)yieldsasequenceofiterates,Vk,thatconvergetoVatageometricrate.Thecontractionargumentsalsoshowthatifjr(x;a)jisboundedbyRmax�0thenVisboundedbyRmax=(1g)andifV02B(X;Rmax=(1g))thenthesameholdsforVk,too.ProofsofthesestatementscanbefoundinmanytextbookssuchasinthatofBertsekasandShreve(1978).OurinitialsetofassumptionsontheclassofMDPsconsideredissummarizedasfollows:AssumptionA0[MDPRegularity]TheMDP(X;A;P;S;g)satisesthefollowingconditions:Xisabounded,closedsubsetofsomeEuclideanspace,Aisniteandthediscountfactorgsatis-es0g1.TherewardkernelSissuchthattheimmediaterewardfunctionrisaboundedmeasurablefunctionwithboundRmax.Further,thesupportofS(jx;a)isincludedin[ˆRmax;ˆRmax]independentlyof(x;a)2XA.LetµbeadistributionoverX.Forareal-valuedmeasurablefunctiongdenedoverX,kgkp;µisdenedbykgkp;µ=Rjg(x)jpµ(dx).Thespaceoffunctionswithboundedkkp;µ-normshallbedenotedbyLp(X;µ).3.Sampling-basedFittedValueIterationTheparametersofsampling-basedFVIareadistribution,µ2M(X),afunctionsetFB(X),aninitialvaluefunction,V02F,andtheintegersN;MandK.Thealgorithmworksbycomputingaseriesoffunctions,V1;V2;:::2Finarecursivemanner.The(k+1)thiterateisobtainedfromthekthfunctionasfollows:FirstaMonte-CarloestimateofTVkiscomputedatanumberofrandomstates(Xi)1iN:ˆV(Xi)=maxa2A1MMåj=1hRXi;aj+gVk(YXi;aj)i;i=1;2;:::;N:Herethebasepoints,X1;:::;XN,aresampledfromthedistributionµ,independentlyofeachother.Foreachofthesebasepointsandforeachpossibleactiona2Athenextstates,YXi;aj2X,andrewards,RXi;aj2R,aredrawnviathehelpofthegenerativemodeloftheMDP:YXi;ajP(jXi;a);RXi;ajS(jXi;a);(j=1;2;:::;M,i=1;:::;N).Byassumption,(YXi;aj;RXi;aj)and(YXi0;a0j0;RXi0;a0j0)areindependentofeachotherwhenever(i;j;a)6=(i0;j0;a0).ThenextiterateVk+1isobtainedasthebesttinFtothedata(Xi;ˆV(Xi))i=1;2;:::;Nw.r.t.thep-normbasedempiricallossVk+1=argminf2FNåi=1jf(Xi)ˆV(Xi)jp:(1)ThesestepsareiteratedK�0times,yieldingthesequenceV1;:::;VK.Westudytwovariantsofthisalgorithm:Whenafreshsampleisgeneratedineachiteration,wecallthealgorithmthemulti-samplevariant.Thetotalnumberofsamplesusedbythemulti-samplealgorithmisthusKNM.Sinceinasingleiterationonlyafractionofthesesamplesisused,onemaywonderifitweremoresampleefcienttouseallthesamplesinalltheiterations.3Weshallcall3.Sample-efciencybecomesanissuewhenthesamplegenerationprocessisnotcontrolled(thesamplesaregiven)orwhenitisexpensivetogeneratethesamplesduetothehighcostofsimulation.821 MUNOSANDSZEPESV´ARItheversionofthealgorithmthesingle-samplevariant(detailswillbegiveninSection5).Apossiblecounterargumentagainstthesingle-samplevariantisthatsincethesamplesusedinsubsequentiterationsarecorrelated,thebiasduetosamplingerrorsmaygetamplied.Oneofourinterestingtheoreticalndingsisthatthebias-amplicationeffectisnottoosevereand,infact,thesingle-samplevariantofthealgorithmiswellbehavingandcanbeexpectedtooutperformthemulti-samplevariant.Intheexperimentswewillseesuchacase.LetusnowdiscussthechoiceofthefunctionspaceF.Generally,Fisselectedtobeanitelyparameterizedclassoffunctions:F=ffq2B(X)jq2Qg;dim(Q)+¥:Ourresultsapplytobothlinear(fq(x)=qTf(x))andnon-linear(fq(x)=f(x;q))parameterizations,suchaswaveletbasedapproximations,multi-layerneuralnetworksorkernel-basedregressiontech-niques.Anotherpossibilityistousethekernelmappingideaunderlyingmanyrecentstate-of-the-artsupervised-learningmethods,suchassupport-vectormachines,support-vectorregressionorGaus-sianprocesses(CristianiniandShawe-Taylor,2000).Givena(positivedenite)kernelfunctionK,Fcanbechosenasaclosedconvexsubsetofthereproducing-kernelHilbert-space(RKHS)associatedtoK.Inthiscasetheoptimizationproblem(1)stilladmitsanite,closed-formsolutiondespitethatthefunctionspaceFcannotbewrittenintheaboveformforanynitedimensionalparametersspace(KimeldorfandWahba,1971;Sch¨olkopfandSmola,2002).4.ApproximatingtheBellmanOperatorThepurposeofthissectionistoboundtheerrorintroducedinasingleiterationofthealgorithm.Therearetwocomponentsofthiserror:TheapproximationerrorcausedbyprojectingtheiteratesintothefunctionspaceFandtheestimationerrorcausedbyusinganite,randomsample.Theapproximationerrorcanbebestexplainedbyintroducingthemetricprojectionoperator:Fixthesamplingdistributionµ2M(X)andletp1.ThemetricprojectionofTVontoFw.r.t.theµ-weightedp-normisdenedbyPFTV=argminf2FkfTVkp;µ:HerePF:Lp(X;µ)!Fforg2Lp(X;µ)givesthebestapproximationtoginF.4Theapprox-imationerrorinthekthstepforV=Vkisdp;µ(TV;F)=kPFTVTVkp;µ.Moregenerally,weletdp;µ(TV;F)=inff2FkfTVkp;µ:Hence,theapproximationerrorcanbemadesmallbyselectingFtobelargeenough.WeshalldiscusshowthiscanbeaccomplishedforalargeclassofMDPsinSection7.4.Theexistenceanduniquenessofbestapproximationsisoneofthefundamentalproblemsofapproximationtheory.Existencecanbeguaranteedunderfairlymildconditions,suchasthecompactnessofFw.r.t.kkp;µ,orifFisnitedimensional(Cheney,1966).Sincethemetricprojectionoperatorisneededfordiscussionpurposesonly,herewesimplyassumethatPFiswell-dened.822 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONLetusnowturntothediscussionoftheestimationerror.Inthekthiteration,givenV=Vk,thefunctionV0=Vk+1iscomputedasfollows:ˆV(Xi)=maxa2A1MMåj=1hRXi;aj+gV(YXi;aj)i;i=1;2;:::;N;(2)V0=argminf2FNåi=1 f(Xi)ˆV(Xi) p;(3)wheretherandomsamplessatisfytheconditionsoftheprevioussectionand,forthesakeofsim-plicity,weassumethattheminimizerinEquation(3)exists.Clearly,foraxedXi,maxa1=MåMj=1(RXi;aj+gV(YXi;aj))!(TV)(Xi)asM!¥w.p.1.Hence,forlargeenoughM,ˆV(Xi)isagoodapproximationto(TV)(Xi).Ontheotherhand,ifNisbigthenforany(f;g)2FF,theempiricalp-normloss,1=NåN=1(f(Xi)g(Xi))p,isagoodap-proximationtothetruelosskfgkp;µ.Hence,weexpecttondtheminimizerof(3)tobeclosetotheminimizerofkfTVkp;µ.Sincethefunctionxpisstrictlyincreasingforx�0,p�0,theminimizerofkfTVkp;µoverFisjustthemetricprojectionofTVonF,henceforN;Mbig,V0canbeexpectedtobeclosetoPFTV.NotethatEquation(3)lookslikeanordinaryp-normfunctiontting.Onedifferencethoughisthatinregressionthetargetfunctionequalstheregressorg(x)=E[ˆV(Xi)jXi=x],whilstinourcasethetargetfunctionisTVandTV6=g.ThisisbecauseE"maxa2A1MMåj=1hRXi;aj+gV(YXi;aj)i Xi#maxa2AE"1MMåj=1hRXi;aj+gV(YXi;aj)i Xi#:Infact,ifwehadanequalityherethenwewouldhavenoreasontosetM�1:inapureregres-sionsettingitisalwaysbettertohaveacompletelyfreshpairofsamplesthantohaveapairwherethecovariateissettobeequaltosomeprevioussample.DuetoM�1therateofconvergencewiththesamplesizeofsampling-basedFVIwillbeslightlyworsethantheratesavailableforregression.AbovewearguedthatforNlargeenoughandforaxedpair(f;g)2FF,theempiricallosswillapproximatethetrueloss,thatis,theestimationerrorwillbesmall.However,weneedthispropertytoholdforV0.SinceV0istheminimizeroftheempiricalloss,itdependsontherandomsamplesandhenceitisarandomobjectbyitselfandsotheargumentthattheestimationerrorissmallforanyxed,deterministicpairoffunctionscannotbeusedwithV0.Thisis,however,thesituationinsupervisedlearningproblems,too.AsimpleideadevelopedthereistoboundtheestimationerrorofV0bytheworst-caseestimationerroroverF: 1NNåi=1jV0(Xi)g(Xi)jp\rV0g\rp;µ supf2F 1NNåi=1jf(Xi)g(Xi)jpkfgkp;µ :Thisinequalityholdsw.p.1sinceforanyrandomeventw,V0=V0(w)isanelementofF.Theright-handsidehereisthemaximaldeviationofalargenumberofempiricalaveragesfromtheirrespectivemeans.Thebehaviorofthisquantityisthemainfocusofempiricalprocesstheoryandweshallusethetoolsdevelopedthere,inparticularPollard'stailinequality(cf.,Theorem5inAppendixA).Whenboundingthesizeofmaximaldeviations,thesizeofthefunctionsetbecomesamajorfactor.Whenthefunctionsethasanitenumberofelements,aboundfollowsbyexponentialtail823 MUNOSANDSZEPESV´ARIinequalitiesandaunionboundingargument.WhenFisinnite,the`capacity'ofFmeasuredbythe(empirical)coveringnumberofFcanbeusedtoderiveanappropriatebound:Lete�0,q1,x1:Ndef=(x1;:::;xN)2XNbexed.The(e;q)-coveringnumberofthesetF(x1:N)=f(f(x1);:::;f(xN))jf2FgisthesmallestintegermsuchthatF(x1:N)canbecoveredbymballsofthenormed-space(RN;kkq)withcentersinF(x1:N)andradiusN1=qe.The(e;q)-coveringnumberofF(x1:N)isdenotedbyNq(e;F(x1:N)).Whenq=1,weuseNinsteadofN1.WhenX1:Narei.i.d.withcommonunderlyingdistributionµthenENq(e;F(X1:N))shallbedenotedbyNq(e;F;N;µ).ByJensen'sinequalityNpNqforpq.ThelogarithmofNqiscalledtheq-normmetricentropyofF.Forq=1,weshallcalllogN1themetricentropyofF(withoutanyqualiers).Theideaunderlyingcoveringnumbersisthatwhatreallymatterswhenboundingmaximaldeviationsishowmuchthefunctionsinthefunctionspacevaryattheactualsamples.Ofcourse,withoutimposinganyconditionsonthefunctionspace,coveringnumberscangrowasafunctionofthesamplesize.ForspecicchoicesofF,however,itispossibletoboundthecoveringnumbersofFindepen-dentlyofthenumberofsamples.Infact,accordingtoawell-knownresultduetoHaussler(1995),coveringnumberscanbeboundedasafunctionoftheso-calledpseudo-dimensionofthefunctionclass.Thepseudo-dimension,orVC-subgraphdimensionVF+ofFisdenedastheVC-dimensionofthesubgraphsoffunctionsinF.5Thefollowingstatementgivestheboundthatdoesnotdependonthenumberofsamplepoints:Proposition1(Haussler(1995),Corollary3)ForanysetX,anypointsx1:N2XN,anyclassFoffunctionsonXtakingvaluesin[0;L]withpseudo-dimensionVF+¥,andanye�0,N(e;F(x1:N))e(VF++1)2eLeVF+:ForagivensetoffunctionsFleta+Fdenotethesetoffunctionsshiftedbytheconstanta:a+F=ff+ajf2Fg.Clearly,neitherthepseudo-dimensionnorcoveringnumbersarechangedbyshifts.ThisallowsonetoextendProposition1tofunctionsetswithfunctionstakingvaluesin[L;+L].Boundsonthepseudo-dimensionareknownformanypopularfunctionclassesincludinglin-earlyparameterizedfunctionclasses,multi-layerperceptrons,radialbasisfunctionnetworks,sev-eralnon-andsemi-parametricfunctionclasses(cf.,NiyogiandGirosi,1999;AnthonyandBartlett,1999;Gy¨oretal.,2002;Zhang,2002,andthereferencestherein).Ifqisthedimensionalityofthefunctionspace,theseboundstaketheformO(log(q)),O(q)orO(qlogq).6Anotherroutetogetausefulboundonthenumberofsamplesistoderiveanupperboundonthemetricentropythatgrowswiththenumberofsamplesatasublinearrate.Asanexampleconsiderthefollowingclassofbounded-weight,linearlyparameterizedfunctions:FA=ffq:X!Rjfq(x)=qTf(x);kqkqAg:5.TheVC-dimensionofasetsystemisdenedasfollows(Sauer,1972;VapnikandChervonenkis,1971):GivenasetsystemCwithbasesetUwesaythatCshattersthepointsofAUifallpossible2jAjsubsetsofAcanbeobtainedbyintersectingAwithelementsofC.TheVC-dimensionofCisthecardinalityofthelargestsubsetAUthatcanbeshattered.6.Again,similarboundsareknowntoholdforthesupremum-normmetricentropy.824 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONItisknownthatfornite-dimensionalsmoothparametricclassestheirmetricentropyscaleswithdim(f)+¥.IffisthefeaturemapassociatedwithsomepositivedenitekernelfunctionKthenfcanbeinnitedimensional(thisclassarisesifone`kernelizes'FVI).InthiscasetheboundsduetoZhang(2002)canbeused.TheseboundthemetricentropybydA2B2=e2elog(2N+1),whereBisanupperboundonsupx2Xkf(x)kpwithp=1=(11=q).74.1Finite-sampleBoundsThefollowinglemmashowsthatwithhighprobability,V0isagoodapproximationtoTVwhensomeelementofFisclosetoTVandifthenumberofsamplesishighenough:Lemma1ConsideranMDPsatisfyingAssumptionA0.LetVmax=Rmax=(1g),xarealnumberp1,integersN;M1,µ2M(X)andFB(X;Vmax).PickanyV2B(X;Vmax)andletV0=V0(V;N;M;µ;F)bedenedbyEquation(3).LetN0(N)=N18e4p;F;N;µ.Thenforanye;d�0,\rV0TV\rp;µdp;µ(TV;F)+eholdsw.p.atleast1dprovidedthatN�1288Vmaxe2plog(1=d)+log(32N0(N))(4)andM�8(ˆRmax+gVmax)2e2log(1=d)+log(8NjAj):(5)Aswehaveseenbefore,foralargenumberofchoicesofF,themetricentropyofFisindependentofN.InsuchcasesEquation(4)givesanexplicitboundonNandM.Intheresultingbound,thetotalnumberofsamplesperiteration,NM,scaleswithe(2p+2)(apartfromlogarithmicterms).Thecomparableboundforthepureregressionsettingise2p.TheadditionalquadraticfactoristhepricetopaybecauseofthebiasednessofthevaluesˆV(Xi).ThemainideasoftheproofareillustratedusingFigure1(theproofoftheLemmacanbefoundinAppendixA).Theleft-handsideofthisguredepictsthespaceofboundedfunctionsoverX,whiletheright-handsidegureshowsacorrespondingvectorspace.Thespacesareconnectedbythemappingf7!˜fdef=(f(X1);:::;f(XN))T.Inparticular,thismappingsendsthesetFintotheset˜F=f˜fjf2Fg.TheproofgoesbyupperboundingthedistancebetweenV0andTVintermsofthedistancebetweenfandTV.Heref=PFTVisthebestttoTVinF.ThechoiceoffismotivatedbythefactthatV0isthebesttinFtothedata(Xi;ˆV(Xi))i=1;:::;Nw.r.t.thep-normkkp.Theboundisdevelopedbyrelatingaseriesofdistancestoeachother:Inparticular,ifNislargethenkV0TVkp;µandkeV0fTVkpareexpectedtobeclosetoeachother.Ontheotherhand,ifMislargethenˆVandfTVareexpectedtobeclosetoeachother.Hence,keV0fTVkpandkeV0ˆVkpareexpectedtobeclosetoeachother.Now,sinceeV0isthebestttoˆVineF,thedistancebetweeneV0andˆVisnotlargerthanthedistancebetweentheimageefofanarbitraryfunctionf2FandˆV.Choosingf=fweconcludethatthedistancebetweenefandˆVisnotsmallerthankeV0ˆVkp.7.Similarboundsexistforthesupremum-normmetricentropy.825 MUNOSANDSZEPESV´ARI(B(X);kkp;)FV0fTV(RN;kkp)~F~TV~f~V0^VFigure1:IllustrationoftheproofofLemma1forboundingthedistanceofV0andTVintermsofthedistanceoffandTV,wherefisthebestttoTVinF(cf.,Equations2,3).Forafunctionf2B(X),ef=(f(X1);:::;f(XN))T2RN.TheseteFisdenedbyfefjf2Fg.Segmentsonthegureconnectobjectswhosedistancesarecomparedintheproof.ExploitingagainthatMislarge,weseethatthedistancebetweenefandˆVmustbeclosetothatofbetweenefandfTV,whichinturnmustbeclosetotheLp(X;µ)distanceoffandTVifNisbigenough.Hence,ifkfTVkp;µissmallthensoiskV0TVkp;µ.4.2BoundsfortheSingle-SampleVariantWhenanalyzingtheerrorofsampling-basedFVI,wewouldliketouseLemma1forboundingtheerrorcommittedwhenapproximatingTVkstartingfromVkbasedonanewsample.Whendoingso,however,wehavetotakeintoaccountthatVkisrandom.YetLemma1requiresthatV,thefunctionwhoseBellmanimageisapproximated,issomexed(non-random)function.Theproblemiseasilyresolvedinthemulti-samplevariantofthealgorithmbynotingthatthesamplesusedincalculatingVk+1areindependentofthesamplesusedtocalculateVk.AformalargumentispresentedinAppendixB.3.Thesameargument,however,doesnotworkforthesingle-samplevariantofthealgorithmwhenVk+1andVkarebothcomputedusingthesamesetofrandomvariables.ThepurposeofthissectionistoextendLemma1tocoverthiscase.Informulatingthisresultwewillneedthefollowingdenition:ForFB(X)letusdeneFT=ffTgjf2F;g2Fg:Thefollowingresultholds:Lemma2DenotebyWthesample-spaceunderlyingtherandomvariablesfXig,fYXi;ajg,fRXi;ajg,i=1;:::;N;j=1;:::;M;a2A.ThentheresultofLemma1continuestoholdifVisarandomfunctionsatisfyingV(w)2F,w2WprovidedthatN=O(V2max(1=e)2plog(N(ce;FT;N;µ)=d))andM=O((ˆRmax+gVmax)2=e2log(NjAjN(c0e;F;M;µ)=d));826 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONwherec;c0�0areconstantsindependentoftheparametersoftheMDPandthefunctionspaceF.TheproofcanbefoundinAppendixA.1.Notethatthesample-sizeboundsinthislemmaaresimilartothoseofLemma1,exceptthatNnowdependsonthemetricentropyofFTandMdependsonthemetricentropyofF.LetusnowgivetwoexampleswhenexplicitboundsonthecoveringnumberofFTcanbegivenusingsimplemeans:Fortherstexamplenotethatifg:(RR;kk1)!RisLipschitz8withLipschitzconstantGthenthee-coveringnumberofthespaceoffunctionsoftheformh(x)=g(f1(x);f2(x)),f12F1,f22F2canbeboundedbyN(e=(2G);F1;n;µ)N(e=(2G);F2;n;µ)(thisfollowsdirectlyfromthedenitionofcoveringnumbers).Sinceg(x;y)=xyisLipschitzwithG=1,N(e;FT;n;µ)N(e=2;F;n;µ)N(e=2;FT;n;µ).HenceitsufcestoboundthecoveringnumbersofthespaceFT=fTfjf2Fg.Onepossibilitytodothisisasfollows:AssumethatXiscompact,F=ffqjq2Qg,QiscompactandthemappingH:(Q;kk)!(B(X);L¥)denedbyH(q)=fqisLipschitzwithcoefcientL.Fixx1:nandconsiderN(e;FT(x1:n)).Letq1;q2bearbitrary.ThenjTfq1(x)Tfq2(x)jkTfq1Tfq2k¥gkfq1fq2k¥gLkq1q2k.NowassumethatC=fq1;:::;qmgisane=(Lg)-coverofthespaceQandconsideranyn1,(x1;:::;xn)2Xn,q2Q.LetqibethenearestneighborofqinC.Then,\r(Tfq)(x1:n)(Tfqi)(x1:n)\r1nkTfqTfqik¥ne.Hence,N(e;FT(x1:n))N(e=(Lg);Q).NotethatthemappingHcanbeshowntobeLipschitzianformanyfunctionspacesofinterest.Asanexampleletusconsiderthespaceoflinearlyparameterizedfunctionstakingtheformfq=qTfwithasuitablebasisfunctionf:X!Rdf.BytheCauchy-Schwarzinequality,\rqTfqT2f\r¥=supx2Xjhq1q2;f(x)ijkq1q2k2supx2Xkf(x)k2.Hence,bychoosingthe`2norminthespaceQ,wegetthatq7!qTfisLipschitzwithcoefcientkkf()k2k¥(thisgivesaboundonthemetricentropythatislinearindf).5.MainResultsForthesakeofspecicity,letusreiteratethealgorithms.LetV02F.Thesingle-samplevariantofsampling-basedFVIproducesasequenceoffunctionfVkg0kKFsatisfyingVk+1=argminf2FNåi=1 f(Xi)maxa2A1MMåj=1hRXi;aj+gVk(YXi;aj)i p:(6)Themulti-samplevariantisobtainedbyusingafreshsetofsamplesineachiteration:Vk+1=argminf2FNåi=1 f(Xki)maxa2A1MMåj=1hRXki;a;kj+gVk(YXki;a;kj)i p:(7)Letpkbeagreedypolicyw.r.t.Vk.Weareinterestedinboundingthelossduetousingpolicypkinsteadofanoptimalone,wherethelossismeasuredbyaweightedp-norm:Lk=kVVpkkp;r:Hererisadistributionwhoseroleistoputmoreweightonthosepartsofthestatespacewhereperformancemattersmore.Aparticularlysensiblechoiceistosetrtobethedistributionoverthe8.Amappinggbetweennormedfunctionspaces(B1;kk)and(B2;kk)isLipschitzwithfactorC�0if8x;y2B1,kg(x)g(y)kkxyk.827 MUNOSANDSZEPESV´ARIstatesfromwhichwestarttousepk.Inthiscaseifp=1thenLkmeasurestheexpectedloss.Forp�1thelossdoesnothaveasimilarlysimpleinterpretation,exceptthatwithp!¥werecoverthesupremum-normloss.Henceincreasingpgenerallymeansthattheevaluationbecomesmorepessimistic.Letusnowdiscusshowwearriveataboundontheexpectedp-normloss.Bytheresultsoftheprevioussectionwehaveaboundontheerrorintroducedinanygiveniteration.Hence,allweneedtoshowisthattheerrorsdonotblowupastheyarepropagatedthroughthealgorithm.Sincetheprevioussection'sboundsaregivenintermsofweightedp-norms,itisnaturaltodevelopweightedp-normboundsforthewholealgorithm.Letusconcentrateonthecasewheninalliterationstheerrorcommittedisbounded.Sinceweuseweightedp-normbounds,theusualsupremum-normanalysisdoesnotwork.However,asimilarargumentcanbeused.Thesketchofthisargumentisasfollows:Sinceweareinterestedindevelopingaboundontheperformanceofthegreedypolicyw.r.t.thenalestimateofV,werstdevelopapointwiseanalogueofsupremum-normBellman-errorbounds:(IgPp)(VVp)g(PpPp)(VV):HereVplaystheroleofthenalvaluefunctionestimate,pisagreedypolicyw.r.t.V,andVpisitsvalue-function.Hence,weseethatitsufcestodevelopupperandlowerboundsonVVwithV=VK.Fortheupperestimate,weusethatVTVk=TVTVk=TpVTpkVkTpVTpVk=gPp(VVk).Hence,ifVk+1=TVkekthenVVk+1gPp(VVk)+ek.AnanalogousreasoningresultsinthelowerboundVVk+1gPpk(VVk)+ek.Herepkisapolicygreedyw.r.t.Vk.Now,exploitingthattheoperatorPpislinearforanyp,iteratingtheseboundsyieldsupperandlowerboundsonVVKasafunctionoffekgk.AcrucialstepoftheargumentistoreplaceT,thenon-linearBellmanoperatorbylinearoperators(Pp,forsuitablep)sincepropagatingerrorsthroughlinearoperatorsiseasy,whileingeneral,itisimpossibletodothesamewithnon-linearoperators.Actually,aswepropagatetheerrors,itisnothardtoforeseethatoperatorproductsoftheformPpKPpK1:::Ppkenterourboundsandthattheerroramplicationcausedbytheseproductoperatorsisthemajorsourceofthepossibleincreaseoftheerror.Notethatifasupremum-normanalysiswerefollowed(p=¥),wewouldimmediatelyndthatthemaximumamplicationbytheseproductoperatorsisboundedbyone:Since,asitiswellknown,foranypolicyp,jRV(y)P(dyjx;p(x))jRjV(y)jP(dyjx;p(x))kVk¥RP(dyjx;p(x))=kVk¥,thatis,kPpk¥1.HencekPpK:::Ppkk¥kPpKk¥:::kPpkk¥1;andstartingfromthepointwisebounds,onerecoversthewell-knownsupremum-normboundsbyjusttakingthesupremumofthebounds'twosides.Hence,thepointwiseboundingtechniqueyieldsastightboundsastheprevioussupremum-normboundingtechnique.However,sinceinthealgo-rithmonlytheweightedp-normerrorsarecontrolled,insteadoftakingthepointwisesupremum,weintegratethepointwiseboundsw.r.t.themeasurertoderivethedesiredp-normboundsprovidedthattheinducedoperator-normoftheseoperatorproductsw.r.t.weightedp-normscanbebounded.Onesimpleassumptionthatallowsthisisasfollows:AssumptionA1[Uniformlystochastictransitions]Forallx2Xanda2A,assumethatP(jx;a)isabsolutelycontinuousw.r.t.µandtheRadon-NikodymderivativeofPw.r.t.µisboundeduniformly828 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONwithboundCµ:Cµdef=supx2X;a2A\rdP(jx;a)dµ\r¥+¥:AssumptionA1canbewrittenintheformP(jx;a)Cµµ(),anassumptionthatwasintro-ducedbyMunos(2003)inaniteMDPcontextfortheanalysisofapproximatepolicyiteration.Clearly,ifAssumptionA1holdsthenforp1,byJensen'sinequality,jRV(y)P(dyjx;p(x))jpRjV(y)jpP(dyjx;p(x))RCµjV(y)jpdµ(dy),hencekPpVkp;rC1=pµkVkp;µandthuskPpkp;rC1=pµ.NotethatwhenµistheLebesgue-measureoverXthenAssumptionA1becomesequivalenttoassumingthatthetransitionprobabilitykernelP(dyjx;a)admitsauniformlyboundeddensity.Thenoisierthedynamics,thesmallertheconstantCµ.AlthoughCµ+¥lookslikeastrongre-striction,theclassofMDPsthatadmitthisrestrictionisstillquitelargeinthesensethattherearehardinstancesinit(thisisdiscussedindetailinSection8).However,theaboveassumptioncer-tainlyexcludescompletelyorpartiallydeterministicMDPs,whichmightbeimportant,forexample,innancialapplications.Letusnowconsideranotherassumptionthatallowsforsuchsystems,too.Theideaisthatfortheanalysisweonlyneedtoreasonabouttheoperatornormsofweightedsumsoftheproductofarbitrarystochastickernels.Thismotivatesthefollowingassumption:AssumptionA2[Discounted-averageconcentrabilityoffuture-statedistributions]Givenr,µ,m1andanarbitrarysequenceofstationarypoliciesfpmgm1,assumethatthefuture-statedistributionrPp1Pp2:::Ppmisabsolutelycontinuousw.r.t.µ.Assumethatc(m)def=supp1;:::;pm\rd(rPp1Pp2:::Ppm)dµ\r¥(8)satisesCr;µdef=(1g)2åm1mgm1c(m)+¥:Weshallcallc(m)them-stepconcentrabilityofafuture-statedistribution,whilewecallCr;µthediscounted-averageconcentrabilitycoefcientofthefuture-statedistributions.Thenumberc(m)measureshowmuchrcangetampliedinmstepsascomparedtothereferencedistributionµ.Hence,ingeneralweexpectc(m)togrowwithm.Infact,theconditionthatCr;µisniteisagrowthrateconditiononc(m).Thankstodiscounting,Cr;µisniteforareasonablylargeclassofsystems:Infact,wewillnowarguethatAssumptionA2isweakerthanAssumptionA1andthatCr;µisnitewhenthetop-LyapunovexponentoftheMDPisnite.Toshowtherststatementitsufcestoseethatc(m)Cµholdsforanym.Thisholdssincebydenitionforanydistributionnandpolicyp,nPpCµµ.Thentaken=rPp1:::Ppm1andp=pmtoconcludethatrPp1:::Ppm1PpmCµµandsoc(m)Cµ.Letusnowturntothecomparisonwiththetop-LyapunovexponentoftheMDP.Asourstart-ingpointwetakethedenitionoftop-Lyapunovexponentassociatedwithsequencesofnitedi-mensionalmatrices:IffPtgtissequenceofsquarematriceswithnon-negativeentriesandfytgtisasequenceofvectorsthatsatisfyyt+1=Ptytthen,bydenition,thetop-Lyapunovexponentisˆgtop=limsupt!¥(1=t)log+(kytk¥).Ifthetop-Lyapunovexponentispositivethentheassociatedsystemissensitivetoitsinitialconditions(unstable).Anegativetop-Lyapunovexponent,ontheotherhand,indicatesthatthesystemisstable;incaseofcertainstochasticsystemstheexistence829 MUNOSANDSZEPESV´ARIofstrictlystationarynon-anticipatingrealizationsisequivalenttoanegativeLyapunovexponent(BougerolandPicard,1992).9Now,onemaythinkofytasaprobabilitydistributionoverthestatespaceandthematricesasthetransitionkernels.Onewaytogeneralizetheabovedenitiontocontrolledsystemsandinnitestatespacesistoidentifyytwiththefuturestatedistributionwhenthepoliciesareselectedtomaximizethegrowthrateofkytk¥.Thisgivesrisetoˆgtop=limsupm!¥1mlogc(m),wherec(m)isdenedby(8).10Then,byelementaryarguments,wegetthatifˆgtoplog(1=g)thenåm0mpgmc(m)¥.Infact,ifˆgtop0thenC(r;n)¥.Hence,weinterpretC(r;n)+¥asaweakstabilitycondition.SinceAssumptionA1isstrongerthanAssumptionA2intheproofswewillproceedbyrstdevelopingaproofunderAssumptionA2.ThereasonAssumptionA1isstillconsideredisthatitwillallowustoderivesupremum-normperformanceboundseventhoughinthealgorithmwecontrolonlytheweightedp-normbounds.Asanalpreparatorystepbeforethepresentationofourmainresults,letusdenetheinherentBellmanerrorassociatedwiththefunctionspaceF(asintheintroduction)bydp;µ(TF;F)=supf2Fdp;µ(Tf;F):Notethatdp;µ(TF;F)generalizesthenotionofBellmanerrorstofunctionspacesinanaturalway:Aswehaveseentheerroriniterationkdependsondp;µ(TˆVk;F).SinceˆVk2F,theinherentBellmanerrorgivesauniformboundontheerrorsoftheindividualiterations.11Thenexttheoremisthemainresultofthepaper.ItstatesthatwithhighprobabilitythenalperformanceofthepolicyfoundbythealgorithmcanbemadeasclosetoaconstanttimestheinherentBellmanerrorofthefunctionspaceFasdesiredbyselectingasufcientlyhighnumberofsamples.Hence,sampling-basedFVIcanbeusedtondnear-optimalpoliciesifFissufcientlyrich:em2ConsideranMDPsatisfyingAssumptionA0andA2.Fixp1,µ2M(X)andletV02FB(X;Vmax).Thenforanye;d�0,thereexistintegersK;MandNsuchthatKislinearinlog(1=e),logVmaxandlog(1=(1g)),N,Marepolynomialin1=e,log(1=d),log(1=(1g)),Vmax,ˆRmax,log(jAj),log(N(ce(1g)2=(C1=pr;µg);F;N;µ))forsomeconstantc�0,suchthatifthemulti-samplevariantofsampling-basedFVIisrunwithparameters(N;M;µ;F)andpKisapolicygreedyw.r.t.theKthiteratethenw.p.atleast1d,kVVpKkp;r2g(1g)2C1=pr;µdp;µ(TF;F)+e:If,insteadofAssumptionA2,AssumptionA1holdsthenw.p.atleast1d,kVVpKk¥2g(1g)2C1=pµdp;µ(TF;F)+e:Further,theresultscontinuetoholdforthesingle-samplevariantofsampling-basedFVIwiththeexceptionthatNdependsonlog(N(ce;FT;N;µ))andMdependsonlog(N(c0e;F;M;µ))forappropriatec;c0�0.9.Thelackofexistenceofsuchsolutionswouldprobablyprecludeanysample-basedestimationofthesystem.10.Hereweallowthesequenceofpoliciestobechangedwitheachm.Itisanopenquestionisasinglesequenceofpolicieswouldgivethesameresult.11.Moregenerally,dp;µ(G;F)def=supg2Gdp;µ(g;F)def=supg2Ginff2Fkgfkp;µ.830 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONTheproofisgiveninAppendixB.Assumingthatthepseudo-dimensionofthefunction-spaceFisniteasinProposition1,acloseexaminationoftheproofgivesthefollowinghigh-probabilityboundforthemulti-samplevariant:kVVpKkp;r2g(1g)2C1=pr;µdp;µ(TF;F)+O(gKVmax)+O(VF+N(log(N)+log(K=d))1=2p+1M(log(NjAj)+log(K=d))1=2):(9)HereN;M;Karearbitraryintegersandtheboundholdsw.p.1d.Thersttermboundstheapproximationerror,thesecondarisesduetothenitenumberofiterations,whilethelasttwotermsboundtheestimationerror.ThisformoftheboundallowsustoreasonaboutthelikelybestchoiceofNandMgivenaxedbudgetofn=KNMsamples(orˆn=NMsamplesperiteration).Indeed,op-timizingtheboundyieldsthatthebestchoiceofNandM(apartfromconstants)isgivenbyN=(VF+)1=(p+1)ˆnp=(p+1),M=(ˆn=VF+)1=(p+1),resultinginthebound(n=(KVF+))1=(2p+2)fortheestimationerror,disregardinglogarithmicterms.NotethatthechoiceofN;Mdoesnotinu-encetheothererrorterms.Now,letusconsiderthesingle-samplevariantofFVI.Acarefulinspectionoftheproofresultsinaninequalityidenticalto(9)justwiththepseudo-dimensionofFreplacedbythepseudo-dimensionofFand1=MreplacedbyVF+=M.WemayagainaskthequestionofhowtochooseN;M,givenaxed-sizebudgetofn=NMsamples.Theformulaearesimilartothepreviousones.Theresultingoptimizedboundontheestimationerroris(n=(VFVF+))1=(2p+2).ItfollowsthatgivenaxedbudgetofnsamplesprovidedthatK�VFtheboundforthesingle-samplevariantisbetterthantheoneforthemulti-samplevariant.InbothcasesalogicalchoiceistosetKtominimizetherespectivebounds.Infact,theoptimalchoiceturnsouttobeK1=log(1=g)u1=(1g)inbothcases.Henceasgapproachesone,thesingle-samplevariantofFVIcanbeexpectedtobecomemoreefcient,providedthateverythingelseiskeptthesame.Itisinterestingtonotethatasgbecomeslargerthenumberoftimesthesamplesarereusedincreases,too.Thatthesingle-samplevariantbecomesmoreefcientisbecausethevariancereductioneffectofsamplereuseisstrongerthantheincreaseofthebias.Ourcomputersimulations(Section9)conrmthisexperimentally.Anotherwaytousetheaboveboundistomakecomparisonswiththeratesavailableinnon-parametricregression:First,noticethattheapproximationerrorofFisdenedastheinherentBellmanerrorofFinsteadofusinganexternalreferenceclass.ThisseemsreasonablesincewearetryingtondanapproximatexedpointofTwithinF.Theestimationerror,forasamplesizeofn,canbeseentobeboundedbyO(n1=(2(p+1))),whichforp=2givesO(n1=6).Inregression,thecomparableerror(whenusingaboundingtechniquesimilartoours)isboundedbyn1=4(Gy¨oretal.,2002).Withconsiderablymorework,usingthetechniquesofLeeetal.(1996)(seealsoChapter11ofGy¨oretal.,2002)inregressionitispossibletogetarateofn1=2,atthepriceofmultiplyingtheapproximationerrorbyaconstantlargerthanone.ItseemspossibletousethesetechniquestoimprovetheexponentofNfrom1=2pto1=pinEquation(9)(atthepriceofincreasingtheinuenceoftheapproximationerror).Thenthenewratewouldbecomen1=4.Thisisstillworsethanthebestpossibleratefornon-parametricregression.Theadditionalfactorcomesfromtheneedtousethesamplestocontrolthebiasofthetargetvalues(i.e.,thatweneedM!¥).Thus,inthecaseofFVI,theinferiorrateascomparedwithregressionseemsunavoidable.By831 MUNOSANDSZEPESV´ARIswitchingfromstatevaluefunctionstoaction-valuefunctionsitseemsquitepossibletoeliminatethisinefciency.Inthiscasethecapacityofthefunctionspacewouldincrease(inparticular,inEquation(9)VF+wouldbereplacedbyjAjVF+).6.RandomizedPoliciesThepreviousresultshowsthatbymakingtheinherentBellmanerrorofthefunctionspacesmallenough,wecanensureaclose-to-optimalperformanceifoneusesapolicygreedyw.r.t.thelastvalue-functionestimate,VK.However,thecomputationofsuchagreedypolicyrequirestheevalua-tionofsomeexpectations,whoseexactvaluesarehoweveroftendifculttocompute.Inthissectionweshowthatbycomputationsanalogoustothatusedinobtainingtheiterateswecancomputearandomizednear-optimalpolicybasedonVK.Letuscallanactionaa-greedyw.r.t.thefunctionVandstatex,ifr(x;a)+gZV(y)P(dyjx;a)(TV)(x)a:GivenVKandastatex2Xwecanusesamplingtodrawana-greedyactionw.p.atleast1lbyexecutingthefollowingprocedure:LetRx;ajS(;x;a),Yx;ajP(jx;a),j=1;2;:::;M0withM0=M0(a;l)andcomputetheapproximatevalueofaatstatexusingQM0(x;a)=1M0M0åj=1hRx;aj+gVK(Yx;aj)i:LetthepolicypK;l:X!AbedenedbypKa;l(x)=argmaxa2AQM0(x;a):Thefollowingresultholds:Theorem3ConsideranMDPsatisfyingAssumptionsA0andA2.Fixp1,µ2M(X)andletV02FB(X;Vmax).Selecta=(1g)e=8,l=e8(1g)VmaxandletM0=O(jAjˆR2log(jAj=l)=a2).Then,foranye;d�0,thereexistintegersK;MandNsuchthatKislinearinlog(1=e),logVmaxandlog(1=(1g)),N,Marepolynomialin1=e,log(1=d),1=(1g),Vmax,ˆRmax,log(jAj),log(N(ce(1g)2=C1=pr;µ);F;µ)))forsomec�0,suchthatiffVkgK=1aretheiteratesgeneratedbymulti-sampleFVIwithparameters(N;M;K;µ;F)thenforthepolicypK;lasdenedabove,w.p.atleast1d,wehave\rVVpK;l\rp;µ4g(1g)2C1=pr;µdp;µ(TF;F)+e:Ananalogousresultholdsforthesupremum-normlossunderAssumptionsA0andA1withCr;µreplacedbyCµ.TheproofcanbefoundinAppendixC.Asimilarresultholdsforthesingle-samplevariantofFVI.WenotethatinplaceoftheaboveuniformsamplingmodelonecouldalsousetheMedianEliminationAlgorithmofEven-Daretal.(2002),resultinginareductionofM0byafactoroflog(jAj).However,forthesakeofcompactnesswedonotexplorethisoptionhere.832 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATION7.AsymptoticConsistencyAhighlydesirablepropertyofanylearningalgorithmisthatasthenumberofsamplesgrowstoinnity,theerrorofthealgorithmshouldconvergetozero;inotherwords,thealgorithmshouldbeconsistent.SamplingbasedFVIwithaxedfunctionspaceFisnotconsistent:Ourpreviousresultsshowthatinsuchacasethelossconvergesto2g(1g)2C1=pr;µdp;µ(TF;F).Asimpleideatoremedythissituationistoletthefunctionspacegrowwiththenumberofsamples.InregressionthecorrespondingmethodwasproposedbyGrendander(1981)andiscalledthemethodofsieves.ThepurposeofthissectionistoshowthatFVIcombinedwiththismethodgivesaconsistentalgorithmforalargeclassofMDPs,namelyforthosethathaveLipschitzianrewardsandtransitions.Itisimportanttoemphasizethatalthoughtheresultsinthissectionassumethesesmoothnessconditions,themethoditselfdoesnotrequiretheknowledgeofthesmoothnessfactors.ItisleftforfutureworktodeterminewhethersimilarresultsholdforlargerclassesofMDPs.Thesmoothnessofthetransitionprobabilitiesandrewardsisdenedw.r.t.changesintheinitialstate:8(x;x0;a)2XXA,kP(jx;a)P(jx0;a)kLP\rxx0\ra;jr(x;a)r(x0;a)jLr\rxx0\ra:Herea;LP;Lr�0aretheunknownsmoothnessparametersoftheMDPandkP(jx;a)P(jx0;a)kdenotesthetotalvariationnormofthesignedmeasureP(jx;a)P(jx0;a).12Themethodisbuiltonthefollowingobservation:IftheMDPissmoothintheabovesenseandifV2B(X)isuniformlyboundedbyVmaxthenTVisL=(Lr+gVmaxLP)-Lipschitzian(withexponent0a1):j(TV)(x)(TV)(x0)j(Lr+gVmaxLP)kxx0ka;8x;x02X:Hence,ifFnisrestrictedtoVmax-boundedfunctionsthenTFndef=fTVjV2FngcontainsL-LipschitzVmax-boundedfunctionsonly:TFnLip(a;L;Vmax)def=ff2B(X)jkfk¥Vmax;jf(x)f(y)jLkxykag:Bythedenitionofdp;µ,dp;µ(TFn;Fn)dp;µ(Lip(a;L;Vmax);Fn):Henceifwemaketheright-handsideconvergetozeroasn!¥thensowilldodp;µ(TFn;Fn).Thequantity,dp;µ(Lip(a;L;Vmax);Fn)isnothingbuttheapproximationerroroffunctionsintheLips-chitzclassLip(a;L;Vmax)byelementsofFn.Now,dp;µ(Lip(a;L;Vmax);Fn)dp;µ(Lip(a;L);Fn),whereLip(a;L)isthesetofLipschitz-functionswithLipschitzconstantLandweexploitedthatLip(a;L)=[Vmax�0Lip(a;L;Vmax).InapproximationtheoryanapproximationclassfFngissaidtobeuniversalifforanya;L�0,limn!¥dp;µ(Lip(a;L);Fn)=0:12.LetµbeasignedmeasureoverX.Thenthetotalvariationmeasure,jµjofµisdenedbyjµj(B)=supå¥=1jµ(Bi)j,wherethesupremumistakenoverallatmostcountablepartitionsofBintopairwisedisjointpartsfromtheBorelsetsoverX.Thetotalvariationnormkµkofµiskµk=jµj(X).833 MUNOSANDSZEPESV´ARIForalargevarietyofapproximationclasses(e.g.,approximationbypolynomials,Fourierbasis,wavelets,functiondictionaries)notonlyuniversalityisestablished,butvariantsofJackson'stheo-remgiveusratesofconvergenceoftheapproximationerror:dp;µ(Lip(a;L);Fn)=O(Lna)(e.g.,DeVore,1997).Oneremainingissueisthatclassicalapproximationspacesarenotuniformlybounded(i.e.,thefunctionsinthemdonotassumeauniformbound),whileourpreviousargumentshowingthattheimagespaceTFnisasubsetofLipschitzfunctionscriticallyreliesonthatFnisuniformlybounded.Onesolutionistousetruncations:LetTVmaxbethetruncationoperator,TVmaxr=(sign(r)Vmax;ifjrj�Vmax;r;otherwise:Now,asimplecalculationshowsthatdp;µ(Lip(a;L)\B(X;Vmax);TVmaxFn)dp;µ(Lip(a;L);Fn);whereTVmaxFn=fTVmaxfjf2Fng.This,togetherwithTheorem2givesrisetothefollowingresult:Corollary4ConsideranMDPsatisfyingAssumptionsA0andA2andassumethatbothitsimme-diaterewardfunctionandtransitionkernelareLipschitzian.Fixp1,µ2M(X)andletfFng,beauniversalapproximationclasssuchthatthepseudo-dimensionofTVmaxFngrowssublinearlyinn.Then,foreache;d�0thereexistanindexn0suchthatforanynn0thereexistintegersK;N;Mthatarepolynomialin1=e,log(1=d),1=(1g),Vmax,ˆRmax,log(jAj),andV(TVmaxFn)+suchthatifVKistheoutputofmulti-sampleFVIwhenitusesthefunctionsetTVmaxFnandXiµthenkVVpKkp;reholdsw.p.atleast1d.AnidenticalresultholdsforkVVpKk¥whenAs-sumptionA2isreplacedbyAssumptionA1.Theresultextendstosingle-sampleFVIasbefore.OneaspectinwhichthiscorollaryisnotsatisfactoryisthatsolvingtheoptimizationproblemdenedbyEquation(1)overTVmaxFniscomputationallychallengingevenwhenFnisaclassoflinearlyparameterizedfunctionsandp=2.OneideaistodotheoptimizationrstoverFnandthentruncatetheobtainedfunctions.Theresultingprocedurecanbeshowntobeconsistent(cf.,Chapter10ofGy¨oretal.,2002,foranalikeresultinaregressionsetting).Itisimportanttoemphasizethattheconstructionusedinthissectionisjustoneexampleofhowourmainresultmayleadtoconsistentalgorithms.AnimmediateextensionofthepresentworkwouldbetotargetthebestpossibleconvergenceratesforagivenMDPbyusingpenalizedestimation.Weleavethestudyofsuchmethodsforfuturework.8.DiscussionofRelatedWorkSamplingbasedFVIhasrootsthatdatebacktotheearlydaysofdynamicprogramming.Oneoftherstexamplesofusingvalue-functionapproximationmethodsistheworkofSamuelwhousedbothlinearandnon-linearmethodstoapproximatevaluefunctionsinhisprogramsthatlearnedtoplaythegameofcheckers(Samuel,1959,1967).Atthesametime,BellmanandDreyfus(1959)exploredtheuseofpolynomialsforacceleratingdynamicprogramming.Bothintheseworksandalsoinmostlaterworks(e.g.,Reetz,1977;Morin,1978)FVIwithrepresentativestateswasconsidered.834 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONOftheseauthors,onlyReetz(1977)presentstheoreticalresultswho,ontheotherhand,consideredonlyone-dimensionalfeaturespaces.FVIisaspecialcaseofapproximatevalueiteration(AVI)whichencompassesanyalgorithmoftheformVt+1=TVt+et,wheretheerrorsetarecontrolledinsomeway.Iftheerrorterms,et,areboundedinsupremumnorm,thenastraightforwardanalysisshowsthatasymptotically,theworst-caseperformance-lossforthepolicygreedyw.r.t.themostrecentiteratescanbeboundedby2g(1g)2supt1ketk¥(e.g.,BertsekasandTsitsiklis,1996).WhenVt+1isthebestapproximationofTVtinFthensupt1ketk¥canbeupperboundedbytheinherentBellmanerrord¥(TF;F)=supf2Finfg2FkgTfk¥andwegettheloss-bound2g(1g)2d¥(TF;F).Apartfromthesmoothnessfactors(Cr;µ,Cµ)andtheestimationerrorterm,ourloss-boundshavethesameform(cf.,Equa-tion9).Inparticular,ifµisabsolutelycontinuousw.r.t.theLebesguemeasurethenlettingp!¥allowsustorecoverthesepreviousbounds(sincethenC1=pµdp;µ(TF;F)!d¥(TF;F)).Further,weexpectthatthep-normboundswouldbetightersincethesupremumnormissensitivetooutliers.Adifferentanalysis,originallyproposedbyGordon(1995)andTsitsiklisandVanRoy(1996),goesbyassumingthattheiteratessatisfyVt+1=PTVt,wherePisanoperatorthatmapsboundedfunctionstothefunctionspaceF.WhileGordon(1995)andTsitsiklisandVanRoy(1996)consid-eredtheplanningscenariowithknowndynamicsandmakinguseofasetofrepresentativestates,subsequentresultsbySinghetal.(1995),OrmoneitandSen(2002)andSzepesv´ariandSmart(2004)consideredlessrestrictedproblemsettings,thoughnoneoftheseauthorspresentednite-samplebounds.ThemainideaintheseanalysesisthattheaboveiteratesmustconvergetosomelimitV¥ifthecompositeoperatorPTisasupremum-normcontraction.SinceTisacontraction,thisholdswheneverPisasupremum-normnon-expansion.Inthiscase,thelossofusingthepolicygreedyw.r.t.V¥canbeboundedby4g(1g)2eP,whereePisthebestapproximationtoVbyxedpointsofP:eP=inff2F:Pf=fkfVk¥(e.g.,TsitsiklisandVanRoy,1996,Theorem2).Inpracticeaspecialclassofapproximationmethodscalledaveragersareused(Gordon,1995).ForthesemethodsPisguaranteedtobeanon-expansion.Kernelregressionmethods,suchask-nearestneighborssmoothingwithxedcenters,treebasedsmoothing(Ernstetal.,2005),orlinearinterpolationwithaxedsetofbasisfunctionssuchassplineinterpolationwithxedknotsallbelongtothisclass.InalltheseexamplesPisalinearoperatorandtakestheformPf=a+ån=1(Lif)fiwithsomefunctiona,appropriatebasisfunctions,fi,andlinearfunctionalsLi(i=1;2;:::;n).OneparticularlyinterestingcaseiswhenLif=f(xi)forsomepointsfxig,f00,åifi1,a0,(Pf)(xi)=f(xi)and(fi(xj))ijhasfullrank.InthiscaseallmembersofthespacespannedbythebasisfunctionsffigarexedpointsofP.HenceeP=d¥(span(f1;:::;fn);V)andsothelossoftheprocedureisdirectlycontrolledbythesizeofFn=span(f1;:::;fn).Letusnowdiscussthechoiceofthefunctionspacesinaveragersandsampling-basedFVI.Inthecaseofaveragers,theclassisrestricted,buttheapproximationrequirement,makingePsmall,seemstobeeasiertosatisfythanthecorrespondingrequirementwhichasksformakingtheinherentBellmanresidualofthefunctionspaceFnsmall.WethinkthatinthelackofknowledgeofVthisadvantagemightbeminorandcanbeoffsetbythelargerfreedomtochooseFn(i.e.,nonlinear,orkernel-basedmethodsareallowed).Infact,whenVisunknownonemustresorttothegenericpropertiesoftheclassofMDPsconsidered(e.g.,smoothness)inordertondtheappropriatefunctionspace.Sincetheoptimalpolicyisunknown,too,itisnotquiteimmediatethatthefactthatonlyasinglefunction(thatdependsonanunknownMDP)mustbewellapproximatedshouldbeanadvantage.Still,onemayarguethattheself-referentialnatureoftheinherentBellman-835 MUNOSANDSZEPESV´ARIerrormakesthedesignforsampling-basedFVIharder.AswehaveshowninSection7,providedthattheMDPsaresmooth,designingthesespacesisnotnecessarilyharderthandesigningafunctionapproximatorforsomeregressiontask.LetusnowdiscusssomeotherrelatedworkswheretheauthorsconsidertheerrorresultingfromsomeMonte-Carloprocedure.OnesetofresultscloselyrelatedtotheonespresentedhereisduetoTsitsiklisandRoy(2001).Theseauthorsstudiedsampling-basedttedvalueiterationwithlin-earfunctionapproximators.However,theyconsideredadifferentclassofMDPs:nitehorizon,optimalstoppingwithdiscountedtotalrewards.Inthissettingthenext-statedistributionundertheconditionofnotstoppingisuncontrolled—thestateofthemarketevolvesindependentlyofthede-cisionmaker.TsitsiklisandRoy(2001)arguethatinthiscaseitisbettertosamplefulltrajectoriesthantogeneratesamplesinsomeother,arbitraryway.Theiralgorithmimplementsapproximatebackwardpropagationofthevalues(byL2ttingwithlinearfunctionapproximators),exploitingthattheproblemhasaxed,nitehorizon.Theirmainresultshowsthattheestimationerrorcon-vergestozerow.p.1asthenumberofsamplesgrowstoinnity.Further,aboundontheasymptoticperformanceisgiven.Duetothespecialstructureoftheproblem,thisbounddependsonlyonhowwelltheoptimalvaluefunctionisapproximatedbythechosenfunctionspace.Certainly,becauseoftheknowncounterexamples(Baird,1995;TsitsiklisandVanRoy,1996),wecannothopesuchaboundtoholdinthegeneralcase.Theworkpresentedherebuildsonourpreviouswork.Fornitestate-spaceMDPs,Munos(2003,2005)consideredplanningscenarioswithknowndynamicsanalyzingthestabilityofbothapproximatepolicyiterationandvalueiterationwithweightedL2(resp.,Lp)norms.PreliminaryversionsoftheresultspresentedherewerepublishedinSzepesv´ariandMunos(2005).Usingtech-niquessimilartothosedevelopedhere,recentlywehaveprovedresultsforthelearningscenariowhenonlyasingletrajectoryofsomexedbehaviorpolicyisknown(Antosetal.,2006).Weknowofnootherworkthatwouldhaveconsideredtheweightedp-normerroranalysisofsampling-basedFVIforcontinuousstate-spaceMDPsandinadiscounted,innite-horizonsettings.Oneworkwheretheauthorstudiesttedvalueiterationandwhichcomeswithanite-sampleanalysisisbyMurphy(2005),who,justlikeTsitsiklisandRoy(2001),studiednitehorizonprob-lemswithnodiscounting.13Becauseofthenite-horizonsetting,theanalysisisconsiderablesim-pler(thealgorithmworksbackwards).ThesamplescomefromanumberofindependenttrajectoriesjustlikeinthecaseofTsitsiklisandRoy(2001).Theerrorboundscomeintheformofperformancedifferencesbetweenapairofgreedypolicies:Oneofthepoliciesfromthepairisgreedyw.r.t.thevaluefunctionreturnedbythealgorithm,whiletheotherisgreedyw.r.t.tosomearbitrary`test'functionfromthefunctionsetconsideredinthealgorithm.Thederivedboundshowsthatthenum-berofsamplesneededisexponentialinthehorizonoftheproblemandisproportionaltoe4,whereeisthedesiredestimationerror.Theapproximationerroroftheprocedure,however,isnotconsid-ered:Murphysuggeststhattheoptimalaction-valuefunctioncouldbeaddedatvirtuallynocosttothefunctionsetsusedbythealgorithm.Accordingly,herboundsscaleonlywiththecomplexityofthefunctionclassanddonotscaledirectlywiththedimensionalityofthestatespace(justthrough13.WelearntoftheresultsofMurphy(2005)aftersubmittingourpaper.Oneinterestingaspectofthispaperisthattheresultsarepresentedforpartiallyobservableproblems.However,sinceallvalue-functionapproximationmethodsintroducestatealiasinganyway,resultsworkedoutforthefullyobservablecasecarrythroughtothelimitedfeedbackcasewithoutanychangeexceptthattheapproximationpowerofthefunctionapproximationmethodisfurtherlimitedbytheinformationthatisfedintotheapproximator.Basedonthisobservationonemaywonderifitispossibletogetconsistentalgorithmsthatavoidanexplicit`stateestimation'component.However,thisremainsthesubjectoffuturework.836 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONthecomplexityofthefunctionclass).Oneinterpretationofthisisthatifweareluckytochooseafunctionapproximatorsothattheoptimalvaluefunction(atallstages)canberepresentedexactlywithitthentherateofconvergencecanbefast.Intheunluckycase,noboundisgiven.Wewillcomebacktothediscussionofworst-casesamplecomplexityafterdiscussingtheworkbyKakadeandLangford(KakadeandLangford,2002;Kakade,2003).ThealgorithmconsideredbyKakadeandLangfordiscalledconservativepolicyiteration(CPI).Thealgorithmisdesignedfordiscountedinnitehorizonproblems.Thegeneralversionsearchesinaxedpolicyspace,P,ineachstepanoptimizerpickingapolicythatmaximizestheaverageoftheempiricaladvantagesofthepreviouspolicyatanumberofstates(basepoints)sampledfromsomedistribution.Theseadvantagescouldbeestimatedbysamplingsufcientlylongtrajectoriesfromthebasepoints.Thepolicypickedthiswayismixedintothepreviouspolicytopreventperformancedropsduetodrasticchanges,hencethenameofthealgorithm.Theorems7.3.1and7.3.3Kakade(2003)giveboundsonthelossofusingthepolicyreturnedbythisprocedurerelativetousingsomeotherpolicyp(e.g.,anear-optimalpolicy)asafunctionofthetotalvariationdistancebetweenn,thedistributionusedtosamplethebasepoints(thisdistributionisprovidedbytheuser),andthediscountedfuture-statedistributionunderlyingpwhenpisstartedfromarandomstatesampledfromn(dp;n).Thus,unlikeinthepresentpapertheerroroftheprocedurecanonlybecontrolledbyndingadistributionthatminimizesthedistancetodp;g,wherepisanear-optimalpolicy.Thismightbeasdifcultastheproblemofndingagoodpolicy.Theorem6.2inKakadeandLangford(2002)boundstheexpectedperformancelossundernasafunctionoftheimprecisionoftheoptimizerandtheRadon-Nykodimderivativeofdp;nanddp0;n,wherep0isthepolicyreturnedbythealgorithm.Howeverthisresultappliesonlytothecasewhenthepolicysetisunrestricted,andhencetheresultislimitedtoniteMDPs.Nowletusdiscusstheworst-casesamplecomplexityofsolvingMDPs.Averysimpleobser-vationisthatitshouldbeimpossibletogetboundsthatscalepolynomiallywiththedimensionofthestate-spaceunlessspecialconditionsaremadeontheproblem.Thisisbecausetheproblemofestimatingthevaluefunctionofapolicyinatrivialnite-horizonproblemwithasingletimestepisequivalenttoregression.HenceknownlowerboundsforregressionmustapplytoRL,aswell(seeStone,1980,1982andChapter3ofGy¨or,Kohler,Krzyzak,andWalk,2002forsuchbounds).Inparticular,fromtheseboundsitfollowsthattheminimaxsamplecomplexityofRLisexponentialinthedimensionalityofthestatespaceprovidedtheclassofMDPsislargeenough.Hence,itisnotsurprisingthatunlessveryspecialconditionsareimposedontheclassofMDPsconsidered,FVIanditsvariantsaresubjecttothecurse-of-dimensionality.OnewaytohelpwiththisexponentialscalingiswhenthealgorithmiscapableoftakingadvantageofthepossibleadvantageouspropertiesoftheMDPtobesolved.Inouropinion,onemajoropenprobleminRListodesignsuchmethods(ortoshowthatsomeexistingmethodpossessesthisproperty).Thecurse-of-dimensionalityisnotspecictoFVIvariants.Infact,aresultofChowandTsit-siklis(1989)statesthefollowing:ConsideraclassofMDPswithX=[0;1]d.AssumethatforanyMDPintheclass,thetransitionprobabilitykernelunderlyingtheMDPhasadensityw.r.t.theLebesguemeasureandthesedensitieshaveacommonupperbound.Further,theMDPswithintheclassareassumedtobeuniformlysmooth:foranyMDPintheclasstheLipschitzconstantoftherewardfunctionoftheMDPisboundedbyanappropriateconstantandthesameholdsfortheLipschitzconstantofthedensityfunction.Fixadesiredprecision,e.Then,anyalgorithmthatisguaranteedtoreturnane-optimalapproximationtotheoptimalvaluefunctionmustquery(sample)therewardfunctionandthetransitionprobabilitiesatleastW(1=ed)-times,forsomeMDPwithinthe837 MUNOSANDSZEPESV´ARIclassconsidered.Hence,evenclassesofsmoothMDPswithuniformlyboundedtransitiondensitieshaveveryhardinstances.Thesituationchangesdramaticallyifoneisallowedtointerleavecomputationsandcontrol.Inthiscase,buildingontherandom-discretizationmethodofRust(1996b),itispossibletoachievenear-optimalbehaviorbyusinganumberofsamplesperstepthatscalespolynomiallyintheimpor-tantquantities(Szepesv´ari,2001).Inparticular,thisresultshowsthatitsufcestoletthenumberofsamplesscalelinearlyinthedimensionalityofthestatespace.Interestingly,thisresultholdsforaclassofMDPsthatsubsumestheoneconsideredbyChowandTsitsiklis(1989).Therandom-discretizationmethodrequiresthattheMDPsintheclasssatisfyAssumptionA1withacommonconstantandalsotheknowledgeofthedensityunderlyingthetransitionprobabilitykernel.Whenthedensitydoesnotexistorisnotknown,itcouldbeestimated.However,estimatingconditionaldensityfunctionsitselfisalsosubjecttothecurseofdimensionality,hence,theadvantageoftherandom-discretizationmethodmeltsawayinsuchsituations,makingsampling-basedFVIaviablealternative.Thisisthecaseindeed,sincetheresultspresentedhererequireweakerassumptionsonthetransitionprobabilitykernel(AssumptionA2)andthusapplytoabroaderclassofproblems.Anothermethodthatinterleavescomputationsandcontrolisthesparsetrajectory-treemethodofKearnsetal.(1999).Thesparsetrajectory-treemethodbuildsarandomlookaheadtreetocomputesamplebasedapproximationtothevaluesofeachoftheactionsavailableatthecurrentstate.Thismethoddoesnotrequiretheknowledgeofthedensityunderlyingthetransitionprobabilitykernel,nordoesitrequireanyassumptionsontheMDP.Unfortunately,thecomputationalcostofthismethodscalesexponentiallyinthe`e-horizon',logg(Rmax=(e(1g))).Thisputsseverelimitsontheutilityofthismethodwhenthediscountfactorisclosetooneandthenumberofactionsismoderate.Kearnsetal.(1999)arguethatwithoutimposingadditionalassumptions(i.e.,smoothness)ontheMDPtheexponentialdependencyontheeffectivehorizontimeisunavoidable(asimilardependenceonthehorizonshowsupintheboundsofMurphy,2005andKakade,2003).9.SimulationStudyThepurposeofthissectionistoillustratethetradeoffsinvolvedinusingFVI.Sinceidenticalorverysimilaralgorithmshavebeenusedsuccessfullyinmanypriorempiricalstudies(e.g.,LongstaffandShwartz,2001;Haugh,2003;JungandUthmann,2004),wedonotattemptathoroughempiricalevaluationofthealgorithm.9.1AnOptimalReplacementProblemTheproblemusedasatestbedisasimpleone-dimensionaloptimalreplacementproblem,describedforexamplebyRust(1996a).Thesystemhasaone-dimensionalstate.Thestatevariable,xt2R+,measurestheaccumulatedutilizationofaproduct,suchastheodometerreadingonacar.Byconvention,weletxt=0denoteabrandnewproduct.Ateachtimestep,t,therearetwopossibledecisions:eitherkeep(at=K)orreplace(at=R)theproduct.ThislatteractionimpliesanadditionalcostCofsellingtheexistingproductandreplacingitbyanewone.Thetransitionto838 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONanewstateoccurswiththefollowingexponentialdensities:p(yjx;K)=beb(yx);ifyx;0;ifyx;p(yjx;R)=beby;ify0;0;ify0:Therewardfunctionisr(x;K)=c(x),wherec(x)representsthecostofmaintainingtheproduct.Byassumptions,cismonotonicallyincreasing.Therewardassociatedwiththereplacementoftheproductisindependentofthestateandisgivenbyr(x;R)=Cc(0).TheoptimalvaluefunctionsolvestheBellmanoptimalityequation:V(x)=maxhc(x)+gZ¥xp(yjx;K)V(y)dy;Cc(0)+gZ¥0p(yjx;R)V(y)dyi:Heretherstargumentofmaxrepresentsthetotalfuturerewardgiventhattheproductisnotre-placed,whilethesecondargumentgivesthetotalfuturerewardprovidedthattheproductisreplaced.Thisequationhasaclosedformsolution:V(x)=(Rxxc0(y)1g(1geb(1g)(yx))dyc(x)1g;ifxx;c(x)1g;ifx�x;HerexistheuniquesolutiontoC=Zx0c0(y)1g(1geb(1g)y)dy:Theoptimalpolicyisp(x)=Kifx2[0;x],andp(x)=Rifx�x.9.2ResultsWechosethenumericalvaluesg=0:6,b=0:5,C=30,c(x)=4x.Thisgivesx'4:8665andtheoptimalvaluefunction,plottedinFigure2,isV(x)=10x+30(e0:2(xx)1);ifxx;10x;ifx�x:Weconsiderapproximationofthevaluefunctionusingpolynomialsofdegreel.AssuggestedinSection7,weusedtruncation.Inordertomakethestatespacebounded,weactuallyconsideraproblemthatcloselyapproximatestheoriginalone.Forthiswexanupperboundforthestates,xmax=10x,andmodifytheproblemdenitionsuchthatifthenextstateyhappenstobeoutsideofthedomain[0;xmax]thentheproductisreplacedimmediately,andanewstateisdrawnasifactionRwerechosenintheprevioustimestep.Bythechoiceofxmax,R¥xmaxp(dyjx;R)isnegligibleandhencetheoptimalvaluefunctionofthealteredproblemcloselymatchesthatoftheoriginalproblemwhenitisrestrictedto[0;xmax].Wechosethedistributionµtobeuniformoverthestatespace[0;xmax].Thetransitiondensityfunctionsp(jx;a)areboundedbyb,thusAssumptionA1holdswithCµ=bxmax=5.Figure2illustratestwoiterates(k=2andk=K=20)ofthemulti-sampleversionofsampling-basedFVI:thedotsrepresentsthepointsf(Xn;ˆVM;k+1(Xn))g1nNforN=100,whereXiisdrawn839 MUNOSANDSZEPESV´ARIOptimal value function-48.67x=0x=10x=4.8670Sampled pointsV2V20Figure2:IllustrationoftwoiterationstepsofSamplingbasedFVI(up:k=2,down:k=20).Thedotsrepresentthepairs(Xi;ˆVM;k(Xi)),i=1;:::;NbasedonM=10sampledtransitionsperbasepointandN=100basepoints.Thegreycurveisthebesttamongpolynomialsofdegreel=4.Thethinblackcurveistheoptimalvaluefunction.fromµandfˆVM;k+1(Xn)g1nNiscomputedusing(2)withV=VkandM=10samples.Thegreycurveisthebestt(minimizingtheleastsquareerrortothedata,thatis,p=2)inF(forl=4)andthethinblackcurveistheoptimalvaluefunction.Figure3showstheL¥approximationerrorsjjVVKjj¥fordifferentvaluesofthedegreelofthepolynomialregression,anddifferentvaluesofthenumberofbasepointsNandthenumberofsamplednextstatesM.ThenumberofiterationswassettoK=20.ThereasonthatthegureshowstheerrorofapproximatingVbyVK(i.e.,eK=kVVKk¥)insteadofkVVpKk¥isthatinthisproblemthislattererrorconvergesveryfastandthusislessinteresting.(ThetechniquedevelopedinthepapercanbereadilyusedtoderiveaboundontheestimationerrorofV.)Ofcourse,theperformancelossisalwaysupperbounded(inL¥-norm)bytheapproximationerror,thankstothewell-knownbound(e.g.,BertsekasandTsitsiklis,1996):jjVVpKjj¥2=(1g)jjVVKjj¥.FromFigure3weobservewhenthedegreelofthepolynomialsincreases,theerrordecreasesrstbecauseofthedecreaseoftheinherentapproximationerror,buteventuallyincreasesbecauseofovertting.Thisgraphthusillustratesthedifferentcomponentsofthebound(9)wheretheapproximationerrortermdp;µ(TF;F)decreaseswithl(asdiscussedinSection7)whereastheestimationerrorbound,beingafunctionofthepseudo-dimensionofF,increaseswithl(insuchalinearapproximationarchitectureVF+equalsthenumberofbasisfunctionplusone,thatis,VF+=l+2)withrateOl+2N1=2p+O(1=M1=2),disregardinglogarithmicfactors.Accordingtothisbound,theestimationerrordecreaseswhenthenumberofsamplesincreases,whichiscorroboratedbytheexperimentsthatshowthatoverttingdecreaseswhenthenumberofsamplesN;Mincreases.Notethattruncationneverhappensexceptwhenthedegreeofthepolynomialisverylargecomparedwiththenumberofsamples.840 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATION0510152025303540455002468101214161820Degree of the polynomialsApproximation errorFigure3:ApproximationerrorsjjVVKjj¥ofthefunctionVKreturnedbysampling-basedFVIafterK=20iterations,fordifferentvaluesofthepolynomialsdegreel,forN=100,M=10(plaincurve),N=100,M=100(dotcurve),andN=1000,M=10(dashcurve)samples.Theplottedvaluesaretheaverageover100independentruns.Inoursecondsetofexperimentsweinvestigatedwhetherinthisproblemthesingle-sampleorthemulti-samplevariantofthealgorithmismoreadvantageousprovidedthatsamplecollectionisexpensiveorlimitedinsomeway.Figure4showsthedistributionalcharacterofVKVasafunctionofthestate.Theorderofthettedpolynomialsis5.Thesolid(black)curveshowsthemeanerror(representingthebias)for50independentruns,thedashed(blue)curvesshowtheupperandlowercondenceintervalsat1:5-timestheobservedstandarddeviation,whilethedash-dotted(red)curvesshowthemini-mum/maximumapproximationerrors.Notethepeakatx:Thevaluefunctionatthispointisnon-smooth,introducingabiasthatconvergestozeroratherslowly(thesameeffectinFourieranalysisisknownastheGibbsphenomenon).Itisalsoevidentfromthegurethattheapproximationerrorneartheedgesofthestatespaceislarger.Inpolynomialinterpolationfortheuniformarrangementsofthebasepoints,theerroractuallyblowsupattheendoftheintervalsastheorderofinterpolationisincreased(Runge'sphenomenon).Ageneralsuggestiontoavoidthisistoincreasethedensenessofpointsneartheedgesortointroducemoreexiblemethods(e.g.,splines).InFVItheedgeeffectisultimatelywashedout,butitmaystillcauseaconsiderableslowdownoftheprocedurewhenthebehaviorofthevalue-functionneartheboundariesiscritical.841 MUNOSANDSZEPESV´ARI0246810-3-2-10123stateerror0246810-3-2-10123stateerrorFigure4:Approximationerrorsforthemulti-sample(leftgure)andthesingle-sample(right)vari-antsofsamplingbasedFVI.Theguresshowthedistributionoferrorsforapproximatingtheoptimalvaluefunctionasafunctionofthestate,asmeasuredusing50independentruns.Forbothversion,N=100,K=10.However,forthemulti-samplevariantM=10,whileforthesingle-samplevariantM=100,makingthetotalnumberofsamplesusedequalinthetwocases.Now,astothecomparisonofthesingle-andmulti-samplealgorithms,itshouldbeapparentfromthisgurethatforthisspecicsetup,thesingle-samplevariantisactuallypreferable:Thebiasdoesnotseemtoincreasemuchduetothereuseofsamples,whilethevarianceoftheestimatesdecreasessignicantly.10.ConclusionsWeconsideredsampling-basedFVIfordiscounted,large(possiblyinnite)statespace,nite-actionMarkovianDecisionProcesseswhenonlyagenerativemodeloftheenvironmentisavailable.Ineachiteration,theimageofthepreviousiterateundertheBellmanoperatorisapproximatedatanitenumberofpointsusingasimpleMonte-Carlotechnique.Aregressionmethodisusedthentotafunctiontothedataobtained.Themaincontributionsofthepaperareperformanceboundsforthisprocedurethatholdswithhighprobability.TheboundsscalewiththeinherentBellmanerrorofthefunctionspaceusedintheregressionstep,andthestochasticstabilitypropertiesoftheMDP.ItisanopenquestionifthenitenessoftheinherentBellmanerrorisnecessaryforthestabilityofFVI,butthecounterexamplesdiscussedintheintroductionsuggestthattheinherentBellmanresidualofthefunctionspaceshouldindeedplayacrucialroleinthenalperformanceofFVI.Evenlessisknownaboutwhetherthestochasticstabilityconditionsarenecessaryoriftheycanberelaxed.Wearguedthatbyincreasingthenumberofsamplesandtherichnessofthefunctionspaceatthesametime,theresultingalgorithmcanbeshowntobeconsistentforawideclassofMDPs.The842 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONderivedratesshowthat,inlinewithourexpectations,FVIwouldtypicallysufferfromthecurse-of-dimensionalityexceptwhensomespecicconditions(extremesmoothness,onlyafewstatevari-ablesarerelevant,sparsity,etc.)aremet.Sincetheseconditionscouldbedifculttoverifyaprioriforanypracticalproblem,adaptivemethodsareneeded.Webelievethatthetechniquesdevelopedinthispapermayserveasasolidfoundationsfordevelopingandstudyingsuchalgorithms.Oneimmediatepossibilityalongthislinewouldbetoextendourresultstopenalizedempiricalriskminimizationwhenapenaltytermpenalizingtheroughnessofthecandidatefunctionsisaddedtotheempiricalrisk.Theadvantageofthisapproachisthatwithoutanyaprioriknowledgeofthesmoothnessclass,themethodallowsonetoachievetheoptimalrateofconvergence(seeGy¨oretal.,2002,Section21.2).Anotherproblemleftforfutureworkistoimprovethescalingofourbounds.Animportantopenquestionistoestablishtightlowerboundsfortherateofconvergenceforvalue-functionbasedRLmethods.ThereareotherwaystoimprovetheperformanceofouralgorithmthataremoredirectlyrelatedtospecicsofRL.BothTsitsiklisandRoy(2001)andKakade(2003)arguedthatµ,thedistributionusedtosamplethestatesshouldbeselectedtomatchthefuturestatedistributionofa(near-)optimalpolicy.Sincetheonlywaytolearnabouttheoptimalpolicyisbyrunningthealgorithm,oneideaistochangethesamplingdistributionbymovingitclosertothefuture-statedistributionofthemostrecentpolicy.TheimprovementpresumablymanifestsitselfbydecreasingthetermincludingCr;µ.AnotherpossibilityistoadaptivelychooseM,thenumberofsamplednextstatesbasedontheavailablelocalinformationlikeinactivelearning,hopingthatthiswaythesample-efciencyofthealgorithmcouldbefurtherimproved.AcknowledgmentsCsabaSzepesv´arigreatlyacknowledgesthesupportreceivedfromtheHungarianNationalScienceFoundation(OTKA),GrantNo.T047193,theHungarianAcademyofSciences(BolyaiFellow-ship),theAlbertaIngenuityFund,NSERCandtheComputerandAutomationResearchInstituteoftheHungarianAcademyofSciences.WewouldliketothankBarnab´asP´oczosandtheanonymousreviewersforhelpfulcomments,suggestionsanddiscussions.AppendixA.ProofofLemma1InordertoproveLemma1weusethefollowinginequalityduetoPollard:Theorem5(Pollard,1984)LetFbeasetofmeasurablefunctionsf:X![0;K]andlete�0,Nbearbitrary.IfXi,i=1;:::;Nisani.i.d.sequencetakingvaluesinthespaceXthenP supf2F 1NNåi=1f(Xi)E[f(X1)] �e!8EN(e=8;F(X1:N))eNe2128K2:Hereoneshouldperhapsworkwithouterexpectationsbecause,ingeneral,thesupremumofanuncountablymanyrandomvariablescannotbeguaranteedtobemeasurable.However,sinceforspecicexamplesoffunctionspaceF,measurabilitycantypicallybeestablishedbyroutinesepa-rabilityarguments,wewillaltogetherignorethesemeasurabilityissuesinthispaper.843 MUNOSANDSZEPESV´ARINow,letusproveLemma1thatstatedthenite-sampleboundforasingleiterate.ProofLetWdenotethesamplespaceunderlyingtherandomvariables.Lete00�0bearbitraryandletfbesuchthatkfTVkp;µinff2FkfTVkp;µ+e00.Denekkp;ˆµbykfkp;ˆµ=1NNåi=1jf(Xi)jp:Wewillprovethelemmabyshowingthatthefollowingsequenceofinequalitiesholdsimultane-ouslyonasetofeventsofmeasurenotsmallerthan1d:\rV0TV\rp;µ\rV0TV\rp;ˆµ+e0(10)\rV0ˆV\rp;ˆµ+2e0(11)\rfˆV\rp;ˆµ+2e0(12)kfTVkp;ˆµ+3e0(13)kfTVkp;µ+4e0(14)=dp;µ(TV;F)+4e0+e00:ItfollowsthenthatkV0TVkp;µinff2FkfTVkp;µ+4e0+e00w.p.atleast1d.Sincee00�0wasarbitrary,italsofollowsthatkV0TVkp;µinff2FkfTVkp;µ+4e0w.p.atleast1d.Now,theLemmafollowsbychoosinge0=e=4.Letusnowturntotheproofof(10)–(14).First,observethat(12)holdsduetothechoiceofV0since\rV0ˆV\rp;ˆµ\rfˆV\rp;ˆµholdsforallfunctionsffromFandthusthesameinequalityholdsforf2F,too.Thus,(10)–(14)willbeestablishedifweprovethat(10),(11),(13)and(14)allholdw.p.atleast1d0withd0=d=4.LetQ=max( \rV0TV\rp;µ\rV0TV\rp;ˆµ ; kfTVkp;µkfTVkp;ˆµ ):WeclaimthatPQ�e0d0;(15)whered0=d=4.Fromthis,(10)and(14)willfollow.Inordertoprove(15)notethatforallw2W,V0=V0(w)2F.Hence,supf2F kfTVkp;µkfTVkp;ˆµ  \rV0TV\rp;µ\rV0TV\rp;ˆµ holdspointwiseinW.Thereforetheinequalitysupf2F kfTVkp;µkfTVkp;ˆµ �Q(16)holdspointwiseinW,tooandhencePQ�e0P supf2F kfTVkp;µkfTVkp;ˆµ �e0!:844 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONWeclaimthatP supf2F kfTVkp;µkfTVkp;ˆµ �e0!P supf2F kfTVkp;µkfTVkp;ˆµ �(e0)p!:(17)Consideranyeventwsuchthatsupf2F kfTVkp;µkfTVkp;ˆµ �e0:Foranysuchevent,w,thereexistafunctionf02Fsuchthat \rf0TV\rp;µ\rf0TV\rp;ˆµ �e0:Picksuchafunction.Assumerstthatkf0TVkp;ˆµkf0TVkp;µ.Hence,kf0TVkp;ˆµ+e0kf0TVkp;µ.Sincep1,theelementaryinequalityxp+yp(x+y)pholdsforanynon-negativenumbersx;y.Hencewegetkf0TVkp;ˆµ+ep(kf0TVkp;ˆµ+e)pkf0TVkp;µandthus \rf0TV\rp;ˆµ\rf0TV\rp;µ �ep:Thisinequalitycanbeshowntoholdbyananalogousreasoningwhenkf0TVkp;ˆµ�kf0TVkp;µ.Inequality(17)nowfollowssincesupf2F kfTVkp;µkfTVkp;ˆµ  \rf0TV\rp;µ\rf0TV\rp;ˆµ :Now,observethatkfTVkp;µ=E[j(f(X1)(TV)(X1))jp],andkfTVkp;ˆµisthusjustthesam-pleaverageapproximationofkfTVkp;µ.Hence,bynotingthatthecoveringnumberassociatedwithffTVjf2FgisthesameasthecoveringnumberofF,callingforTheorem5resultsinP supf2F kfTVkp;µkfTVkp;ˆµ �(e0)p!8EhN((e0)p8;F(X1:N))ieN218e02Vmaxp2:Bymakingtheright-handsideupperboundedbyd0=d=4wendalowerboundonN,displayedinturnin(4).Thisnishestheproofof(15).Now,letusproveinequalities(11)and(13).Letfdenoteanarbitraryrandomfunctionsuchthatf=f(x;w)ismeasurableforeachx2XandassumethatfisuniformlyboundedbyVmax.Makinguseofthetriangleinequality kfgkp;ˆµkfhkp;ˆµ kghkp;ˆµ;wegetthat kfTVkp;ˆµ\rfˆV\rp;ˆµ \rTVˆV\rp;ˆµ:(18)Hence,itsufcestoshowthat\rTVˆV\rp;ˆµe0holdsw.p1d0.845 MUNOSANDSZEPESV´ARIForthispurposeweshalluseHoeffding'sinequality(Hoeffding,1963)andunionboundargu-ments.Fixanyindexi(1iN).LetK1=ˆRmax+gVmax.Then,byassumptionRXi;aj+gV(YXi;aj)2[K1;K1]holdsw.p.1andthusbyHoeffding'sinequality,P EhRXi;a1+gV(YXi;a1)jX1:Ni1MMåj=1RXi;aj+gV(YXi;aj) �e0jX1:N!2e2M(e0)2K21;(19)whereX1:N=(X1;:::;XN).Makingtheright-handsideupperboundedbyd0=(NjAj)wendalowerboundonM(cf.,Equation5).Since (TV)(Xi)ˆV(Xi) maxa2A EhRXi;a1+gV(YXi;a1)jX1:Ni1MMåj=1hRXi;aj+gV(YXi;aj)i itfollowsbyaunionboundingargumentthatP (TV)(Xi)ˆV(Xi) �e0jX1:Nd0=N;andhenceanotherunionboundingargumentyieldsPmaxi=1;:::;N (TV)(Xi)ˆV(Xi) p�(e0)pjX1:Nd0:TakingtheexpectationofbothsidesofthisinequalitygivesPmaxi=1;:::;N (TV)(Xi)ˆV(Xi) p�(e0)pd0:HencealsoP 1NNåi=1 (TV)(Xi)ˆV(Xi) p�(e0)p!d0andthereforeby(18),P kfTVkp;ˆµ\rfˆV\rp;ˆµ �e0d0Usingthiswithf=V0andf=fshowsthatinequalities(11)and(13)eachholdw.p.atleast1d0.Thisnishestheproofofthelemma.Now,letusturntotheproofofLemma2,whichstatedanite-sampleboundforthesingle-samplevariantofthealgorithm.A.1ProofofLemma2ProofTheproofisanalogoustothatofLemma1,henceweonlygivethedifferences.Upto(16)thetwoproofsproceedinanidenticalway,however,from(16)wecontinuebyconcludingthatsupg2Fsupf2F kfTgkp;µkfTgkp;ˆµ �Q846 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONholdspointwiseinW.Fromthispointonward,supf2Fisreplacedbysupg;f2Fthroughouttheproofof(15):TheproofgoesthroughasbeforeuntilthepointwherePollard'sinequalityisused.Atthispoint,sincewehavetwosuprema,weneedtoconsidercoveringnumberscorrespondingtothefunctionsetFT=ffTgjf2F;g2Fg:InthesecondpartoftheproofwemustalsousePollard'sinequalityinplaceofHoeffding's.Inparticular,(19)isreplacedwithP supg2F EhRXi;a1+gg(YXi;a1)jX1:Ni1MMåj=1RXi;aj+gg(YXi;aj) �e0 X1:N!8EN(e0=8;F+(Z1:Mi;a))eM(e0)2128K21;whereZji;a=(RXi;aj;YXi;aj).HereF+=fh:RX!Rjh(s;x)=sIfjsjVmaxg+f(x)forsomef2Fg.TheproofisconcludedbynotingthatthecoveringnumbersofF+canbeboundedintermsofthecoveringnumbersofFusingtheargumentspresentedaftertheLemmaattheendofSection4.AppendixB.ProofofTheorem2ThetheoremstatesPAC-boundsonthesamplesizeofsampling-basedFVI.Theideaoftheproofistoshowthat(i)iftheerrorsineachiterationaresmallthenthenalerrorwillbesmallwhenK,thenumberofiterationsishighenoughand(ii)thepreviousresults(Lemma1and2)showthattheerrorsstaysmallwithhighprobabilityineachiterationprovidedthatM;Nishighenough.Puttingtheseresultstogethergivesthemainresult.Hence,weneedtoshow(i).First,notethatiteration(7)or(6)maybewrittenVk+1=TVkekwhereek,denedbyek=TVkVk+1,istheapproximationerroroftheBellmanoperatorappliedtoVkduetosampling.Theproofisdoneintwosteps:werstproveastatementthatgivespointwisebounds(i.e.,theboundsholdforanystatex2X)whichisthenusedtoprovethenecessaryLpbounds.Parts(i)and(ii)areconnectedinSectionsB.3,B.4.B.1PointwiseErrorBoundsLemma3WehaveVVpK(IgPpK)1nåK1k=0gKk(Pp)Kk+PpKPpK1:::Ppk+1jekj(20)+gK+1(Pp)K+1+(PpKPpK1:::Pp0)jVV0jo:ProofSinceTVkTpVk,wehaveVVk+1=TpVTpVk+TpVkTVk+ekgPp(VVk)+ek;fromwhichwededucebyinductionVVKK1åk=0gKk1(Pp)Kk1ek+gK(Pp)K(VV0):(21)847 MUNOSANDSZEPESV´ARISimilarly,fromthedenitionofpkandsinceTVTpkV,wehaveVVk+1=TVTpkV+TpkVTVk+ekgPpk(VVk)+ek:Thus,byinduction,VVKK1åk=0gKk1(PpK1PpK2:::Ppk+1)ek+gK(PpK1PpK2:::Pp0)(VV0):(22)Now,fromthedenitionofpK,TpKVK=TVKTpVK,andwehaveVVpK=TpVTpVK+TpVKTVK+TpKVKTpKVpKgPp(VVK)+gPpK(VKV+VVpK)(IgPpK)(VVpK)g(PpPpK)(VVK);andsince(IgPpK)isinvertibleanditsinverseisamonotonicoperator14(wemaywrite(IgPpK)1=åm0gm(PpK)m),wededuceVVpKg(IgPpK)1(PpPpK)(VVK)Now,using(21)and(22),VVpK(IgPpK)1nåK1k=0gKk(Pp)KkPpKPpK1:::Ppk+1ek+gK+1(Pp)K+1(PpKPpK1:::Pp0)(VV0)ofromwhich(20)followsbytakingtheabsolutevalueofbothsides.B.2LpErrorBoundsWehavethefollowingapproximationresults.Lemma4Foranyh�0,thereexistsKthatislinearinlog(1=h)(andlogVmax)suchthat,iftheLp(µ)normoftheapproximationerrorsisboundedbysomee(kekkp;µeforall0kK)thenGivenAssumptionA1wehavekVVpKk¥2g(1g)2C1=pµe+h:(23)GivenAssumptionA2wehavekVVpKkp;r2g(1g)2C1=pr;µe+h:(24)14.AnoperatorTismonotonicifforanyxy,TxTy.848 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONNotethatifkekk¥ethenlettingp!¥wegetbackthewell-known,unimprovablesupremum-normerrorboundslimsupK!¥kVVpKk¥2g(1g)2eforapproximatevalueiteration(BertsekasandTsitsiklis,1996).(Infact,byinspectingtheproofbelowitturnsoutthatforthistheweakercondition,limsupk!¥kekk¥esufces,too.)ProofWehaveseenthatifA1holdsthenA2alsoholds,andforanydistributionr,Cr;µCµ.Thus,ifthebound(24)holdsforanyrthenchoosingrtobeaDiracateachstateproves(23).Thusweonlyneedtoprove(24).Wemayrewrite(20)asVVpK2g(1gK+1)(1g)2"K1åk=0akAkjekj+aKAKjVV0j#;withthepositivecoefcientsak=(1g)gKk11gK+1;for0kK;andaK=(1g)gK1gK+1;(denedsuchthattheysumto1)andtheprobabilitykernels:Ak=1g2(IgPpK)1(Pp)Kk+PpKPpK1:::Ppk+1;for0kK;AK=1g2(IgPpK)1(Pp)K+1+PpKPpK1:::Pp0:Wehave:kVVpKkp;r=Zr(dx)jV(x)VpK(x)jp2g(1gK+1)(1g)2pZr(dx)"K1åk=0akAkjekj+aKAKjVV0j#p(x)2g(1gK+1)(1g)2pZr(dx)"K1åk=0akAkjekjp+aKAKjVV0jp#(x);byusingtwotimesJensen'sinequality(sincethesumofthecoefcientsak,fork2[0;K],is1,andtheAkarepositivelinearoperatorswithAk1=1)(i.e.,convexityofx!jxjp).ThetermjVV0jmaybeboundedby2Vmax.Now,underAssumptionA2,rAk(1g)åm0gmc(m+Kk)µandwededucekVVpKkp;r2g(1gK+1)(1g)2p"K1åk=0ak(1g)åm0gmc(m+Kk)kekkp;µ+aK(2Vmax)p#:849 MUNOSANDSZEPESV´ARIReplaceakbytheirvalues,andfromthedenitionofCr;µ,andsincekekkp;µe,wehave:kVVpKkp;r2g(1gK+1)(1g)2ph(1g)21gK+1åm0K1åk=0gm+Kk1c(m+Kk)ep+(1g)gK1gK+1(2Vmax)pi2g(1gK+1)(1g)2p11gK+1Cr;µep+(1g)gK1gK+1(2Vmax)pThusthereisKlinearinlog(1=h)andlogVmaxsuchthatgK(1g)24gVmaxhpsuchthatthesecondtermisboundedbyhp,thus,kVVpKkp;r2g(1g)2pCr;µep+hpthuskVVpKkp;r2g(1g)2C1=pr;µe+hB.3FromPointwiseExpectationstoConditionalExpectationsWewillneedthefollowinglemmaintheproofofthetheorem:Lemma5AssumethatX;Yareindependentrandomvariablestakingvaluesintherespectivemea-surablespaces,XandY.Letf:XY!RbeaBorel-measurablefunctionsuchthatE[f(X;Y)]exists.Assumethatforally2Y,E[f(X;y)]0.ThenE[f(X;Y)jY]0holds,too,w.p.1.Thislemmaisanimmediateconsequenceofthefollowingresult,whoseproofisgivenforthesakeofcompleteness:Lemma6AssumethatX;Yareindependentrandomvariablestakingvaluesintherespectivemea-surablespaces,XandY.Letf:XY!RbeaBorel-measurablefunctionandassumethatE[f(X;Y)]exists.Letg(y)=E[f(X;y)].ThenE[f(X;Y)jY]=g(Y)holdsw.p.1.ProofLetusrstconsiderthecasewhenfhastheformf(x;y)=Ifx2AgIfy2Bg,whereAX,BYaremeasurablesets.Writer(x)=Ifx2Agands(y)=Ify2Bg.ThenE[f(X;Y)jY]=E[r(X)s(Y)jY]=r(Y)E[s(X)jY]sinces(Y)isY-measurable.SinceXandYareindependent,soares(X)andYandthusE[s(X)jY]=E[s(X)].Ontheotherhand,g(y)=E[r(X)s(y)]=s(y)E[r(X)],andthusitin-deedholdsthatE[f(X;Y)jY]=g(Y)w.p.1.Now,bytheadditivityofexpectationsthesamerelationholdsforsumsoffunctionsoftheaboveformandhence,ultimately,forallsimplefunctions.Iffisnonnegativevaluedthenwecanndasequenceofincreasingsimplefunctionsfnwithlimitf.850 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONByLebesgue'smonotoneconvergencetheorem,gn(y)def=E[fn(X;y)]!E[f(X;y)](=g(y)).Fur-ther,sinceLebesgue'smonotoneconvergencetheoremalsoholdsforconditionalexpectations,wealsohaveE[fn(X;Y)jY]!E[f(X;Y)jY].Sincegn(Y)=E[fn(X;Y)jY]!E[f(X;Y)jY]w.p.1.,andgn(Y)!g(Y)w.p.1.,wegetthatg(Y)=E[f(X;Y)jY]w.p.1.Extensiontoanarbitraryfunctionfollowsbydecomposingthefunctionintoitspositiveandnegativeparts.B.4ProofofTheorem2ProofLetusconsiderrstthemulti-samplevariantofthealgorithmunderAssumptionA2.Fixe;d�0.LettheiteratesproducedbythealgorithmbeV1;:::;VK.Ouraimistoshowthatbyselectingthenumberofiterates,Kandthenumberofsamples,N;Mlargeenough,theboundkVVpKkp;r2g(1g)2C1=pr;µdp;µ(TF;F)+e(25)holdsw.p.atleast1d.First,notethatbyconstructiontheiteratesVkremainboundedbyVmax.ByLemma4,underAssumptionA2,forallthoseevents,wheretheerrorek=TVkVk+1ofthekthiterateisbelow(inLp(µ)-norm)somelevele0,wehavekVVpKkp;r2g(1g)2C1=pr;µe0+h;(26)providedthatK=W(log(1=h)).Now,choosee0=(e=2)(1g)2=(2gC1=pr;µ)andh=e=2.Letf(e;d)denotethefunctionthatgiveslowerboundsonN;MinLemma1basedonthevalueofthedesiredestimationerroreandcondenced.Let(N;M)f(e0;d=K).OnedifcultyisthatVk,thekthiterateisrandomitself,henceLemma1(statedfordeterministicfunctions)cannotbeapplieddi-rectly.However,thankstotheindependenceofsamplesbetweeniterates,thisiseasytoxviatheapplicationofLemma5.ToshowthisletusdenotethecollectionofrandomvariablesusedinthekthstepbySk.Hence,SkconsistsoftheNbasepoints,aswellasjAjNMnextstatesandrewards.Further,introducethenotationV0(V;Sk)todenotetheresultofsolvingtheoptimizationproblem(2)–(3)basedonthesampleSkandstartingfromthevaluefunctionV2B(X).ByLemma1,P\rV0(V;Sk)TV\rp;µdp;µ(TV;F)+e01d=K:NowletusapplyLemma5withX:=Sk,Y:=Vkandf(S;V)=IfkV0(V;S)TVkp;µdp;µ(TV;F)+e0g(1d=K).SinceSkisindependentofVkthelemmacanindeedbeapplied.Hence,P\rV0(Vk;Sk)TVk\rp;µdp;µ(TVk;F)+e0jVk1d=K:TakingtheexpectationofbothsidesgivesPkV0(Vk;Sk)TVkkp;µdp;µ(TVk;F)+e01d=K.SinceV0(Vk;Sk)=Vk+1,ek=TVkVk+1,wethushavethatkekkp;µdp;µ(TV;F)+e0(27)holdsexceptforasetofbadeventsBkofmeasureatmostd=K.851 MUNOSANDSZEPESV´ARIHence,inequality(27)holdssimultaneouslyfork=1;:::;K,exceptfortheeventsinB=[kBk.NotethatP(B)åK=1P(Bk)d.NowpickanyeventinthecomplementerofB.Thus,forsuchanevent(26)holdswithe0=dp;µ(TV;F)+e0.Plugginginthedenitionsofe0andhweobtain(25).NowassumethattheMDPsatisesAssumptionA1.Asbefore,weconcludethat(27)holdsexceptfortheeventsinBkandwiththesamechoiceofNandM,westillhaveP(B)=P([kBK)d.Now,using(23)weconcludethatexceptonthesetB,kVVpKk¥2g(1g)2C1=pr;µdp;µ(TF;F)+e;concludingtherstpartoftheproof.Forsingle-sampleFVItheproofproceedsidentically,exceptthatnowoneusesLemma2inplaceofLemma1.AppendixC.ProofofTheorem3ProofWewouldliketoprovethatthepolicydenedinSection6givesclosetooptimalperfor-mance.LetusproverstthestatementunderAssumptionA2.BythechoiceofM0,itfollowsusingHoeffding'sinequality(seealsoEven-Daretal.,2002,Theorem1)thatpK;lselectsa-greedyactionsw.p.atleast1l.LetpKbeapolicythatselectsa-greedyactions.AstraightforwardadaptationoftheproofofLemma5.17ofSzepesv´ari(2001)yieldsthatforallstatex2X,jVpK;l(x)VpK(x)j2Vmaxl1g:(28)Now,usethetriangleinequalitytoget\rVVpK;l\rp;r\rVVpK\rp;r+\rVpKVpK;l\rp;r:By(28),thesecondtermcanbeboundedby2Vmaxl1g,soletusconsidertherstterm.AmodicationofLemmas3and4yieldsthefollowingresult,theproofofwhichwillbegivenattheendofthissection:Lemma7Thefollowingbound\rVVpK\rp;r211=p2g(1g)2C1=pr;µmax0kKkekkp;µ+h+a1g(29)holdsforKsuchthatgKh(1g)24gVmaxhip:Again,letf(e;d)bethefunctionthatgivestheboundsonN;MinLemma1forgiveneanddandset(N;M)f(e0;d=K)fore0tobechosenlater.UsingthesameargumentasintheproofofTheorem2andLemma1wemayconcludethatkekkp;µdp;µ(TVk;F)+e0dp;µ(TF;F)+e0holdsexceptforasetBkwithP(Bk)d=K.Thus,exceptonthesetB=[kBkofmeasurenotmorethand,\rVVpK;l\rp;r211=p2g(1g)2C1=pr;µdp;µ(TF;F)+e0+h+a1g+2Vmaxl1g4g(1g)2C1=pr;µdp;µ(TF;F)+4g(1g)2C1=pr;µe0+2h+2a1g+2Vmaxl1g:852 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONNowdenea=e(1g)=8,h=e=8,e0=e4(1g)24gC1=pr;µandl=e4(1g)2Vmaxtoconcludethat\rVVpK;l\rp;r4g(1g)2C1=pr;µdp;µ(TF;F)+eholdseverywhereexceptonB.Also,justlikeintheproofofTheorem2,wegetthatunderAssump-tionA1thestatementforthesupremumnormholds,aswell.ItthusremainedtoproveLemma7:Proof[Lemma7]Write1fortheconstantfunctionthatequalsto1.SincepKisa-greedyw.r.t.VK,wehaveTVKTpKVKTVKa1.Thus,similarlytotheproofofLemma3,wehaveVVpKa=TpVTpVK+TpVKTVK+TVKTpKaVK+TpKaVKTpKaVpKagPp(VVK)+gPpKa(VKV+VVpKa)+a1(IgPpKa)1hg(PpPpKa)(VVK)i+a11g;andbyusing(21)and(22),wededuceVVpK(IgPpK)1nåK1k=0gKk(Pp)Kk+PpKaPpK1:::Ppk+1jekj+gK+1(Pp)K+1+(PpKPpKPpK1:::Pp1)jVV0jo+a11g:Now,fromtheinequalityja+bjp2p1(jajp+jbjp),wededuce,byfollowingthesamelinesasintheproofofLemma4,that\rVVpK\rp;r2p12g(1g)2pCr;µ(max0kKkekkp;µ)p+hp+a1gp;andLemma7follows.ReferencesM.AnthonyandP.L.Bartlett.NeuralNetworkLearning:TheoreticalFoundations.CambridgeUniversityPress,Cambridge,UK,1999.A.Antos,Cs.Szepesv´ari,andR.Munos.Learningnear-optimalpolicieswithBellman-residualminimizationbasedttedpolicyiterationandasinglesamplepath.InG.LugosiandH.U.Si-mon,editors,TheNineteenthAnnualConferenceonLearningTheory,COLT2006,Proceedings,volume4005ofLNCS/LNAI,pages574–588,Berlin,Heidelberg,June2006.Springer-Verlag.(Pittsburgh,PA,USA,June22–25,2006.).A.Antos,Cs.Szepesv´ari,andR.Munos.Value-iterationbasedttedpolicyiteration:learningwithasingletrajectory.In2007IEEESymposiumonApproximateDynamicProgrammingandReinforcementLearning(ADPRL2007),pages330–337.IEEE,April2007.(Honolulu,Hawaii,Apr1–5,2007.).853 MUNOSANDSZEPESV´ARIA.Antos,Cs.Szepesv´ari,andR.Munos.Learningnear-optimalpolicieswithBellman-residualminimizationbasedttedpolicyiterationandasinglesamplepath.MachineLearning,71:89–129,2008.LeemonC.Baird.Residualalgorithms:Reinforcementlearningwithfunctionapproximation.InArmandPrieditisandStuartRussell,editors,ProceedingsoftheTwelfthInternationalConferenceonMachineLearning,pages30–37,SanFrancisco,CA,1995.MorganKaufmann.R.E.BellmanandS.E.Dreyfus.Functionalapproximationanddynamicprogramming.Math.TablesandotherAidsComp.,13:247–251,1959.D.P.BertsekasandS.E.Shreve.StochasticOptimalControl(TheDiscreteTimeCase).AcademicPress,NewYork,1978.D.P.BertsekasandJ.N.Tsitsiklis.Neuro-DynamicProgramming.AthenaScientic,Belmont,MA,1996.P.BougerolandN.Picard.Strictstationarityofgeneralizedautoregressiveprocesses.AnnalsofProbability,20:1714–1730,1992.E.W.Cheney.IntroductiontoApproximationTheory.McGraw-Hill,London,NewYork,1966.C.S.ChowandJ.N.Tsitsiklis.Thecomplexityofdynamicprogramming.JournalofComplexity,5:466–488,1989.C.S.ChowandJ.N.Tsitsiklis.Anoptimalmultigridalgorithmforcontinuousstatediscretetimestochasticcontrol.IEEETransactionsonAutomaticControl,36(8):898–914,1991.N.CristianiniandJ.Shawe-Taylor.Anintroductiontosupportvectormachines(andotherkernel-basedlearningmethods).CambridgeUniversityPress,2000.R.H.CritesandA.G.Barto.Improvingelevatorperformanceusingreinforcementlearning.InAdvancesinNeuralInformationProcessingSystems9,1997.R.DeVore.NonlinearApproximation.ActaNumerica,1997.T.G.DietterichandX.Wang.Batchvaluefunctionapproximationviasupportvectors.InT.G.Dietterich,S.Becker,andZ.Ghahramani,editors,AdvancesinNeuralInformationProcessingSystems14,Cambridge,MA,2002.MITPress.D.Ernst,P.Geurts,andL.Wehenkel.Tree-basedbatchmodereinforcementlearning.JournalofMachineLearningResearch,6:503–556,2005.E.Even-Dar,S.Mannor,andY.Mansour.PACboundsformulti-armedbanditandMarkovdecisionprocesses.InFifteenthAnnualConferenceonComputationalLearningTheory(COLT),pages255–270,2002.G.J.Gordon.Stablefunctionapproximationindynamicprogramming.InArmandPrieditisandStu-artRussell,editors,ProceedingsoftheTwelfthInternationalConferenceonMachineLearning,pages261–268,SanFrancisco,CA,1995.MorganKaufmann.854 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONA.Gosavi.Areinforcementlearningalgorithmbasedonpolicyiterationforaveragereward:Em-piricalresultswithyieldmanagementandconvergenceanalysis.MachineLearning,55:5–29,2004.U.Grendander.AbstractInference.Wiley,NewYork,1981.L.Gy¨or,M.Kohler,A.Krzyzak,andH.Walk.Adistribution-freetheoryofnonparametricregres-sion.Springer-Verlag,NewYork,2002.M.Haugh.Dualitytheoryandsimulationinnancialengineering.InProceedingsoftheWinterSimulationConference,pages327–334,2003.D.Haussler.Spherepackingnumbersforsubsetsofthebooleann-cubewithboundedVapnik-Chervonenkisdimension.JournalofCombinatorialTheory,SeriesA,69(2):217–232,1995.W.Hoeffding.Probabilityinequalitiesforsumsofboundedrandomvariables.JournaloftheAmericanStatisticalAssociation,58:13–30,1963.T.JungandT.Uthmann.Experimentsinvaluefunctionapproximationwithsparsesupportvectorregression.InECML,pages180–191,2004.S.KakadeandJ.Langford.Approximatelyoptimalapproximatereinforcementlearning.InPro-ceedingsoftheNineteenthInternationalConferenceonMachineLearning,pages267–274,SanFrancisco,CA,USA,2002.MorganKaufmannPublishersInc.S.M.Kakade.Onthesamplecomplexityofreinforcementlearning.PhDthesis,GatsbyComputa-tionalNeuroscienceUnit,UniversityCollegeLondon,2003.M.Kearns,Y.Mansour,andA.Y.Ng.Asparsesamplingalgorithmfornear-optimalplanninginlargeMarkoviandecisionprocesses.InProceedingsofIJCAI'99,pages1324–1331,1999.G.KimeldorfandG.Wahba.SomeresultsonTchebychefansplinefunctions.J.Math.Anal.Applic.,33:82–95,1971.M.LagoudakisandR.Parr.Least-squarespolicyiteration.JournalofMachineLearningResearch,4:1107–1149,2003.W.S.Lee,P.L.Bartlett,andR.C.Williamson.Efcientagnosticlearningofneuralnetworkswithboundedfan-in.IEEETransactionsonInformationTheory,42(6):2118–2132,1996.F.A.LongstaffandE.S.Shwartz.Valuingamericanoptionsbysimulation:Asimpleleast-squaresapproach.Rev.FinancialStudies,14(1):113–147,2001.S.Mahadevan,N.Marchalleck,T.Das,andA.Gosavi.Self-improvingfactorysimulationusingcontinuous-timeaverage-rewardreinforcementlearning.InProceedingsofthe14thInternationalConferenceonMachineLearning(IMLC'97),1997.T.L.Morin.Computationaladvancesindynamicprogramming.InDynamicProgramminganditsApplications,pages53–90.AcademicPress,1978.855 MUNOSANDSZEPESV´ARIR.Munos.Errorboundsforapproximatepolicyiteration.In19thInternationalConferenceonMachineLearning,pages560–567,2003.R.Munos.Errorboundsforapproximatevalueiteration.AmericanConferenceonArticialIntel-ligence,2005.S.A.Murphy.AgeneralizationerrorforQ-learning.JournalofMachineLearningResearch,6:1073–1097,2005.A.Y.NgandM.Jordan.PEGASUS:ApolicysearchmethodforlargeMDPsandPOMDPs.InProceedingsofthe16thConferenceinUncertaintyinArticialIntelligence,pages406–415,2000.P.NiyogiandF.Girosi.Generalizationboundsforfunctionapproximationfromscatterednoisydata.AdvancesinComputationalMathematics,10:51–80,1999.D.OrmoneitandS.Sen.Kernel-basedreinforcementlearning.MachineLearning,49:161–178,2002.M.L.Puterman.MarkovDecisionProcesses—DiscreteStochasticDynamicProgramming.JohnWiley&Sons,Inc.,NewYork,NY,1994.D.Reetz.ApproximatesolutionsofadiscountedMarkoviandecisionproblem.BonnerMathema-tischerSchriften,98:DynamischeOptimierungen:77–92,1977.M.Riedmiller.NeuralttedQiteration–rstexperienceswithadataefcientneuralreinforcementlearningmethod.In16thEuropeanConferenceonMachineLearning,pages317–328,2005.J.Rust.Numericaldyanmicprogrammingineconomics.InH.Amman,D.Kendrick,andJ.Rust,editors,HandbookofComputationalEconomics.Elsevier,NorthHolland,1996a.J.Rust.Usingrandomizationtobreakthecurseofdimensionality.Econometrica,65:487–516,1996b.A.L.Samuel.Somestudiesinmachinelearningusingthegameofcheckers.IBMJournalonResearchandDevelopment,pages210–229,1959.ReprintedinComputersandThought,E.A.FeigenbaumandJ.Feldman,editors,McGraw-Hill,NewYork,1963.A.L.Samuel.Somestudiesinmachinelearningusingthegameofcheckers,II–recentprogress.IBMJournalonResearchandDevelopment,pages601–617,1967.N.Sauer.Onthedensityoffamiliesofsets.JournalofCombinatorialTheorySeriesA,13:145–147,1972.B.Sch¨olkopfandA.J.Smola.LearningwithKernels.MITPress,Cambridge,MA,2002.S.P.SinghandD.P.Bertsekas.Reinforcementlearningfordynamicchannelallocationincellulartelephonesystems.InAdvancesinNeuralInformationProcessingSystems9,1997.S.P.Singh,T.Jaakkola,andM.I.Jordan.Reinforcementlearningwithsoftstateaggregation.InProceedingsofNeuralInformationProcessingSystems7,pages361–368.MITPress,1995.856 FINITE-TIMEBOUNDSFORFITTEDVALUEITERATIONC.J.Stone.Optimalratesofconvergencefornonparametricestimators.AnnalsofStatistics,8:1348–1360,1980.C.J.Stone.Optimalglobalratesofconvergencefornonparametricregression.AnnalsofStatistics,10:1040–1053,1982.Cs.Szepesv´ari.EfcientapproximateplanningincontinuousspaceMarkoviandecisionproblems.AICommunications,13:163–176,2001.Cs.Szepesv´ari.EfcientapproximateplanningincontinuousspaceMarkoviandecisionproblems.JournalofEuropeanArticialIntelligenceResearch,2000.accepted.Cs.Szepesv´ariandR.Munos.Finitetimeboundsforsamplingbasedttedvalueiteration.InICML'2005,pages881–886,2005.Cs.Szepesv´ariandW.D.Smart.Interpolation-basedQ-learning.InD.SchuurmansR.Greiner,editor,ProceedingsoftheInternationalConferenceonMachineLearning,pages791–798,2004.G.J.Tesauro.TemporaldifferencelearningandTD-Gammon.CommunicationsoftheACM,38:58–67,March1995.J.N.TsitsiklisandVanB.Roy.RegressionmethodsforpricingcomplexAmerican-styleoptions.IEEETransactionsonNeuralNetworks,12:694–703,2001.J.N.TsitsiklisandB.VanRoy.Feature-basedmethodsforlargescaledynamicprogramming.MachineLearning,22:59–94,1996.V.N.VapnikandA.Ya.Chervonenkis.Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabilities.TheoryofProbabilityanditsApplications,16:264–280,1971.X.WangandT.G.Dietterich.Efcientvaluefunctionapproximationusingregressiontrees.InPro-ceedingsoftheIJCAIWorkshoponStatisticalMachineLearningforLarge-ScaleOptimization,Stockholm,Sweden,1999.T.Zhang.Coveringnumberboundsofcertainregularizedlinearfunctionclasses.JournalofMa-chineLearningResearch,2:527–550,2002.W.ZhangandT.G.Dietterich.Areinforcementlearningapproachtojob-shopscheduling.InProceedingsoftheInternationalJointConferenceonArticialIntellience,1995.857