/
ScalingUpAVIwithOptions ScalingUpAVIwithOptions

ScalingUpAVIwithOptions - PDF document

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
361 views
Uploaded On 2015-08-19

ScalingUpAVIwithOptions - PPT Presentation

1999Inadditionmuchefforthasgoneintoalgorithmsthatlearn147good148optionsforexplorationStollePrecup2002Mannoretal2004ThesealgorithmsmayproduceoptionsthatarealsousefulforplanningLastly ID: 110704

1999)).Inaddition muchefforthasgoneintoalgorithmsthatlearn“good”optionsforexploration(Stolle&Precup 2002;Mannoretal. 2004).Thesealgorithmsmayproduceoptionsthatarealsousefulforplanning.Lastly

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "ScalingUpAVIwithOptions" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

ScalingUpAVIwithOptions 1999)).Inaddition,muchefforthasgoneintoalgorithmsthatlearn“good”optionsforexploration(Stolle&Precup,2002;Mannoretal.,2004).Thesealgorithmsmayproduceoptionsthatarealsousefulforplanning.Lastly,optionsen-ablegreaterexibilitytomodelreal-worldproblemswherethetimebetweendecisionsmayvary.Forexample,inin-ventorymanagementproblems,ordersmaybeplacedwheninventoryisrunninglow.Thisstrategymakesthetimebe-tweenordersarandomvariable,whichismorenaturallymodeledbyoptionsthanprimitiveactions.Thus,optionsareanimportantcandidateforinvestigatingplanningwithtemporallyextendedactions.Themaintechnicalcontributionsofthispaperareex-tendingthenite-sample/nite-iterationanalysisofAVItoplanningwithoptions.First,weintroduceageneralizationoftheFittedValueIteration(FVI)algorithmthatincorpo-ratessamplesgeneratedbyoptions.Weshowthatwhenthesetofoptionscontainstheprimitiveactions,ourgeneral-izedalgorithmconvergesapproximatelyasfastasFVIwithonlyprimitiveactions(Proposition1).ThenwedeveloppreciseconditionsunderwhichourgeneralizedFVIalgo-rithmconvergesfasterwithoptionsthanwithonlyprimi-tiveactions(Theorem1).Theseconditionsturnouttode-pendcriticallyonwhethertheiteratesproducedbyFVIun-derestimatetheoptimalvaluefunction.Finally,ourexper-imentalresultsintwodomainsdemonstratethatthecon-vergencerateofplanningwithoptionscanbesignicantlyfasterthanplanningwithonlyprimitiveactions.However,aspredictedbyourtheoreticalanalysis,theimprovementinconvergenceonlyoccurswhentheiteratesofouralgo-rithmunderestimatetheoptimalvaluefunction,whichcanbecontrolledinpracticebysettingtheinitialestimateoftheoptimalvaluefunctionpessimistically.OuranalysisofFVIsuggeststhatoptionscanplayanimportantroleinplanningbyinducingfastconvergence.2.BackgroundAMarkovDecisionProcess(MDP)isdenedbya5-tuplehX;A;P;R; iwhereXisasetofstates,Aisasetofprimitiveactions,Pmapsfromstate-actionpairstoaprob-abilitydistributionoverstates,Risamappingfromstate-actionpairstorewarddistributionsboundtotheinterval[�RMAX;RMAX],and 2[0;1)isadiscountfactor.LetB(X;VMAX)denotethesetoffunctionswithdomainXandrangeboundedby[�VMAX;VMAX]andM(X)thesetofallprobabilitymeasuresonX.ThroughoutthispaperwewillconsiderMDPswhereXisaboundedsubsetofad-dimensionalEuclideanspaceandAisanitesetofprimitiveactions.Adeterministic,stationarypolicy:X!AforanMDPisamappingfromstatestoprimitiveactions.Wedenotethesetofdeterministic,stationarypoliciesby.TheobjectiveofplanninginanMDPistoderiveapolicythatmax-imizesV(x)=E[P1t=0 tRt(xt;(xt))jx0=x;];wherexisthelong-termvalueoffollowingstartinginstatex.ThefunctionViscalledthevaluefunc-tionofthepolicyanditiswellknownthatitcanbewrittenrecursivelyasTV=E[R(x;(x))]+ RP(yjx;(x))V(y)dy;whereTisabackupopera-torwithrespecttoandVisitsuniquexedpoint.GivenV2B(X;VMAX),thegreedypolicywithre-specttoVisdenedby(x)=argmaxa2AE[R(x;a)]+ RP(yjx)V(y)dy.WedenotetheoptimalvaluefunctionbyV=max2V.Apolicyisoptimalifitscorre-spondingvaluefunctionisVandapolicyis -optimalifV(x)V(x)� forallx2X.TheBellmanopera-torTisdenedby(TV)(x)=maxa2AE[R(x;a)]+ ZP(yjx;a)V(y)dy;(1)whereV2B(X;VMAX),whichisknowntohavexedpointV.Equation(1)denestheValueIteration(VI)al-gorithm.VIconvergestoV,butitiscomputationallyin-tractableinMDPswithextremelylargeorcontinuousstate-spaces.PrimitiveactionFittedValueIteration(PFVI)isagener-alizationofVItohandlelargeorcontinuousstate-spaces.PFVIrunsiterativelyproducingasequenceofK1esti-matesfVkgKk=1oftheoptimalvaluefunctionandreturnsapolicyKthatisgreedywithrespecttothenalestimateVK.Duringeachiterationk,thealgorithmcomputesasetofempiricalestimates^VkofTVk�1forNstates,andthentsafunctionapproximatorto^Vk.Togenerate^Vk,NstatesfxigNi=1aresampledfromadistribution2M(X).Foreachsampledstatexiandeachprimitiveactiona2A,Lnextstatesfyai;jgLj=1andrewardsfrai;jgLj=1aresampledfromtheMDPsimulatorS.Forthekthiteration,theesti-matesoftheBellmanbackupsarecomputedby^Vk(xi)=maxa2A1 LLXj=1�rai;j+ Vk�1(yai;j);(2)whereV0istheinitialestimateoftheoptimalvaluefunc-tiongivenasanargumenttoPFVI.Thekthestimateoftheoptimalvaluefunctionisobtainedbyapplyingasupervisedlearningalgorithm,thatproducesVk=argminf2FNXi=1 f(xi)�^Vk(xi) p;(3)wherep1andFB(X;VMAX)isthehypothesisspaceofthesupervisedlearningalgorithm.Munos&Szepesv´ari(2008)presentedanite-sample,nite-iterationanalysisofPFVIwithguaranteesdepen-dentontheLp-normratherthanthemoreconservativemax ScalingUpAVIwithOptions ThentheupdateresultingfromapplyingtheBellmanoper-atortothepreviousiterateVk�1isestimatedby^Vk(xi) maxo2Oxi1 LLXj=1hroi;j+ oi;jVk�1(yoi;j)i;(7)andweobtainthebesttaccordingto(3).Inadditiontore-turninganextstateandreward,Salsoreturnsthenumberoftimestepsthattheoptionexecutedbeforeterminating.Thisadditionalinformationisneededtocompute(7).Oth-erwise,thedifferencesbetweenPFVIandOFVIareminoranditisnaturaltoaskifOFVIhassimilarnite-sampleandconvergencebehaviorcomparedtoPFVI.Proposition1.Forany";�0andK1.Fixp1.GivenasetofoptionsOcontainingallprimitiveactionsA,aninitialstatedistribution2M(X),asamplingdis-tribution2M(X),andV02B(X;VMAX),ifA1(;)holds,thenthereexistpositiveintegersNandLsuchthatwhenOFVIisexecuted,thepolicy'KreturnedbyOFVIsatisesjjV�V'Kjjp;2 (1� )2C1=p;bp;(TF;F)+"+� K+11=p2kV�V0k1 (1� )(8)holdswithprobabilityatleast1�.AproofofProposition1aswellassufcientvaluesforNandLaregiveninthesupplementarymaterial.Inequal-ity(8)boundsthesuboptimalityofthepolicyreturnedbyOFVI,similartoinequality(4)fromMunos&Szepesv´ari(2008).AslongasthesetofoptionsOcontainsallprim-itiveactionsA,OFVIhasperformanceintheworstcasethatiscomparabletoPFVI.ThemaindifferencesbetweentheboundinProposition1andMunos&Szepesv´ari(2008,Theorem2)isthattheinherentBellmanerrorinPropo-sition1maybelargerthantheinherentBellmanerrorwithonlyprimitiveactions,becausebackupscomputedbyOFVImayspanmultipletimestepsresultinginmorecomplextargetsforthesupervisedlearningalgorithmtot.However,thelasttermscharacterizingtheconvergenceratesof(4)and(8)areidentical.Proposition1impliesthatOFVIalwaysconvergesapprox-imatelyasfastasPFVI.However,thetwoalgorithmsmayconvergetodifferentregionsofthevaluefunctionspaceduetothelargerinherentBellmanerrorofOFVI.Inthefol-lowingsection,wewillinvestigateconditionswhereOFVIconvergesfasterthanPFVI.4.ConvergenceRateofOFVIWeareinterestedinanalyzingthecasewhereOFVIplanswithanoptionsetconsistingofthesetofprimitiveactionsandafewadditionaltemporallyextendedactions.Inmostcases,atemporallyextendedactioncanonlybeinitializedfromasubsetofthestate-space.Thefollowingdenesthesetofstatesthathaveaccesstotemporallyextendedac-tionsthatfollowapproximatelyoptimalpoliciesfromthosestates.Denition1.LetObethegivensetofoptions, 0,andd1.The( ;d)-omegaset! ;dcontainsthestatesx2Xsuchthatthereexistsanoptiono2Oxsatisfying(1)thedurationofexecutingofromxsatisesinfYXEDox;Yd;and(2)ois -optimalwithrespecttox(i.e.Q(x;o)V(x)� ).Temporallyextendedactionsareonlyusefulforplanningiftheyarefrequentlyencountered.Thefollowingassumptionguaranteesthat,evenifthetemporallyextendedactionsaresparselyscatteredthroughoutthestate-space,thentheyarenottoodifculttoreachfromanystatethatwearelikelytoencounterstartingfromx0.Assumption2.[A2( ;d; ;;j)]Let ; 0,d;j1,and2M(X).Foranymprimitivepolicies1;2;:::;m,let=P1P2:::Pm.Thereexistsan -optimalpolicy^suchthateither(1)Prx[x2! ;d]1� ;or(2)9i2f1;2;:::;jgPryi[y2! ;d]1� wherei=�P^ifori=1;2;:::;j.A2( ;d; ;;j)assumesthat,startingfromx,ateachtimestepeverypossibletrajectoryeitherencountersastatein! ;dwithhighprobability(1� ),meaningthattheagentalmostalwaysencountersstateswithtemporallyex-tendedactionsthatareusefulforplanning,orfromtheagent'scurrentstatethereexistsapolicy^thattransitionsto! ;dwithhighprobabilityinatmostjtimesteps.UnderA2,usefultemporallyextendedactionsdonotneedtobeateverystate.Theymaybescatteredsparselythroughoutthestatespace.ThisassumptionisweaksinceitcanbemadetrueforanyMDPandsetofoptionscontainingtheprimi-tiveactionsbytuningtheparametervalues.Furthermore,theagentdoesnotneedtoknow^.Itissufcientthat^exists.FasterconvergencedependscriticallyontheoptimismorpessimismofthesequenceofiteratesproducedbyOFVI.WesaythatanestimateV2B(X;VMAX)oftheopti-malvaluefunctionisoptimisticifV(x)�V(x)forallx2X,andwesaythatVispessimisticifV(x)V(x)forallx2X.Infact,theSMDPBellmanoperatorThasaslowerconvergenceratethantheMDPBellmanoperatorTwhenactingonentriesofV2B(X;VMAX)thatareopti-mistic(Hauskrechtetal.,1998).ThismeansthatOFVIcanonlyconvergemorequicklythanPFVIwhensomeoftheiteratesfVkgKk=0arepessimisticinatleastpartofthestate-space.ForstandardvalueiterationthisisnotaproblembecausewecansettheinitialestimateV0tobepessimisticandthefactthatTismonotonicandconvergestoVen- ScalingUpAVIwithOptions (a)Optimistic(V0V)(b)Pessimistic(V0V)Figure1.OptimalReplacementTask:ConvergenceratesofPFVIandOFVIaveragedover100trials(std.deviationsaretoosmalltovisualize).(a)WhentheinitialvaluefunctionestimateV0isoptimistic,thereisnodifferencebetweentheconvergenceratesofPFVIandOFVI.(b)However,whenV0ispessimistic,OFVIconvergesfasterthanPFVI. (a)Optimistic(V0V) (b)Pessimistic(V0V)Figure2.OptimalReplacementTask:AverageiteratesVk(k=2;5;and10)forPFVIandOFVI.(a)Columns1,2,and3showthattheconvergencerateofOFVIandPFVIarequalitativelysimilarwhentheinitialvaluefunctionisoptimistic.(b)Whentheinitialvaluefunctionispessimistic,OFVI'svaluefunctionestimateafterk=2iterations(column1)isqualitativelysimilartoPFVI'svaluefunctionestimateafterk=5iterations(column2). ScalingUpAVIwithOptions Figure3.InventoryManagementTask:AveragevaluespredictedbypessimisticiteratesforPFVIandOFVI(shadedregionsdenote1std.deviation). Figure4.InventoryManagementTask:CumulativerewardofpoliciesderivedfromPFVIandOFVIaftereachiterationcom-paredtouniformrandom(Rand)and1-stepgreedypolicies(shadedregionsdenote1std.deviation).solidblacklinedepictstheoptimalvaluefunctionV.WithanoptimisticinitialvaluefunctionthebehaviorofPFVIandOFVIisqualitativelyidentical.However,withapessimisticinitialvaluefunction,OFVI'sseconditerateisqualitativelysimilartoPFVI'sfthiterate.5.2.CyclicInventoryManagementTaskInabasicinventorymanagementsetting,anagentcontrolstheorderpolicyforasinglecommodity(Scarf,1959).Eachround,thedemandforthecommodityisrevealed(sampledfromadistribution)andsubtractedfromtheagent'sinven-tory.Theagentdecidesthequantitytoorder,andtheorderislledimmediately.Iftheagentdidnothavesufcientsupplytomeetthedemanditreceivesahighpenalty(i.e.,largenegativereward).Ontheotherhand,orderingcom-moditiesandstoringthemarealsopenalized(i.e.,givennegativerewards).Theobjectiveistondthepolicythatbalancesthesepenaltiesovertime.Cyclicinventorymanagementproblemsarefurthercompli-catedbecausethedemanddistributionchangesaftereachround,butthedistributionsrepeatafteranitenumberofrounds(Sethi&Cheng,1997).Weintroduceaneight-commodity,cyclicinventoryman-agementproblemwithnitestorage.Thedemanddistribu-tionscycleevery12monthsandthereare24roundsperyear.Theagentmustmanageeightdifferentcommodi-tiesthatarestoredtogetherinanitewarehouse.Orderingtoomuchofquantityimeansthatthereislessroominthewarehouseforquantityj6=i.Thustheagentmustworkoutcomplextrade-offsthatdependonboththecurrentin-ventorylevelsandthetimeofyear.Thedetailsofthetaskandexactparametersusedinourexperimentsaredescribedinthesupplementarymaterial.Thestate-spacewasdescribedbyh;xiwhere2f1;2;:::;24gdenotestheperiodinthecycleandxisavectordeterminingthequantityofeachcommoditystoredinthewarehouse.Toapproximatethevaluefunction,wepartitionedthestate-spacebythe24periodsinthecycle.Thus,eachiteratewasconstructedfrom24independentfunctionapproximators.Becauseofthehighdimensional-ityoftheinventoryspace,weneededafunctionapproxima-tionarchitecturewithgoodgeneralizationproperties.Af-terexperimentingwithvariousarchitectures,wefoundthatlinearapproximationswithaxedgridofone-dimensionalradialbasisfunctionsgeneralizedwellwithlimitedsam-ples.Cross-validationwasusedtoselectgriddensityandbasiswidths.FortheOFVIcondition,wecreatedoptionsbasedontheintuitionthatgoodinventorymanagementpoliciesordercommoditiesinlargequantitiesandmakezeroordersonasmanyroundsaspossibletoavoidthebaseorderingpenalty.Wedenedoptionsthatalwaysplacezeroordersandtermi-nateoncetheinventorylevelofaspeciccommoditydropsbelowathreshold.Optionswereaddedfortwentydifferentthresholdlevelsforeachcommodity.Thisproblemisdifcultfortworeasons.First,theagentmustmanageeightdifferentcommoditieswithlimited,sharedstorage.Makingalargeorderofonecommodityre-ducesthespaceavailableforothercommodities.Second,thedemanddistributionsarecyclicrequiringadaptationtothetimeofyear.Theagentmustplanaheadstockpilingcommoditieswhendemandislow.However,iftheagentllsitswarehouse,itwillnotbeabletoadapttounex-pectedsituationsthatariseduetotheproblem'sstochasticdemands.Figure3showsaverage(over5,000sampledstates)pre- ScalingUpAVIwithOptions dictedvaluesforPFVIandOFVI,whenV0ispessimistic.TheaveragepredictedvalueincreasesmorerapidlyforOFVIthanPFVI.ThecurvemarkedwithsquaresinFigure3depictsthefastestpossibleconvergenceratewithonlyprimitiveactionswhenV0ispessimistic.Onmostitera-tions,thiscurvefallswellbelowtheconvergencerateofOFVIuntilOFVIseemstohaveapproximatelyconverged.Figure4showscumulativerewardreceivedbypoliciesderivedfromtheiteratesofPFVIandOFVIwhenV0ispessimistic.Cumulativerewardswererecordedover100timestepsstartingfromastatewithzeroinventory.Poli-ciesderivedfromPFVIandOFVIoutperformrandomand1-stepgreedypoliciesafterasingleiteration.Policiesde-rivedfromOFVIconvergetoagoodsolutionwithfeweriterationsthanPFVI.Thedecreaseinperformanceinpoli-ciesderivedfromPFVIandOFVIinlateriterationsisduetoapproximationerror(thefactthatourfunctionapprox-imationarchitecturedoesnotexactlytthepointsgener-atedbyBellmanbackupsexplainedbythersttermontherighthandsideof(4),(8),and(9)).WhenV0isoptimistic,policiesderivedfromPFVIandOFVIachievesimilarcu-mulativerewardateachiteration(notshown).6.DiscussionWedemonstratedtwodifferenttaskswhereaugmentingtheprimitiveactionswithtemporallyextendedactionsleadstofasterconvergence.Asourtheoreticalanalysispredicted,whenV0waspessimistic,theadditionaltemporallyex-tendedactionshelpedOFVItoconvergefasterthanPFVI.However,addingadditionaloptionsincreasesthecompu-tationalandsamplecomplexityofeachiterationofOFVI.Thus,randomlygeneratinghundredsofoptionswillprob-ablynotleadtooverallimprovementinthecomputationalcomplexityofAVI.However,addingafewoptionsdoesnotsignicantlyincreasethecostofeachiteration.Anaturalextensionofthisworkistoconsiderautomati-callygeneratingoptionsthatspeedupplanning.Manypre-viousworkshavelookedatgeneratingoptions(McGovern&Barto,2001;Stolle&Precup,2002;Mannoretal.,2004;Silver&Ciosek,2012).Whatismissingfromthelitera-turearetheoreticalanalysesthatenablesustoevaluateandcomparedifferentstrategiesforgeneratingoptions.Optionsmayhaveotherbenetsforplanningbesidesim-provingtheconvergencerate.Forexample,optionsmayenableaplanningalgorithmto“skipover”regionsofthestate-spacewithhighlycomplexdynamicswithoutimpact-ingthequalityoftheplannedpolicy.Inpartiallyobservableenvironments,optionsmaybeexploitedtodecreaseuncer-taintyaboutthehiddenstateby“skippingover”regionsofthestate-spacewherethereislargeobservationvariance,or“testing”hypothesesaboutthehiddenstate(Mannetal.,2013).Optionsmayalsoplayanimportantroleinrobustoptimization,wherethedynamicsoftemporallyextendedactionsareknownwithgreatercertaintythanthedynamicsofasequenceofprimitiveactions.AcknowledgementsTheresearchleadingtotheseresultshasreceivedfundingfromtheEuropeanResearchCounselundertheEuropeanUnionsSeventhFrameworkProgram(FP7/2007-2013)/ERCGrantAgreementNo306638.ReferencesHauskrecht,Milos,Meuleau,Nicolas,Kaelbling,LesliePack,Dean,Thomas,andBoutilier,Craig.Hierarchicalsolutionofmarkovdecisionprocessesusingmacro-actions.InProceedingsoftheFourteenthConferenceonUncertaintyinArticialIntelligence,pp.220–229,1998.He,Ruijie,Brunskill,Emma,andRoy,Nicholas.Efcientplanningunderuncertaintywithmacro-actions.JournalofArticialIntelligenceResearch,40:523–570,2011.Mann,TimothyA.,Park,Yunjung,Jeong,Sungmoon,Lee,Minho,andChoe,Yoonsuck.Autonomousandinter-activeimprovementofbinocularvisualdepthestimationthroughsensorimotorinteraction.AutonomousMentalDevelopment,IEEETransactionson,5(1):74–84,2013.ISSN1943-0604.doi:10.1109/TAMD.2012.2216524.Mannor,Shie,Menache,Ishai,Hoze,Amit,andKlein,Uri.Dynamicabstractioninreinforcementlearningviaclustering.InProceedingsoftheTwenty-FirstInternationalConferenceonMachinelearning,ICML'04,pp.71–,NewYork,NY,USA,2004.ACM.ISBN1-58113-838-5.doi:10.1145/1015330.1015355.URLhttp://doi.acm.org/10.1145/1015330.1015355.McGovern,AmyandBarto,AndrewG.AutomaticDis-coveryofSubgoalsinReinforcementLearningusingDi-verseDensity.InProceedingsofthe18thInternationalConferenceonMachineLearning,pp.361–368,SanFransisco,USA,2001.Munos,R´emi.Errorboundsforapproximatevalueitera-tion.InProceedingsoftheNationalConferenceonAr-ticialIntelligence,2005.Munos,R´emiandSzepesv´ari,Csaba.Finite-timeboundsforttedvalueiteration.JournalofMachineLearningResearch,9:815–857,2008.Precup,Doina,Sutton,RichardS,andSingh,Satinder.Theoreticalresultsonreinforcementlearningwithtem-

Related Contents


Next Show more