/
StochasticMulti-Armed-BanditProblemwithNon-stationaryRewards StochasticMulti-Armed-BanditProblemwithNon-stationaryRewards

StochasticMulti-Armed-BanditProblemwithNon-stationaryRewards - PDF document

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
371 views
Uploaded On 2016-02-23

StochasticMulti-Armed-BanditProblemwithNon-stationaryRewards - PPT Presentation

OmarBesbesColumbiaUniversityNewYorkNYob2105columbiaeduYonatanGurStanfordUniversityStanfordCAygurstanfordeduAssafZeeviColumbiaUniversityNewYorkNYassafgsbcolumbiaeduAbstractInamultiarmedbandi ID: 228523

OmarBesbesColumbiaUniversityNewYork NYob2105@columbia.eduYonatanGurStanfordUniversityStanford CAygur@stanford.eduAssafZeeviColumbiaUniversityNewYork NYassaf@gsb.columbia.eduAbstractInamulti-armedbandi

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "StochasticMulti-Armed-BanditProblemwithN..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

StochasticMulti-Armed-BanditProblemwithNon-stationaryRewards OmarBesbesColumbiaUniversityNewYork,NYob2105@columbia.eduYonatanGurStanfordUniversityStanford,CAygur@stanford.eduAssafZeeviColumbiaUniversityNewYork,NYassaf@gsb.columbia.eduAbstractInamulti-armedbandit(MAB)problemagamblerneedstochooseateachroundofplayoneofKarms,eachcharacterizedbyanunknownrewarddistribution.Rewardrealizationsareonlyobservedwhenanarmisselected,andthegambler'sobjectiveistomaximizehiscumulativeexpectedearningsoversomegivenhori-zonofplayT.Todothis,thegamblerneedstoacquireinformationaboutarms(exploration)whilesimultaneouslyoptimizingimmediaterewards(exploitation);thepricepaidduetothistradeoffisoftenreferredtoastheregret,andthemainquestionishowsmallcanthispricebeasafunctionofthehorizonlengthT.Thisproblemhasbeenstudiedextensivelywhentherewarddistributionsdonotchangeovertime;anassumptionthatsupportsasharpcharacterizationoftheregret,yetisoftenviolatedinpracticalsettings.Inthispaper,wefocusonaMABformulationwhichallowsforabroadrangeoftemporaluncertaintiesintherewards,whilestillmaintainingmathematicaltractability.Wefullycharacterizethe(regret)complex-ityofthisclassofMABproblemsbyestablishingadirectlinkbetweentheextentofallowablereward“variation”andtheminimalachievableregret,andbyestab-lishingaconnectionbetweentheadversarialandthestochasticMABframeworks.1IntroductionBackgroundandmotivation.Inthepresenceofuncertaintyandpartialfeedbackonrewards,anagentthatfacesasequenceofdecisionsneedstojudiciouslyuseinformationcollectedfrompastobservationswhentryingtooptimizefutureactions.Awidelystudiedparadigmthatcapturesthistensionbetweentheacquisitioncostofnewinformation(exploration)andthegenerationofinstan-taneousrewardsbasedontheexistinginformation(exploitation),isthatofmultiarmedbandits(MAB),originallyproposedinthecontextofdrugtestingby[1],andplacedinageneralsettingby[2].TheoriginalsettinghasagamblerchoosingamongKslotmachinesateachroundofplay,anduponthatselectionobservingarewardrealization.Inthisclassicalformulationtherewardsareassumedtobeindependentandidenticallydistributedaccordingtoanunknowndistributionthatcharacterizeseachmachine.Theobjectiveistomaximizetheexpectedsumof(possiblydis-counted)rewardsreceivedoveragiven(possiblyinnite)timehorizon.Sincetheirinception,MABproblemswithvariousmodicationshavebeenstudiedextensivelyinStatistics,Economics,Oper-ationsResearch,andComputerScience,andareusedtomodelaplethoraofdynamicoptimizationproblemsunderuncertainty;examplesincludeclinicaltrials([3]),strategicpricing([4]),investmentininnovation([5]),packetrouting([6]),on-lineauctions([7]),assortmentselection([8]),andon-1 lineadvertising([9]),tonamebutafew.Foroverviewsandfurtherreferencescf.themonographsby[10],[11]forBayesian/dynamicprogrammingformulations,and[12]thatcoversthemachinelearningliteratureandtheso-calledadversarialsetting.SincethesetofMABinstancesinwhichonecanidentifytheoptimalpolicyisextremelylimited,atypicalyardsticktomeasureperformanceofacandidatepolicyistocompareittoabenchmark:anoraclethatateachtimeinstantselectsthearmthatmaximizesexpectedreward.Thedifferencebetweentheperformanceofthepolicyandthatoftheoracleiscalledtheregret.WhenthegrowthoftheregretasafunctionofthehorizonTissub-linear,thepolicyislong-runaverageoptimal:itslongrunaverageperformanceconvergestothatoftheoracle.Hencetherstorderobjectiveistodeveloppolicieswiththischaracteristic.ThepreciserateofgrowthoftheregretasafunctionofTprovidesarenedmeasureofpolicyperformance.[13]istherstpaperthatprovidesasharpcharacterizationoftheregretgrowthrateinthecontextofthetraditional(stationaryrandomrewards)setting,oftenreferredtoasthestochasticMABproblem.Mostoftheliteraturehasfollowedthispathwiththeobjectiveofdesigningpoliciesthatexhibitthe“slowestpossible”rateofgrowthintheregret(oftenreferredtoasrateoptimalpolicies).Inmanyapplicationdomains,severalofwhichwerenotedabove,temporalchangesintherewarddistributionstructureareanintrinsiccharacteristicoftheproblem.Theseareignoredinthetradi-tionalstochasticMABformulation,buttherehavebeenseveralattemptstoextendthatframework.Theoriginofthislineofworkcanbetracedbackto[14]whoconsideredacasewhereonlythestateofthechosenarmcanchange,givingrisetoarichlineofwork(see,e.g.,[15],and[16]).Inpartic-ular,[17]introducedthetermrestlessbandits;amodelinwhichthestates(associatedwithrewarddistributions)ofarmschangeineachstepaccordingtoanarbitrary,yetknown,stochasticprocess.Consideredahardclassofproblems(cf.[18]),thislineofworkhasledtovariousapproximations(see,e.g.,[19]),relaxations(see,e.g.,[20]),andconsiderationsofmoredetailedprocesses(see,e.g.,[21]forirreducibleMarkovprocess,and[22]foraclassofhistory-dependentrewards).DeparturefromthestationarityassumptionthathasdominatedmuchoftheMABliteratureraisesfundamentalquestionsastohowoneshouldmodeltemporaluncertaintyinrewards,andhowtobenchmarkperformanceofcandidatepolicies.Oneview,istoallowtherewardrealizationstobeselectedatanypointintimebyanadversary.Theseideashavetheiroriginsingametheorywiththeworkof[23]and[24],andhavesinceseensignicantdevelopment;[25]and[12]providereviewsofthislineofresearch.Withinthissocalledadversarialformulation,theefcacyofapolicyoveragiventimehorizonTisoftenmeasuredrelativetoabenchmarkdenedbythesinglebestactiononecouldhavetakeninhindsight(afterseeingallrewardrealizations).Thesinglebestactionbenchmarkrepresentsastaticoracle,asitisconstrainedtoasingle(static)action.Thisstaticoraclecanperformquitepoorlyrelativetoadynamicoraclethatfollowstheoptimaldynamicsequenceofactions,asthelatteroptimizesthe(expected)rewardateachtimeinstantoverallpossibleactions.1Thus,apotentiallimitationoftheadversarialframeworkisthatevenifapolicyhasa“small”regretrelativetoastaticoracle,thereisnoguaranteewithregardtoitsperformancerelativetothedynamicoracle.Maincontributions.Themaincontributionofthispaperliesinfullycharacterizingthe(regret)complexityofabroadclassofMABproblemswithnon-stationaryrewardstructurebyestablishingadirectlinkbetweentheextentofreward“variation”andtheminimalachievableregret.Morespecically,thepaper'scontributionsarealongfourdimensions.Onthemodelingsideweformulateaclassofnon-stationaryrewardstructurethatisquitegeneral,andhencecanbeusedtorealisticallycaptureavarietyofreal-worldtypephenomena,yetismathematicallytractable.ThemainconstraintthatweimposeontheevolutionofthemeanrewardsisthattheirvariationovertherelevanttimehorizonisboundedbyavariationbudgetVT;aconceptthatwasrecentlyintroducedin[26]inthecontextofnon-stationarystochasticapproximation.Thislimitsthepowerofnaturecomparedtotheadversarialsetupdiscussedabovewhererewardscanbepickedtomaximallyaffectthepolicy'sperformanceateachinstancewithinf1;:::;Tg.Nevertheless,thisconstraintallowsforarichclassoftemporalchanges,extendingmostofthetreatmentinthenon-stationarystochasticMABliterature,whichmainlyfocusesonanitenumberofchangesinthemeanrewards,see,e.g.,[27]andreferencestherein.Wefurtherdiscussconnectionswithstudiednon-stationaryinstancesinx6.Theseconddimensionofcontributionliesintheanalysisdomain.Forageneralclassofnon-stationaryrewarddistributionsweestablishlowerboundsontheperformanceofanynon-anticipatingpolicyrelativetothedynamicoracle,andshowthattheseboundscanbeachieved, 1Undernon-stationaryrewardsitisimmediatethatthesinglebestactionmaybesub-optimalinmanydecisionepochs,andtheperformancegapbetweenthestaticandthedynamicoraclescangrowlinearlywithT.2 uniformlyovertheclassofadmissiblerewarddistributions,byasuitablepolicyconstruction.Theterm“achieved”ismeantinthesenseoftheorderoftheregretasafunctionofthetimehorizonT,thevariationbudgetVT,andthenumberofarmsK.Ourpoliciesareshowntobeminimaxoptimaluptoatermthatislogarithmicinthenumberofarms,andtheregretissublinearandisoforder(KVT)1=3T2=3.Ouranalysiscomplementsstudiednon-stationaryinstancesbytreatingabroadandexibleclassoftemporalchangesintherewarddistributions,yetstillestablishingoptimalityresultsandshowingthatsublinearregretisachievable.OurresultsprovideaspectrumofordersoftheminimaxregretrangingbetweenorderT2=3(whenVTisaconstantindependentofT)andorderT(whenVTgrowslinearlywithT),mappingallowedvariationtobestachievableperformance.Withtheanalysisdescribedaboveweshedlightontheexploration-exploitationtradeoffthatcharac-terizesthenon-stationaryrewardsetting,andthechangeinthistradeoffcomparedtothestationarysetting.Inparticular,ourresultshighlightthetensionthatexistsbetweentheneedto“remember”and“forget.”Thisischaracteristicofseveralalgorithmsthathavebeendevelopedintheadversar-ialMABliterature,e.g.,thefamilyofexponentialweightmethodssuchasEXP3,EXP3.Sandthelike;see,e.g.,[28],and[12].Inanutshell,thefewerpastobservationsoneretains,thelargerthestochasticerrorassociatedwithone'sestimatesofthemeanrewards,whileatthesametimeusingmorepastobservationsincreasestheriskofthesebeingbiased.OneinterestingobservationdrawninthispaperconnectsbetweentheadversarialMABsetting,andthenon-stationaryenvironmentstudiedhere.Inparticular,asin[26],itisseenthatanoptimalpolicyintheadversarialsettingmaybesuitablycalibratedtoperformnear-optimallyinthenon-stationarystochasticsetting.Thiswillbefurtherdiscussedafterthemainresultsareestablished.2ProblemFormulationLetK=f1;:::;Kgbeasetofarms.LetT=f1;2;:::;Tgdenoteasequenceofdecisionepochsfacedbyadecisionmaker.Atanyepocht2T,thedecision-makerpullsoneoftheKarms.Whenpullingarmk2Katepocht2T,arewardXkt2[0;1]isobtained,whereXktisarandomvariablewithexpectationkt=EXkt.Wedenotethebestpossibleexpectedrewardatdecisionepochtbyt,i.e.,t=maxk2Kkt .Changesintheexpectedrewardsofthearms.Weassumetheexpectedrewardofeacharmktmaychangeatanydecisionepoch.Wedenotebykthesequenceofexpectedrewardsofarmk:k=kt Tt=1.Inaddition,wedenotebythesequenceofvectorsofallKexpectedrewards:=k Kk=1.Weassumethattheexpectedrewardofeacharmcanchangeanarbitrarynumberoftimes,butboundthetotalvariationoftheexpectedrewards:T�1Xt=1supk2K kt�kt+1 :(1)LetfVt:t=1;2;:::gbeanon-decreasingsequenceofpositiverealnumberssuchthatV1=0,KVttforallt,andfornormalizationpurposessetV2=2K�1.WerefertoVTasthevariationbudgetoverT.Wedenethecorrespondingtemporaluncertaintyset,asthesetofrewardvectorsequencesthataresubjecttothevariationbudgetVToverthesetofdecisionepochsf1;:::;Tg:V=(2[0;1]KT:T�1Xt=1supk2K kt�kt+1 VT):Thevariationbudgetcapturestheconstraintimposedonthenon-stationaryenvironmentfacedbythedecision-maker.Whilelimitingthepossibleevolutionintheenvironment,itallowsfornumer-ousformsinwhichtheexpectedrewardsmaychange:continuously,indiscreteshocks,andofachangingrate(Figure1depictstwodifferentvariationpatternsthatcorrespondtothesamevariationbudget).Ingeneral,thevariationbudgetVTisdesignedtodependonthenumberofpullsT.Admissiblepolicies,performance,andregret.LetUbearandomvariabledenedoveraproba-bilityspace(U;U;Pu).Let1:U!Kandt:[0;1]t�1U!Kfort=2;3;:::bemeasurablefunctions.Withsomeabuseofnotationwedenotebyt2Ktheactionattimet,thatisgivenbyt=1(U)t=1;t�Xt�1;:::;X1;Ut=2;3;:::;3 Figure1:Twoinstancesofvariationinthemeanrewards:(Left)Axedvariationbudget(thatequals3)is“spent”overthewholehorizon.(Right)Thesamebudgetis“spent”intherstthirdofthehorizon.Themappingsft:t=1;:::;TgtogetherwiththedistributionPudenetheclassofadmissiblepolicies.WedenotethisclassbyP.WefurtherdenotebyfHt;t=1;:::;Tgtheltrationassoci-atedwithapolicy2P,suchthatH1=(U)andHt=Xj t�1j=1;Uforallt2f2;3;:::g.NotethatpoliciesinParenon-anticipating,i.e.,dependonlyonthepasthistoryofactionsandob-servations,andallowforrandomizedstrategiesviatheirdependenceonU.Wedenetheregretunderpolicy2Pcomparedtoadynamicoracleastheworst-casedifferencebetweentheexpectedperformanceofpullingateachepochtthearmwhichhasthehighestexpectedrewardatepocht(thedynamicoracleperformance)andtheexpectedperformanceunderpolicy:R(V;T)=sup2V(TXt=1t�E"TXt=1t#);wheretheexpectationE[]istakenwithrespecttothenoisyrewards,aswellastothepolicy'sactions.Inaddition,wedenotebyR(V;T)theminimalworst-caseregretthatcanbeguaranteedbyanadmissiblepolicy2P,thatis,R(V;T)=inf2PR(V;T).Then,R(V;T)isthebestachievableperformance.InthefollowingsectionswestudythemagnitudeofR(V;T).Weanalyzethemagnitudeofthisquantitybyestablishingupperandlowerbounds;intheseboundswerefertoaconstantCasabsoluteifitisindependentofK,VT,andT.3LowerboundonthebestachievableperformanceWenextprovidealowerboundonthethebestachievableperformance.Theorem1AssumethatrewardshaveaBernoullidistribution.Then,thereissomeabsolutecon-stantC�0suchthatforanypolicy2PandforanyT1,K2andVT2K�1;K�1T,R(V;T)C(KVT)1=3T2=3:Wenotethatwhenrewarddistributionsarestationary,thereareknownpoliciessuchasUCB1([29])thatachieveregretoforderp Tinthestochasticsetup.Whentherewardstructureisnon-stationaryanddenedbytheclassV,thennopolicymayachievesuchaperformanceandthebestperformancemustincuraregretofatleastorderT2=3.Thisadditionalcomplexityembeddedinthenon-stationarystochasticMABproblemcomparedtothestationaryonewillbefurtherdiscussedinx6.WenotethatTheorem1alsoholdswhenVTisincreasingwithT.Inparticular,whenthevariationbudgetislinearinT,theregretgrowslinearlyandlongrunaverageoptimalityisnotachievable.Thedriverofthechangeinthebestachievableperformancerelativetotheoneestablishedinastationaryenvironment,isasecondtradeoff(overthetensionbetweenexploringdifferentarmsandcapitalizingontheinformationalreadycollected)introducedbythenon-stationaryenvironment,between“remembering”and“forgetting”:estimatingtheexpectedrewardsisdonebasedonpastobservationsofrewards.Whilekeepingtrackofmoreobservationsmaydecreasethevarianceofmeanrewardsestimates,thenon-stationaryenvironmentimpliesthat“old”informationispotentiallylessrelevantduetopossiblechangesintheunderlyingrewards.Thechangingrewardsgiveincentivetodismissoldinformation,whichinturnencouragesenhancedexploration.TheproofofTheorem1emphasizestheimpactofthesetradeoffsontheachievableperformance.4 Keyideasintheproof.AtahighleveltheproofofTheorem1buildsonideasofidentifyingaworst-case“strategy”ofnature(e.g.,[28],proofofTheorem5.1)adaptingthemtooursetting.Whiletheproofisdeferredtotheonlinecompanion(assupportingmaterial),wenextdescribethekeyideaswhenVT=1.2WedeneasubsetofvectorsequencesV0VandshowthatwhenisdrawnrandomlyfromV0,anyadmissiblepolicymustincurregretoforder(KVT)1=3T2=3.WedeneapartitionofthedecisionhorizonTintobatchesT1;:::;Tmofsize~Teach(except,possiblythelastbatch):Tj=nt:(j�1)~T+1tminnj~T;Too;forallj=1;:::;m;(2)wherem=dT=~Teisthenumberofbatches.InV0,ineverybatchthereisexactlyone“good”armwithexpectedreward1=2+"forsome0"1=4,andalltheotherarmshaveexpectedreward1=2.The“good”armisdrawnindependentlyinthebeginningofeachbatchaccordingtoadiscreteuniformdistributionoverf1;:::;Kg.Thus,theidentityofthe“good”armcanchangeonlybetweenbatches.Byselecting"suchthat"T=~TVT,any2V0iscomposedofexpectedrewardsequenceswithavariationofatmostVT,andthereforeV0V.Giventhedrawsunderwhichexpectedrewardsequencesaregenerated,naturepreventsanyaccumulationofinformationfromonebatchtoanother,sinceatthebeginningofeachbatchanew“good”armisdrawnindependentlyofthehistory.TheproofofTheorem1establishesthatwhen"1=p ~Tnoadmissiblepolicycanidentifythe“good”armwithhighprobabilitywithinabatch.Sincethereare~Tepochsineachbatch,theregretthatanypolicymustincuralongabatchisoforder~T"p ~T,whichyieldsaregretoforderp ~TT=~TT=p ~Tthroughoutthewholehorizon.Selectingthesmallestfeasible~Tsuchthatthevariationbudgetconstraintissatisedleadsto~TT2=3,yieldingaregretoforderT2=3throughoutthehorizon.4Anear-optimalpolicyWeapplytheideasunderlyingthelowerboundinTheorem1todeveloparateoptimalpolicyforthenon-stationarystochasticMABproblemwithavariationbudget.Considerthefollowingpolicy: Rexp3.Inputs:apositivenumber ,andabatchsizeT.1.Setbatchindexj=12.RepeatwhilejdT=Te:(a)Set=(j�1)T(b)Initialization:foranyk2Ksetwkt=1(c)Repeatfort=+1;:::;minfT;+Tg:Foreachk2K,setpkt=(1� )wkt PKk0=1wk0t+ KDrawanarmk0fromKaccordingtothedistributionpkt Kk=1ReceivearewardXk0tFork0set^Xk0t=Xk0t=pk0t,andforanyk6=k0set^Xkt=0.Forallk2Kupdate:wkt+1=wktexp( ^Xkt K)(d)Setj=j+1,andreturntothebeginningofstep2 Clearly2P.TheRexp3policyusesExp3,apolicyintroducedby[30]forsolvingaworst-casesequentialallocationproblem,asasubroutine,restartingiteveryTepochs. 2Forthesakeofsimplicity,thediscussioninthisparagraphassumesavariationbudgetthatisxedandindependentofT;theproofofTheorem3detailsageneraltreatmentforabudgetthatdependsonT.5 Theorem2LetbetheRexp3policywithabatchsizeT=l(KlogK)1=3(T=VT)2=3mandwith =minn1;q KlogK (e�1)To.Then,thereissomeabsoluteconstantCsuchthatforeveryT1,K2,andVT2K�1;K�1T:R(V;T)C(KlogKVT)1=3T2=3:Theorem2isobtainedbyestablishingaconnectionbetweentheregretrelativetothesinglebestactionintheadversarialsetting,andtheregretwithrespecttothedynamicoracleinnon-stationarystochasticsettingwithvariationbudget.Severalclassesofpolicies,suchasexponential-weight(includingExp3)andpolynomial-weightpolicies,havebeenshowntoachieveregretoforderp Twithrespecttothesinglebestactionintheadversarialsetting(seechapter6of[12]forareview).Whileingeneralthesepoliciestendtoperformwellnumerically,thereisnoguaranteefortheirperformancerelativetothedynamicoraclestudiedinthispaper,sincethesinglebestactionitselfmayincurlinearregretrelativetothedynamicoracle;seealso[31]forastudyoftheempiricalperformanceofoneclassofalgorithms.TheproofofTheorem2showsthatanypolicythatachievesregretoforderp Twithrespecttothesinglebestactionintheadversarialsetting,canbeusedasasubroutinetoobtainnear-optimalperformancewithrespecttothedynamicoracleinoursetting.Rexp3emphasizesthetwotradeoffsdiscussedintheprevioussection.Thersttradeoff,informationacquisitionversuscapitalizingonexistinginformation,iscapturedbythesubroutinepolicyExp3.Infact,anypolicythatachievesagoodperformancecomparedtoasinglebestactionbenchmarkintheadversarialsettingmustbalanceexplorationandexploitation.Thesecondtradeoff,“remembering”versus“forgetting,”iscapturedbyrestartingExp3andforgettinganyacquiredinformationeveryTpulls.Thus,oldinformationthatmayslowdowntheadaptationtothechangingenvironmentisbeingdiscarded.Theorem1andTheorem2togethercharacterizetheminimaxregret(uptoamultiplicativefactor,logarithmicinthenumberofarms)inafullspectrumofvariationsVT:R(V;T)(KVT)1=3T2=3:Hence,wehavequantiedtheimpactoftheextentofchangeintheenvironmentonthebestachiev-ableperformanceinthisbroadclassofproblems.Forexample,forthecaseinwhichVT=CT ,forsomeabsoluteconstantCand0 1thebestachievableregretisoforderT(2+ )=3.WenallynotethatrestartingisonlyonewayofadaptingpoliciesfromtheadversarialMABsettingtoachievenearoptimalityinthenon-stationarystochasticsetting;awaythatarticulateswelltheprinciplesleadingtonearoptimality.Intheonlinecompanionwedemonstratethatnearoptimalitycanbeachievedbyotheradaptationmethods,showingthattheExp3.Spolicy(givenin[28])canbetunedby =1 Tand (KVT=T)1=3toachievenearoptimalityinoursetting,withoutrestarting.5ProofofTheorem2Thestructureoftheproofisasfollows.First,webreakthehorizontoasequenceofbatchesofsizeTeach,andanalyzetheperformancegapbetweenthesinglebestactionandthedynamicoracleineachbatch.Then,wepluginaknownperformanceguaranteeforExp3relativetothesinglebestaction,andsumoverbatchestoestablishtheregretofRexp3relativetothedynamicoracle.Step1(Preliminaries).FixT1,K2,andVT2K�1;K�1T.LetbetheRexp3policy,tunedby =minn1;q KlogK (e�1)ToandT2f1;:::;Tg(tobespeciedlateron).WebreakthehorizonTintoasequenceofbatchesT1;:::;TmofsizeTeach(except,possiblyTm)accordingto(2).Let2V,andxj2f1;:::;mg.Wedecompositiontheregretinbatchj:E24Xt2Tj(t�t)35=Xt2Tjt�E24maxk2K8:Xt2TjXkt9=;35| {z }J1;j+E24maxk2K8:Xt2TjXkt9=;35�E24Xt2Tjt35| {z }J2;j:(3)Therstcomponent,J1;j,istheexpectedlossassociatedwithusingasingleactionoverbatchj.Thesecondcomponent,J2;j,istheexpectedregretrelativetothebeststaticactioninbatchj.6 Step2(AnalysisofJ1;jandJ2;j).DeningkT+1=kTforallk2K,wedenotethevariationinexpectedrewardsalongbatchTjbyVj=Pt2Tjmaxk2K kt+1�kt .Wenotethat:mXj=1Vj=mXj=1Xt2Tjmaxk2K kt+1�kt VT:(4)Letk0beanarmwithbestexpectedperformanceoverTj:k02argmaxk2KnPt2Tjkto.Then,maxk2K8:Xt2Tjkt9=;=Xt2Tjk0t=E24Xt2TjXk0t35E24maxk2K8:Xt2TjXkt9=;35;(5)andtherefore,onehas:J1;j=Xt2Tjt�E24maxk2K8:Xt2TjXkt9=;35(a)Xt2Tjt�k0tTmaxt2Tjnt�k0to(b)2VjT;(6)forany2Vandj2f1;:::;mg,where(a)holdsby(5)and(b)holdsbythefollowingargument:otherwisethereisanepocht02Tjforwhicht0�k0t0&#x]TJ ;� -1;.93; Td;&#x [00;2Vj.Indeed,letk1=argmaxk2Kkt0.Insuchcase,forallt2Tjonehask1tk1t0�Vj&#x]TJ ;� -1;.93; Td;&#x [00;k0t0+Vjk0t,sinceVjisthemaximalvariationinbatchTj.Thishowever,contradictstheoptimalityofk0atepocht,andthus(6)holds.Inaddition,Corollary3.2in[28]pointsoutthattheregretincurredbyExp3(tunedby =minn1;q KlogK (e�1)To)alongTbatches,relativetothesinglebestaction,isboundedby2p e�1p TKlogK.Therefore,foreachj2f1;:::;mgonehasJ2;j=E24maxk2K8:Xt2TjXkt9=;�E24Xt2Tjt3535(a)2p e�1p TKlogK;(7)forany2V,where(a)holdssincewithineachbatcharmsarepulledaccordingtoExp3( ).Step3(Regretthroughoutthehorizon).Summingoverm=dT=Tebatcheswehave:R(V;T)=sup2V(TXt=1t�E"TXt=1t#)(a)mXj=12p e�1p TKlogK+2VjT(b)T T+12p e�1p TKlogK+2TVT:=2p e�1p KlogKT p T+2p e�1p TKlogK+2TVT;(8)where:(a)holdsby(3),(6),and(7);and(b)followsfrom(4).Finally,selectingT=l(KlogK)1=3(T=VT)2=3m,weestablish:R(V;T)2p e�1(KlogKVT)1=3T2=3+2p e�1r (KlogK)1=3(T=VT)2=3+1KlogK+2(KlogK)1=3(T=VT)2=3+1VT(a)2+2p 2p e�1+4(KlogKVT)1=3T2=3;where(a)followsfromTK2,andVT2K�1;K�1T.Thisconcludestheproof. 7 6DiscussionUnknownvariationbudget.TheRexp3policyreliesonpriorknowledgeofVT,butpredictionsofVTmaybeinaccurate(suchestimationcanbemaintainedfromhistoricaldataifactionsareoccasionallyrandomized,forexample,byttingVT=T ).Denotingthe“true”variationbudgetbyVTandtheestimatethatisusedbytheagentwhentuningRexp3by^VT,onemayobservethattheanalysisintheproofofTheorem2holdsuntilequation(8),butthenTwillbetunedusing^VT.ThisimpliesthatwhenVTand^VTare“close,”Rexp3stillguaranteeslong-runaverageoptimality.Forexample,supposethatRexp3istunedby^VT=T ,butthevariationisVT=T +.Thensublinearregret(oforderT2=3+ =3+)isguaranteedaslongas(1� )=3;e.g.,if =0and=1=4,Rexp3guaranteesregretoforderT11=12(accuratetuningwouldhaveguaranteedorderT3=4).Sincetherearenorestrictionsontherateatwhichthevariationbudgetcanbespent,aninterestingandpotentiallychallengingopenproblemistodelineatetowhatextentitispossibletodesignadaptivepoliciesthatdonotusepriorknowledgeofVT,yetguarantee“good”performance.Contrastingwithtraditional(stationary)MABproblems.Thecharacterizedminimaxregretinthestationarystochasticsettingisoforderp Twhenexpectedrewardscanbearbitrarilyclosetoeachother,andoforderlogTwhenrewardsare“wellseparated”(see[13]and[29]).Contrast-ingtheminimaxregret(oforderV1=3TT2=3)wehaveestablishedinthestochasticnon-stationaryMABproblemwiththoseestablishedinstationarysettingsallowsonetoquantifythe“priceofnon-stationarity,”whichmathematicallycapturestheaddedcomplexityembeddedinchangingrewardsversusstationaryones(asafunctionoftheallowedvariation).Clearly,additionalcomplexityisintroducedevenwhentheallowedvariationisxedandindependentofthehorizonlength.Contrastingwithothernon-stationaryMABinstances.TheclassofMABproblemswithnon-stationaryrewardsthatisformulatedinthecurrentchapterextendsotherMABformulationsthatallowrewardstochangeinamorestructuredmanner.Forexample,[32]considerasettingwhererewardsevolveaccordingtoaBrownianmotionandregretislinearinT;ourresults(whenVTislinearinT)areconsistentwiththeirs.Twootherrepresentativestudiesarethoseof[27],thatstudyastochasticMABproblemsinwhichexpectedrewardsmaychangeanitenumberoftimes,and[28]thatformulateanadversarialMABprobleminwhichtheidentityofthebestarmmaychangeanitenumberoftimes.Bothstudiessuggestpoliciesthat,utilizingthepriorknowledgethatthenumberofchangesmustbenite,achieveregretoforderp Trelativetothebestsequenceofactions.However,theperformanceofthesepoliciescandeterioratetoregretthatislinearinTwhenthenumberofchangesisallowedtodependonT.Whenthereisanitevariation(VTisxedandindependentofT)butnotnecessarilyanitenumberofchanges,weestablishthatthebestachievableperformancedeterioratetoregretoforderT2=3.Inthatrespect,itisnotsurprisingthatthe“hardcase”usedtoestablishthelowerboundinTheorem1describesanature'sstrategythatallocatesvariationoveralarge(asafunctionofT)numberofchangesintheexpectedrewards.Lowvariationrates.Whileourformulationfocuseson“signicant”variationinthemeanrewards,ourestablishedboundsalsoholdfor“smaller”variationscales;whenVTdecreasesfromO(1)toO(T�1=2)theminimaxregretratedecreasesfromT2=3top T.Indeed,whenthevariationscaleisO(T�1=2)orsmaller,therateofregretcoincideswiththatoftheclassicalstochasticMABsetting.References[1]W.R.Thompson.Onthelikelihoodthatoneunknownprobabilityexceedsanotherinviewoftheevidenceoftwosamples.Biometrika,25:285–294,1933.[2]H.Robbins.Someaspectsofthesequentialdesignofexperiments.BulletinoftheAmericanMathematicalSociety,55:527–535,1952.[3]M.Zelen.Playthewinnerruleandthecontrolledclinicaltrials.JournaloftheAmericanStatisticalAssociation,64:131–146,1969.[4]D.BergemannandJ.Valimaki.Learningandstrategicpricing.Econometrica,64:1125–1149,1996.[5]D.BergemannandU.Hege.Thenancingofinnovation:Learningandstopping.RANDJournalofEconomics,36(4):719–752,2005.8 [6]B.AwerbuchandR.D.Kleinberg.Addaptiveroutingwithend-to-endfeedback:distributedlearningandgeometricapproaches.InProceedingsofthe36thACMSymposiuimonTheoryofComputing(STOC),pages45–53,2004.[7]R.D.KleinbergandT.Leighton.Thevalueofknowingademandcurve:Boundsonregretforonlineposted-priceauctions.InProceedingsofthe44thAnnualIEEESymposiumonFoundationsofComputerScience(FOCS),pages594–605,2003.[8]F.CaroandG.Gallien.Dynamicassortmentwithdemandlearningforseasonalconsumergoods.Man-agementScience,53:276–292,2007.[9]S.Pandey,D.Agarwal,D.Charkrabarti,andV.Josifovski.Banditsfortaxonomies:Amodel-basedapproach.InSIAMInternationalConferenceonDataMining,2007.[10]D.A.BerryandB.Fristedt.Banditproblems:sequentialallocationofexperiments.ChapmanandHall,1985.[11]J.C.Gittins.Multi-ArmedBanditAllocationIndices.JohnWileyandSons,1989.[12]N.Cesa-BianchiandG.Lugosi.Prediction,Learning,andGames.CambridgeUniversityPress,Cam-bridge,UK,2006.[13]T.L.LaiandH.Robbins.Asymptoticallyefcientadaptiveallocationrules.AdvancesinAppliedMath-ematics,6:4–22,1985.[14]J.C.GittinsandD.M.Jones.Adynamicallocationindexforthesequentialdesignofexperiments.North-Holland,1974.[15]J.C.Gittins.Banditprocessesanddynamicallocationindices(withdiscussion).JournaloftheRoyalStatisticalSociety,SeriesB,41:148–177,1979.[16]P.Whittle.Armacquiringbandits.TheAnnalsofProbability,9:284–292,1981.[17]P.Whittle.Restlessbandits:Activityallocationinachangingworld.JournalofAppliedProbability,25A:287–298,1988.[18]C.H.PapadimitriouandJ.N.Tsitsiklis.Thecomplexityofoptimalqueueingnetworkcontrol.InStructureinComplexityTheoryConference,pages318–322,1994.[19]D.BertsimasandJ.Nino-Mora.Restlessbandits,linearprogrammingrelaxations,andprimaldualindexheuristic.OperationsResearch,48(1):80–90,2000.[20]S.GuhaandK.Munagala.Approximationalgorithmsforpartial-informationbasedstochasticcontrolwithmarkovianrewards.In48thAnnualIEEESymposiumonFundationsofComputerScience(FOCS),pages483–493,2007.[21]R.Ortner,D.Ryabko,P.Auer,andR.Munos.Regretboundsforrestlessmarkovbandits.InAlgorithmicLearningTheory,pages214–228.SpringerBerlinHeidelberg,2012.[22]M.G.Azar,A.Lazaric,andE.Brunskill.Stochasticoptimizationofalocallysmoothfunctionundercorrelatedbanditfeedback.arXivpreprintarXiv:1402.0562,2014.[23]D.Blackwell.Ananalogoftheminimaxtheoremforvectorpayoffs.PacicJournalofMathematics,6:1–8,1956.[24]J.Hannan.Approximationtobayesriskinrepeatedplays,ContributionstotheTheoryofGames,Volume3.PrincetonUniversityPress,Cambridge,UK,1957.[25]D.P.FosterandR.V.Vohra.Regretintheon-linedecisionproblem.GamesandEconomicBehaviour,29:7–35,1999.[26]O.Besbes,Y.Gur,andA.Zeevi.Non-stationarystochasticoptimization.Workingpaper,2014.[27]A.GarivierandE.Moulines.Onupper-condenceboundpoliciesforswitchingbanditproblems.InAlgorithmicLearningTheory,pages174–188.SpringerBerlinHeidelberg,2011.[28]P.Auer,N.Cesa-Bianchi,Y.Freund,andR.E.Schapire.Thenon-stochasticmulti-armedbanditproblem.SIAMjournalofcomputing,32:48–77,2002.[29]P.Auer,N.Cesa-Bianchi,andP.Fischer.Finite-timeanalysisofthemultiarmedbanditproblem.MachineLearning,47:235–246,2002.[30]Y.FreundandR.E.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.J.Comput.SystemSci.,55:119–139,1997.[31]C.Hartland,S.Gelly,N.Baskiotis,O.Teytaud,andM.Sebag.Multi-armedbandit,dynamicenvironmentsandmeta-bandits.NIPS-2006workshop,Onlinetradingbetweenexplorationandexploitation,Whistler,Canada,2006.[32]A.SlivkinsandE.Upfal.Adaptingtoachangingenvironment:Thebrownianrestlessbandits.InPro-ceedingsofthe21stAnnualConferenceonLearningTheory(COLT),pages343–354,2008.9