OmarBesbesColumbiaUniversityNewYorkNYob2105columbiaeduYonatanGurStanfordUniversityStanfordCAygurstanfordeduAssafZeeviColumbiaUniversityNewYorkNYassafgsbcolumbiaeduAbstractInamultiarmedbandi ID: 228523
Download Pdf The PPT/PDF document "StochasticMulti-Armed-BanditProblemwithN..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
StochasticMulti-Armed-BanditProblemwithNon-stationaryRewards OmarBesbesColumbiaUniversityNewYork,NYob2105@columbia.eduYonatanGurStanfordUniversityStanford,CAygur@stanford.eduAssafZeeviColumbiaUniversityNewYork,NYassaf@gsb.columbia.eduAbstractInamulti-armedbandit(MAB)problemagamblerneedstochooseateachroundofplayoneofKarms,eachcharacterizedbyanunknownrewarddistribution.Rewardrealizationsareonlyobservedwhenanarmisselected,andthegambler'sobjectiveistomaximizehiscumulativeexpectedearningsoversomegivenhori-zonofplayT.Todothis,thegamblerneedstoacquireinformationaboutarms(exploration)whilesimultaneouslyoptimizingimmediaterewards(exploitation);thepricepaidduetothistradeoffisoftenreferredtoastheregret,andthemainquestionishowsmallcanthispricebeasafunctionofthehorizonlengthT.Thisproblemhasbeenstudiedextensivelywhentherewarddistributionsdonotchangeovertime;anassumptionthatsupportsasharpcharacterizationoftheregret,yetisoftenviolatedinpracticalsettings.Inthispaper,wefocusonaMABformulationwhichallowsforabroadrangeoftemporaluncertaintiesintherewards,whilestillmaintainingmathematicaltractability.Wefullycharacterizethe(regret)complex-ityofthisclassofMABproblemsbyestablishingadirectlinkbetweentheextentofallowablerewardvariationandtheminimalachievableregret,andbyestab-lishingaconnectionbetweentheadversarialandthestochasticMABframeworks.1IntroductionBackgroundandmotivation.Inthepresenceofuncertaintyandpartialfeedbackonrewards,anagentthatfacesasequenceofdecisionsneedstojudiciouslyuseinformationcollectedfrompastobservationswhentryingtooptimizefutureactions.Awidelystudiedparadigmthatcapturesthistensionbetweentheacquisitioncostofnewinformation(exploration)andthegenerationofinstan-taneousrewardsbasedontheexistinginformation(exploitation),isthatofmultiarmedbandits(MAB),originallyproposedinthecontextofdrugtestingby[1],andplacedinageneralsettingby[2].TheoriginalsettinghasagamblerchoosingamongKslotmachinesateachroundofplay,anduponthatselectionobservingarewardrealization.Inthisclassicalformulationtherewardsareassumedtobeindependentandidenticallydistributedaccordingtoanunknowndistributionthatcharacterizeseachmachine.Theobjectiveistomaximizetheexpectedsumof(possiblydis-counted)rewardsreceivedoveragiven(possiblyinnite)timehorizon.Sincetheirinception,MABproblemswithvariousmodicationshavebeenstudiedextensivelyinStatistics,Economics,Oper-ationsResearch,andComputerScience,andareusedtomodelaplethoraofdynamicoptimizationproblemsunderuncertainty;examplesincludeclinicaltrials([3]),strategicpricing([4]),investmentininnovation([5]),packetrouting([6]),on-lineauctions([7]),assortmentselection([8]),andon-1 lineadvertising([9]),tonamebutafew.Foroverviewsandfurtherreferencescf.themonographsby[10],[11]forBayesian/dynamicprogrammingformulations,and[12]thatcoversthemachinelearningliteratureandtheso-calledadversarialsetting.SincethesetofMABinstancesinwhichonecanidentifytheoptimalpolicyisextremelylimited,atypicalyardsticktomeasureperformanceofacandidatepolicyistocompareittoabenchmark:anoraclethatateachtimeinstantselectsthearmthatmaximizesexpectedreward.Thedifferencebetweentheperformanceofthepolicyandthatoftheoracleiscalledtheregret.WhenthegrowthoftheregretasafunctionofthehorizonTissub-linear,thepolicyislong-runaverageoptimal:itslongrunaverageperformanceconvergestothatoftheoracle.Hencetherstorderobjectiveistodeveloppolicieswiththischaracteristic.ThepreciserateofgrowthoftheregretasafunctionofTprovidesarenedmeasureofpolicyperformance.[13]istherstpaperthatprovidesasharpcharacterizationoftheregretgrowthrateinthecontextofthetraditional(stationaryrandomrewards)setting,oftenreferredtoasthestochasticMABproblem.Mostoftheliteraturehasfollowedthispathwiththeobjectiveofdesigningpoliciesthatexhibittheslowestpossiblerateofgrowthintheregret(oftenreferredtoasrateoptimalpolicies).Inmanyapplicationdomains,severalofwhichwerenotedabove,temporalchangesintherewarddistributionstructureareanintrinsiccharacteristicoftheproblem.Theseareignoredinthetradi-tionalstochasticMABformulation,buttherehavebeenseveralattemptstoextendthatframework.Theoriginofthislineofworkcanbetracedbackto[14]whoconsideredacasewhereonlythestateofthechosenarmcanchange,givingrisetoarichlineofwork(see,e.g.,[15],and[16]).Inpartic-ular,[17]introducedthetermrestlessbandits;amodelinwhichthestates(associatedwithrewarddistributions)ofarmschangeineachstepaccordingtoanarbitrary,yetknown,stochasticprocess.Consideredahardclassofproblems(cf.[18]),thislineofworkhasledtovariousapproximations(see,e.g.,[19]),relaxations(see,e.g.,[20]),andconsiderationsofmoredetailedprocesses(see,e.g.,[21]forirreducibleMarkovprocess,and[22]foraclassofhistory-dependentrewards).DeparturefromthestationarityassumptionthathasdominatedmuchoftheMABliteratureraisesfundamentalquestionsastohowoneshouldmodeltemporaluncertaintyinrewards,andhowtobenchmarkperformanceofcandidatepolicies.Oneview,istoallowtherewardrealizationstobeselectedatanypointintimebyanadversary.Theseideashavetheiroriginsingametheorywiththeworkof[23]and[24],andhavesinceseensignicantdevelopment;[25]and[12]providereviewsofthislineofresearch.Withinthissocalledadversarialformulation,theefcacyofapolicyoveragiventimehorizonTisoftenmeasuredrelativetoabenchmarkdenedbythesinglebestactiononecouldhavetakeninhindsight(afterseeingallrewardrealizations).Thesinglebestactionbenchmarkrepresentsastaticoracle,asitisconstrainedtoasingle(static)action.Thisstaticoraclecanperformquitepoorlyrelativetoadynamicoraclethatfollowstheoptimaldynamicsequenceofactions,asthelatteroptimizesthe(expected)rewardateachtimeinstantoverallpossibleactions.1Thus,apotentiallimitationoftheadversarialframeworkisthatevenifapolicyhasasmallregretrelativetoastaticoracle,thereisnoguaranteewithregardtoitsperformancerelativetothedynamicoracle.Maincontributions.Themaincontributionofthispaperliesinfullycharacterizingthe(regret)complexityofabroadclassofMABproblemswithnon-stationaryrewardstructurebyestablishingadirectlinkbetweentheextentofrewardvariationandtheminimalachievableregret.Morespecically,thepaper'scontributionsarealongfourdimensions.Onthemodelingsideweformulateaclassofnon-stationaryrewardstructurethatisquitegeneral,andhencecanbeusedtorealisticallycaptureavarietyofreal-worldtypephenomena,yetismathematicallytractable.ThemainconstraintthatweimposeontheevolutionofthemeanrewardsisthattheirvariationovertherelevanttimehorizonisboundedbyavariationbudgetVT;aconceptthatwasrecentlyintroducedin[26]inthecontextofnon-stationarystochasticapproximation.Thislimitsthepowerofnaturecomparedtotheadversarialsetupdiscussedabovewhererewardscanbepickedtomaximallyaffectthepolicy'sperformanceateachinstancewithinf1;:::;Tg.Nevertheless,thisconstraintallowsforarichclassoftemporalchanges,extendingmostofthetreatmentinthenon-stationarystochasticMABliterature,whichmainlyfocusesonanitenumberofchangesinthemeanrewards,see,e.g.,[27]andreferencestherein.Wefurtherdiscussconnectionswithstudiednon-stationaryinstancesinx6.Theseconddimensionofcontributionliesintheanalysisdomain.Forageneralclassofnon-stationaryrewarddistributionsweestablishlowerboundsontheperformanceofanynon-anticipatingpolicyrelativetothedynamicoracle,andshowthattheseboundscanbeachieved, 1Undernon-stationaryrewardsitisimmediatethatthesinglebestactionmaybesub-optimalinmanydecisionepochs,andtheperformancegapbetweenthestaticandthedynamicoraclescangrowlinearlywithT.2 uniformlyovertheclassofadmissiblerewarddistributions,byasuitablepolicyconstruction.ThetermachievedismeantinthesenseoftheorderoftheregretasafunctionofthetimehorizonT,thevariationbudgetVT,andthenumberofarmsK.Ourpoliciesareshowntobeminimaxoptimaluptoatermthatislogarithmicinthenumberofarms,andtheregretissublinearandisoforder(KVT)1=3T2=3.Ouranalysiscomplementsstudiednon-stationaryinstancesbytreatingabroadandexibleclassoftemporalchangesintherewarddistributions,yetstillestablishingoptimalityresultsandshowingthatsublinearregretisachievable.OurresultsprovideaspectrumofordersoftheminimaxregretrangingbetweenorderT2=3(whenVTisaconstantindependentofT)andorderT(whenVTgrowslinearlywithT),mappingallowedvariationtobestachievableperformance.Withtheanalysisdescribedaboveweshedlightontheexploration-exploitationtradeoffthatcharac-terizesthenon-stationaryrewardsetting,andthechangeinthistradeoffcomparedtothestationarysetting.Inparticular,ourresultshighlightthetensionthatexistsbetweentheneedtorememberandforget.Thisischaracteristicofseveralalgorithmsthathavebeendevelopedintheadversar-ialMABliterature,e.g.,thefamilyofexponentialweightmethodssuchasEXP3,EXP3.Sandthelike;see,e.g.,[28],and[12].Inanutshell,thefewerpastobservationsoneretains,thelargerthestochasticerrorassociatedwithone'sestimatesofthemeanrewards,whileatthesametimeusingmorepastobservationsincreasestheriskofthesebeingbiased.OneinterestingobservationdrawninthispaperconnectsbetweentheadversarialMABsetting,andthenon-stationaryenvironmentstudiedhere.Inparticular,asin[26],itisseenthatanoptimalpolicyintheadversarialsettingmaybesuitablycalibratedtoperformnear-optimallyinthenon-stationarystochasticsetting.Thiswillbefurtherdiscussedafterthemainresultsareestablished.2ProblemFormulationLetK=f1;:::;Kgbeasetofarms.LetT=f1;2;:::;Tgdenoteasequenceofdecisionepochsfacedbyadecisionmaker.Atanyepocht2T,thedecision-makerpullsoneoftheKarms.Whenpullingarmk2Katepocht2T,arewardXkt2[0;1]isobtained,whereXktisarandomvariablewithexpectationkt=EXkt.Wedenotethebestpossibleexpectedrewardatdecisionepochtbyt,i.e.,t=maxk2Kkt .Changesintheexpectedrewardsofthearms.Weassumetheexpectedrewardofeacharmktmaychangeatanydecisionepoch.Wedenotebykthesequenceofexpectedrewardsofarmk:k=kt Tt=1.Inaddition,wedenotebythesequenceofvectorsofallKexpectedrewards:=k Kk=1.Weassumethattheexpectedrewardofeacharmcanchangeanarbitrarynumberoftimes,butboundthetotalvariationoftheexpectedrewards:T1Xt=1supk2Kktkt+1:(1)LetfVt:t=1;2;:::gbeanon-decreasingsequenceofpositiverealnumberssuchthatV1=0,KVttforallt,andfornormalizationpurposessetV2=2K1.WerefertoVTasthevariationbudgetoverT.Wedenethecorrespondingtemporaluncertaintyset,asthesetofrewardvectorsequencesthataresubjecttothevariationbudgetVToverthesetofdecisionepochsf1;:::;Tg:V=(2[0;1]KT:T1Xt=1supk2Kktkt+1VT):Thevariationbudgetcapturestheconstraintimposedonthenon-stationaryenvironmentfacedbythedecision-maker.Whilelimitingthepossibleevolutionintheenvironment,itallowsfornumer-ousformsinwhichtheexpectedrewardsmaychange:continuously,indiscreteshocks,andofachangingrate(Figure1depictstwodifferentvariationpatternsthatcorrespondtothesamevariationbudget).Ingeneral,thevariationbudgetVTisdesignedtodependonthenumberofpullsT.Admissiblepolicies,performance,andregret.LetUbearandomvariabledenedoveraproba-bilityspace(U;U;Pu).Let1:U!Kandt:[0;1]t1U!Kfort=2;3;:::bemeasurablefunctions.Withsomeabuseofnotationwedenotebyt2Ktheactionattimet,thatisgivenbyt=1(U)t=1;tXt1;:::;X1;Ut=2;3;:::;3 Figure1:Twoinstancesofvariationinthemeanrewards:(Left)Axedvariationbudget(thatequals3)isspentoverthewholehorizon.(Right)Thesamebudgetisspentintherstthirdofthehorizon.Themappingsft:t=1;:::;TgtogetherwiththedistributionPudenetheclassofadmissiblepolicies.WedenotethisclassbyP.WefurtherdenotebyfHt;t=1;:::;Tgtheltrationassoci-atedwithapolicy2P,suchthatH1=(U)andHt=Xj t1j=1;Uforallt2f2;3;:::g.NotethatpoliciesinParenon-anticipating,i.e.,dependonlyonthepasthistoryofactionsandob-servations,andallowforrandomizedstrategiesviatheirdependenceonU.Wedenetheregretunderpolicy2Pcomparedtoadynamicoracleastheworst-casedifferencebetweentheexpectedperformanceofpullingateachepochtthearmwhichhasthehighestexpectedrewardatepocht(thedynamicoracleperformance)andtheexpectedperformanceunderpolicy:R(V;T)=sup2V(TXt=1tE"TXt=1t#);wheretheexpectationE[]istakenwithrespecttothenoisyrewards,aswellastothepolicy'sactions.Inaddition,wedenotebyR(V;T)theminimalworst-caseregretthatcanbeguaranteedbyanadmissiblepolicy2P,thatis,R(V;T)=inf2PR(V;T).Then,R(V;T)isthebestachievableperformance.InthefollowingsectionswestudythemagnitudeofR(V;T).Weanalyzethemagnitudeofthisquantitybyestablishingupperandlowerbounds;intheseboundswerefertoaconstantCasabsoluteifitisindependentofK,VT,andT.3LowerboundonthebestachievableperformanceWenextprovidealowerboundonthethebestachievableperformance.Theorem1AssumethatrewardshaveaBernoullidistribution.Then,thereissomeabsolutecon-stantC0suchthatforanypolicy2PandforanyT1,K2andVT2K1;K1T,R(V;T)C(KVT)1=3T2=3:Wenotethatwhenrewarddistributionsarestationary,thereareknownpoliciessuchasUCB1([29])thatachieveregretoforderp Tinthestochasticsetup.Whentherewardstructureisnon-stationaryanddenedbytheclassV,thennopolicymayachievesuchaperformanceandthebestperformancemustincuraregretofatleastorderT2=3.Thisadditionalcomplexityembeddedinthenon-stationarystochasticMABproblemcomparedtothestationaryonewillbefurtherdiscussedinx6.WenotethatTheorem1alsoholdswhenVTisincreasingwithT.Inparticular,whenthevariationbudgetislinearinT,theregretgrowslinearlyandlongrunaverageoptimalityisnotachievable.Thedriverofthechangeinthebestachievableperformancerelativetotheoneestablishedinastationaryenvironment,isasecondtradeoff(overthetensionbetweenexploringdifferentarmsandcapitalizingontheinformationalreadycollected)introducedbythenon-stationaryenvironment,betweenrememberingandforgetting:estimatingtheexpectedrewardsisdonebasedonpastobservationsofrewards.Whilekeepingtrackofmoreobservationsmaydecreasethevarianceofmeanrewardsestimates,thenon-stationaryenvironmentimpliesthatoldinformationispotentiallylessrelevantduetopossiblechangesintheunderlyingrewards.Thechangingrewardsgiveincentivetodismissoldinformation,whichinturnencouragesenhancedexploration.TheproofofTheorem1emphasizestheimpactofthesetradeoffsontheachievableperformance.4 Keyideasintheproof.AtahighleveltheproofofTheorem1buildsonideasofidentifyingaworst-casestrategyofnature(e.g.,[28],proofofTheorem5.1)adaptingthemtooursetting.Whiletheproofisdeferredtotheonlinecompanion(assupportingmaterial),wenextdescribethekeyideaswhenVT=1.2WedeneasubsetofvectorsequencesV0VandshowthatwhenisdrawnrandomlyfromV0,anyadmissiblepolicymustincurregretoforder(KVT)1=3T2=3.WedeneapartitionofthedecisionhorizonTintobatchesT1;:::;Tmofsize~Teach(except,possiblythelastbatch):Tj=nt:(j1)~T+1tminnj~T;Too;forallj=1;:::;m;(2)wherem=dT=~Teisthenumberofbatches.InV0,ineverybatchthereisexactlyonegoodarmwithexpectedreward1=2+"forsome0"1=4,andalltheotherarmshaveexpectedreward1=2.Thegoodarmisdrawnindependentlyinthebeginningofeachbatchaccordingtoadiscreteuniformdistributionoverf1;:::;Kg.Thus,theidentityofthegoodarmcanchangeonlybetweenbatches.Byselecting"suchthat"T=~TVT,any2V0iscomposedofexpectedrewardsequenceswithavariationofatmostVT,andthereforeV0V.Giventhedrawsunderwhichexpectedrewardsequencesaregenerated,naturepreventsanyaccumulationofinformationfromonebatchtoanother,sinceatthebeginningofeachbatchanewgoodarmisdrawnindependentlyofthehistory.TheproofofTheorem1establishesthatwhen"1=p ~Tnoadmissiblepolicycanidentifythegoodarmwithhighprobabilitywithinabatch.Sincethereare~Tepochsineachbatch,theregretthatanypolicymustincuralongabatchisoforder~T"p ~T,whichyieldsaregretoforderp ~TT=~TT=p ~Tthroughoutthewholehorizon.Selectingthesmallestfeasible~Tsuchthatthevariationbudgetconstraintissatisedleadsto~TT2=3,yieldingaregretoforderT2=3throughoutthehorizon.4Anear-optimalpolicyWeapplytheideasunderlyingthelowerboundinTheorem1todeveloparateoptimalpolicyforthenon-stationarystochasticMABproblemwithavariationbudget.Considerthefollowingpolicy: Rexp3.Inputs:apositivenumber ,andabatchsizeT.1.Setbatchindexj=12.RepeatwhilejdT=Te:(a)Set=(j1)T(b)Initialization:foranyk2Ksetwkt=1(c)Repeatfort=+1;:::;minfT;+Tg:Foreachk2K,setpkt=(1 )wkt PKk0=1wk0t+ KDrawanarmk0fromKaccordingtothedistributionpkt Kk=1ReceivearewardXk0tFork0set^Xk0t=Xk0t=pk0t,andforanyk6=k0set^Xkt=0.Forallk2Kupdate:wkt+1=wktexp( ^Xkt K)(d)Setj=j+1,andreturntothebeginningofstep2 Clearly2P.TheRexp3policyusesExp3,apolicyintroducedby[30]forsolvingaworst-casesequentialallocationproblem,asasubroutine,restartingiteveryTepochs. 2Forthesakeofsimplicity,thediscussioninthisparagraphassumesavariationbudgetthatisxedandindependentofT;theproofofTheorem3detailsageneraltreatmentforabudgetthatdependsonT.5 Theorem2LetbetheRexp3policywithabatchsizeT=l(KlogK)1=3(T=VT)2=3mandwith =minn1;q KlogK (e1)To.Then,thereissomeabsoluteconstantCsuchthatforeveryT1,K2,andVT2K1;K1T:R(V;T)C(KlogKVT)1=3T2=3:Theorem2isobtainedbyestablishingaconnectionbetweentheregretrelativetothesinglebestactionintheadversarialsetting,andtheregretwithrespecttothedynamicoracleinnon-stationarystochasticsettingwithvariationbudget.Severalclassesofpolicies,suchasexponential-weight(includingExp3)andpolynomial-weightpolicies,havebeenshowntoachieveregretoforderp Twithrespecttothesinglebestactionintheadversarialsetting(seechapter6of[12]forareview).Whileingeneralthesepoliciestendtoperformwellnumerically,thereisnoguaranteefortheirperformancerelativetothedynamicoraclestudiedinthispaper,sincethesinglebestactionitselfmayincurlinearregretrelativetothedynamicoracle;seealso[31]forastudyoftheempiricalperformanceofoneclassofalgorithms.TheproofofTheorem2showsthatanypolicythatachievesregretoforderp Twithrespecttothesinglebestactionintheadversarialsetting,canbeusedasasubroutinetoobtainnear-optimalperformancewithrespecttothedynamicoracleinoursetting.Rexp3emphasizesthetwotradeoffsdiscussedintheprevioussection.Thersttradeoff,informationacquisitionversuscapitalizingonexistinginformation,iscapturedbythesubroutinepolicyExp3.Infact,anypolicythatachievesagoodperformancecomparedtoasinglebestactionbenchmarkintheadversarialsettingmustbalanceexplorationandexploitation.Thesecondtradeoff,rememberingversusforgetting,iscapturedbyrestartingExp3andforgettinganyacquiredinformationeveryTpulls.Thus,oldinformationthatmayslowdowntheadaptationtothechangingenvironmentisbeingdiscarded.Theorem1andTheorem2togethercharacterizetheminimaxregret(uptoamultiplicativefactor,logarithmicinthenumberofarms)inafullspectrumofvariationsVT:R(V;T)(KVT)1=3T2=3:Hence,wehavequantiedtheimpactoftheextentofchangeintheenvironmentonthebestachiev-ableperformanceinthisbroadclassofproblems.Forexample,forthecaseinwhichVT=CT,forsomeabsoluteconstantCand01thebestachievableregretisoforderT(2+)=3.WenallynotethatrestartingisonlyonewayofadaptingpoliciesfromtheadversarialMABsettingtoachievenearoptimalityinthenon-stationarystochasticsetting;awaythatarticulateswelltheprinciplesleadingtonearoptimality.Intheonlinecompanionwedemonstratethatnearoptimalitycanbeachievedbyotheradaptationmethods,showingthattheExp3.Spolicy(givenin[28])canbetunedby=1 Tand (KVT=T)1=3toachievenearoptimalityinoursetting,withoutrestarting.5ProofofTheorem2Thestructureoftheproofisasfollows.First,webreakthehorizontoasequenceofbatchesofsizeTeach,andanalyzetheperformancegapbetweenthesinglebestactionandthedynamicoracleineachbatch.Then,wepluginaknownperformanceguaranteeforExp3relativetothesinglebestaction,andsumoverbatchestoestablishtheregretofRexp3relativetothedynamicoracle.Step1(Preliminaries).FixT1,K2,andVT2K1;K1T.LetbetheRexp3policy,tunedby =minn1;q KlogK (e1)ToandT2f1;:::;Tg(tobespeciedlateron).WebreakthehorizonTintoasequenceofbatchesT1;:::;TmofsizeTeach(except,possiblyTm)accordingto(2).Let2V,andxj2f1;:::;mg.Wedecompositiontheregretinbatchj:E24Xt2Tj(tt)35=Xt2TjtE24maxk2K8:Xt2TjXkt9=;35| {z }J1;j+E24maxk2K8:Xt2TjXkt9=;35E24Xt2Tjt35| {z }J2;j:(3)Therstcomponent,J1;j,istheexpectedlossassociatedwithusingasingleactionoverbatchj.Thesecondcomponent,J2;j,istheexpectedregretrelativetothebeststaticactioninbatchj.6 Step2(AnalysisofJ1;jandJ2;j).DeningkT+1=kTforallk2K,wedenotethevariationinexpectedrewardsalongbatchTjbyVj=Pt2Tjmaxk2Kkt+1kt.Wenotethat:mXj=1Vj=mXj=1Xt2Tjmaxk2Kkt+1ktVT:(4)Letk0beanarmwithbestexpectedperformanceoverTj:k02argmaxk2KnPt2Tjkto.Then,maxk2K8:Xt2Tjkt9=;=Xt2Tjk0t=E24Xt2TjXk0t35E24maxk2K8:Xt2TjXkt9=;35;(5)andtherefore,onehas:J1;j=Xt2TjtE24maxk2K8:Xt2TjXkt9=;35(a)Xt2Tjtk0tTmaxt2Tjntk0to(b)2VjT;(6)forany2Vandj2f1;:::;mg,where(a)holdsby(5)and(b)holdsbythefollowingargument:otherwisethereisanepocht02Tjforwhicht0k0t0]TJ ; -1;.93; Td; [00;2Vj.Indeed,letk1=argmaxk2Kkt0.Insuchcase,forallt2Tjonehask1tk1t0Vj]TJ ; -1;.93; Td; [00;k0t0+Vjk0t,sinceVjisthemaximalvariationinbatchTj.Thishowever,contradictstheoptimalityofk0atepocht,andthus(6)holds.Inaddition,Corollary3.2in[28]pointsoutthattheregretincurredbyExp3(tunedby =minn1;q KlogK (e1)To)alongTbatches,relativetothesinglebestaction,isboundedby2p e1p TKlogK.Therefore,foreachj2f1;:::;mgonehasJ2;j=E24maxk2K8:Xt2TjXkt9=;E24Xt2Tjt3535(a)2p e1p TKlogK;(7)forany2V,where(a)holdssincewithineachbatcharmsarepulledaccordingtoExp3( ).Step3(Regretthroughoutthehorizon).Summingoverm=dT=Tebatcheswehave:R(V;T)=sup2V(TXt=1tE"TXt=1t#)(a)mXj=12p e1p TKlogK+2VjT(b)T T+12p e1p TKlogK+2TVT:=2p e1p KlogKT p T+2p e1p TKlogK+2TVT;(8)where:(a)holdsby(3),(6),and(7);and(b)followsfrom(4).Finally,selectingT=l(KlogK)1=3(T=VT)2=3m,weestablish:R(V;T)2p e1(KlogKVT)1=3T2=3+2p e1r (KlogK)1=3(T=VT)2=3+1KlogK+2(KlogK)1=3(T=VT)2=3+1VT(a)2+2p 2p e1+4(KlogKVT)1=3T2=3;where(a)followsfromTK2,andVT2K1;K1T.Thisconcludestheproof. 7 6DiscussionUnknownvariationbudget.TheRexp3policyreliesonpriorknowledgeofVT,butpredictionsofVTmaybeinaccurate(suchestimationcanbemaintainedfromhistoricaldataifactionsareoccasionallyrandomized,forexample,byttingVT=T).DenotingthetruevariationbudgetbyVTandtheestimatethatisusedbytheagentwhentuningRexp3by^VT,onemayobservethattheanalysisintheproofofTheorem2holdsuntilequation(8),butthenTwillbetunedusing^VT.ThisimpliesthatwhenVTand^VTareclose,Rexp3stillguaranteeslong-runaverageoptimality.Forexample,supposethatRexp3istunedby^VT=T,butthevariationisVT=T+.Thensublinearregret(oforderT2=3+=3+)isguaranteedaslongas(1)=3;e.g.,if=0and=1=4,Rexp3guaranteesregretoforderT11=12(accuratetuningwouldhaveguaranteedorderT3=4).Sincetherearenorestrictionsontherateatwhichthevariationbudgetcanbespent,aninterestingandpotentiallychallengingopenproblemistodelineatetowhatextentitispossibletodesignadaptivepoliciesthatdonotusepriorknowledgeofVT,yetguaranteegoodperformance.Contrastingwithtraditional(stationary)MABproblems.Thecharacterizedminimaxregretinthestationarystochasticsettingisoforderp Twhenexpectedrewardscanbearbitrarilyclosetoeachother,andoforderlogTwhenrewardsarewellseparated(see[13]and[29]).Contrast-ingtheminimaxregret(oforderV1=3TT2=3)wehaveestablishedinthestochasticnon-stationaryMABproblemwiththoseestablishedinstationarysettingsallowsonetoquantifythepriceofnon-stationarity,whichmathematicallycapturestheaddedcomplexityembeddedinchangingrewardsversusstationaryones(asafunctionoftheallowedvariation).Clearly,additionalcomplexityisintroducedevenwhentheallowedvariationisxedandindependentofthehorizonlength.Contrastingwithothernon-stationaryMABinstances.TheclassofMABproblemswithnon-stationaryrewardsthatisformulatedinthecurrentchapterextendsotherMABformulationsthatallowrewardstochangeinamorestructuredmanner.Forexample,[32]considerasettingwhererewardsevolveaccordingtoaBrownianmotionandregretislinearinT;ourresults(whenVTislinearinT)areconsistentwiththeirs.Twootherrepresentativestudiesarethoseof[27],thatstudyastochasticMABproblemsinwhichexpectedrewardsmaychangeanitenumberoftimes,and[28]thatformulateanadversarialMABprobleminwhichtheidentityofthebestarmmaychangeanitenumberoftimes.Bothstudiessuggestpoliciesthat,utilizingthepriorknowledgethatthenumberofchangesmustbenite,achieveregretoforderp Trelativetothebestsequenceofactions.However,theperformanceofthesepoliciescandeterioratetoregretthatislinearinTwhenthenumberofchangesisallowedtodependonT.Whenthereisanitevariation(VTisxedandindependentofT)butnotnecessarilyanitenumberofchanges,weestablishthatthebestachievableperformancedeterioratetoregretoforderT2=3.Inthatrespect,itisnotsurprisingthatthehardcaseusedtoestablishthelowerboundinTheorem1describesanature'sstrategythatallocatesvariationoveralarge(asafunctionofT)numberofchangesintheexpectedrewards.Lowvariationrates.Whileourformulationfocusesonsignicantvariationinthemeanrewards,ourestablishedboundsalsoholdforsmallervariationscales;whenVTdecreasesfromO(1)toO(T1=2)theminimaxregretratedecreasesfromT2=3top T.Indeed,whenthevariationscaleisO(T1=2)orsmaller,therateofregretcoincideswiththatoftheclassicalstochasticMABsetting.References[1]W.R.Thompson.Onthelikelihoodthatoneunknownprobabilityexceedsanotherinviewoftheevidenceoftwosamples.Biometrika,25:285294,1933.[2]H.Robbins.Someaspectsofthesequentialdesignofexperiments.BulletinoftheAmericanMathematicalSociety,55:527535,1952.[3]M.Zelen.Playthewinnerruleandthecontrolledclinicaltrials.JournaloftheAmericanStatisticalAssociation,64:131146,1969.[4]D.BergemannandJ.Valimaki.Learningandstrategicpricing.Econometrica,64:11251149,1996.[5]D.BergemannandU.Hege.Thenancingofinnovation:Learningandstopping.RANDJournalofEconomics,36(4):719752,2005.8 [6]B.AwerbuchandR.D.Kleinberg.Addaptiveroutingwithend-to-endfeedback:distributedlearningandgeometricapproaches.InProceedingsofthe36thACMSymposiuimonTheoryofComputing(STOC),pages4553,2004.[7]R.D.KleinbergandT.Leighton.Thevalueofknowingademandcurve:Boundsonregretforonlineposted-priceauctions.InProceedingsofthe44thAnnualIEEESymposiumonFoundationsofComputerScience(FOCS),pages594605,2003.[8]F.CaroandG.Gallien.Dynamicassortmentwithdemandlearningforseasonalconsumergoods.Man-agementScience,53:276292,2007.[9]S.Pandey,D.Agarwal,D.Charkrabarti,andV.Josifovski.Banditsfortaxonomies:Amodel-basedapproach.InSIAMInternationalConferenceonDataMining,2007.[10]D.A.BerryandB.Fristedt.Banditproblems:sequentialallocationofexperiments.ChapmanandHall,1985.[11]J.C.Gittins.Multi-ArmedBanditAllocationIndices.JohnWileyandSons,1989.[12]N.Cesa-BianchiandG.Lugosi.Prediction,Learning,andGames.CambridgeUniversityPress,Cam-bridge,UK,2006.[13]T.L.LaiandH.Robbins.Asymptoticallyefcientadaptiveallocationrules.AdvancesinAppliedMath-ematics,6:422,1985.[14]J.C.GittinsandD.M.Jones.Adynamicallocationindexforthesequentialdesignofexperiments.North-Holland,1974.[15]J.C.Gittins.Banditprocessesanddynamicallocationindices(withdiscussion).JournaloftheRoyalStatisticalSociety,SeriesB,41:148177,1979.[16]P.Whittle.Armacquiringbandits.TheAnnalsofProbability,9:284292,1981.[17]P.Whittle.Restlessbandits:Activityallocationinachangingworld.JournalofAppliedProbability,25A:287298,1988.[18]C.H.PapadimitriouandJ.N.Tsitsiklis.Thecomplexityofoptimalqueueingnetworkcontrol.InStructureinComplexityTheoryConference,pages318322,1994.[19]D.BertsimasandJ.Nino-Mora.Restlessbandits,linearprogrammingrelaxations,andprimaldualindexheuristic.OperationsResearch,48(1):8090,2000.[20]S.GuhaandK.Munagala.Approximationalgorithmsforpartial-informationbasedstochasticcontrolwithmarkovianrewards.In48thAnnualIEEESymposiumonFundationsofComputerScience(FOCS),pages483493,2007.[21]R.Ortner,D.Ryabko,P.Auer,andR.Munos.Regretboundsforrestlessmarkovbandits.InAlgorithmicLearningTheory,pages214228.SpringerBerlinHeidelberg,2012.[22]M.G.Azar,A.Lazaric,andE.Brunskill.Stochasticoptimizationofalocallysmoothfunctionundercorrelatedbanditfeedback.arXivpreprintarXiv:1402.0562,2014.[23]D.Blackwell.Ananalogoftheminimaxtheoremforvectorpayoffs.PacicJournalofMathematics,6:18,1956.[24]J.Hannan.Approximationtobayesriskinrepeatedplays,ContributionstotheTheoryofGames,Volume3.PrincetonUniversityPress,Cambridge,UK,1957.[25]D.P.FosterandR.V.Vohra.Regretintheon-linedecisionproblem.GamesandEconomicBehaviour,29:735,1999.[26]O.Besbes,Y.Gur,andA.Zeevi.Non-stationarystochasticoptimization.Workingpaper,2014.[27]A.GarivierandE.Moulines.Onupper-condenceboundpoliciesforswitchingbanditproblems.InAlgorithmicLearningTheory,pages174188.SpringerBerlinHeidelberg,2011.[28]P.Auer,N.Cesa-Bianchi,Y.Freund,andR.E.Schapire.Thenon-stochasticmulti-armedbanditproblem.SIAMjournalofcomputing,32:4877,2002.[29]P.Auer,N.Cesa-Bianchi,andP.Fischer.Finite-timeanalysisofthemultiarmedbanditproblem.MachineLearning,47:235246,2002.[30]Y.FreundandR.E.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.J.Comput.SystemSci.,55:119139,1997.[31]C.Hartland,S.Gelly,N.Baskiotis,O.Teytaud,andM.Sebag.Multi-armedbandit,dynamicenvironmentsandmeta-bandits.NIPS-2006workshop,Onlinetradingbetweenexplorationandexploitation,Whistler,Canada,2006.[32]A.SlivkinsandE.Upfal.Adaptingtoachangingenvironment:Thebrownianrestlessbandits.InPro-ceedingsofthe21stAnnualConferenceonLearningTheory(COLT),pages343354,2008.9