/
JMLR Workshop and Conference Proceedings vol JMLR Workshop and Conference Proceedings vol

JMLR Workshop and Conference Proceedings vol - PDF document

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
462 views
Uploaded On 2015-04-23

JMLR Workshop and Conference Proceedings vol - PPT Presentation

1 4223 25th Annual Conference on Learning Theory The Best of Both Worlds Stochastic and Adversarial Bandits ebastien Bubeck SBUBECK PRINCETON EDU Department of Operations Research and Financial Engineering Princeton University Princeton NJ USA Aleksa ID: 54367

4223 25th Annual Conference

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JMLR Workshop and Conference Proceedings..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITSwherethelastinequalityfollowsfrom (0�1)2+0�1�i i(0�1)1 Thisproves.Thusweget,usingthefactthat()and()arenotsatisedattime,aswellas(),andthefactthat()isnotsatisedforactivearmsattime log( +5log( i�12+s 2log( Then,using()and()weget,thanksto1+6 log()+12(1+log Klog()+20(log(+12 (1+log)log()+5loglog( +5log( 60(1+log)(1+log log()+)+200wherethelastinequalityfollowsfrom()andstraightforwardcomputations. UBECKLIVKINSMoreover()isalsoneversatised,indeedsincewehave:j;t log( +5log( i2�2s log( +5log( InconclusionweprovedthatExp3isneverstartedinthestochasticmodel,thatis.Thus,using(),weobtain: (1+log)+ (1+log)log()+5logNowremarkthatforanyarm,onecanseethat()implies:log( +1log( Indeediflog( +1,then log( +5log( i�128r 42i 259+54 whichcontradicts(Theproofisconcludedwithstraightforwardcomputationsandbyshowingthat1+logK:Denoteby:::theorderedrandomvariables;:::;.Thenweclearlyhave ,whichproves(B.3.Analysisintheadversarialmodel.Firstweshowthat.Let,thenwehave,by,()andsince()isnotsatisedattime K (0�1)2+0�1�i log()+5log( i2�s log( +5log( +2 log( +5log( � K (0�1)2+0�1�i log()+5log( i2+s log( +5log( i2;42.22 HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITSanytime2f;:::;Inthestochasticmodel, ;t t2+max(t�i;0) log()+5log( ;tIntheadversarialmodel, ;t t2+max(t�i;0) log()+5log( ;tInthestochasticmodel, 2log( Inbothmodels,(1+log)+ (1+log)log()+5logIntheadversarialmodel, log(Wewillnowmakeadeterministicreasoningontheeventthattheaboveinequalitiesareindeedtrue.B.2.AnalysisinthestochasticmodelFirstnotethatbyequations()and(),test()isneversatised..Remarkthatbyequation(),test()isneversatisedfor,sinceifi;i�+2 log( +5log( Thuswehave.Moreoverif,thenitmeansthatandtest()wassatisedattimestep(andnotsatisedattime).Thus,using(),weseethatifthenit+2 log( +5log( i2�6s log( +5log( and(since log( +5log( i�126s log( +5log( Thustest()isneversatisedsince:j;t+2 log( +5log( i210s log( +5log( i�12:42.21 UBECKLIVKINS=( s�qii isamartingaledifferencesequencesuchthatjand,sinceisincreasingin,itfollowsthat (1+logThususingLemmaweobtainthatwithprobabilityatleast (1+log)log()+5logItimpliesthat s1�siqii(1+logt)+q (1+log)log()+5logwhichistheclaimedinequality. Thenextlemmarestatesregretguaranteeforintermsofoursetting.InsteadofusingtheoriginalguaranteefromAueretal.),wetakeanimprovedboundfrom(namely,Theorem2.4inLemma20Intheadversarialmodel,withprobabilityatleast,wehave log(K=10Kn.PuttingtogethertheresultsofLemma,weobtainthatwithprobabilityatleast,thefollowinginequalitiesholdtrueforanyarm2f;:::;K HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITSFor2f;:::;n,let)= pi;s�1gi;s1siT+s qiTZisqiT Wehave,forNowremarkthatisamartingaledifferencesequencessuchthatj )and T; Thus,usingLemma,weobtainthatwithprobabilityatleast 40@min(i;t)Xs=11 T; log()+5 Then,usinganunionboundover,weobtaintheclaimedinequalitybytaking(withanotherunionboundtogetthetwo-sidedinequality). Next,weanalyzethe(average)cumulativerewardcollectedbythealgorithm.Again,inthestochasticmodelcanbeusedasanestimateofthetrueexpectedreward,anditisnothardtoseethatitisareasonablysharpestimate.Lemma18Foranyarm2f;:::;K,inthestochasticmodelwehavewithprobabilityatleast,foranytime2f;:::;n,if 2log(2 Proof.ThisfollowsviaanunionboundoverthevalueofandastandardHoeffding'sinequal-ityforindependentrandomvariables,seeTheorem Nextweshowthat,essentially, Lemma19Forany2f;:::;K;t2f;:::;n,withprobabilityatleast,if(1+log)+ (1+log)log()+5logProof.UsingthenotationoftheproofofLemma,wehavefor)= s1�si:42.19 UBECKLIVKINSLetusdiscusssomenotation.Recallthatwedenotebytheprobabilitythatthealgorithmselectsarmattime;thisprobabilityisdenotedbyinthedescriptionofthealgorithm.Asin1willdenotetheprobabilityofarmatthemomentwhenthisarmwasdeactivated.denotethesetofactivearmsattheendoftimestep.Wealsointroduceasthelasttimestepbeforewestart,withaconventionthatifweneverstart.Moreovernotethatwiththisnotation,ifthenwehave.Wegeneralizethisnotationandset.Forsakeofnotation,inthefollowingdenotestheminimumbetweenthetimewhenarmisdeactivatedandthelasttimebeforewestart,thatis;B.1.ConcentrationinequalitiesFirstwederiveaversionofBernstein'sinequalityformartingalesthatsuitsourneeds.Lemma16Fbealtration,and;:::;Xrealrandomvariablessuchthat-measurable,)=0jforsomeb&#x-319;.Let&#x-319;.Thenwithprobabilityatleast log()+5Proof.TheprooffollowsfromTheoremalongwithanunionboundontheeventsentsx;x2f;b;:::;.Italsouses a+p bp 2(a+b). Nowletususethismartingaleinequalitytoderivetheconcentrationboundfor(average)esti-matedcumulativerewards.Recallthatisanestimatorof,sowewanttoupper-boundthedifference,andinthestochasticmodelisanestimatorofthetrueexpectedreward,sowewanttoupper-boundthedifferenceLemma17Foranyarm2f;:::;Kandanytime2f;:::;n,inthestochasticmodelwehavewithprobabilityatleast,if ;t t2+max(t�i;0) log(2)+5log(2 ;tMoreoverintheadversarialmodelwehavewithprobabilityatleast,if ;t t2+max(t�i;0) log(2)+5log(2 ;tProof.Theproofofthetwoconcentrationinequalitiesissimilar,sowerestrictourattentiontotheadversarialmodel.Letbetheltrationassociatedtothehistoricofthestrategy.Weintroducethefollowingsequenceofindependentrandomvariables:forfor;1],letZis(p)Bernoulli(p).Thenforwehave, pi;s1si+s qiiZisqii s1�si:42.18 HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITS Algorithm1TheSAOstrategywithparameter\f� f;:::;K.Aisthesetofactivearmsfor=1;:::;Kn.isthetimewhenarmisdeactivated.pistheprobabilityofselectingarmendforfor=1;:::;nMainloopatrandomfromSelectionofthearmtoplayfor=1;:::;KTestoffourpropertiesforarmTestifarmshouldbedeactivatedj;t log( +5log( nfDeactivationofarmendif.qdenotestheprobabilityofarmatthemomentwhenitwasde-activatedifoneofthethreefollowingpropertiesissatisedStartExp3.Pwiththeparametersdescribedin[Theorem2.4,TestifstochasticmodelstillvalidforarmFirst,testifthetwoestimatesofareconsistent;let=min(;t 2log( Ti(t)+s Kt t2+t�t log()+5log( Second,testiftheestimatedsuboptimalityofarmdidnotincreasetoomuchj;t log( +5log( Third,testifarmstillseemssignicantlysuboptimalj;t log( +5log( endifendforEndoftestingfor=1;:::;KUpdateoftheprobabilityofselectingarm +1 jAj0@1�Xj62Aqjj +1endforendfor 42.17 UBECKLIVKINSAppendixB.SAO(StochasticandAdversarialOptimalalgorithm)Inthissectionwetreatthegeneralcase:armsandadaptiveadversary.Theproposedalgorithm,SAO,isdescribedpreciselyinAlgorithm(seepage).Onahigh-level,SAOproceedssimi-larlytothesimpliedversioninSection,butthereareafewkeydifferences.First,theexplorationandexploitationphasesarenowinterleaved.Indeed,SAOstartswithallarmsbeing“active”,andthenitsuccessively“deactivates”themastheyturnouttobesuboptimal.Thus,thealgorithmevolvesfrompureexploration(whenallarmsactivated)topureexploitation(whenallarmsbuttheoptimalonearedeactivated).Second,inordertomaketheaboveevolutionsmoothweadoptamorecomplicated(re)samplingschedulethattheoneweusedinSection.Namely,theprobabilityofselectingagivenarmcon-tinuouslyincreaseswhilethisarmstaysactive,andthencontinuouslydecreaseswhenitgetsdeac-tivated,andthetransitionbetweenthetwophasesisalsocontinuous.Forthepreciseequation,seeEquation()inAlgorithm1.Third,thismoresubtlebehaviorofthe(re)samplingprobabilitiesinturnnecessitatesmorecomplicatedconsistencyconditions(e.g.seeCondition()comparedtoCondition()),andamoreintricateanalysis.Thekeyintheanalysisistoobtainthegoodconcentrationpropertiesofthedifferentestimators,whichweaccomplishbyexhibitingmartingalesequencesandresortingtoBernstein'sinequalityformartingales(TheoremRecallthatthecrucialparameterforthestochasticmodelistheminimalgap=min:=(maxistheofarm.Ourmainresultisformulatedasfollows:Theorem15SAOwithsatisesinthestochasticmodel: log()log andintheadversarialmodel:log()log Moreprecisely,forany,withprobabilityatleast,SAOwith=10Kninthestochasticmodel: (1+log)log andintheadversarialmodel:60(1+log)(1+log log()+5)+200Wedividetheproofintothreeparts.InSectionweproposeseveralconcentrationinequali-tiesforthedifferentquantitiesinvolvedinthealgorithm.Thenwemakeadeterministicargumentconditionalontheeventthatalltheseconcentrationinequalitiesholdtrue.First,inSection,weanalyzestochasticrewards,andSectionconcernstheadversarialrewards. HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITSProof.\f�andconsidertwocases:\fthenwecantake\f= inTheorem11(ab)andobtainobtainjX�j ]=Pr[j=2Nowassume\f.Wecanstilltake\f= inTheorem11(b)toobtainobtainX���X�� eThenletustake=&#x-278;inTheorem(a)toobtainobtainX�\f2]=Pr[eItfollowsthatthatjX�j,completingtheproof. Nextwepresenttwoimportantconcentrationinequalitiesformartingalesequencesthatweuseintheanalysisofthegeneralcase(K�Theorem13(Hoeffding-Azuma'sinequalityformartingales,bealtration,and;:::;Xrealrandomvariablessuchthat-measurable,)==At;Awhereisarandomvariable-measurableandisapositivecon-stant.Then,forany"�,wehave orequivalentlyforany�,withprobabilityatleast,wehave log( Theorem14(Bernstein'sinequalityformartingales,FreedmanFaltration,and;:::;Xrealrandomvariablessuchthat-measurable,)=jforsomeb�andlet.Then,forany"�,wehave +2andforany�,withprobabilityatleast,wehaveeither�V log()+log( 3:(16)42.15 UBECKLIVKINSRobertKleinbergandAleksandrsSlivkins.SharpDichotomiesforRegretMinimizationinMetricSpaces.In21stACM-SIAMSymp.onDiscreteAlgorithms(SODA),2010.RobertKleinberg,AleksandrsSlivkins,andEliUpfal.Multi-ArmedBanditsinMetricSpaces.In40thACMSymp.onTheoryofComputing(STOC),pages681–690,2008.T.L.LaiandHerbertRobbins.AsymptoticallyefcientAdaptiveAllocationRules.AdvancesinApplied,6:4–22,1985.Odalric-AmbrymMaillardandRemiMunos.AdaptiveBandits:Towardsthebesthistory-dependentstrategy.24thConf.onLearningTheory(COLT),2011.ColinMcDiarmid.Concentration.InM.Habib.C.McDiarmid.J.RamirezandB.Reed,editors,ProbabilisticMethodsforDiscreteMathematics,pages195–248.Springer-Verlag,Berlin,1998.V.PerchetandP.Rigollet.Themulti-armedbanditproblemwithcovariates.ArxivpreprintarXiv:1110.6084HerbertRobbins.SomeAspectsoftheSequentialDesignofExperiments.Bull.Amer.Math.Soc.,58:527–535,1952.AleksandrsSlivkins.ContextualBanditswithSimilarityInformation.In24thConf.onLearningTheory(COLT),2011.G.Stoltz.IncompleteInformationandInternalRegretinPredictionofIndividualSequences.PhDthesis,UniversiteParis-Sud,Orsay,France,May2005.AppendixA.ConcentrationinequalitiesRecallthattheanalysisinSectionreliesonChernoffBoundsasstatedinTheorem.LetusderivefromaversionofChernoffBoundsthatcanbefoundintheliterature.Theorem11(ChernoffBounds:Theorem2.3ini.i.d.random:::XX;1].Let betheiraverage,andletletX].Thenfor"�thefollowingtwopropertieshold:hold:X(1+ ;""X(1�")]eCorollary12InthesettingofTheorem,forany\f&#x-278;wehave:Pr[&#x-278;\f\f; )]WeobtainTheorembytaking ,notingthat\f; )Cmax(1;p C� HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITSReferencesJacobAbernethy,EladHazan,andAlexanderRakhlin.CompetingintheDark:AnEfcientAlgorithmforBanditLinearOptimization.In21thConf.onLearningTheory(COLT),pages263–274,2008.J.-Y.Audibert,R.Munos,andCs.Szepesvari.Exploration-exploitationtrade-offusingvarianceestimatesinmulti-armedbandits.TheoreticalComputerScience,410:1876–1902,2009.J.Y.AudibertandS.Bubeck.RegretBoundsandMinimaxPoliciesunderPartialMonitoring.J.ofMachineLearningResearch(JMLR),11:2785–2836,2010.ApreliminaryversionhasbeenpublishedinCOLTP.AuerandR.Ortner.Ucbrevisited:Improvedregretboundsforthestochasticmulti-armedbanditproblem.PeriodicaMathematicaHungarica,61:55–65,2010.PeterAuer,NicoloCesa-Bianchi,andPaulFischer.Finite-timeanalysisofthemultiarmedbanditproblem.MachineLearning,47(2-3):235–256,2002a.Preliminaryversionin15thICML,1998.PeterAuer,NicoloCesa-Bianchi,YoavFreund,andRobertE.Schapire.ThenonstochasticmultiarmedbanditSIAMJ.Comput.,32(1):48–77,2002b.Preliminaryversionin36thIEEEFOCS,1995.MosheBabaioff,YogeshwerSharma,andAleksandrsSlivkins.Characterizingtruthfulmulti-armedbanditmechanisms.In10thACMConf.onElectronicCommerce(EC),pages79–88,2009.MosheBabaioff,RobertKleinberg,andAleksandrsSlivkins.Truthfulmechanismswithimplicitpaymentcomputation.In11thACMConf.onElectronicCommerce(EC),pages43–52,2010.BestPaperAward.S.Bubeck.BanditsGamesandClusteringFoundations.PhDthesis,UniversiteLille1,2010.ebastienBubeck,RemiMunos,GillesStoltz,andCsabaSzepesvari.OnlineOptimizationinX-ArmedJ.ofMachineLearningResearch(JMLR),12:1587–1627,2011.PreliminaryversioninoCesa-BianchiandGaborLugosi.Prediction,learning,andgames.CambridgeUniv.Press,2006.NicoloCesa-Bianchi,YishayMansour,andGillesStoltz.Improvedsecond-orderboundsforpredictionwithexpertadvice.MachineLearning,66:321–352,2007.PreliminaryversioninCOLT2005VarshaDani,ThomasP.Hayes,andShamKakade.StochasticLinearOptimizationunderBanditFeedback.21thConf.onLearningTheory(COLT),2008.NikhilDevanurandShamM.Kakade.Thepriceoftruthfulnessforpay-per-clickauctions.In10thACMConf.onElectronicCommerce(EC),pages99–106,2009.D.A.Freedman.Ontailprobabilitiesformartingales.TheAnnalsofProbability,3:100–118,1975.elienGarivierandOlivierCappe.TheKL-UCBAlgorithmforBoundedStochasticBanditsandBeyond.24thConf.onLearningTheory(COLT),2011.EladHazanandSatyenKale.Betteralgorithmsforbenignbandits.In20thACM-SIAMSymp.onDiscreteAlgorithms(SODA),pages38–47,2009.W.Hoeffding.Probabilityinequalitiesforsumsofboundedrandomvariables.JournaloftheAmericanStatisticalAssociation,58:13–30,1963.J.HondaandA.Takemura.Anasymptoticallyoptimalbanditalgorithmforboundedsupportmodels.In23rdannualconferenceonlearningtheory,2010. UBECKLIVKINSmutuallyindependent.ThereforeChernoffBoundsapply,andw.h.p. Using(),itfollowsthat +p t�)t=2�Ccrnp tt=4. RecallthatClaim(b)connectsalgorithm'sestimateandthebenchmarkaveragewillre-usethisclaimlaterintheproofs).Inthestochasticmodelthesetwoquantities,aswellasthealgorithm'saverage,areclosetotherespectiveexpectedreward.Thefollowinglemmamakesthisconnectionprecise.Claim9Assumethestochasticmodel.Thenduringtheexploitationphaseforeacharmandeachthefollowingholdswithhighprobability:j j j Proof.AllthreeinequalitiesfollowfromChernoffBounds.Therstinequalityfollowsimmedi-ately.Toobtaintheothertwoinequalities,weclaimthatw.h.p.itholdsthatj Indeed,notethatwithoutlossofgeneralityindependentsamplesfromtherewarddistributionofaredrawninadvance,andthentherewardfromthe-thplayofarmisthe-thsample.ThenbyChernoffBoundsthebound()holdsw.h.p.foreach)=,andthenonecantaketheUnionBoundoveralltoobtain().Claimproved.Finally,weuse()andpluginthelowerboundsonfromClaim Nowthatwehaveallthegroundwork,letusarguethatinthestochasticmodeltheconsistencyconditioninthealgorithmaresatisedwithhighprobability.Corollary10Assumethestochasticmodel.Thenineachroundoftheexploitationphase,withhighprobabilitythefollowingholds: 1�232Ccrn=p Moreover,conditions()aresatised.Proof.Condition()followssimplybycombiningClaim(b)andClaimToobtain(),wenotethatbyClaim(b)andClaimw.h.p.itholdsthatj RecallthatCondition()holdsattime,andfailsat.Thisinconjunctionwith()implies().Inturn,()with()implyCondition( TocompletetheproofofTheorem2(b),assumeweareinthestochasticmodelwithgap=.Intherestoftheargument,weomit“withhighprobability”.Iftheexplorationphaseneverends,itiseasytoseethat ,andwearedonesincetrivially RnnO(1 .Else,byCorollary10itholdsthatarmisoptimal,=(andmoreoverthattheexploitationphaseneverends.Now,byClaimintheexploitationphasethesuboptimalarmisplayedatmosttimes.Therefore RnO(1 log3n).42.12 HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITSNowwearereadyforthenalcomputations.Wewillneedtoconsiderthreecases,dependingonwhichphasethealgorithmisinwhenithalts(i.e.,reachesthetimehorizon).First,iftheexplorationphaseneverendsthenbyClaim(a)w.h.p.itholdsthat foreacharm,andtheexitcondition()neverfails.Thisimpliestheclaimedregret Fromhereonletusassumethattheexplorationphaseendsatsomen.Deneregretonthetimeintervalala;bba;b]=maxbethelastroundintheexploitationphase.ByCorollaryandClaimwehave (i.e.,thealgorithmhaltsduringexploitation)thenwearedone.Third,ifthealgorithmenterstheadversarialphasethenwecanusetheregretboundforAueretal.),whichstatesthatw.h.p. .Therefore ThiscompletestheproofofTheorem3.4.Analysis:stochasticmodelWestartwithasimpleclaimthatw.h.p.eacharmisplayedsufcientlyoftenduringexploration,andarmisplayedsufcientlyoftenduringexploitation.ThisclaimcomplementsClaimtheprevioussubsectionwhichstatesthatarmisnotplayedtoooftenduringexploitation(wewillre-useClaimlaterintheproofs).Claim8Withhighprobabilityitholdsthat:(a)duringtheexplorationphase,eacharmisplayedatleast(b)duringtheexploitationphase,foreachtimeProof.BothpartsfollowfromChernoffBounds.Theonlysubtletyistoensurethatwedonotconditionthesummands(inthesumthatweapplytheChernoffBoundsto)onaparticularvalueoforonthefactthatarmischosenforexploitation.Forpart(a),withoutlossofgeneralityassumethatfaircoinsaretossedinadvance,sothatinthe-throundofexplorationweusethe-thcointosstodecidewhicharmischosen.ThenbyChernoffBoundsforeachw.h.p.itholdsthatamongtherstcointossesthereareatleast headsandatleastthismanytails.WetaketheUnionBoundoverall,soinparticularthisholdsfor.Thereforew.h.p.wehave: Theclaimfollowsfrom()becauseweforceexplorationtolastforatleastForpart(b),letusanalyzetheexploitationphaseseparately.Weareinterestedinthesum,whererangesoverallroundsintheexploitationphase.Wewillworkinthepost-explorationprobabilityspace.Theindicatorvariables,forallroundsduringexploitation,are UBECKLIVKINSFrompart(a),wehavethatC p .Therefore,C p (p +p C p : CombiningClaim5(b)andCondition(),weobtain:Corollary6Intheexploitationphase,foranyround(exceptpossiblytheverylastroundinthephase)itholdsw.h.p.that�GByCorollary,regretaccumulatedbyroundintheexploitationphaseis,withhighprobability,equalto.Thefollowingclaimupper-boundsthisquantityby .Theproofofthisclaimcontainsourmainregretcomputation.Claim7Foranyroundintheexploitationphaseitholdsw.h.p.that� Proof.Throughoutthisproof,letusassumethatthehigh-probabilityeventsinClaim4andClaim5actuallyhold;wewillomit“withhighprobability”fromhereon.besome(butnotthelast)roundintheexploitationphase.First,� Wehaveupper-boundedthetwosquarebracketsin()using,respectively,Claim(b)andCon-dition().Weprovedthatalgorithm'saverageforarm)isnottoosmallcomparedtothecorrespondingbenchmarkaverage,andweusedtheestimateasanintermediaryintheSimilarly,usingCondition(),Condition(),andClaim(b)toupper-boundthethreesquarebracketsinthenextequation,weobtainthat� Herewehaveprovedthatthealgorithmdidnotdotoobadlyplayingarm,eventhoughthisarmwassupposedtobesuboptimal.Specically,weestablishthatalgorithm'saverageforarmisnottoosmallcomparedtothebenchmarkaverageforarm1).Again,theestimatesservedusasintermediariesintheproof.Finally,letusgofromboundsonaveragerewardstoboundsoncumulativerewards(andprovetheclaim).Combining(),()andClaim,wehave:� t+T2(t)=p � t+p � tlog2n): 42.10 HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITSProof.Forpart(a),weareinterestedinthesum.AsintheproofofClaimChernoffBoundsdonotimmediatelyapplysincethenumberofsummandsisarandomvariable(andconditioningonaparticularvalueoftamperswithindependencebetweensummands).Soletusconsideranalternativealgorithminwhichtheexplorationphaseproceedsindenitely,withoutthestoppingcondition,andusesthesamerandomnessastheoriginalalgorithm.bethearmselectedinroundofthisalternativealgorithm,anddene.Then(whenrunonthesameprobleminstance)bothalgorithmscoincideforany,soinparticular=2.Now,isthesumofboundedindependentrandomvariableswithexpectation.ThereforebyChernoffBoundsw.h.p.itholdsthatC foreachwhichimpliestheclaim.Forpart(b),wewillanalyzetheexploitationphaseseparately.Letusworkinthepost-explorationprobabilityspace.WewillconsiderthealternativealgorithmfromtheproofofClaim(inwhichexploitationcontinuesindenitely).Thiswaywedonotneedworrythatweimplicitlyconditionontheeventthataparticularroundt�belongstotheexploitationphase.Clearly,itsuf-cestoprove()forthisalternativealgorithm.Tofacilitatethenotation,denethetimeinterval+1;:::;t,anddenoteTohandlearm,notethatinisasumofindependentrandomvariables,withexpec-.Since forany,thesummandsareboundedby2.ThereforebyChernoffBoundswithhighprobabilityitholdsthat Fromthisandpart(a)itfollowsthatw.h.p. +p t�)3Ccrnp whichimpliestheclaimforarmHandlingarmrequiresalittlemoreworksincethesummandsmaybelarge(sincetheyhaveasmallprobabilityinthedenominator).Foreach g2;sI2;s=2t Ps2INTs tg2;sI2;s=2t Ps2INTXs;whereXs=s ;1].In,randomvariables;saremutuallyindependent,andtheexpectationoftheirsumis 2tEheG2;INTi= 2tG2;INT 2t� Notingthatandletting ,weobtain .ByChernoffBoundsw.h.p.itholdsthatC Goingbackto,weobtain: Ccrnp C p p : 6.Notethatthisisnotthesame“alternativealgorithm”astheoneintheproofofClaim UBECKLIVKINSwithhighprobability(abbreviatedw.h.p.),whichwillmeanmeanwithprobabilityatleastoratleast,dependingonthecontext.Toparameterizethealgorithm,letusxsome=12ln(suchthatTheorem3ensuressuccessprobabilityatleast3.3.Analysis:adversarialmodelWeneedtoanalyzeouralgorithmintwodifferentrewardmodels.Westartwiththeadversarialmodel,sothatwecanre-usesomeoftheclaimsprovedheretoanalyzethestochasticmodel.Recallthatdenotesthedurationoftheexplorationphase(whichingeneralisarandomvariable).FollowingtheconventionfromSectionthatwhenevertheexplorationphaseends,thearmchosenforexploitationisarm.(Notethatwedonotassumethatarmisthebestarm.)Westarttheanalysisbyshowingthatthere-samplingscheduleintheexploitationphasedoesnotresultinplayingarmtoooften.Claim4Duringtheexploitationphase,armisplayedatmosttimesw.h.p..Proof.Wewillworkinthepost-explorationprobabilityspace.Weneedtoboundfromabovethe,whererangesovertheexploitationphase.However,ChernoffBoundsdonotimmedi-atelyapplysincethenumberofsummandsitselfisarandomvariable.Further,ifweconditiononaspecicdurationofexploitationthenwebreakindependencebetweensummands.Wesidestepthisissuebyconsideringanalternativealgorithminwhichexploitationlastsindenitely(i.e.,withoutthestoppingconditions),andwhichusesthesamerandomnessastheoriginalalgorithm.Itsufcestoboundfromabovethenumberoftimesthatarm2isplayedduringtheexploitationphaseinthisalternativealgorithm;denotethisnumberby.Lettingbethearmselectedinroundofthealternativealgorithm,wehavethatisasumof0-1randomvariables,andinthesevariablesareindependent.Moreover,initholdsthatthatN]= Therefore,theclaimfollowsfromChernoffBounds. Nowweconnecttheestimatedcumulativerewardswiththebenchmark.Morespeci-cally,wewillboundfromaboveseveralexpressionsoftheform.Naturally,theupperboundforarmwillbestrongersincethisarmisplayedmoreoftenduringexploitation.Toensurethattheboundforarmisstrongenoughweneedtoplaythisarm“sufcientlyoften”duringex-ploitation.(WhereasClaimensuresthatwedonotplayit“toooften”.)Hereandelsewhereinthisanalysis,wenditmoreeleganttoexpresssomeoftheclaimsintermsoftheaveragecumulativerewards(suchas,etc.)Claim5(a)Withhighprobability, foreacharm(b)Foranyroundintheexploitationphase,withhighprobabilityitholdsthat t;jeH2;t�H2;tj3Ccrn=p :(5)42.8 HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITSTheexplorationphaseissimple:Condition()ischosensothatonceitfailsthen(assumingstochasticrewards)theseeminglybetterarmisindeedthebestarmwithhighprobability.Intheexploitationphase,wedenethere-samplingscheduleforarmandacollectionof“con-sistencyconditions”.There-samplingscheduleshouldbesufcientlyraretoavoidaccumulatingmuchregretifarmisindeedabadarm.Theconsistencyconditionsshouldbesufcientlystrongtojustifyusingthestochasticmodelasanoperatingassumptionwhiletheyhold.Namely,anadver-saryconstrainedbytheseconditionsshouldnotbeabletoinicttoomuchregretonouralgorithminthersttwophases.Yet,theconsistencyconditionsshouldbeweakenoughsothattheyholdwithhigh-probabilityinthestochasticmodel,despitethelowsamplingprobabilityofarmItisessentialthatweusebothintheconsistencyconditions:theinterplayofthesetwoestimatorsallowsustoboundregretintheadversarialmodel.Otherthanthat,theconditionsthatweusearefairlynatural(thesurprisingpartisthattheywork).Condition()checkswhethertherelationbetweenthetwoarmsisconsistentwiththeoutcomeoftheexplorationphase,i.e.whetherstillseemsbetterthanarm,butnottoomuchbetter.Condition()checkswhetherforeach,theestimateisclosetotheaverage.Inthestochasticmodel,bothestimatetheexpectedgain,soweexpectthemtobenottoofarapart.However,ourdenitionof“toofar”shouldbeconsistentwithhowoftenagivenarmissampled.3.2.ConcentrationinequalitiesThe“probabilistic”aspectoftheanalysisisconnedtoprovingthatseveralpropertiesofestimatesandsamplingtimesholdwithhighprobability.Therestoftheanalysiscanproceedasifpropertiesholdwithprobability.Inparticular,wehavemadeourcoreargumentessentiallydeter-ministic,whichgreatlysimpliespresentation.Allhigh-probabilityresultsareobtainedusinganelementaryconcentrationinequalitylooselyknownasChernoffBounds.Forthesakeofsimplicity,weuseaslightlyweakerformulationbelow(seeAppendixAforaproof),whichusesjustoneinequalityforallcases.Theorem3(ChernoffBounds)Bounds)n]beaindependentrandomvariablessuchthatthat;1]foreach.Letbetheirsum,andletletX].ThenPr[�C )]C=foranyC�WewilloftenneedtoapplyChernoffBoundstosumswhosesummandsdependonsomeeventsintheexecutionofthealgorithmsandthereforearenotmutuallyindependent.However,inallcasestheseissuesarebutaminortechnicalobstaclewhichcanbeside-steppedusingaslightlymorecarefulsetup.Inparticular,wesometimesnditusefultoworkintheprobabilityspaceobtainedbyconditioningontheoutcomeoftheexplorationphase.Specically,thepost-explorationprobabilityistheprobabilityspaceobtainedbyconditioningonthefollowingevents:thattheexplorationphaseends,thatithasaspecicduration,andthatarmischosenforexploitation.Throughouttheanalysis,wewillobtainconcentrationboundsthatholdwithprobabilityatleast.WewilloftentakeaUnionBoundoverallrounds,whichwillimplysuccessprobabilityatleast.Tosimplifypresentation,wewillallowaslightabuseofnotation:wewillsay 5.However,theindependenceissuesappearprohibitiveforK�armsorifweconsideranon-obliviousadversary.Soforthegeneralcaseweresortedtoamorecomplicatedanalysisviamartingaleinequalities. UBECKLIVKINSWearelookingforthe“best-of-both-worlds”feature: regretintheadversarialmodel, regretinthestochasticmodel,where=isthegap.Ourgoalinthissectionistoobtainthisfeatureinthesimplestwaypossible.Inparticular,wewillhidetheconstantsundernotation,andwillnotattempttooptimizethepolylogfactors;also,wewillassumeobliviousadversary.Wewillprovethefollowingtheorem:Theorem2ConsideraMABproblemwithtwoarms.Thereexistsaalgorithmsuchthat:(a)againstanobliviousadversary,itsexpectedregretis (b)inthestochasticmodel,itsexpectedregretsatises Rn]O(1 Bothregretboundsalsoholdwithprobabilityatleast Notethatinthestochasticmodel,regrettriviallycannotbelargerthan,sopart(b)triviallyimpliesregret Rn]~O(p Ouranalysisproceedsviahigh-probabilityargumentsanddirectlyobtainsthehigh-probabilityguarantees.Thehigh-probabilityargumentstendtomaketheanalysiscleaner;wesuspectitcannotbemademuchsimplerifweonlyseekboundsonexpectedregret.3.1.AsimpliedSAO(StochasticandAdversarialOptimal)AlgorithmThealgorithmproceedsinthreephases:exploration,exploitation,andadversarialphase.Intheexplorationphase,wealternatethetwoarmsuntiloneofthemappearssignicantlybetterthantheother.Inexploitationphase,wefocusonthebetterarm,butre-sampletheotherarmwithsmallprobability.Wecheckseveralconsistencyconditionswhichshouldholdwithhighprobabilityiftherewardsarestochastic.Whenandifoneoftheseconditionsfails,wedeclarethatwearenotinthecaseofstochasticrewards,andswitchtorunningabanditalgorithmfortheadversarialmodel,namelyalgorithmAueretal.Thealgorithmisparameterizedby=(logwhichwewillchoselaterinSectionTheformaldescriptionofthethreephasesisasfollows.(Explorationphase)Ineachround,pickanarmatrandom: .Gotothenextphaseassoonast�andthefollowingconditionfails: bethedurationofthisphase.Withoutlossofgenerality,assume.Thismeans,informally,thatarmisselectedforexploitation.(Exploitationphase)Ineachroundt�,pickarm2withprobability ,andarmtheremainingprobability=1 Aftertheround,checkthefollowingconsistencyconditions eH1;t�eH2;t40Ccrn=p j j Ifoneoftheseconditionsfails,gotothenextphase.(Adversarialphase)RunalgorithmAueretal. HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITSdespitethelowsamplingprobabilityofarm.Thefactthat(a)and(b)arenotmutuallyexclusiveissurprisingandunexpected.Moreprecisely,theconsistencyconditionsshouldbestrongenoughtoinsureusfromlosingtoomuchinthersttwophasesevenifweareintheadversarialmodel.Weuseaspecicre-samplingscheduleforarmwhichisrareenoughsothatwedonotaccumulatemuchregretifthisisindeedabadarm,andyetsufcienttochecktheconsistencyconditions.Toextendtothe-armcase,we“interleave”explorationandexploitation,“deactivating”armsonebyoneastheyturnouttobesuboptimal.Thesamplingprobabilityofagivenarmincreaseswhilethearmstaysactive,andthendecreasesafteritisdeactivated,withasmoothtransitionbe-tweenthetwophases.Thiscomplicatedbehavior(andthefactthatwehandlegeneraladversaries)inturnnecessitateamoredelicateanalysis.2.PreliminariesWeconsiderrandomizedalgorithms,inthesensethat(thearmchosenattime)isdrawnfromaprobabilitydistribution;:::;K.Wedenotebytheprobabilitythat.Forbrevity,.Givensucharandomizedalgorithm,itisawell-knowntricktouse asanunbiasedestimateofthereward.Nowforarmandtimeweintroduce:xed-armcumulativerewardfromarmuptotimeestimatedcumulativerewardfromarmuptotimealgorithm'scumulativerewardfromarmuptotime)=samplingtimeofarmuptotimeThecorrespondingaverages: tGi;t,eHi;t=1 ,andisthecumulativerewardofa“xed-armalgorithm”thatalwaysplaysarm.Recallthatourbenchmarksarefortheadversarialmodel,andandGi;t]forthestochasticmodel.Notethat)areobservedbyanalgorithmwhereas)isnot.Informally,areestimatesfortheexpectedrewardinthestochasticmodel,andisanestimateforthebenchmarkrewardintheadversarialmodel.Inthestochasticmodelwedenetheofarm=(max,andtheminimalgap=min:Followingtheliterature,wemeasurealgorithm'sperformanceintermsofregret denedinFigure.Thetwonotionsofregretaresomewhatdifferent,inparticularthe“stochasticregret” exactlyequaltotheexpected“adversarialregret”.However,inthestochasticmodeltheyareapproximatelyequal: Rn]E[Rn]E[ ]+ 1 3.Thecaseof=2Wewillderivea(slightlyweakerversionof)themainresultforthespecialcaseof=2armsandobliviousadversary,usingasimpliedalgorithm.Thisversioncontainsmostoftheideasfromthegeneralcase,butcanbepresentedinamorelucidfashion. 4.Thisfactiswell-knownandeasytoprove,e.g.seeProposition34inAudibertandBubeck UBECKLIVKINSisthemaximal“temporalvariation”oftherewards.Similarresultshavebeenobtainedforthefull-feeback(“experts”)versioninCesa-Bianchietal.Abernethyetal.).Also,theregretboundfordependsonthegap,andmatchestheoptimalworst-caseboundforthestochasticmodel(uptologarithmicfactors).Moreover,adaptivityto“nice”probleminstancesisacrucialthemeintheworkonbanditsinmetricspaces(Kleinbergetal.Bubecketal.Slivkins),anMABsettinginwhichsomeinformationonsimilaritybetweenarmsisaprioriavailabletoanalgorithm.Thedistinctionbetweenpolylog( regrethasbeencrucialinotherMABsettings:banditswithlinearrewards(Danietal.),banditsinmetricspaces(KleinbergandSlivkins),andanextensionofMABtoauctions(Babaioffetal.DevanurandKakadeBabaioffetal.).Interestingly,herewehavefourdifferentMABsettings(includingtheoneinthispaper)inwhichthisdistinctionoccursforfourdifferentreasons,withnoapparentconnections.Apropersurveyoftheliteratureonmulti-armedbanditsisbeyondthescopeofthispaper;areaderisencouragedtorefertoCesa-BianchiandLugosi)forbackground.Animportanthigh-leveldistinctionisbetweenBayesianandnon-BayesianMABformulations.Bothhavearichliterature;thispaperfocusesonthelatter.The“basic”MABversiondenedinthispaperhasbeenextendedinvariouspaperstoincludeadditionalinformationand/orassumptionsaboutrewards.MostrelevanttothispaperarealgorithmsAueretal.)andAueretal.hasaslightlymorerenedregretboundthantheonethatwecitedearlier: Rn=O(Pi:ilogn withhighprobability.Amatchinglowerbound(uptotheconsiderationsofthevarianceandconstantfactors)isprovedinLaiandRobbins).Severalrecentpapers(andOrtnerHondaandTakemuraAudibertetal.AudibertandBubeckMaillardandMunosGarivierandCappPerchetandRigollet)improveover,obtainingalgorithmswithregretboundsthatareevenclosertothelowerbound.TheregretboundforforRn]= ,andaversionofachievesthiswithhighprobability(Aueretal.).Thereisanearlymatchinglowerboundof KnAudibertandBubeck()haveshavedoffthefactor,achievinganalgorithmwithregret Knintheadversarialmodelagainstanobliviousadversary.High-levelideas.Forclarity,letusconsiderthesimpliedalgorithmforthespecialcaseoftwoarmsandobliviousadversary.Thealgorithmstartswiththeassumptionthatthestochasticmodelistrue,andthenproceedsinthreephases:“exploration”,“exploitation”,andthe“adversarialphase”.Intheexplorationphase,wealternatethetwoarmsuntiloneofthem(say,arm1)appearssignif-icantlybetterthantheother.Whenandifthathappens,wemovetotheexploitationphasewherewefocusonarm1,butre-samplearmwithsmallprobability.Aftereachroundwecheckseveralconsistencyconditionswhichshouldholdwithhighprobabilityiftherewardsarestochastic.Whenandifoneoftheseconditionsfails,wedeclarethatwearenotinthecaseofstochasticrewards,andswitchtorunningabanditalgorithmfortheadversarialmodel(aversionofHerewehaveanincarnationofthe“attack-defense”tradeoffmentionedearlierinthissection:theconsistencyconditionsshouldbe(a)strongenoughtojustifyusingthestochasticmodelasanoperatingassumptionwhiletheconditionshold,and(b)weakenoughsothatwecancheckthem 3.TheresultinHazanandKale()doesnotshedlightonthequestioninthepresentpaper,becausethe“temporalvariation”concernsactualrewardsratherthanexpectedrewards.Inparticular,temporalvariationisminimalwhentheactualrewardofeacharmisconstantovertime,and(essentially)maximalinthestochasticmodelwith0-1rewards. HEESTOFOTHORLDS:STOCHASTICANDDVERSARIALANDITSWeanswertheabovequestionafrmatively,withanewalgorithmcalledSAO(StochasticandAdversarialOptimal).Toformulateourresult,weneedtointroducesomenotation.Inthestochasticmodel,letbetheexpectedsingle-roundrewardfromarm.Acrucialparameteristhe=min,where=max.Withthisnotation,attainsregret inthestochasticmodel,whereisthenumberofarms.Wearelookingforthefollow-ing:regretgretRn]= Knintheadversarialmodelandregret ]= inthestochasticmodel,wherepolylog(factors.Ourmainresultisasfollows.Theorem1ThereexistsanalgorithmSAOfortheMABproblemsuchthat:(a)intheadversarialmodel,SAOachievesregret )log(b)inthestochasticmodel,SAOachievesregret Rn]O(K )logMoreover,withverylittleextraworkwecanobtainthecorrespondinghigh-probabilityversions(seeTheoremforaprecisestatement).Itiseasier,andmoreinstructive,toexplainthemainideasonthespecialcaseoftwoarmsandobliviousadversary.Thisspecialcase(withasimpliedalgorithm)ispresentedinSection.Duetothepagelimit,thegeneralcaseiseshedoutintheAppendix.ThequestionraisedinthispapertouchesuponanimportantthemeinMachineLearn-ing,andmoregenerallyinthedesignofalgorithmswithpartiallyknowninputs:howtoachieveagoodworst-caseperformancealsotakeadvantageof“nice”probleminstances.InthecontextofMABitisnaturaltofocusonthedistinctionbetweenstochasticandadversarialrewards,especiallygiventheprominenceofthetwomodelsintheMABliterature.Thenour“best-of-both-worlds”questionistherst-orderspecicquestionthatoneneedstoresolve.Also,weprovidetherstanalysisofthesameMABalgorithmunderbothadversarialandstochasticrewards.Oncethe“best-of-both-worlds”questionissettled,severalfollow-upquestionsemerge.Mostimmediately,itisnotclearwhetherthepolylogfactorscanbeimprovedtomatchtheoptimalguar-anteesforeachrespectivemodel;alowerboundwouldindicatethatthe“attack-defence”tradeoffisfundamentallydifferentfromthefamiliarexplore-exploittradeoffs.Anaturaldirectionforfur-therworkisrewardsthatareadversarialonafewshorttimeintervals,butstochasticmostofthetime.Moreover,itisdesirabletoadaptnotonlytothebinarydistinctionbetweenthestochasticandadversarialrewards,butalsotosomeformofcontinuoustradeoffbetweenthetworewardmodels.Finally,weacknowledgethatoursolutionisnomore(andnoless)thanatheoreticalproofofconcept.Morework,theoreticalandexperimental,andperhapsnewideasorevennewalgorithms,areneededforapracticalsolution.Inparticular,apracticalalgorithmshouldprobablygobeyondwhatweaccomplishinthispaper,alongthelinesofthetwopossibleextensionsmentionedabove.Relatedwork.Thegeneralthemeofcombiningworst-caseandoptimisticperformanceboundshavereceivedconsiderableattentioninpriorworkononlinelearning.AnaturalincarnationofthisthemeinthecontextofMABconcernsprovingupperboundsonregretthatcanbewrittenintermsofsomecomplexitymeasureoftherewards,andmatchtheoptimalworst-casebounds.Tothisend,aversionofachievesregret KG,whereisthemaximalcumulativerewardofasinglearm,andthecorrespondinghighprobabilityresultwasrecentlyprovedinAudibertand().InHazanandKale(),theauthorsobtainregret KV,where 2.Anobliviousadversaryxestherewardsforallroundwithoutobservingthealgorithm'schoices. UBECKLIVKINSKnownparameters:Unknownparameters(stochasticmodel):probabilitydistributions;:::;;1]withresp.means;:::;Foreachround=1;:::;n(1)algorithmchooses2f;:::;K,possiblyusingexternalrandomization;(2)adversarysimultaneouslyselectsrewards=(;:::;gK;t;t;1]K.-inthestochasticmodel,eachrewardisdrawnindependently.(3)theforecasterreceives(andobserves)therewardHedoesnotobservetherewardsfromtheotherarms.Minimizetheregret,denedintheadversarialmodelby:=maxandinthestochasticmodelby: Rn=nXt=1maxi2f1;:::;Kgi�It: Figure1:TheMABframework:adversarialrewardsandstochasticrewards.areessentiallyoptimal.Itisworthnotingthathaveinuenced,andtosomeextentinspired,anumberoffollow-uppapersonricherMABsettings.However,itiseasytoseethatincursatrivialregretintheadversarialmodel,whereas regreteveninthestochasticmodel.Thisraisesanaturalquestionthatweaimtoresolveinthispaper:canweachievethebestofbothworlds?Isthereabanditalgorithmwhichmatchestheperformanceofintheadversarialmodel,andattainstheperformanceoftherewardsareinfactstochastic?Amorespecic(andslightlymilder)formulationisasfollows:Isthereabanditalgorithmthathas regretintheadversarialmodelandpolylog(regretinthestochasticmodel?Wearenotawareofanypriorworkonthisquestion.Intuitively,weintroduceanewtradeoff:abanditalgorithmhastobalancebetweenattackingtheweakadversary(stochasticrewards)anditselffromamoredeviousadversarythattargetsalgorithm'sweaknesses,suchasbe-ingtooaggressiveiftherewardsequenceisseeminglystochastic.Inparticular,whilethebasicexploration-exploitationtradeoffinducesregretinthestochasticmodel,and gretintheadversarialmodel,itisnotclearaprioriwhataretheoptimalregretguaranteesforthisnewattack-defensetradeoff. 1.Thisisclearlytruefortheoriginalversionofwithamixingparameter.However,thismixingisunnecessaryagainstobliviousadversaries().Theregretoftheresultingalgorithminthestochasticmodelisunknown. JMLR:WorkshopandConferenceProceedingsvol23(2012)42.125thAnnualConferenceonLearningTheoryTheBestofBothWorlds:StochasticandAdversarialBanditsebastienBubeckSBUBECKPRINCETONEDUDepartmentofOperationsResearchandFinancialEngineeringPrincetonUniversityPrinceton,NJ,USAAleksandrsSlivkinsSLIVKINSMICROSOFTCOMMicrosoftResearchMountainView,USAShieMannor,NathanSrebro,RobertC.WilliamsonWepresentanewbanditalgorithm,SAO(StochasticandAdversarialOptimal)whoseregretis(essentially)optimalbothforadversarialrewardsandforstochasticrewards.Specically,SAOcombinesthe worst-caseregretofExp3(Aueretal.)andthe(poly)logarithmicregretofUCB1(Aueretal.)forstochasticrewards.Adversarialrewardsandstochasticrewardsarethetwomainsettingsintheliteratureonmulti-armedbandits(MAB).PriorworkonMABtreatsthemseparately,anddoesnotattempttojointlyoptimizeforboth.Thisresultfallsintothegeneralagendatodesignalgorithmsthatcombinetheoptimalworst-caseperformancewithimprovedguaranteesfor“nice”probleminstances.1.IntroductionMulti-armedbandits(henceforth,MAB)isasimplemodelforsequentialdecisionmakingunderuncertaintythatcapturesthecrucialtradeoffbetweenexploration(acquiringnewinformation)andexploitation(optimizingbasedontheinformationthatiscurrentlyavailable).Introducedinearly1950-ies(),ithasbeenstudiedintensivelysincetheninOperationsResearch,Elec-tricalEngineering,Economics,andComputerScience.The“basic”MABframeworkcanbeformulatedasagamebetweentheplayer(i.e.,thealgo-rithm)andtheadversary(i.e.,theenvironment).Theplayerselectsactions(“arms”)sequentiallyfromaxed,nitesetofpossibleoptions,andreceivesrewardsthatcorrespondtotheselectedactions.Forsimplicity,itiscustomarytoassumethattherewardsareboundedinin;1].Inthead-versarialmodelonemakesnootherrestrictionsonthesequenceofrewards,whileinthestochasticmodelweassumethattherewardsofagivenarmisani.i.dsequenceofrandomvariables.Theperformancecriterionistheso-calledregret,whichcomparestherewardsreceivedbytheplayertotherewardsaccumulatedbyahypotheticalbenchmarkalgorithm.Atypical,standardbenchmarkisthebestsinglearm.SeeFigureforaprecisedescriptionofthisframework.AdversarialrewardsandstochasticrewardsarethetwomainrewardmodelsintheMABlit-erature.Botharenowverywellunderstood,inparticularthankstotheseminalpapers(LaiandAueretal.).Inparticular,thealgorithmfrom(Aueretal.attainsaregretgrowingas intheadversarialmodel,whereisthenumberofrounds,andalgorithmfrom(Aueretal.)attainsinthestochasticmodel.Bothresults2012S.Bubeck&A.Slivkins.