Manufactured in The Netherlands Finitetime Analysis of the Multiarmed Bandit Problem PETER AUER pauerigitugrazacat University of Technology Graz A8010 Graz Austria NICOL O CESABIANCHI cesabianchidtiunimiit DTI University of Milan via Bramante 65 I26 ID: 37627
Download Pdf The PPT/PDF document "Machine Learning Kluwer Academic Pub..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
MachineLearning,47,235Ð256,20022002KluwerAcademicPublishers.ManufacturedinTheNetherlands.Finite-timeAnalysisoftheMultiarmedBanditProblem*PETERAUERpauer@igi.tu-graz.ac.atUniversityofTechnologyGraz,A-8010Graz,AustriaOCESA-BIANCHIcesa-bianchi@dti.unimi.itDTI,UniversityofMilan,viaBramante65,I-26013Crema,ItalyPAULFISCHERÞscher@ls2.informatik.uni-dortmund.deLehrstuhlInformatikII,UniversitatDortmund,D-44221Dortmund,GermanyJyrkiKivinenReinforcementlearningpoliciesfacetheexplorationversusexploitationdilemma,i.e.thesearchforabalancebetweenexploringtheenvironmenttoÞndproÞtableactionswhiletakingtheempiricallybestactionasoftenaspossible.ApopularmeasureofapolicyÕssuccessinaddressingthisdilemmaistheregret,thatisthelossduetothefactthatthegloballyoptimalpolicyisnotfollowedallthetimes.Oneofthesimplestexamplesoftheexploration/exploitationdilemmaisthemulti-armedbanditproblem.LaiandRobbinsweretheÞrstonestoshowthattheregretforthisproblemhastogrowatleastlogarithmicallyinthenumberofplays.Sincethen,policieswhichasymptoticallyachievethisregrethavebeendevisedbyLaiandRobbinsandmanyothers.Inthisworkweshowthattheoptimallogarithmicregretisalsoachievableuniformlyovertime,withsimpleandefÞcientpolicies,andforallrewarddistributionswithboundedsupport.Keywords:banditproblems,adaptiveallocationrules,Þnitehorizonregret1.IntroductionTheexplorationversusexploitationdilemmacanbedescribedasthesearchforabalancebetweenexploringtheenvironmenttoÞndproÞtableactionswhiletakingtheempiricallybestactionasoftenaspossible.Thesimplestinstanceofthisdilemmaisperhapsthemulti-armedbandit,aproblemextensivelystudiedinstatistics(Berry&Fristedt,1985)thathasalsoturnedouttobefundamentalindifferentareasofartiÞcialintelligence,suchasreinforcementlearning(Sutton&Barto,1998)andevolutionaryprogramming(Holland,Initsmostbasicformulation,a-armedbanditproblemisdeÞnedbyrandomvariablesfor11,whereeachistheindexofagamblingmachine(i.e.,theÒarmÓofabandit).SuccessiveplaysofmachineyieldrewardswhichareApreliminaryversionappearedinProc.of15thInternationalConferenceonMachineLearning,pages100Ð108.MorganKaufmann,1998 P.AUER,N.CESA-BIANCHIANDP.FISCHERindependentandidenticallydistributedaccordingtoanunknownlawwithunknownex-.Independencealsoholdsforrewardsacrossmachines;i.e.,independent(andusuallynotidenticallydistributed)foreach1andeach,orallocationstrategyisanalgorithmthatchoosesthenextmachinetoplaybasedonthesequenceofpastplaysandobtainedrewards.Letbethenumberoftimeshasbeenplayedbyduringtheplays.ThentheregretisdenedbybyTjn]wherewhere]denotesexpectation.Thustheregretistheexpectedlossduetothefactthatthepolicydoesnotalwaysplaythebestmachine.Intheirclassicalpaper,LaiandRobbins(1985)found,forspecicfamiliesofrewarddistributions(indexedbyasinglerealparameter),policiessatisfying 0as istheKullback-Leiblerdivergencebetweentherewarddensityofanysuboptimalma-andtherewarddensityofthemachinewithhighestrewardexpectationHence,underthesepoliciestheoptimalmachineisplayedexponentiallymoreoftenthananyothermachine,atleastasymptotically.LaiandRobbinsalsoprovedthatthisregretisthebestpossible.Namely,foranyallocationstrategyandforanysuboptimalmachinemachineTjn]lnnDpj pasymptotically,providedthattherewarddistributionssatisfysomemildassumptions.Thesepoliciesworkbyassociatingaquantitycalleduppercondenceindextoeachma-chine.Thecomputationofthisindexisgenerallyhard.Infact,itreliesontheentiresequenceofrewardsobtainedsofarfromagivenmachine.Oncetheindexforeachmachineiscom-puted,thepolicyusesitasanestimateforthecorrespondingrewardexpectation,pickingforthenextplaythemachinewiththecurrenthighestindex.Morerecently,Agrawal(1995)introducedafamilyofpolicieswheretheindexcanbeexpressedassimplefunctionofthetotalrewardobtainedsofarfromthemachine.ThesepoliciesarethusmucheasiertocomputethanLaiandRobbins,yettheirregretretainstheoptimallogarithmicbehavior(thoughwithalargerleadingconstantinsomecases).Inthispaperwestrengthenpreviousresultsbyshowingpoliciesthatachievelogarithmicregretuniformlyovertime,ratherthanonlyasymptotically.Ourpoliciesarealsosimpletoimplementandcomputationallyefcient.InTheorem1weshowthatasimplevariantofAgrawalsindex-basedpolicyhasnite-timeregretlogarithmicallyboundedforarbitrarysetsofrewarddistributionswithboundedsupport(aregretwithbetterconstantsisproven FINITE-TIMEANALYSISinTheorem2foramorecomplicatedversionofthispolicy).AsimilarresultisshowninTheorem3foravariantofthewell-knownrandomized-greedyheuristic.Finally,inTheorem4weshowanotherindex-basedpolicywithlogarithmicallyboundedregretforthenaturalcasewhentherewarddistributionsarenormallydistributedwithunknownmeansandvariances.Throughoutthepaper,andwheneverthedistributionsofrewardsforeachmachineareunderstoodfromthecontext,wedewhere,werecall,istherewardexpectationformachineisanymaximalelementintheset2.Mainresultsrstresultshowsthatthereexistsanallocationstrategy,1,achievinglogarithmicregretuniformlyoverandwithoutanypreliminaryknowledgeabouttherewarddistri-butions(apartfromthefactthattheirsupportisin[01]).Thepolicy1(sketchedingure1)isderivedfromtheindex-basedpolicyofAgrawal(1995).Theindexofthispolicyisthesumoftwoterms.Thersttermissimplythecurrentaveragereward.Thesecondtermisrelatedtothesize(accordingtoChernoff-Hoeffdingbounds,seeFact1)oftheone-sideddenceintervalfortheaveragerewardwithinwhichthetrueexpectedrewardfallswithoverwhelmingprobability.Theorem1.ForallKifpolicyisrunonKmachineshavingarbitraryrewarddistributionsPwithsupportinin1]thenitsexpectedregretafteranynumbernofplaysisatmost i 1 2 wherearetheexpectedvaluesofP Figure1.Sketchofthedeterministicpolicy1(seeTheorem1). P.AUER,N.CESA-BIANCHIANDP.FISCHER Figure2.Sketchofthedeterministicpolicy2(seeTheorem2).ToproveTheorem1weshowthat,foranysuboptimalmachine plusasmallconstant.Theleadingconstant8isworsethanthecorrespondingconstantinLaiandRobbinsresult(1).Infact,onecanshowthatwheretheconstant2isthebestpossible.Usingaslightlymorecomplicatedpolicy,whichwecall2(seegure2),wecanbringthemainconstantof(2)arbitrarilycloseto1.Thepolicy2worksasfollows.Theplaysaredividedinepochs.Ineachnewepochamachineispickedandthentimes,whereisanexponentialfunctionandisthenumberofepochsplayedbythatmachinesofar.Themachinepickedineachnewepochistheone,whereisthecurrentnumberofplays,isthecurrentaveragerewardformachine,and Inthenextresultwestateaboundontheregretof2.Theconstant,hereleftunspec-ed,isdenedin(18)intheappendix,wherethetheoremisalsoproven.Theorem2.ForallKifpolicyisrunwithinputonKmachineshavingarbitraryrewarddistributionsPwithsupportinin1]thenitsexpectedregretafteranynumber 2 2i FINITE-TIMEANALYSISofplaysisatmost 2 i c wherearetheexpectedvaluesofP.Bychoosingsmall,theconstantoftheleadingterminthesum(4)getsarbi-trarilycloseto1;however,0.Thetwotermsinthesumcanbetraded-offbylettingbeslowlydecreasingwiththenumberofplays.Asimpleandwell-knownpolicyforthebanditproblemistheso-called-greedyrule(seeSutton,&Barto,1998).Thispolicyprescribestoplaywithprobability1themachinewiththehighestaveragereward,andwithprobabilityarandomlychosenmachine.Clearly,theconstantexplorationprobabilitycausesalinear(ratherthanlogarithmic)growthintheregret.Theobviousxistoletgotozerowithacertainrate,sothattheexplorationprobabilitydecreasesasourestimatesfortherewardexpectationsbecomemoreaccurate.Itturnsoutthatarateof1,whereis,asusual,theindexofthecurrentplay,allowstoprovealogarithmicboundontheregret.Theresultingpolicy,,isshowningure3.Theorem3.ForallKandforallrewarddistributionsPwithsupportinin1]ifpolicyisrunwithinputparameter Figure3.Sketchoftherandomizedpolicy(seeTheorem3). P.AUER,N.CESA-BIANCHIANDP.FISCHERthentheprobabilitythatafteranynumberndofplayschoosesasubop-timalmachinejisatmost d2n 2c d2lnn1d2e12 n1d2e12c5d2 4e d2 .Forlargeenough(e.g.5)theaboveboundisoforder,asthesecondandthirdtermsintheboundareforsome0(recallthat01).NotealsothatthisisaresultstrongerthanthoseofTheorems1and2,asitestablishesaboundontheinstantaneousregret.However,unlikeTheorems1and2,hereweneedtoknowalowerboundonthedifferencebetweentherewardexpectationsofthebestandthesecondbestmachine.Ourlastresultconcernsaspecialcase,i.e.thebanditproblemwithnormallydistributedrewards.Surprisingly,wecouldnotndintheliteratureregretbounds(notevenasymp-totical)forthecasewhenboththemeanandthevarianceoftherewarddistributionsareunknown.Here,weshowthatanindex-basedpolicycalled,seegure4,achieveslogarithmicregretuniformlyoverwithoutknowingmeansandvariancesoftherewarddistributions.However,ourproofisbasedoncertainboundsonthetailsoftheandtheStudentdistributionthatwecouldonlyverifynumerically.TheseboundsarestatedasConjecture1andConjecture2intheAppendix.Thechoiceoftheindexinisbased,asfor1,onthesizeoftheone-sidedcondenceintervalfortheaveragerewardwithinwhichthetrueexpectedrewardfallswithoverwhelmingprobability.Inthecaseof1,therewarddistributionwasunknown,andweusedChernoff-Hoeffdingboundstocomputetheindex.Inthiscaseweknowthat Figure4.Sketchofthedeterministicpolicy(seeTheorem4). FINITE-TIMEANALYSISthedistributionisnormal,andforcomputingtheindexweusethesamplevarianceasanestimateoftheunknownvariance.Theorem4.ForallKifpolicyisrunonKmachineshavingnormalrewarddistributionsPthenitsexpectedregretafteranynumbernofplaysisat i 1 2 8logwherearethemeansandvariancesofthedistributionsAsanalremarkforthissection,notethatTheorems13alsoholdforrewardsthatarenotindependentacrossmachines,i.e.mightbedependentforany,andFurthermore,wealsodonotneedthattherewardsofasinglearmarei.i.d.,butonlytheweakerassumptionthatthatXitXi1Xit1]iforall13.ProofsRecallthat,foreach11Xin]iforall1and.Also,foranyxedpolicyisthenumberoftimesmachinehasbeenplayedbyintheplays.Ofcourse,wealwayshave.Wealsodenether.v.denotesthemachineplayedattimeForeach11de Given,wecallthemachinewiththeleastindexsuchthatInwhatfollows,wewillalwaysputasuperscripttoanyquantitywhichreferstotheoptimalmachine.Forexamplewewriteinsteadof,wheretheindexoftheoptimalmachine.Somefurthernotation:Foranypredicatewedetobetheindicatorfuctionoftheevent;i.e.,1ifistrueand0otherwise.Finally,VarrX]denotesthevarianceoftherandomvariableNotethattheregretafterplayscanbewrittenasasTjn](5)SowecanboundtheregretbysimplyboundingeacheachTjn].Wewillmakeuseofthefollowingstandardexponentialinequalitiesforboundedrandomvariables(see,e.g.,theappendixofPollard,1984). P.AUER,N.CESA-BIANCHIANDP.FISCHERFact1(Chernoff-Hoeffdingbound)LetXberandomvariableswithcommonrangee1]andsuchthatIEEXtX1Xt1].LetS .ThenforallaFact2(Bernsteininequality)LetXberandomvariableswithrangee1]andnt1VarrXtXt1X1]2LetS .ThenforallaaSn] aexp ProofofTheorem1: 2ln.Foranymachine,weupperboundonanysequenceofplays.Moreprecisely,foreach1weboundtheindicatorfunctionasfollows.Letbeanarbitrarypositiveinteger.NowobservethatimpliesthatatleastoneofthefollowingmustWeboundtheprobabilityofevents(7)and(8)usingFact1(Chernoff-Hoeffdingbound)4ln FINITE-TIMEANALYSIS4lnFor8ln,(9)isfalse.Infact 8ln.SowegettTin]8ln 8ln8ln 8ln 2i 1 2 whichconcludestheproof.ProofofTheorem3:Recallthat,for.Let Theprobabilitythatmachineischosenattime K 1 n KXjTjn1XTn1andXjTjnXTnXjTjnj j 2 XTn j Nowtheanalysisforbothtermsontheright-handsideisthesame.Letbethenumberofplaysinwhichmachinewaschosenatrandomintheplays.Thenwehave 2nt1TjntXjtj j 2(11) P.AUER,N.CESA-BIANCHIANDP.FISCHER 2Xjtj j 2nt1TjntXjtj j byFact1(Chernoff-Hoeffdingbound) 2 2 2je 2jx02since tx 1et1 exx0t1TRjntXjtj j 2 2 2je 2jx02x0TRjnx0 2 whereinthelastlinewedroppedtheconditioningbecauseeachmachineisplayedatrandomindependentlyofthepreviouschoicesofthepolicy.Since Var K1 t K1 byBernsteinsinequality(2)wegetFinallyitremainstolowerbound.Forandwehave 2Knt1 t1 2Knt1 t 1 2Kntn 1 tn 2K c d2lnn nc d2lne12 FINITE-TIMEANALYSISThus,using(10)(13)andtheabovelowerboundonweobtain K 2x0ex05 4 2je 2jx02c d2n 2c d2lnn1d2e12 n1d2e12c5d2 4e d2 Thisconcludestheproof.4.ExperimentsForpracticalpurposes,theboundofTheorem1canbetunedmorenely.Weuse ss 1X2j X2js 2ln asunuppercondenceboundforthevarianceofmachine.Asbefore,thismeansthat,whichhasbeenplayedtimesduringtheplays,hasavariancethatisatmostthesamplevarianceplus 2ln.Wethenreplacetheuppercondencebound 2lnofpolicy1with lnn (thefactor14isanupperboundonthevarianceofaBernoullirandomvariable).Thisvariant,whichwecall,performssubstantiallybetterthan1inessentiallyallofourexperiments.However,wearenotabletoprovearegretbound.Wecomparedtheempiricalbehaviourpolicies2,andBernoullirewarddistributionswithdifferentparametersshowninthetablebelow. 12345678910 10.90.620.90.830.550.45110.90.60.60.60.60.60.60.60.60.6120.90.80.80.80.70.70.70.60.60.6130.90.80.80.80.80.80.80.80.80.8140.550.450.450.450.450.450.450.450.450.45 P.AUER,N.CESA-BIANCHIANDP.FISCHERRows13denerewarddistributionsfora2-armedbanditproblem,whereasrows1114denerewarddistributionsfora10-armedbanditproblem.Theentriesineachrowdenotetherewardexpectations(i.e.theprobabilitiesofgettingareward1,asweworkwithBernoullidistributions)forthemachinesindexedbythecolumns.Notethatdistributions1and11are(therewardoftheoptimalmachinehaslowvarianceandthedifferencesarealllarge),whereasdistributions3and14are(therewardoftheoptimalmachinehashighvarianceandsomeofthedifferencesaresmall).Wemadeexperimentstotestthedifferentpolicies(orthesamepolicywithdifferentinputparameters)onthesevendistributionslistedabove.Ineachexperimentwetrackedtwoperformancemeasures:(1)thepercentageofplaysoftheoptimalmachine;(2)theactualregret,thatisthedifferencebetweentherewardoftheoptimalmachineandtherewardofthemachineplayed.Theplotforeachexperimentshows,onasemi-logarithmicscale,thebehaviourofthesequantitiesduring100000playsaveragedover100differentruns.Weranarstroundofexperimentsondistribution2tondoutgoodvaluesfortheparametersofthepolicies.Ifaparameterischosentoosmall,thentheregretgrowslinearly(exponentiallyinthesemi-logarithmicplot);ifaparameterischosentoolargethentheregretgrowslogarithmically,butwithalargeleadingconstant(correspondingtoasteeplineinthesemi-logarithmicplot).Policy2isrelativelyinsensitivetothechoiceofitsparameter,aslongasitiskeptrelativelysmall(seegure5).Axedvalue0001hasbeenusedforalltheremainingexperiments.Onotherhand,thechoiceofinpolicyisdifcultasthereisnovaluethatworksreasonablywellforallthedistributionsthatweconsidered.Therefore,wehaveroughlysearchedforthebestvalueforeachdistribution.Intheplots,wewillalsoshowtheperformanceofforvaluesofaroundthisempiricallybestvalue.Thisshowsthattheperformancedegradesrapidlyifthisparameterisnotappropriatelytuned.Finally,ineachexperimenttheparameterwassetto Figure5.Searchforthebestvalueofparameterofpolicy FINITE-TIMEANALYSIS4.1.ComparisonbetweenpoliciesWecansummarizethecomparisonofallthepoliciesonthesevendistributionsasfollows(seeFigs.6Anoptimallytunedperformsalmostalwaysbest.Signicantexceptionsaredistributions12and14:thisisbecauseexploresuniformlyoverallmachines,thusthepolicyishurtifthereareseveralnonoptimalmachines,especiallywhentheirrewardexpectationsdifferalot.Furthermore,ifisnotwelltuneditsperfor-mancedegradesrapidly(exceptfordistribution13,onwhichperformswellawiderangeofvaluesofitsparameter).Inmostcases,performscomparablytoawell-tuned.Further-isnotverysensitivetothevarianceofthemachines,thatiswhyitperformssimilarlyondistributions2and3,andondistributions13and14.Policy2performssimilarlyto,butalwaysslightlyworse. Figure6.Comparisonondistribution1(2machineswithparameters0 Figure7.Comparisonondistribution2(2machineswithparameters0 P.AUER,N.CESA-BIANCHIANDP.FISCHER Figure8.Comparisonondistribution3(2machineswithparameters0 Figure9.Comparisonondistribution11(10machineswithparameters0 Figure10.Comparisonondistribution12(10machineswithparameters0 FINITE-TIMEANALYSIS Figure11.Comparisonondistribution13(10machineswithparameters0 Figure12.Comparisonondistribution14(10machineswithparameters05.ConclusionsWehaveshownsimpleandefcientpoliciesforthebanditproblemthat,onanysetofrewarddistributionswithknownboundedsupport,exhibituniformlogarithmicregret.Ourpoliciesaredeterministicandbasedonuppercondencebounds,withtheexceptionofarandomizedallocationrulethatisadynamicvariantofthe-greedyheuristic.Moreover,ourpoliciesarerobustwithrespecttotheintroductionofmoderatedependenciesintherewardprocesses.Thisworkcanbeextendedinmanyways.Amoregeneralversionofthebanditproblemisobtainedbyremovingthestationarityassumptiononrewardexpectations(seeBerry&Fristedt,1985;Gittins,1989forextensionsofthebasicbanditproblem).Forexample,supposethatastochasticrewardprocessisassociatedtoeachmachine.Here,playingmachineattimeyieldsarewardandcausesthecurrent P.AUER,N.CESA-BIANCHIANDP.FISCHERtochangeto1,whereasthestatesofothermachinesremainfrozen.Awell-studiedprobleminthissetupisthemaximizationofthetotalexpectedrewardinasequenceplays.Therearemethods,liketheGittinsallocationindices,thatallowtondtheoptimalmachinetoplayateachtimebyconsideringeachrewardprocessindependentlyfromtheothers(eventhoughthegloballyoptimalsolutiondependsonalltheprocesses).However,computationoftheGittinsindicesfortheaverage(undiscounted)rewardcriterionusedhererequirespreliminaryknowledgeabouttherewardprocesses(see,e.g.,Ishikida&Varaiya,1994).Toovercomethisrequirement,onecanlearntheGittinsindices,asproposedinDuff(1995)forthecaseofnite-stateMarkovianrewardprocesses.However,therearenonite-timeregretboundsshownforthissolution.Atthemoment,wedonotknowwhetherourtechniquescouldbeextendedtothesemoregeneralbanditproblems.AppendixA:ProofofTheorem2Notethat1(14)1.Assumethatforallandletbethelargestintegersuchthat Notethat1.Wehave nishesits-thepoch nishesits-thepochNowconsiderthefollowingchainofimplicationsnishesits-thepochsuchthat suchthat suchthat 0suchthatwherethelastimplicationholdbecauseisincreasingin.HenceHenceTjn] rj rrj FINITE-TIMEANALYSIS Theassumptionimpliesln1.Therefore,for,wehave 2 2j(16)andanr1 21 j using(16)above 1 22j derivedfrom(16) 1 Westartbyboundingtherstsumin(15).Using(17)andFact1(Chernoff-Hoeffdingbound)weget exp expexp10.Nowlet.By(14)weget1.Hence exp P.AUER,N.CESA-BIANCHIANDP.FISCHER1.Furthermanipulationyieldsexp 1 x1ec1 c1 Wecontinuebyboundingthesecondsumin(15).UsingoncemoreFact1,weget exp 21 e1 exp exp exp exp exp 11 exp 1 as11 exp Now,as1for1,wecanboundtheseriesinthelastformulaabovewithanintegralexp exp FINITE-TIMEANALYSIS exp bychangeofvariable ln1 e ex whereweset As0 1,wehave04.Toupperboundthebracketedformulaabove,considerthefunction withderivatives x2eF2eex Intheintervalisseentohaveazeroat.As0inthesameinterval,thisistheuniquemaximumof,andwe10.Sowehave ex x11 10111 Piecingeverythingtogether,andusing(14)toupperbound,wendthat j 1 1 1 111 2 2j c 2jwherec1 1 2 1 1 1 111 Thisconcludestheproof. P.AUER,N.CESA-BIANCHIANDP.FISCHERAppendixB:ProofofTheorem4TheproofgoesverymuchalongthesamelinesastheproofofTheorem1.Itisbasedonthetwofollowingconjectureswhichweonlyveriednumerically.Conjecture1.beaStudentrandomvariablewithdegreesoffreedom.Then,forall0 Conjecture2.bearandomvariablewithdegreesoffreedom.ThenWenowproceedwiththeproofofTheorem4.LetFixamachineand,forany,set 16QissX2is s1lnt bethecorrespondingquantityfortheoptimalmachine.Toupperbound,weproceedexactlyasintherstpartoftheproofofTheorem1obtaining,foranypositiveintegerTherandomvariable hasaStudentdistribution1degreesoffreedom(see,e.g.,Wilks,1962,8.4.3page211).Therefore,usingConjecture1with1and ,weget QisisiX2isisisi14 forall8ln.Theprobabilityofisboundedanalogously.Finally,since-distributedwith1degreesoffreedom(see,e.g.,Wilks,1962, FINITE-TIMEANALYSIS8.4.1page208).Therefore,usingConjecture2with1and,weget 2isi 64ln 2i8lntSettingmax2562i completestheproofofthetheorem.AcknowledgmentsThesupportfromESPRITWorkingGroupEP27150,NeuralandComputationalLearningII(NeuroCOLTII),isgratefullyacknowledged.1.SimilarextensionsofLaiandRobbinsresultswerealsoobtainedbyYakowitzandLowe(1991),andbyBurnetasandKatehakis(1996).ReferencesAgrawal,R.(1995).Samplemeanbasedindexpolicieswithregretforthemulti-armedbanditproblem.AdvancesinAppliedProbability,1054Berry,D.,&Fristedt,B.(1985).Banditproblems.London:ChapmanandHall.Burnetas,A.,&Katehakis,M.(1996).Optimaladaptivepoliciesforsequentialallocationproblems.AdvancesinAppliedMathematics,122Duff,M.(1995).Q-learningforbanditproblems.InProceedingsofthe12thInternationalConferenceonMachine(pp.209Gittins,J.(1989).Multi-armedbanditallocationindices,Wiley-InterscienceseriesinSystemsandOptimization.NewYork:JohnWileyandSons.Holland,J.(1992).Adaptationinnaturalandarticialsystems.Cambridge:MITPress/BradfordBooks.Ishikida,T.,&Varaiya,P.(1994).Multi-armedbanditproblemrevisited.JournalofOptimizationTheoryand,113Lai,T.,&Robbins,H.(1985).Asymptoticallyefcientadaptiveallocationrules.AdvancesinAppliedMathematicsPollard,D.(1984).Convergenceofstochasticprocesses.Berlin:Springer. P.AUER,N.CESA-BIANCHIANDP.FISCHERSutton,R.,&Barto,A.(1998).Reinforcementlearning,anintroduction.Cambridge:MITPress/BradfordWilks,S.(1962).Matematicalstatistics.NewYork:JohnWileyandSons.Yakowitz,S.,&Lowe,W.(1991).Nonparametricbanditmethods.AnnalsofOperationsResearch,297ReceivedSeptember29,2000RevisedMay21,2001AcceptedJune20,2001FinalmanuscriptJune20,2001