/
Machine Learning     Kluwer Academic Publishers Machine Learning     Kluwer Academic Publishers

Machine Learning Kluwer Academic Publishers - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
480 views
Uploaded On 2015-02-21

Machine Learning Kluwer Academic Publishers - PPT Presentation

Manufactured in The Netherlands Finitetime Analysis of the Multiarmed Bandit Problem PETER AUER pauerigitugrazacat University of Technology Graz A8010 Graz Austria NICOL O CESABIANCHI cesabianchidtiunimiit DTI University of Milan via Bramante 65 I26 ID: 37627

Manufactured The Netherlands

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Machine Learning Kluwer Academic Pub..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

MachineLearning,47,235Ð256,20022002KluwerAcademicPublishers.ManufacturedinTheNetherlands.Finite-timeAnalysisoftheMultiarmedBanditProblem*PETERAUERpauer@igi.tu-graz.ac.atUniversityofTechnologyGraz,A-8010Graz,AustriaOCESA-BIANCHIcesa-bianchi@dti.unimi.itDTI,UniversityofMilan,viaBramante65,I-26013Crema,ItalyPAULFISCHERÞscher@ls2.informatik.uni-dortmund.deLehrstuhlInformatikII,UniversitatDortmund,D-44221Dortmund,GermanyJyrkiKivinenReinforcementlearningpoliciesfacetheexplorationversusexploitationdilemma,i.e.thesearchforabalancebetweenexploringtheenvironmenttoÞndproÞtableactionswhiletakingtheempiricallybestactionasoftenaspossible.ApopularmeasureofapolicyÕssuccessinaddressingthisdilemmaistheregret,thatisthelossduetothefactthatthegloballyoptimalpolicyisnotfollowedallthetimes.Oneofthesimplestexamplesoftheexploration/exploitationdilemmaisthemulti-armedbanditproblem.LaiandRobbinsweretheÞrstonestoshowthattheregretforthisproblemhastogrowatleastlogarithmicallyinthenumberofplays.Sincethen,policieswhichasymptoticallyachievethisregrethavebeendevisedbyLaiandRobbinsandmanyothers.Inthisworkweshowthattheoptimallogarithmicregretisalsoachievableuniformlyovertime,withsimpleandefÞcientpolicies,andforallrewarddistributionswithboundedsupport.Keywords:banditproblems,adaptiveallocationrules,Þnitehorizonregret1.IntroductionTheexplorationversusexploitationdilemmacanbedescribedasthesearchforabalancebetweenexploringtheenvironmenttoÞndproÞtableactionswhiletakingtheempiricallybestactionasoftenaspossible.Thesimplestinstanceofthisdilemmaisperhapsthemulti-armedbandit,aproblemextensivelystudiedinstatistics(Berry&Fristedt,1985)thathasalsoturnedouttobefundamentalindifferentareasofartiÞcialintelligence,suchasreinforcementlearning(Sutton&Barto,1998)andevolutionaryprogramming(Holland,Initsmostbasicformulation,a-armedbanditproblemisdeÞnedbyrandomvariablesfor11,whereeachistheindexofagamblingmachine(i.e.,theÒarmÓofabandit).SuccessiveplaysofmachineyieldrewardswhichareApreliminaryversionappearedinProc.of15thInternationalConferenceonMachineLearning,pages100Ð108.MorganKaufmann,1998 P.AUER,N.CESA-BIANCHIANDP.FISCHERindependentandidenticallydistributedaccordingtoanunknownlawwithunknownex-.Independencealsoholdsforrewardsacrossmachines;i.e.,independent(andusuallynotidenticallydistributed)foreach1andeach,orallocationstrategyisanalgorithmthatchoosesthenextmachinetoplaybasedonthesequenceofpastplaysandobtainedrewards.Letbethenumberoftimeshasbeenplayedbyduringtheplays.ThentheregretisdenedbybyTjn]wherewhere]denotesexpectation.Thustheregretistheexpectedlossduetothefactthatthepolicydoesnotalwaysplaythebestmachine.Intheirclassicalpaper,LaiandRobbins(1985)found,forspecicfamiliesofrewarddistributions(indexedbyasinglerealparameter),policiessatisfying 0as istheKullback-Leiblerdivergencebetweentherewarddensityofanysuboptimalma-andtherewarddensityofthemachinewithhighestrewardexpectationHence,underthesepoliciestheoptimalmachineisplayedexponentiallymoreoftenthananyothermachine,atleastasymptotically.LaiandRobbinsalsoprovedthatthisregretisthebestpossible.Namely,foranyallocationstrategyandforanysuboptimalmachinemachineTjn]lnnDpj pasymptotically,providedthattherewarddistributionssatisfysomemildassumptions.Thesepoliciesworkbyassociatingaquantitycalleduppercondenceindextoeachma-chine.Thecomputationofthisindexisgenerallyhard.Infact,itreliesontheentiresequenceofrewardsobtainedsofarfromagivenmachine.Oncetheindexforeachmachineiscom-puted,thepolicyusesitasanestimateforthecorrespondingrewardexpectation,pickingforthenextplaythemachinewiththecurrenthighestindex.Morerecently,Agrawal(1995)introducedafamilyofpolicieswheretheindexcanbeexpressedassimplefunctionofthetotalrewardobtainedsofarfromthemachine.ThesepoliciesarethusmucheasiertocomputethanLaiandRobbins,yettheirregretretainstheoptimallogarithmicbehavior(thoughwithalargerleadingconstantinsomecases).Inthispaperwestrengthenpreviousresultsbyshowingpoliciesthatachievelogarithmicregretuniformlyovertime,ratherthanonlyasymptotically.Ourpoliciesarealsosimpletoimplementandcomputationallyefcient.InTheorem1weshowthatasimplevariantofAgrawalsindex-basedpolicyhasnite-timeregretlogarithmicallyboundedforarbitrarysetsofrewarddistributionswithboundedsupport(aregretwithbetterconstantsisproven FINITE-TIMEANALYSISinTheorem2foramorecomplicatedversionofthispolicy).AsimilarresultisshowninTheorem3foravariantofthewell-knownrandomized-greedyheuristic.Finally,inTheorem4weshowanotherindex-basedpolicywithlogarithmicallyboundedregretforthenaturalcasewhentherewarddistributionsarenormallydistributedwithunknownmeansandvariances.Throughoutthepaper,andwheneverthedistributionsofrewardsforeachmachineareunderstoodfromthecontext,wedewhere,werecall,istherewardexpectationformachineisanymaximalelementintheset2.Mainresultsrstresultshowsthatthereexistsanallocationstrategy,1,achievinglogarithmicregretuniformlyoverandwithoutanypreliminaryknowledgeabouttherewarddistri-butions(apartfromthefactthattheirsupportisin[01]).Thepolicy1(sketchedingure1)isderivedfromtheindex-basedpolicyofAgrawal(1995).Theindexofthispolicyisthesumoftwoterms.Thersttermissimplythecurrentaveragereward.Thesecondtermisrelatedtothesize(accordingtoChernoff-Hoeffdingbounds,seeFact1)oftheone-sideddenceintervalfortheaveragerewardwithinwhichthetrueexpectedrewardfallswithoverwhelmingprobability.Theorem1.ForallKifpolicyisrunonKmachineshavingarbitraryrewarddistributionsPwithsupportinin1]thenitsexpectedregretafteranynumbernofplaysisatmost i 1 2 wherearetheexpectedvaluesofP Figure1.Sketchofthedeterministicpolicy1(seeTheorem1). P.AUER,N.CESA-BIANCHIANDP.FISCHER Figure2.Sketchofthedeterministicpolicy2(seeTheorem2).ToproveTheorem1weshowthat,foranysuboptimalmachine plusasmallconstant.Theleadingconstant8isworsethanthecorrespondingconstantinLaiandRobbinsresult(1).Infact,onecanshowthatwheretheconstant2isthebestpossible.Usingaslightlymorecomplicatedpolicy,whichwecall2(seegure2),wecanbringthemainconstantof(2)arbitrarilycloseto1.Thepolicy2worksasfollows.Theplaysaredividedinepochs.Ineachnewepochamachineispickedandthentimes,whereisanexponentialfunctionandisthenumberofepochsplayedbythatmachinesofar.Themachinepickedineachnewepochistheone,whereisthecurrentnumberofplays,isthecurrentaveragerewardformachine,and   Inthenextresultwestateaboundontheregretof2.Theconstant,hereleftunspec-ed,isdenedin(18)intheappendix,wherethetheoremisalsoproven.Theorem2.ForallKifpolicyisrunwithinputonKmachineshavingarbitraryrewarddistributionsPwithsupportinin1]thenitsexpectedregretafteranynumber 2 2i FINITE-TIMEANALYSISofplaysisatmost 2 i c wherearetheexpectedvaluesofP.Bychoosingsmall,theconstantoftheleadingterminthesum(4)getsarbi-trarilycloseto1;however,0.Thetwotermsinthesumcanbetraded-offbylettingbeslowlydecreasingwiththenumberofplays.Asimpleandwell-knownpolicyforthebanditproblemistheso-called-greedyrule(seeSutton,&Barto,1998).Thispolicyprescribestoplaywithprobability1themachinewiththehighestaveragereward,andwithprobabilityarandomlychosenmachine.Clearly,theconstantexplorationprobabilitycausesalinear(ratherthanlogarithmic)growthintheregret.Theobviousxistoletgotozerowithacertainrate,sothattheexplorationprobabilitydecreasesasourestimatesfortherewardexpectationsbecomemoreaccurate.Itturnsoutthatarateof1,whereis,asusual,theindexofthecurrentplay,allowstoprovealogarithmicboundontheregret.Theresultingpolicy,,isshowningure3.Theorem3.ForallKandforallrewarddistributionsPwithsupportinin1]ifpolicyisrunwithinputparameter Figure3.Sketchoftherandomizedpolicy(seeTheorem3). P.AUER,N.CESA-BIANCHIANDP.FISCHERthentheprobabilitythatafteranynumberndofplayschoosesasubop-timalmachinejisatmost d2n 2c d2lnn1d2e12 n1d2e12c5d2 4e d2 .Forlargeenough(e.g.5)theaboveboundisoforder,asthesecondandthirdtermsintheboundareforsome0(recallthat01).NotealsothatthisisaresultstrongerthanthoseofTheorems1and2,asitestablishesaboundontheinstantaneousregret.However,unlikeTheorems1and2,hereweneedtoknowalowerboundonthedifferencebetweentherewardexpectationsofthebestandthesecondbestmachine.Ourlastresultconcernsaspecialcase,i.e.thebanditproblemwithnormallydistributedrewards.Surprisingly,wecouldnotndintheliteratureregretbounds(notevenasymp-totical)forthecasewhenboththemeanandthevarianceoftherewarddistributionsareunknown.Here,weshowthatanindex-basedpolicycalled,seegure4,achieveslogarithmicregretuniformlyoverwithoutknowingmeansandvariancesoftherewarddistributions.However,ourproofisbasedoncertainboundsonthetailsoftheandtheStudentdistributionthatwecouldonlyverifynumerically.TheseboundsarestatedasConjecture1andConjecture2intheAppendix.Thechoiceoftheindexinisbased,asfor1,onthesizeoftheone-sidedcondenceintervalfortheaveragerewardwithinwhichthetrueexpectedrewardfallswithoverwhelmingprobability.Inthecaseof1,therewarddistributionwasunknown,andweusedChernoff-Hoeffdingboundstocomputetheindex.Inthiscaseweknowthat Figure4.Sketchofthedeterministicpolicy(seeTheorem4). FINITE-TIMEANALYSISthedistributionisnormal,andforcomputingtheindexweusethesamplevarianceasanestimateoftheunknownvariance.Theorem4.ForallKifpolicyisrunonKmachineshavingnormalrewarddistributionsPthenitsexpectedregretafteranynumbernofplaysisat i 1 2 8logwherearethemeansandvariancesofthedistributionsAsanalremarkforthissection,notethatTheorems13alsoholdforrewardsthatarenotindependentacrossmachines,i.e.mightbedependentforany,andFurthermore,wealsodonotneedthattherewardsofasinglearmarei.i.d.,butonlytheweakerassumptionthatthatXitXi1Xit1]iforall13.ProofsRecallthat,foreach11Xin]iforall1and.Also,foranyxedpolicyisthenumberoftimesmachinehasbeenplayedbyintheplays.Ofcourse,wealwayshave.Wealsodenether.v.denotesthemachineplayedattimeForeach11de Given,wecallthemachinewiththeleastindexsuchthatInwhatfollows,wewillalwaysputasuperscripttoanyquantitywhichreferstotheoptimalmachine.Forexamplewewriteinsteadof,wheretheindexoftheoptimalmachine.Somefurthernotation:Foranypredicatewedetobetheindicatorfuctionoftheevent;i.e.,1ifistrueand0otherwise.Finally,VarrX]denotesthevarianceoftherandomvariableNotethattheregretafterplayscanbewrittenasasTjn](5)SowecanboundtheregretbysimplyboundingeacheachTjn].Wewillmakeuseofthefollowingstandardexponentialinequalitiesforboundedrandomvariables(see,e.g.,theappendixofPollard,1984). P.AUER,N.CESA-BIANCHIANDP.FISCHERFact1(Chernoff-Hoeffdingbound)LetXberandomvariableswithcommonrangee1]andsuchthatIEEXtX1Xt1].LetS  .ThenforallaFact2(Bernsteininequality)LetXberandomvariableswithrangee1]andnt1VarrXtXt1X1]2LetS  .ThenforallaaSn] aexp ProofofTheorem1: 2ln.Foranymachine,weupperboundonanysequenceofplays.Moreprecisely,foreach1weboundtheindicatorfunctionasfollows.Letbeanarbitrarypositiveinteger.NowobservethatimpliesthatatleastoneofthefollowingmustWeboundtheprobabilityofevents(7)and(8)usingFact1(Chernoff-Hoeffdingbound)4ln FINITE-TIMEANALYSIS4lnFor8ln,(9)isfalse.Infact 8ln.SowegettTin]8ln 8ln8ln 8ln 2i 1 2 whichconcludestheproof.ProofofTheorem3:Recallthat,for.Let Theprobabilitythatmachineischosenattime K 1 n KXjTjn1XTn1andXjTjnXTnXjTjnj j 2 XTn j Nowtheanalysisforbothtermsontheright-handsideisthesame.Letbethenumberofplaysinwhichmachinewaschosenatrandomintheplays.Thenwehave 2nt1TjntXjtj j 2(11) P.AUER,N.CESA-BIANCHIANDP.FISCHER 2Xjtj j 2nt1TjntXjtj j byFact1(Chernoff-Hoeffdingbound) 2 2 2je 2jx02since tx 1et1 exx0t1TRjntXjtj j 2 2 2je 2jx02x0TRjnx0 2 whereinthelastlinewedroppedtheconditioningbecauseeachmachineisplayedatrandomindependentlyofthepreviouschoicesofthepolicy.Since Var K1 t K1 byBernsteinsinequality(2)wegetFinallyitremainstolowerbound.Forandwehave 2Knt1 t1 2Knt1 t 1 2Kntn 1 tn 2K c d2lnn nc d2lne12 FINITE-TIMEANALYSISThus,using(10)(13)andtheabovelowerboundonweobtain K 2x0ex05 4 2je 2jx02c d2n 2c d2lnn1d2e12 n1d2e12c5d2 4e d2 Thisconcludestheproof.4.ExperimentsForpracticalpurposes,theboundofTheorem1canbetunedmorenely.Weuse ss 1X2j X2js  2ln asunuppercondenceboundforthevarianceofmachine.Asbefore,thismeansthat,whichhasbeenplayedtimesduringtheplays,hasavariancethatisatmostthesamplevarianceplus 2ln.Wethenreplacetheuppercondencebound 2lnofpolicy1with lnn (thefactor14isanupperboundonthevarianceofaBernoullirandomvariable).Thisvariant,whichwecall,performssubstantiallybetterthan1inessentiallyallofourexperiments.However,wearenotabletoprovearegretbound.Wecomparedtheempiricalbehaviourpolicies2,andBernoullirewarddistributionswithdifferentparametersshowninthetablebelow. 12345678910 10.90.620.90.830.550.45110.90.60.60.60.60.60.60.60.60.6120.90.80.80.80.70.70.70.60.60.6130.90.80.80.80.80.80.80.80.80.8140.550.450.450.450.450.450.450.450.450.45 P.AUER,N.CESA-BIANCHIANDP.FISCHERRows13denerewarddistributionsfora2-armedbanditproblem,whereasrows1114denerewarddistributionsfora10-armedbanditproblem.Theentriesineachrowdenotetherewardexpectations(i.e.theprobabilitiesofgettingareward1,asweworkwithBernoullidistributions)forthemachinesindexedbythecolumns.Notethatdistributions1and11are(therewardoftheoptimalmachinehaslowvarianceandthedifferencesarealllarge),whereasdistributions3and14are(therewardoftheoptimalmachinehashighvarianceandsomeofthedifferencesaresmall).Wemadeexperimentstotestthedifferentpolicies(orthesamepolicywithdifferentinputparameters)onthesevendistributionslistedabove.Ineachexperimentwetrackedtwoperformancemeasures:(1)thepercentageofplaysoftheoptimalmachine;(2)theactualregret,thatisthedifferencebetweentherewardoftheoptimalmachineandtherewardofthemachineplayed.Theplotforeachexperimentshows,onasemi-logarithmicscale,thebehaviourofthesequantitiesduring100000playsaveragedover100differentruns.Weranarstroundofexperimentsondistribution2tondoutgoodvaluesfortheparametersofthepolicies.Ifaparameterischosentoosmall,thentheregretgrowslinearly(exponentiallyinthesemi-logarithmicplot);ifaparameterischosentoolargethentheregretgrowslogarithmically,butwithalargeleadingconstant(correspondingtoasteeplineinthesemi-logarithmicplot).Policy2isrelativelyinsensitivetothechoiceofitsparameter,aslongasitiskeptrelativelysmall(seegure5).Axedvalue0001hasbeenusedforalltheremainingexperiments.Onotherhand,thechoiceofinpolicyisdifcultasthereisnovaluethatworksreasonablywellforallthedistributionsthatweconsidered.Therefore,wehaveroughlysearchedforthebestvalueforeachdistribution.Intheplots,wewillalsoshowtheperformanceofforvaluesofaroundthisempiricallybestvalue.Thisshowsthattheperformancedegradesrapidlyifthisparameterisnotappropriatelytuned.Finally,ineachexperimenttheparameterwassetto Figure5.Searchforthebestvalueofparameterofpolicy FINITE-TIMEANALYSIS4.1.ComparisonbetweenpoliciesWecansummarizethecomparisonofallthepoliciesonthesevendistributionsasfollows(seeFigs.6Anoptimallytunedperformsalmostalwaysbest.Signicantexceptionsaredistributions12and14:thisisbecauseexploresuniformlyoverallmachines,thusthepolicyishurtifthereareseveralnonoptimalmachines,especiallywhentheirrewardexpectationsdifferalot.Furthermore,ifisnotwelltuneditsperfor-mancedegradesrapidly(exceptfordistribution13,onwhichperformswellawiderangeofvaluesofitsparameter).Inmostcases,performscomparablytoawell-tuned.Further-isnotverysensitivetothevarianceofthemachines,thatiswhyitperformssimilarlyondistributions2and3,andondistributions13and14.Policy2performssimilarlyto,butalwaysslightlyworse. Figure6.Comparisonondistribution1(2machineswithparameters0 Figure7.Comparisonondistribution2(2machineswithparameters0 P.AUER,N.CESA-BIANCHIANDP.FISCHER Figure8.Comparisonondistribution3(2machineswithparameters0 Figure9.Comparisonondistribution11(10machineswithparameters0 Figure10.Comparisonondistribution12(10machineswithparameters0 FINITE-TIMEANALYSIS Figure11.Comparisonondistribution13(10machineswithparameters0 Figure12.Comparisonondistribution14(10machineswithparameters05.ConclusionsWehaveshownsimpleandefcientpoliciesforthebanditproblemthat,onanysetofrewarddistributionswithknownboundedsupport,exhibituniformlogarithmicregret.Ourpoliciesaredeterministicandbasedonuppercondencebounds,withtheexceptionofarandomizedallocationrulethatisadynamicvariantofthe-greedyheuristic.Moreover,ourpoliciesarerobustwithrespecttotheintroductionofmoderatedependenciesintherewardprocesses.Thisworkcanbeextendedinmanyways.Amoregeneralversionofthebanditproblemisobtainedbyremovingthestationarityassumptiononrewardexpectations(seeBerry&Fristedt,1985;Gittins,1989forextensionsofthebasicbanditproblem).Forexample,supposethatastochasticrewardprocessisassociatedtoeachmachine.Here,playingmachineattimeyieldsarewardandcausesthecurrent P.AUER,N.CESA-BIANCHIANDP.FISCHERtochangeto1,whereasthestatesofothermachinesremainfrozen.Awell-studiedprobleminthissetupisthemaximizationofthetotalexpectedrewardinasequenceplays.Therearemethods,liketheGittinsallocationindices,thatallowtondtheoptimalmachinetoplayateachtimebyconsideringeachrewardprocessindependentlyfromtheothers(eventhoughthegloballyoptimalsolutiondependsonalltheprocesses).However,computationoftheGittinsindicesfortheaverage(undiscounted)rewardcriterionusedhererequirespreliminaryknowledgeabouttherewardprocesses(see,e.g.,Ishikida&Varaiya,1994).Toovercomethisrequirement,onecanlearntheGittinsindices,asproposedinDuff(1995)forthecaseofnite-stateMarkovianrewardprocesses.However,therearenonite-timeregretboundsshownforthissolution.Atthemoment,wedonotknowwhetherourtechniquescouldbeextendedtothesemoregeneralbanditproblems.AppendixA:ProofofTheorem2Notethat1(14)1.Assumethatforallandletbethelargestintegersuchthat Notethat1.Wehave nishesits-thepoch nishesits-thepochNowconsiderthefollowingchainofimplicationsnishesits-thepochsuchthat suchthat suchthat 0suchthatwherethelastimplicationholdbecauseisincreasingin.HenceHenceTjn] rj r rj   FINITE-TIMEANALYSIS Theassumptionimpliesln1.Therefore,for,wehave 2 2j(16)andanr1   21 j   using(16)above 1 22j derivedfrom(16) 1  Westartbyboundingtherstsumin(15).Using(17)andFact1(Chernoff-Hoeffdingbound)weget  exp expexp10.Nowlet.By(14)weget1.Hence   exp P.AUER,N.CESA-BIANCHIANDP.FISCHER1.Furthermanipulationyieldsexp 1 x1ec1  c1 Wecontinuebyboundingthesecondsumin(15).UsingoncemoreFact1,weget  exp 21 e1  exp  exp exp   exp  exp 11  exp  1 as11  exp Now,as1for1,wecanboundtheseriesinthelastformulaabovewithanintegralexp exp FINITE-TIMEANALYSIS exp bychangeofvariable ln1 e   ex whereweset As0 1,wehave04.Toupperboundthebracketedformulaabove,considerthefunction withderivatives x2eF2e ex Intheintervalisseentohaveazeroat.As0inthesameinterval,thisistheuniquemaximumof,andwe10.Sowehave   ex x11 10111  Piecingeverythingtogether,andusing(14)toupperbound,wendthat  j 1  1 1 111   2 2j c 2jwherec1 1 2 1  1 1 111 Thisconcludestheproof. P.AUER,N.CESA-BIANCHIANDP.FISCHERAppendixB:ProofofTheorem4TheproofgoesverymuchalongthesamelinesastheproofofTheorem1.Itisbasedonthetwofollowingconjectureswhichweonlyveriednumerically.Conjecture1.beaStudentrandomvariablewithdegreesoffreedom.Then,forall0 Conjecture2.bearandomvariablewithdegreesoffreedom.ThenWenowproceedwiththeproofofTheorem4.LetFixamachineand,forany,set 16QissX2is s1lnt bethecorrespondingquantityfortheoptimalmachine.Toupperbound,weproceedexactlyasintherstpartoftheproofofTheorem1obtaining,foranypositiveintegerTherandomvariable hasaStudentdistribution1degreesoffreedom(see,e.g.,Wilks,1962,8.4.3page211).Therefore,usingConjecture1with1and ,weget QisisiX2isi sisi14 forall8ln.Theprobabilityofisboundedanalogously.Finally,since-distributedwith1degreesoffreedom(see,e.g.,Wilks,1962, FINITE-TIMEANALYSIS8.4.1page208).Therefore,usingConjecture2with1and,weget 2isi 64ln 2i8lntSettingmax2562i completestheproofofthetheorem.AcknowledgmentsThesupportfromESPRITWorkingGroupEP27150,NeuralandComputationalLearningII(NeuroCOLTII),isgratefullyacknowledged.1.SimilarextensionsofLaiandRobbinsresultswerealsoobtainedbyYakowitzandLowe(1991),andbyBurnetasandKatehakis(1996).ReferencesAgrawal,R.(1995).Samplemeanbasedindexpolicieswithregretforthemulti-armedbanditproblem.AdvancesinAppliedProbability,1054Berry,D.,&Fristedt,B.(1985).Banditproblems.London:ChapmanandHall.Burnetas,A.,&Katehakis,M.(1996).Optimaladaptivepoliciesforsequentialallocationproblems.AdvancesinAppliedMathematics,122Duff,M.(1995).Q-learningforbanditproblems.InProceedingsofthe12thInternationalConferenceonMachine(pp.209Gittins,J.(1989).Multi-armedbanditallocationindices,Wiley-InterscienceseriesinSystemsandOptimization.NewYork:JohnWileyandSons.Holland,J.(1992).Adaptationinnaturalandarticialsystems.Cambridge:MITPress/BradfordBooks.Ishikida,T.,&Varaiya,P.(1994).Multi-armedbanditproblemrevisited.JournalofOptimizationTheoryand,113Lai,T.,&Robbins,H.(1985).Asymptoticallyefcientadaptiveallocationrules.AdvancesinAppliedMathematicsPollard,D.(1984).Convergenceofstochasticprocesses.Berlin:Springer. P.AUER,N.CESA-BIANCHIANDP.FISCHERSutton,R.,&Barto,A.(1998).Reinforcementlearning,anintroduction.Cambridge:MITPress/BradfordWilks,S.(1962).Matematicalstatistics.NewYork:JohnWileyandSons.Yakowitz,S.,&Lowe,W.(1991).Nonparametricbanditmethods.AnnalsofOperationsResearch,297ReceivedSeptember29,2000RevisedMay21,2001AcceptedJune20,2001FinalmanuscriptJune20,2001