/
JMLR Workshop and Conference Proceedings vol     Online Trading of Exploration and Exploitation JMLR Workshop and Conference Proceedings vol     Online Trading of Exploration and Exploitation

JMLR Workshop and Conference Proceedings vol Online Trading of Exploration and Exploitation - PDF document

test
test . @test
Follow
504 views
Uploaded On 2015-03-18

JMLR Workshop and Conference Proceedings vol Online Trading of Exploration and Exploitation - PPT Presentation

Of64258ine evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their partiallabel nature A common practice is to create a simulator which simulates th ID: 47347

Of64258ine evaluation the

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JMLR Workshop and Conference Proceedings..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

LICHULANGFORDMOONWANG1.IntroductionWeb-basedrecommendationandadvertisingservicessuchastheYahoo!TodayModule(athttp://www.yahoo.com)leverageuseractivitiessuchasclickstoidentifythemostattrac-tivecontents.Oneinherentchallengeisscoringnewlygeneratedcontentsuchasbreakingnews,especiallywhenthenewsrstemergesandlittledataareavailable.Apersonalizedservicewhichcantailorcontentstowardsindividualusersismoredesirableandchallenging.Adistinctfeatureoftheseapplicationsistheir“partial-label”nature:weobserveuserfeedback(clickornot)foranarticleonlywhenthisarticleisdisplayed.Suchakeychallenge,knownastheex-ploration/exploitationtradeoff,iscommonlystudiedinthecontextualbanditframework(LangfordandZhang,2008)thathasfoundsuccessfulapplications;see,e.g.,Agarwaletal.(2009),Graepeletal.(2010),Lietal.(2010),andMoonetal.(2010).Toevaluateacontextual-banditalgorithmreliably,itisidealtoconductabuckettest,inwhichwerunthealgorithmtoserveafractionofliveusertrafcintherealrecommendationsystem.How-ever,notonlyisthismethodexpensive,requiringsubstantialengineeringeffortsfordeploymentintherealsystem,butitcanalsohaveanegativeimpactonuserexperience.Furthermore,itisnoteasytoguaranteereplicablecomparisonusingbuckettestsasonlinemetricsvarysignicantlyovertime.Ofineevaluationofcontextual-banditalgorithmsthusbecomesvaluable.AlthoughbenchmarkdatasetsforsupervisedlearningsuchastheUCIrepositoryhaveprovedvaluableforempiricalcomparisonofalgorithms,collectingbenchmarkdatatowardsreliableofineevaluationhasbeendifcultinbanditproblems,asexplainedlaterinSection3.TherstpurposeofthepaperistoreviewarecentlyproposedevaluationmethodofLietal.(2011),whichenjoysvalu-abletheoreticalguaranteesincludingunbiasednessandaccuracy.Theeffectivenessofthemethodhasalsobeenveriedbycomparingitsevaluationresultstoonlinebucketresultsusingalargevol-umeofdatarecordedfromYahoo!FrontPage.SuchpositiveresultsnotonlyencouragewideuseoftheproposedmethodinotherWeb-basedapplications,butalsosuggestapromisingsolutiontocreatebenchmarkdatasetsfromreal-worldapplicationsforbanditalgorithms.Asoneapplication,thenextfocusofthepaperistousethisofineevaluationtechniquetovalidateafewnewbanditalgorithmsbasedongeneralizedlinearmodelsorGLMs(McCullaghandNelder,1989).WearguethatGLMsprovideabetterwaytomodelaveragerewardwhentherewardsignalisbinary,comparedtothemorewidelystudiedlinearmodelsdespitetheirstrongtheoreticalguarantees(Auer,2002;Chuetal.,2011).OurexperimentswithrealYahoo!dataprovideempiricalevidencefortheeffectivenessofthesenewalgorithms,andencouragefutureworkondevelopingregretboundforthemortheirvariants.Therestofthepaperisorganizedasfollows.AfterreviewingpreliminariesinSection2,wereviewtheofineevaluationtechniqueinSection3,includingunbiasednessandsamplecomplexityresults.Section4developsalgorithmswithGLM-basedrewardmodels.Thesealgorithmsareinspiredbyexistingonesforlinearmodels,andareempiricallyvalidatedinSection5usingrealdatacollectedfromYahoo!FrontPage.Finally,Section6concludesthepaper.Sinceourpapersconsistoftwomajorcomponents,relatedworkwillbediscussedinappropriatesubsections.2.NotationThemulti-armedbanditproblemisaclassicandpopularmodelforstudyingtheexploration-exploitationtradeoff(BerryandFristedt,1985).Thispaperconsiderstheproblemswithcontextualinformation.FollowingLangfordandZhang(2008),wecallitacontextualbanditproblem.20 LICHULANGFORDMOONWANGinformation,thedisplayednewsarticle,anduserfeedback(clickornot).Whenusingdataofthisformtoevaluateabanditalgorithmofine,wewillnothaveuserfeedbackifthealgorithmrecom-mendsadifferentnewsarticlethantheonestoredinthelog.Inotherwords,datainbandit-styleapplicationsonlycontainuserfeedbackforrecommendednewsarticlesthatwereactuallydisplayedtotheuser,butnotundisplayedones.This“partial-label”natureraisesadifcultythatisthekeydifferencebetweenevaluationofbanditalgorithmsandsupervisedlearningones.Commonpracticeforevaluatingabanditalgorithmistocreateasimulatorandthenrunthealgorithmagainstit.Withthisapproach,wecanevaluateanybanditalgorithmwithouthavingtorunitinarealsystem.Unfortunately,therearetwomajordrawbackswiththisapproach.First,creatingasimulatorcanbechallengingandtime-consumingforpracticalproblems.Second,evaluationresultsbasedonarticialsimulatorsmaynotreecttheactualperformancesincesimulatorsareonlyroughapproximationsofrealproblemsandunavoidablycontainsmodelingbias.Infact,buildingahigh-qualitysimulatorcanbestrictlyharderthanbuildingahigh-qualitypolicy(Strehletal.,2006).ThegoalhereistomeasurethetotalrewardofabanditalgorithmA.Becauseoftheinteractivenatureoftheproblem,itwouldseemthattheonlywaytodothisunbiasedlyistoactuallyrunthealgorithmonlineon“live”data.However,inpractice,thisapproachislikelytobeinfeasibleduetotheseriouslogisticalchallengesthatitpresents.Rather,wemayonlyhaveofinedataavailablethatwascollectedataprevioustimeusinganentirelydifferentloggingpolicy.Becauserewardsareonlyobservedforthearmschosenbytheloggingpolicy,whicharelikelytodifferfromthosechosenbythealgorithmAbeingevaluated,itisnotatallclearhowtoevaluateAbasedonlyonsuchloggeddata.Thisevaluationproblemmaybeviewedasaspecialcaseoftheso-called“off-policyevaluationproblem”inthereinforcement-learningliterature(Precupetal.,2000).Inthissection,wesummarizeourpreviouswork(Lietal.,2011)onasoundtechniqueforcarryingoutsuchanevaluation.Interestedreadersarereferredtotheoriginalpaperformoredetails.Thekeyassumptionofthemethodisthattheindividualeventsarei.i.d.,andthattheloggingpolicychoseeacharmateachtimestepuniformlyatrandom.Althoughweomitthedetails,thislatterassumptioncanbeweakenedconsiderablysothatanyrandomizedloggingpolicyisallowedandthealgorithmcanbemodiedaccordinglyusingrejectionsampling,butatthecostofdecreaseddataefciency.Furthermore,ifAisastationarypolicythatdoesnotchangeovertrials,datamaybeusedmoreefcientlyviapropensityscoring(Langfordetal.,2008;Strehletal.,2011)andrelatedtechniqueslikedoublyrobustestimation(Dud´ketal.,2011).Formally,algorithmAisa(possiblyrandomized)mappingforselectingthearmatattimetbasedonthehistoryht�1oft�1precedingeventstogetherwiththecurrentcontext.Algorithms1and2givetwoversionsoftheevaluationtechnique,oneassumingaccesstoasufcientlylongsequenceofloggedeventsresultingfromtheinteractionoftheloggingpolicywiththeworld,theotherassumingaxedsetofloggedinteraction.ThemethodtakesasinputabanditalgorithmA.Wethenstepthroughthestreamofloggedeventsonebyone.If,giventhecurrenthistoryht�1,ithappensthatthepolicyAchoosesthesamearmaastheonethatwasselectedbytheloggingpolicy,thentheeventisretained(thatis,addedtothehistory),andthetotalreward^GAupdated.Otherwise,ifthepolicyAselectsadifferentarmfromtheonethatwastakenbytheloggingpolicy,thentheeventisentirelyignored,andthealgorithmproceedstothenexteventwithoutanychangeinitsstate.TheprocessrepeatsuntilhTisreached(Algorithm1),oruntildataisexhausted(Algorithm2).Notethat,becausetheloggingpolicychooseseacharmuniformlyatrandom,eacheventisre-tainedbythisalgorithmwithprobabilityexactly1=K,independentofeverythingelse.ThismeansthattheeventswhichareretainedhavethesamedistributionasiftheywereselectedbyD.The22 ANUNBIASEDOFFLINEEVALUATIONOFCONTEXTUALBANDITALGORITHMSWITHGENERALIZEDLINEARMODELS Algorithm1Policy Evaluator(withinnitedatastream). 0:Inputs:T�0;banditalgorithmA;streamofeventsS1:h0 ;fAninitiallyemptyhistoryg2:^GA 0fAninitiallyzerototalrewardg3:fort=1;2;3;:::;Tdo4:repeat5:Getnextevent(x;a;ra)fromS6:untilA(ht�1;x)=a7:ht CONCATENATE(ht�1;(x;a;ra))8:^GA ^GA+ra9:endfor10:Output:^GA=T Algorithm2Policy Evaluator(withnitedatastream). 0:banditalgorithmA;streamofeventsSoflengthL1:h0 ;fAninitiallyemptyhistoryg2:^GA 0fAninitiallyzerototalrewardg3:T 0fAninitiallyzerocounterofvalideventsg4:fort=1;2;3;:::;Ldo5:Getthet-thevent(x;a;ra)fromS6:ifA(ht�1;x)=athen7:ht CONCATENATE(ht�1;(x;a;ra))8:^GA ^GA+ra9:T T+110:else11:ht ht�112:endif13:endfor14:Output:^GA=T unbiasednessguaranteethusfollowsimmediately.Hence,byrepeatingtheevaluationproceduremultipletimesandthenaveragingthereturnedper-trialrewards,wecanaccuratelyestimatethetotalper-trialrewardgAofanyalgorithmAandrespectivecondenceintervals.Furthermore,asthesizeofdataLincreases,theestimationerrorofAlgorithm2decreasesto0attherateofO(1=p L).Thiserrorboundimprovesapreviousresult(Langfordetal.,2008)forasimilarofineevaluationalgorithmandsimilarlyprovidesasharpenedanalysisfortheT=1specialcaseforpolicyevalua-tioninreinforcementlearning(Kearnsetal.,2000).Detailsandempiricalsupportoftheevaluationmethodarefoundinourfullpaper(Lietal.,2011).Insummary,theunbiasedofineevaluationtechniqueprovidesareliablemethodforcollectingbenchmarkdatasoastoevaluateandcomparedifferentbanditalgorithms,whichisnotavailablebefore.Therstsuchbenchmarkhasbeenreleasedtothepublic(Yahoo!,2011).Moreover,thetechniqueisquitegeneral;ithasbeensuccessfullyappliedtodomainslikeranking(Moonetal.,2010)aswellasinbanditproblemswithmultipleobjectives(Agarwaletal.,2011).23 ANUNBIASEDOFFLINEEVALUATIONOFCONTEXTUALBANDITALGORITHMSWITHGENERALIZEDLINEARMODELSForlinearandprobitmodels,theposteriormeanabovecanbecalculatedinclosedform:Ewa[^ra(x;wa)]=(mlinearm p 1+vprobitwheremdef=xaandvdef=x�axarethemeanandvarianceofthequantityxwa.Forthelogisticmodel,however,approximationisnecessary.Usingvariousapproximationtech-niques(seeAppendixC),afewcandidatesarereasonableandwillbecomparedagainsteachotherinthenextsection:Ewa[^ra(x;wa)]8����&#x]TJ ;� -1; .63; Td;&#x [00;&#x]TJ ;� -1; .63; Td;&#x [00;&#x]TJ ;� -1; .63; Td;&#x [00;&#x]TJ ;� -1; .63; Td;&#x [00;:(1+exp(�m))�1M01+exp�m=p 1+v=8�1M1(1+exp(�m�v=2))�1M2exp(m+v=2)M34.3.BalancingExplorationandExploitationChoosingarmsaccordingtotheexploitationruleEquation(2)isdesiredformaximizingtotalre-wards,butatthesametimerisky:thelackofexplorationmaypreventcollectionofdatatocorrectinitialerrorsinparameterestimation.Thissectiondiscussesafewcandidatesforonlinetradeoffsofexplorationandexploitation.Agenericheuristicis-greedy,inwhichonechoosesagreedyarm(accordingtoEquation(2))withprobability1�andarandomarmotherwise.Thisheuristicissimple,completelygeneral,andcanbecombinedwithessentiallywithanyrewardmodels.Unfortunately,duetotheunguided,uniformlyrandomselectionofarmsforexploration,itisoftennotthebestonecandoinpractice.Forexample,thecloselyrelatedepoch-greedyalgorithm(LangfordandZhang,2008)canonlyguarantee~O(T2=3)regretforstochasticbandits,whileguidedexplorationcandosignicantlybetterwith~O(p T)evenatthepresenceofanadversary(Aueretal.,2002b;Beygelzimeretal.,2011).Incontrast,UCB-basedexplorationtechniquesareexplicitlyguidedtowardsarmswithuncer-tainrewardpredictions.Theyarefoundeffectiveinpreviousstudies(Auer,2002;Aueretal.,2002a;Dorardetal.,2009;Lietal.,2010;Chuetal.,2011).Inthecontextofpresentwork,calculatingtheUCBisconvenientsincewemaintaintheposteriorofeachwaexplicitly.Analogoustopre-viousalgorithmsforlinearmodels,aUCBexplorationrulechoosesarmswithamaximumuppercondenceboundoftheexpectedreward,giventheparameterposteriorN(ja;a)andcontext:argmaxara(x;wa; )def=argmaxaEwa[^ra(x;w)]+ p Varwa[^ra(x;wa)](3)withapossiblyslowinggrowingparameter 2R+.Wecallthisarmselection -UCB.Forlinearandprobitmodels,thecalculationofraagaincanbedoneinclosedform:ra(x;wa)=(m+ p vlinear(Dorardetal.,2009;Lietal.,2010)(m+ p v)probitwheremdef=xaandvdef=x�axarethemeanandvarianceofthequantityxwa.Inthecaseofprobit,sinceismonotonic,thegreedyarmwithrespectto(m+ p v)isthusthesameas25 LICHULANGFORDMOONWANG (a)(b) (c)(d)Figure1:Acomparisonofthreegeneralizedlinearmodelswith50%subsamplesofdata.TheplotscontainnCTRforthelearningbucket(leftcolumn)andthedeploymentbucket(rightcol-umn)using-greedy(toprow)andUCB(bottomrow)combined.Numbersareaveragedover5runsonrandomsubsamples.theycan,atthebest,achievenCTRof1:509and1:584inthelearninganddeploymentbuckets,respectively.Theresultswereconsistentwithourpreviouswork(Lietal.,2010)althoughadifferentsetoffeatureswereused.Third,thelogisticandprobitmodelsclearlyoutperformlinearmodels,whichisexpectedastheirlikelihoodmodelsbettercapturethebinaryrewardsignals.SincebinaryrewardsarecommoninWeb-basedapplications(likeclicks,conversions,etc.),weanticipatethelogisticandprobitmodeltobemoreeffectiveingeneralthanlinearmodels.Thatbeingsaid,withalargeamountofdata,thelinearmodelmaystillbeeffective(Lietal.,2010;Moonetal.,2010),andremainsareasonablealgorithmicchoicegiventhesimplicityoftheirclosed-formupdaterules.28 LICHULANGFORDMOONWANG (a)(b)Figure3:AcomparisonofdifferentapproximationsfortheposteriormeanandposteriorUCBinlogisticmodels:(a)learningbucketnCTRwithvariousUCBapproximations;(b)deploy-mentbucketnCTRwithvariouspostermeanapproximationsandwithU0forexploration.thanthe(unnormalized)CTRsinourproblem.Butthispriorisnotoptimisticforlinearmodelssincetheprobabilitymasscentersaround0.Toxthisproblem,wechangedthepriormeanandvariancefortheconstantfeature'sweightto0:5and0:01,respectively.Figure2plotsthenCTRinthelearningbucketwhentheoptimisticinitializationisused.Weomitthedeploymentbucket'snCTRsincetheyareallsimilartotheleftmostonesinFigure1(b,d).Theresultsshowthatoptimisticinitializationaloneiseffectiveenoughwithoutexplicitexplorationlike-greedyorUCB.Indeed,thenCTRwashighestwhensuchexplicitexplorationwasturnedoff,thatis,when= =0.Incontrast,explicitexplorationwasstillnecessaryforlinearmodelsifthenon-optimisticpriorN(0;I)wasused:when= =0,meannCTRinthelearningbucketwasaslowas1:258(withstandarddeviation0:061),comparedto1:535(withstandarddeviation0:018)inFigure2.Hence,weconcludethatnon-optimisticinitializationwithoutexplicitexplorationmayresultinconvergencetosuboptimalpolicies,andthatoptimisticinitializationcanindeedbeeffectiveinpractice.5.4.ApproximationsintheLogisticModelGiventhehighlypromisingperformanceoflogisticmodels,wenextinvestigatehoweffectivethevariousapproximationsare.Figure3(a)comparesthelearningbucket'snCTRwithdifferentap-proximationsoftheUCBscore.Here,theformulaforexploitationisirrelevant,sowedonotshowdeploymentnCTR.TheresultsshowthatU0,U2,andU3canbeeffectivewithappropriatelytunedparameter ,butU1isnotassatisfactory.ItalsosuggeststheUCBrule(U3)givenbyFilippietal.(2011)istooconservative,andconsequentlythebest valueisverysmall.Figure3(b)comparesthedeploymentbucket'snCTRwithdifferentapproximationsofthepos-teriormean;theUCBformulawasxedtoU0.Here,thelearningbucketperformanceisirrelevantaswearecomparingwaystocomputetheposteriormeanwhilefollowingthesameexploration30 LICHULANGFORDMOONWANGmodeandtheHessianatthemode:Et+1[]=^t+1argmax�1+exp(�y(p v+m))�1exp��2(6)Vart+1[]=2^t+1 v+v2exp(v^t+1+) (1+exp(v^t+1+))2!�1(7)Oncewehavetheabove,wethenusethejointGaussianassumptionofwandandhaveEw=t0andCovw=ttxx�t1;wheretisassumedtobediagonal.Then,fromtheiteratedexpectationformula,weobtaint+1=t+x�t^t+1(8)t+1=t+2^t+1�1 2txx�t(9)Fordiagonalcovariancematrixt,theaboveupdatescanbedoneefcientlyinlineartime.AppendixB.ApproximateInferenceinProbitRegressionInprobitregression,theposteriordistributionoftheweightvectorwisproportionaltotheproductoftheprobitlikelihoodandtheGaussianpriordistribution,i.e.p(w)/(yx�w)N(w;t;t);wherexdenoteatrainingsamplewithlabely2f1g.Inourexperiments,aclicksignalc2f0;1ghastobeconvertedtoabinarylabelthrough2c�1.Wetakeavariationalapproachtoapproximatetheposteriordistributionp(w)byaGaussiandistribution.TheapproximateBayesianinferencetechniqueisknownasAssumedDensityFilter(ADF)orExpectationPropagation(EP)(Lawrenceetal.,2002;Minka,2001).Specically,letN(w;t+1;t+1)bethetargetGaussian,whoseparam-etersft+1;t+1garedeterminedbytheminimizerofthefollowingKullback-Leiblerdivergence:argmin;KL(yx�w)N(w;t;t)kN(w;;):Thisoptimizationproblemcanbesolvedanalyticallybymomentmatchinguptothesecondorder,yielding:t+1=t+ (tx)(10)t+1=t�(tx)(tx)�(11)where =y p x�tx+1N(z) (z);=1 p x�tx+1N(z) (z)N(z) (z)+z;32 ANUNBIASEDOFFLINEEVALUATIONOFCONTEXTUALBANDITALGORITHMSWITHGENERALIZEDLINEARMODELSAlinaBeygelzimer,JohnLangford,LihongLi,LevReyzin,andRobertE.Schapire.Contextualbanditalgorithmswithsupervisedlearningguarantees.InProceedingsoftheFourteenthInter-nationalConferenceonArticialIntelligenceandStatistics(AISTATS-11),pages19–26,2011.WeiChu,Seung-TaekPark,ToddBeaupre,NitinMotgi,AmitPhadke,SeinjutiChakraborty,andJoeZachariah.Acasestudyofbehavior-drivenconjointanalysisonYahoo!:FrontPageTodayModule.InProceedingsoftheFifteenthACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining,pages1097–1104,2009.WeiChu,LihongLi,LevReyzin,andRobertE.Schapire.Contextualbanditswithlinearpayofffunctions.InProceedingsoftheFourteenthInternationalConferenceonArticialIntelligenceandStatistics(AISTATS-11),pages208–214,2011.LouisDorard,DorotaGlowacka,andJohnShawe-Taylor.Gaussianprocessesmodellingofdepen-denciesinmulti-armedbanditproblems.InProceedingsoftheTenthInternationalSymposiumonOperationalResearch(SOR-09),pages721–728,2009.MiroslavDud´k,JohnLangford,andLihongLi.Doublyrobustpolicyevaluationandlearning.InProceedingsoftheTwenty-EighthInternationalConferenceonMachineLearning(ICML-11),pages1097–1104,2011.CoRRabs/1103.4601.SarahFilippi,OlivierCappe,Aur´elienGarivier,andCsabaSzepesv´ari.Parametricbandits:Thegeneralizedlinearcase.InAdvancesinNeuralInformationProcessingSystems23(NIPS-10),pages586–594,2011.ThoreGraepel,JoaquinQui˜noneroCandela,ThomasBorchert,andRalfHerbrich.Web-scaleBayesianclick-throughratepredictionforsponsoredsearchadvertisinginMicrosoft'sBingsearchengine.InProceedingsoftheTwenty-SeventhInternationalConferenceonMachineLearning(ICML-10),pages13–20,2010.SteffenGr¨unew¨alder,Jean-YvesAudibert,ManfredOpper,andJohnShawe-Taylor.RegretboundsforGaussianprocessbanditproblems.InProceedingsoftheThirteenthInternationalConferenceonArticialIntelligenceandStatistics(AISTATS-10),pages273–280,2010.MichaelJ.Kearns,YishayMansour,andAndrewY.Ng.ApproximateplanninginlargePOMDPsviareusabletrajectories.InAdvancesinNeuralInformationProcessingSystems12,pages1001–1007,2000.JohnLangfordandTongZhang.Theepoch-greedyalgorithmforcontextualmulti-armedbandits.InAdvancesinNeuralInformationProcessingSystems20,pages1096–1103,2008.JohnLangford,AlexanderL.Strehl,andJenniferWortman.Explorationscavenging.InProceedingsoftheTwenty-FifthInternationalConferenceonMachineLearning(ICML-08),pages528–535,2008.NeilD.Lawrence,MatthiasSeeger,andRalfHerbrich.FastsparseGaussianprocessmethods:Theinformativevectormachine.InS.Becker,S.Thrun,andK.Obermayer,editors,AdvancesinNeuralInformationProcessingSystems15,pages609–616,2002.35