Blum and Y Mansour Abstract Many situations involve repeatedly making decisions in an u ncertain envi ronment for instance deciding what route to drive to work e ach day or repeated play of a game against an opponent with an unknown st rategy In thi ID: 28981
Download Pdf The PPT/PDF document "Learning Regret minimization and Equilib..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
4Learning,Regretminimization,andEquilibriaA.BlumandY.MansourAbstractManysituationsinvolverepeatedlymakingdecisionsinanuncertainenvi-ronment:forinstance,decidingwhatroutetodrivetoworkeachday,orrepeatedplayofagameagainstanopponentwithanunknownstrategy.Inthischapterwedescribelearningalgorithmswithstrongguaranteesforset-tingsofthistype,alongwithconnectionstogame-theoreticequilibriawhenallplayersinasystemaresimultaneouslyadaptinginsuchamanner.Webeginbypresentingalgorithmsforrepeatedplayofamatrixgamewiththeguaranteethatagainstanyopponent,theywillperformnearlyaswellasthebestxedactioninhindsight(alsocalledtheproblemofcombiningexpertadviceorminimizingexternalregret).Inazero-sumgame,suchalgorithmsareguaranteedtoapproachorexceedtheminimaxvalueofthegame,andevenprovideasimpleproofoftheminimaxtheorem.Wethenturntoalgorithmsthatminimizeanevenstrongerformofregret,knownasinternalorswapregret.Wepresentageneralreductionshowinghowtoconvertanyalgorithmforminimizingexternalregrettoonethatminimizesthisstrongerformofregretaswell.Internalregretisimportantbecausewhenallplayersinagameminimizethisstrongertypeofregret,theempiricaldistributionofplayisknowntoconvergetocorrelatedequilibrium.Thethirdpartofthischapterexplainsadierentreduction:howtocon-vertfromthefullinformationsettinginwhichtheactionchosenbytheopponentisrevealedaftereachtimestep,tothepartialinformation(ban-dit)setting,whereateachtimesteponlythepayooftheselectedactionisobserved(suchasinrouting),andstillmaintainasmallexternalregret.Finally,weendbydiscussingroutinggamesintheWardropmodel,whereonecanshowthatifallparticipantsminimizetheirownexternalregret,then4 Learning,Regretminimization,andEquilibria5overalltracisguaranteedtoconvergetoanapproximateNashEquilibrium.Thisfurthermotivatesprice-of-anarchyresults.4.1IntroductionInthischapterweconsidertheproblemofrepeatedlymakingdecisionsinanuncertainenvironment.ThebasicsettingiswehaveaspaceofNactions,suchaswhatroutetousetodrivetowork,ortherowsofamatrixgamelikefrock,paper,scissorsg.Ateachtimestep,thealgorithmprobabilisticallychoosesanaction(say,selectingwhatroutetotake),theenvironmentmakesits\move"(settingtheroadcongestionsonthatday),andthealgorithmthenincursthelossforitsactionchosen(howlongitsroutetook).Theprocessthenrepeatsthenextday.Whatwewouldlikeareadaptivealgo-rithmsthatcanperformwellinsuchsettings,aswellastounderstandthedynamicsofthesystemwhentherearemultipleplayers,alladjustingtheirbehaviorinsuchaway.Akeytechniqueforanalyzingproblemsofthissortisknownasregretanalysis.Themotivationbehindregretanalysiscanbeviewedasthefol-lowing:wedesignasophisticatedonlinealgorithmthatdealswithvariousissuesofuncertaintyanddecisionmaking,andsellittoaclient.Ouralgo-rithmrunsforsometimeandincursacertainloss.Wewouldliketoavoidtheembarrassmentthatourclientwillcomebacktousandclaimthatinretrospectwecouldhaveincurredamuchlowerlossifweusedhissimplealternativepolicy.Theregretofouronlinealgorithmisthedierencebetweenthelossofouralgorithmandthelossusing.Dierentnotionsofregretquantifydierentlywhatisconsideredtobea\simple"alternativepolicy.Externalregret,alsocalledtheproblemofcombiningexpertadvice,comparesperformancetothebestsingleactioninretrospect.Thisimpliesthatthesimplealternativepolicyperformsthesameactioninalltimesteps,whichindeedisquitesimple.Nonetheless,exter-nalregretprovidesageneralmethodologyfordevelopingonlinealgorithmswhoseperformancematchesthatofanoptimalstaticoinealgorithmbymodelingthepossiblestaticsolutionsasdierentactions.Inthecontextofmachinelearning,algorithmswithgoodexternalregretboundscanbepow-erfultoolsforachievingperformancecomparabletotheoptimalpredictionrulefromsomelargeclassofhypotheses.InSection4.3wedescribeseveralalgorithmswithparticularlystrongexternalregretbounds.Westartwiththeveryweakgreedyalgorithm,andbuilduptoanalgorithmwhoselossisatmostO(p TlogN)greaterthanthatofthebestaction,whereTisthenumberoftimesteps.Thatis, 6A.BlumandY.MansourtheregretpertimestepdropsasO(p (logN)=T).InSection4.4weshowthatinazero-sumgame,suchalgorithmsareguaranteedtoapproachorexceedthevalueofthegame,andevenyieldasimpleproofoftheminimaxtheorem.Asecondcategoryofalternativepoliciesarethosethatconsidertheonlinesequenceofactionsandsuggestasimplemodicationtoit,suchas\everytimeyouboughtIBM,youshouldhaveboughtMicrosoftinstead".Whileonecanstudyverygeneralclassesofmodicationrules,themostcommonform,knownasinternalorswapregret,allowsonetomodifytheonlineactionsequencebychangingeveryoccurrenceofagivenactionibyanalternativeactionj.(Thedistinctionbetweeninternalandswapregretisthatinternalregretallowsonlyoneactiontobereplacedbyanother,whereasswapregretallowsanymappingfromf1;:::;Ngtof1;:::;NgandcanbeuptoafactorNlarger).InSection4.5wepresentasimplewaytoecientlyconvertanyexternalregretminimizingalgorithmintoonethatminimizesswapregretwithonlyafactorNincreaseintheregretterm.UsingtheresultsforexternalregretthisachievesaswapregretboundofO(p TNlogN).(Algorithmsforswapregrethavealsobeendevelopedfromrstprinciples|seetheNotessectionofthischapterforreferences|butthisproceduregivesthebestboundsknownforecientalgorithms).Theimportanceofswapregretisduetoitstightconnectiontocorrelatedequilibria,denedinChapter1.Infact,onewaytothinkofacorrelatedequilibriumisthatitisadistributionQoverthejointactionspacesuchthateveryplayerwouldhavezerointernal(orswap)regretwhenplayingit.AswepointoutinSection4.4,ifeachplayercanachieveswapregretT,thentheempiricaldistributionofthejointactionsoftheplayerswillbean-correlatedequilibrium.Wealsodescribehowexternalregretresultscanbeextendedtothepartialinformationmodel,alsocalledthemulti-armedbandit(MAB)problem.Inthismodel,theonlinealgorithmonlygetstoobservethelossoftheactionactuallyselected,anddoesnotseethelossesoftheactionsnotchosen.Forexample,inthecaseofdrivingtowork,youmayonlyobservethetraveltimeontherouteyouactuallydrive,anddonotgettondouthowlongitwouldhavetakenhadyouchosensomealternativeroute.InSection4.6wepresentageneralreduction,showinghowtoconvertanalgorithmwithlowexternalregretinthefullinformationmodeltooneforthepartialinformationmodel(thoughtheboundsproducednotthebestknownboundsforthisproblem).Noticethattheroute-choosingproblemcanbeviewedasageneral-sumgame:yourtraveltimedependsonthechoicesoftheotherdriversaswell.InSection4.7wediscussresultsshowingthatintheWardropmodelof Learning,Regretminimization,andEquilibria7innitesimalagents(consideredinChapter18),ifeachdriveractstomini-mizeexternalregret,thentrac\rowovertimecanbeshowntoapproachanapproximateNashequilibrium.Thisservestofurthermotivateprice-of-anarchyresultsinthiscontext,sinceitmeanstheyapplytothecasethatparticipantsareusingwell-motivatedself-interestedadaptivebehavior.Weremarkthattheresultswepresentinthischapterarenotalwaysthestrongestknown,andtheinterestedreaderisreferredtotherecentbook[CBL06]whichgivesathoroughcoverageofmanyofthethetopicsinthischapter.SeealsotheNotessectionforfurtherreferences.4.2ModelandPreliminariesWeassumeanadversarialonlinemodelwherethereareNavailableactionsX=f1;:::;Ng.Ateachtimestept,anonlinealgorithmHselectsadistributionptovertheNactions.Afterthat,theadversaryselectsalossvector`t2[0;1]N,where`ti2[0;1]isthelossofthei-thactionattimet.Inthefullinformationmodel,theonlinealgorithmHreceivesthelossvector`tandexperiencesaloss`tH=PNi=1pti`ti.(Thiscanbeviewedasanexpectedlosswhentheonlinealgorithmselectsactioni2Xwithprobabilitypti.)Inthepartialinformationmodel,theonlinealgorithmreceives(`tkt;kt),wherektisdistributedaccordingtopt,and`tH=`tktisitsloss.Thelossofthei-thactionduringtherstTtimestepsisLTi=PTt=1`ti,andthelossofHisLTH=PTt=1`tH.TheaimfortheexternalregretsettingistodesignanonlinealgorithmthatwillbeabletoapproachtheperformanceofthebestalgorithmfromagivenclassofalgorithmsG;namely,tohavealossclosetoLTG;min=ming2GLTg.FormallywewouldliketominimizetheexternalregretRG=LTH LTG;min,andGiscalledthecomparisonclass.ThemoststudiedcomparisonclassGistheonethatconsistsofallthesingleactions,i.e.,G=X.Inthischapterweconcentrateonthisimportantcomparisonclass,namely,wewanttheonlinealgorithm'slosstobeclosetoLTmin=miniLTi,andlettheexternalregretbeR=LTH LTmin.ExternalregretusesaxedcomparisonclassG,butonecanalsoenvisionacomparisonclassthatdependsontheonlinealgorithm'sactions.Wecanconsidermodicationrulesthatmodifytheactionsselectedbytheonlinealgorithm,producinganalternativestrategywhichwewillwanttocompeteagainst.AmodicationruleFhasasinputthehistoryandthecurrentactionselectedbytheonlineprocedureandoutputsa(possiblydierent)action.(WedenotebyFtthefunctionFattimet,includinganydependencyonthehistory.)Givenasequenceofprobabilitydistributionsptusedbyan 8A.BlumandY.MansouronlinealgorithmH,andamodicationruleF,wedeneanewsequenceofprobabilitydistributionsft=Ft(pt),wherefti=Pj:Ft(j)=iptj.ThelossofthemodiedsequenceisLH;F=PtPifti`ti.NotethatattimetthemodicationruleFshiftstheprobabilitythatHassignedtoactionjtoactionFt(j).ThisimpliesthatthemodicationruleFgeneratesadierentdistribution,asafunctionoftheonlinealgorithm'sdistributionpt.WewillfocusonthecaseofanitesetFofmemorylessmodicationrules(theydonotdependonhistory).Givenasequenceoflossvectors,theregretofanonlinealgorithmHwithrespecttothemodicationrulesFisRF=maxF2FfLTH LTH;Fg:NotethattheexternalregretsettingisequivalenttohavingasetFexofNmodicationrulesFi,whereFialwaysoutputsactioni.Forinternalregret,thesetFinconsistsofN(N 1)modicationrulesFi;j,whereFi;j(i)=jandFi;j(i0)=i0fori0=i.Thatis,theinternalregretofHismaxF2FinfLTH LTH;Fg=maxi;j2X(TXt=1pti(`ti `tj)):Amoregeneralclassofmemorylessmodicationrulesisswapregretde-nedbytheclassFsw,whichincludesallNNfunctionsF:f1;:::;Ng!f1;:::;Ng,wherethefunctionFswapsthecurrentonlineactioniwithF(i)(whichcanbethesameoradierentaction).Thatis,theswapregretofHismaxF2FswfLTH LTH;Fg=NXi=1maxj2X(TXt=1pti(`ti `tj)):NotethatsinceFexFswandFinFsw,bothexternalandinternalregretareupper-boundedbyswapregret.(SeealsoExercises1and2.)4.3ExternalRegretMinimizationBeforedescribingtheexternalregretresults,webeginbypointingoutthatitisnotpossibletoguaranteelowregretwithrespecttotheoveralloptimalsequenceofdecisionsinhindsight,asisdoneincompetitiveanalysis[ST85,BEY98].Thiswillmotivatewhywewillbeconcentratingonmorerestrictedcomparisonclasses.Inparticular,letGallbethesetofallfunctionsmappingtimesf1;:::;TgtoactionsX=f1;:::;Ng.Theorem4.1ForanyonlinealgorithmHthereexistsasequenceofTlossvectorssuchthatregretRGallisatleastT(1 1=N). Learning,Regretminimization,andEquilibria9ProofThesequenceissimplyasfollows:ateachtimet,theactionitoflowestprobabilityptigetsalossof0,andalltheotheractionsgetalossof1.Sinceminifptig1=N,thismeansthelossofHinTtimestepsisatleastT(1 1=N).Ontheotherhand,thereexistsg2Gall,namelyg(t)=it,withatotallossof0. Theaboveproofshowsthatifweconsiderallpossiblefunctions,wehaveaverylargeregret.FortherestofthesectionwewillusethecomparisonclassGa=fgi:i2Xg,wheregialwaysselectsactioni.Namely,wecomparetheonlinealgorithmtothebestsingleaction.Warmup:GreedyandRandomized-GreedyAlgorithmsInthissection,forsimplicitywewillassumealllossesareeither0or1(ratherthanarealnumberin[0;1]),whichwillsimplifynotationandproofs,thougheverythingpresentedcanbeeasilyextendedtothegeneralcase.Ourrstattempttodevelopagoodregretminimizationalgorithmwillbetoconsiderthegreedyalgorithm.RecallthatLti=Pt=1`i,namelythecumulativelossuptotimetofactioni.TheGreedyalgorithmateachtimetselectsactionxt=argmini2XLt 1i(iftherearemultipleactionswiththesamecumulativeloss,itpreferstheactionwiththelowestindex).Formally:GreedyAlgorithmInitially:x1=1.Attimet:LetLt 1min=mini2XLt 1i,andSt 1=fi:Lt 1i=Lt 1ming.Letxt=minSt 1.Theorem4.2TheGreedyalgorithm,foranysequenceoflosseshasLTGreedyNLTmin+(N 1):ProofAteachtimetsuchthatGreedyincursalossof1andLtmindoesnotincrease,atleastoneactionisremovedfromSt.ThiscanoccuratmostNtimesbeforeLtminincreasesby1.Therefore,GreedyincurslossatmostNbetweensuccessiveincrementsinLtmin.Moreformally,thisshowsinductivelythatLtGreedyN jStj+NLtmin. TheaboveguaranteeonGreedyisquiteweak,statingonlythatitslossisatmostafactorofNlargerthanthelossofthebestaction.Thefollowingtheoremshowsthatthisweaknessissharedbyanydeterministiconlinealgorithm.(Adeterministicalgorithmconcentratesitsentireweightonasingleactionateachtimestep.) 10A.BlumandY.MansourTheorem4.3ForanydeterministicalgorithmDthereexistsalossse-quenceforwhichLTD=TandLTmin=bT=Nc.NotethattheabovetheoremimpliesthatLTDNLTmin+(TmodN),whichalmostmatchestheupperboundforGreedy(Theorem4.2).ProofFixadeterministiconlinealgorithmDandletxtbetheactionitselectsattimet.Wewillgeneratethelosssequenceinthefollowingway.Attimet,letthelossofxtbe1andthelossofanyotheractionbe0.ThisensuresthatDincursloss1ateachtimestep,soLTD=T.SincethereareNdierentactions,thereissomeactionthatalgorithmDhasselectedatmostbT=Nctimes.Byconstruction,onlytheactionsselectedbyDeverhavealoss,sothisimpliesthatLTminbT=Nc. Theorem4.3motivatesconsideringrandomizedalgorithms.Inparticular,oneweaknessofthegreedyalgorithmwasthatithadadeterministictiebreaker.Onecanhopethatiftheonlinealgorithmsplitsitsweightbetweenallthecurrentlybestactions,betterperformancecouldbeachieved.Speci-cally,letRandomizedGreedy(RG)betheprocedurethatassignsauniformdistributionoverallthoseactionswithminimumtotallosssofar.Wenowwillshowthatthisalgorithmachievesasignicantperformanceimprove-ment:itslossisatmostanO(logN)factorfromthebestaction,ratherthanO(N).(Thisissimilartotheanalysisoftherandomizedmarkingalgorithmincompetitiveanalysis).RandomizedGreedy(RG)AlgorithmInitially:p1i=1=Nfori2X.Attimet:LetLt 1min=mini2XLt 1i,andSt 1=fi:Lt 1i=Lt 1ming.Letpti=1=jSt 1jfori2St 1andpti=0otherwise.Theorem4.4TheRandomizedGreedy(RG)algorithm,foranylossse-quence,hasLTRG(lnN)+(1+lnN)LTmin:ProofTheprooffollowsfromshowingthatthelossincurredbyRandomizedGreedybetweensuccessiveincreasesinLtminisatmost1+lnN.Specically,lettjdenotethetimestepatwhichLtminrstreachesalossofj,soweareinterestedinthelossofRandomizedGreedybetweentimestepstjandtj+1.Attimeanytwehave1jStjN.Furthermore,ifattimet2(tj;tj+1]thesizeofStshrinksbykfromsomesizen0downton0 k,thenthelossoftheonlinealgorithmRGisk=n0,sinceeachsuchactionhasweight1=n0.Finally, Learning,Regretminimization,andEquilibria11noticethatwecanupperboundk=n0by1=n0+1=(n0 1)+:::+1=(n0 k+1).Therefore,overtheentiretime-interval(tj;tj+1],thelossofRandomizedGreedyisatmost:1=N+1=(N 1)+1=(N 2)+:::+1=11+lnN:Moreformally,thisshowsinductivelythatLtRG(1=N+1=(N 1)+:::+1=(jStj+1))+(1+lnN)Ltmin. RandomizedWeightedMajorityalgorithmAlthoughRandomizedGreedyachievedasignicantperformancegaincom-paredtotheGreedyalgorithm,westillhavealogarithmicratiotothebestaction.Lookingmorecloselyattheproof,onecanseethatthelossesaregreatestwhenthesetsStaresmall,sincetheonlinelosscanbeviewedasproportionalto1=jStj.Onewaytoovercomethisweaknessistogivesomeweighttoactionswhicharecurrently\nearbest".Thatis,wewouldliketheprobabilitymassonsomeactiontodecaygracefullywithitsdistancetooptimality.ThisistheideaoftheRandomizedWeightedMajorityalgorithmofLittlestoneandWarmuth.Specically,intheRandomizedWeightedMajorityalgorithm,wegiveanactioniwhosetotallosssofarisLiaweightwi=(1 )Li,andthenchooseprobabilitiesproportionaltotheweights:pi=wi=PNj=1wj.Theparameterwillbesettooptimizecertaintradeosbutconceptuallythinkofitasasmallconstant,say0:01.Inthissectionwewillagainassumelossesinf0;1gratherthan[0;1]becauseitallowsforanespeciallyintuitiveinterpretationoftheproof(Theorem4.5).Wethenrelaxthisassumptioninthenextsection(Theorem4.6).RandomizedWeightedMajority(RWM)AlgorithmInitially:w1i=1andp1i=1=N,fori2X.Attimet:If`t 1i=1,letwti=wt 1i(1 );else(`t 1i=0)letwti=wt 1i.Letpti=wti=Wt,whereWt=Pi2Xwti.AlgorithmRWMandTheorem4.5canbegeneralizedtolossesin[0;1]byreplacingtheupdaterulewithwti=wt 1i(1 )`t 1i(seeExercise3).Theorem4.5For1=2,thelossofRandomizedWeightedMajority(RWM)onanysequenceofbinaryf0;1glossessatises:LTRWM(1+)LTmin+lnN :Setting=minfp (lnN)=T;1=2gyieldsLTRWMLTmin+2p TlnN. 12A.BlumandY.Mansour(Note:thesecondpartofthetheoremassumesTisknowninadvance.IfTisunknown,thena\guessanddouble"approachcanbeusedtosetwithjustaconstant-factorlossinregret.Infact,onecanachievethepotentiallybetterboundLTRWMLTmin+2p LminlnNbysetting=minfp (lnN)=Lmin;1=2g.)ProofThekeytotheproofistoconsiderthetotalweightWt.Whatwewillshowisthatanytimetheonlinealgorithmhassignicantexpectedloss,thetotalweightmustdropsubstantially.WewillthencombinethiswiththefactthatWT+1maxiwT+1i=(1 )LTmintoachievethedesiredbound.Specically,letFt=(Pi:`ti=1wti)=WtdenotethefractionoftheweightWtthatisonactionsthatexperiencealossof1attimet;so,FtequalstheexpectedlossofalgorithmRWMattimet.Now,eachoftheactionsexperiencingalossof1hasitsweightmultipliedby(1 )whiletherestareunchanged.Therefore,Wt+1=Wt FtWt=Wt(1 Ft).Inotherwords,theproportionoftheweightremovedfromthesystemateachtimetisexactlyproportionaltotheexpectedlossoftheonlinealgorithm.Now,usingthefactthatW1=NandusingourlowerboundonWT+1wehave:(1 )LTminWT+1=W1TYt=1(1 Ft)=NTYt=1(1 Ft):Takinglogarithms,LTminln(1 )(lnN)+TXt=1ln(1 Ft)(lnN) TXt=1Ft(Usingtheinequalityln(1 z) z)=(lnN) LTRWM(bydenitionofFt)Therefore,LTRWM LTminln(1 ) +ln(N) (1+)LTmin+ln(N) ;(Usingtheinequality ln(1 z)z+z2for0z1 2)whichcompletestheproof. Learning,Regretminimization,andEquilibria13PolynomialWeightsalgorithmThePolynomialWeights(PW)algorithmisanaturalextensionoftheRWMalgorithmtolossesin[0;1](oreventothecaseofbothlossesandgains,seeExercise4)thatmaintainsthesameproofstructureasthatusedforRWMandinadditionperformsespeciallywellinthecaseofsmalllosses.PolynomialWeights(PW)AlgorithmInitially:w1i=1andp1i=1=N,fori2X.Attimet:Letwti=wt 1i(1 `t 1i).Letpti=wti=Wt,whereWt=Pi2Xwti.NoticethattheonlydierencebetweenPWandRWMisintheupdatestep.Inparticular,itisnolongernecessarilythecasethatanactionoftotallossLhasweight(1 )L.However,whatismaintainedisthepropertythatifthealgorithm'slossattimetisFt,thenexactlyanFtfractionofthetotalweightisremovedfromthesystem.Specically,fromtheupdaterulewehaveWt+1=Wt Piwti`ti=Wt(1 Ft)whereFt=(Piwti`ti)=WtisthelossofPWattimet.Wecanusethisfacttoprovethefollowing:Theorem4.6ThePolynomialWeights(PW)algorithm,using1=2,forany[0;1]-valuedlosssequenceandforanykhas,LTPWLTk+QTk+ln(N) ;whereQTk=PTt=1(`tk)2.Setting=minfp (lnN)=T;1=2gandnotingthatQTkT,wehaveLTPWLTmin+2p TlnN:yProofAsnotedabove,wehaveWt+1=Wt(1 Ft)whereFtisPW'slossattimet.So,aswiththeanalysisofRWM,wehaveWT+1=NQTt=1(1 Ft)andtherefore:lnWT+1=lnN+TXt=1ln(1 Ft)lnN TXt=1Ft=lnN LTPW:Nowforthelowerbound,wehave:lnWT+1lnwT+1k=TXt=1ln(1 `tk)(usingtherecursivedenitionofweights)yAgain,forsimplicityweassumethatthenumberoftimestepsTisgivenasaparametertothealgorithm;otherwiseonecanusea\guessanddouble"methodtoset. 14A.BlumandY.Mansour TXt=1`tk TXt=1(`tk)2(usingtheinequalityln(1 z) z z2for0z1 2)= LTk 2QTk:CombiningtheupperandlowerboundsonlnWT+1wehave: LTk 2QTklnN LTPW;whichyieldsthetheorem. LowerBoundsAnobviousquestioniswhetheronecansignicantlyimprovetheboundinTheorem4.6.Wewillshowtwosimpleresultsthatimplythattheregretboundisnearoptimal(seeExercise5forabetterlowerbound).TherstresultshowsthatonecannothopetogetsublinearregretwhenTissmallcomparedtologN,andthesecondshowsthatonecannothopetoachieveregreto(p T)evenwhenN=2.Theorem4.7ConsiderTlog2N.Thereexistsastochasticgenerationoflossessuchthat,foranyonlinealgorithmR1,wehaveE[LTR1]=T=2andyetLTmin=0.ProofConsiderthefollowingsequenceoflosses.Attimet=1,arandomsubsetofN=2actionsgetalossof0andtherestgetalossof1.Attimet=2,arandomsubsetofN=4oftheactionsthathadloss0attimet=1getalossof0,andtherest(includingactionsthathadalossof1attime1)getalossof1.Thisprocessrepeats:ateachtimestep,arandomsubsetofhalfoftheactionsthathavereceivedloss0sofargetalossof0,whilealltherestgetalossof1.Anyonlinealgorithmincursanexpectedlossof1=2ateachtimestep,becauseateachtimestepttheexpectedfractionofprobabilitymassptionactionsthatreceivealossof0isatmost1=2.Yet,forTlog2Ntherewillalwaysbesomeactionwithtotallossof0. Theorem4.8ConsiderN=2.Thereexistsastochasticgenerationoflossessuchthat,foranyonlinealgorithmR2,wehaveE[LTR2 LTmin]=\n(p T).ProofAttimet,we\ripafaircoinandset`t=z1=(0;1)withprobability1=2and`t=z2=(1;0)withprobability1=2.Foranydistributionptthe Learning,Regretminimization,andEquilibria15expectedlossattimetisexactly1=2.ThereforeanyonlinealgorithmR2hasexpectedlossofT=2.GivenasequenceofTsuchlosses,withT=2+ylossesz1andT=2 ylossesz2,wehaveT=2 LTmin=jyj.ItremainstolowerboundE[jyj].Notethattheprobabilityofyis TT=2+y=2T,whichisupperboundedbyO(1=p T)(usingaSterlingapproximation).Thisimpliesthatwithaconstantprobabilitywehavejyj=\n(p T),whichcompletestheproof. 4.4RegretminimizationandgametheoryInthissectionweoutlinetheconnectionbetweenregretminimizationandcentralconceptsingametheory.Westartbyshowingthatinatwoplayerconstantsumgame,aplayerwithexternalregretsublinearinTwillhaveanaveragepayothatisatleastthevalueofthegame,minusavanish-ingerrorterm.Forageneralgame,wewillseethatifalltheplayersuseprocedureswithsublinearswap-regret,thentheywillconvergetoanapprox-imatecorrelatedequilibrium.Wealsoshowthatforaplayerwhominimizesswap-regret,thefrequencyofplayingdominatedactionsisvanishing.GametheoreticmodelWestartwiththestandarddenitionsofagame(seealsoChapter1).AgameG=hM;(Xi);(si)ihasanitesetMofmplayers.PlayerihasasetXiofNactionsandalossfunctionsi:Xi(j=iXj)![0;1]thatmapstheactionofplayeriandtheactionsoftheotherplayerstoarealnumber.(Wehavescaledlossesto[0;1].)ThejointactionspaceisX=Xi.WeconsideraplayerithatplaysagameGforTtimestepsusinganonlineprocedureON.Attimestept,playeriplaysadistribution(mixedaction)Pti,whiletheotherplayersplaythejointdistributionPt i.Wedenoteby`tONthelossofplayeriattimet,i.e.,ExPt[si(xt)],anditscumulativelossisLTON=PTt=1`tON.yItisnaturaltodene,forplayeriattimet,thelossvectoras`t=(`t1;:::;`tN),where`tj=Ext iPt i[si(xtj;xt i)].Namely,`tjisthelossplayeriwouldhaveobservedifattimetithadplayedactionxj.Thecumulativelossofactionxj2XiofplayeriisLTj=PTt=1`tj,andLTmin=minjLTj.yAlternatively,wecouldconsiderxtiasarandomvariabledistributedaccordingtoPti,andsimilarlydiscusstheexpectedloss.Weprefertheabovepresentationforconsistencywiththerestofthechapter. 16A.BlumandY.MansourConstantsumgamesandexternalregretminimizationAtwoplayerconstantsumgameG=hf1;2g;(Xi);(si)ihasthepropertythatforsomeconstantc,foreveryx12X1andx22X2wehaves1(x1;x2)+s2(x1;x2)=c.Itiswellknownthatanyconstantsumgamehasawelldenedvalue(v1;v2)forthegame,andplayeri2f1;2ghasamixedstrategywhichguaranteesthatitsexpectedlossisatmostvi,regardlessoftheotherplayer'sstrategy.(See[Owe82]formoredetails.)Insuchgames,externalregret-minimizationproceduresprovidethefollowingguarantee:Theorem4.9LetGbeaconstantsumgamewithgamevalue(v1;v2).Ifplayeri2f1;2gplaysforTstepsusingaprocedureONwithexternalregretR,thenitsaverageloss1 TLTONisatmostvi+R=T.ProofLetqbethemixedstrategycorrespondingtotheobservedfrequenciesoftheactionsplayer2hasplayed;thatis,qj=PTt=1Pt2;j=T,wherePt2;jistheweightplayer2givestoactionjattimet.Bythetheoryofconstantsumgames,foranymixedstrategyqofplayer2,player1hassomeactionxk2X1suchthatEx2q[s1(xk;x2)]v1(see[Owe82]).Thisimplies,inoursetting,thatifplayer1hasalwaysplayedactionxk,thenitslosswouldbeatmostv1T.ThereforeLTminLTkv1T.Now,usingthefactthatplayer1isplayingaprocedureONwithexternalregretR,wehavethatLTONLTmin+Rv1T+R: Thus,usingaprocedurewithregretR=O(p TlogN)asinTheorem4.6willguaranteeaveragelossatmostvi+O(p (logN)=T).Infact,wecanusetheexistenceofexternalregretminimizationalgo-rithmstoprovetheminimaxtheoremoftwo-playerzero-sumgames.Forplayer1,letv1min=maxz2(X2)minx12X1Ex2z[s1(x1;x2)]andv1max=minz2(X1)maxx22X2Ex1z[s1(x1;x2)].Thatis,v1ministhebestlossthatplayer1canguaranteeforitselfifitistoldthemixedactionofplayer2inadvance.Similarly,v1maxisthebestlossthatplayer1canguaranteetoitselfifithastogorstinselectingamixedaction,andplayer2'sactionmaythendependonit.Theminimaxtheoremstatesthatv1min=v1max.Sinces1(x1;x2)= s2(x1;x2)wecansimilarlydenev2min= v1maxandv2max= v1min.Inthefollowingwegiveaproofoftheminimaxtheorembasedontheex-istenceofexternalregretalgorithms.Assumeforcontradictionthatv1max=v1min+\rforsome\r0(itiseasytoseethatv1maxv1min).ConsiderbothplayersplayingaregretminimizationalgorithmforTstepshavingexternalregretofatmostR,suchthatR=T\r=2.LetLONbethelossofplayer1 Learning,Regretminimization,andEquilibria17andnotethat LONisthelossofplayer2.LetLiminbethecumulativelossofthebestactionofplayeri2f1;2g.Asbefore,letqibethemixedstrat-egycorrespondingtotheobservedfrequenciesofactionsofplayeri2f1;2g.Then,L1min=Tv1min,sinceforL1minweselectthebestactionwithrespecttoaspecicmixedaction,namelyq2.Similarly,L2min=Tv2min.Theregretminimizationalgorithmsguaranteeforplayer1thatLONL1min+R,andforplayer2that LONL2min+R.Combiningtheinequalitieswehave:Tv1max R= Tv2max R L2min RLONL1min+RTv1min+R:Thisimpliesthatv1max v1min2R=T\r,whichisacontradiction.There-fore,v1max=v1min,whichestablishestheminimaxtheorem.CorrelatedEquilibriumandswapregretminimizationWerstdenetherelevantmodicationrulesandestablishtheconnectionbetweenthemandequilibriumnotions.Forx1;b1;b22Xi,letswitchi(x1;b1;b2)bethefollowingmodicationfunctionoftheactionx1ofplayeri:switchi(x1;b1;b2)=(b2ifx1=b1x1otherwiseGivenamodicationfunctionfforplayeri,wecanmeasuretheregretofplayeriwithrespecttofasthedecreaseinitsloss,i.e.,regreti(x;f)=si(x) si(f(xi);x i):Forexample,whenweconsiderf(x1)=switchi(x1;b1;b2),foraxedb1;b22Xi,thenregreti(x;f)ismeasuringtheregretplayerihasforplayingactionb1ratherthanb2,whentheotherplayersplayx i.AcorrelatedequilibriumisadistributionPoverthejointactionspacewiththefollowingproperty.Imagineacorrelatingdevicedrawsavectorofactionsx2XusingdistributionPoverX,andgivesplayeritheactionxifromx.(Playeriisnotgivenanyotherinformationregardingx.)TheprobabilitydistributionPisacorrelatedequilibriumif,foreachplayer,itisabestresponsetoplaythesuggestedaction,providedthattheotherplayersalsodonotdeviate.(ForamoredetaileddiscussionofcorrelatedequilibriumseeChapter1.)Denition4.10AjointprobabilitydistributionPoverXisacorrelatedequilibriumifforeveryplayeri,andanyactionsb1;b22Xi,wehavethatExP[regreti(x;switchi(;b1;b2))]0: 18A.BlumandY.MansourAnequivalentdenitionthatextendsmorenaturallytothecaseofap-proximateequilibriaistosaythatratherthanonlyswitchingbetweenapairofactions,weallowsimultaneouslyreplacingeveryactioninXiwithanotheractioninXi(possiblythesameaction).AdistributionPisacorrelatedequi-libriumiforanyfunctionF:Xi!XiwehaveExP[regreti(x;F)]0.Wenowdenean-correlatedequilibrium.An-correlatedequilibriumisadistributionPsuchthateachplayerhasinexpectationatmostanincentivetodeviate.Formally,Denition4.11AjointprobabilitydistributionPoverXisan-correlatedequilibriaifforeveryplayeriandforanyfunctionFi:Xi!Xi,wehaveExP[regreti(x;Fi)].Thefollowingtheoremrelatestheempiricaldistributionoftheactionsperformedbyeachplayer,theirswapregret,andthedistancetocorrelatedequilibrium.Theorem4.12LetG=hM;(Xi);(si)ibeagameandassumethatforTtimestepseveryplayerfollowsastrategythathasswapregretofatmostR.Then,theempiricaldistributionQofthejointactionsplayedbytheplayersisan(R=T)-correlatedequilibrium.ProofTheempiricaldistributionQassignstoeveryPtaprobabilityof1=T.FixafunctionF:Xi!Xiforplayeri.SinceplayerihasswapregretatmostR,wehaveLTONLTON;F+R,whereLTONisthelossofplayeri.Bydenitionoftheregretfunction,wethereforehave:LTON LTON;F=TXt=1ExtPt[si(xt)] TXt=1ExtPt[si(F(xti);xt i)]=TXt=1ExtPt[regreti(xt;F)]=TExQ[regreti(x;F)]:Therefore,foranyfunctionFi:Xi!XiwehaveExQ[regreti(x;Fi)]R=T. Theabovetheoremstatesthatthepayoofeachplayerisitspayoinsomeapproximatecorrelatedequilibrium.Inaddition,itrelatestheswapregrettothedistancefromequilibrium.Notethatiftheaverageswapregretvanishesthentheprocedureconverges,inthelimit,tothesetofcorrelatedequilibria. Learning,Regretminimization,andEquilibria19DominatedstrategiesWesaythatanactionxj2Xiis-dominatedbyactionxk2Xiifforanyx i2X iwehavesi(xj;x i)+si(xk;x i).Similarly,actionxj2Xiis-dominatedbyamixedactiony2(Xi)ifforanyx i2X iwehavesi(xj;x i)+Exy[si(xd;x i)].Intuitively,agoodlearningalgorithmoughttobeabletolearnnottoplayactionsthatare-dominatedbyothers,andinthissectionweshowthatindeedifplayeriplaysaprocedurewithsublinearswapregret,thenitwillveryrarelyplaydominatedactions.Moreprecisely,letactionxjbe-dominatedbyactionxk2Xi.Usingournotation,thisimpliesthatforanyx iwehavethatregreti(x;switchi(;xj;xk)).LetDbethesetof-dominatedactionsofplayeri,andletwbetheweightthatplayeriputsonactionsinD,averagedovertime,i.e.,w=1 TPTt=1Pj2DPti;j.Playeri'sswapregretisatleastwT(sincewecouldreplaceeachactioninDwiththeactionthatdominatesit).So,iftheplayer'sswapregretisR,thenwTR.Therefore,thetime-averageweightthatplayeriputsonthesetof-dominatedactionsisatmostR=(T)whichtendsto0ifRissublinearinT.Thatis:Theorem4.13ConsideragameGandaplayerithatusesaprocedureofswapregretRforTtimesteps.Thentheaverageweightthatplayeriputsonthesetof-dominatedactionsisatmostR=(T).Weremarkthatingeneralthepropertyofhavinglowexternalregretisnotsucientbyitselftogivesuchaguarantee,thoughthealgorithmsRWMandPWdoindeedhavesuchaguarantee(seeExercise8).4.5GenericreductionfromswaptoexternalregretInthissectionwegiveablack-boxreductionshowinghowanyprocedureAachievinggoodexternalregretcanbeusedasasubroutinetoachievegoodswapregretaswell.Thehigh-levelideaisasfollows(seealsoFig.4.1).WewillinstantiateNcopiesA1;:::;ANoftheexternal-regretprocedure.Ateachtimestep,theseprocedureswilleachgiveusaprobabilityvector,whichwewillcombineinaparticularwaytoproduceourownprobabilityvectorp.Whenwereceivealossvector`,wewillpartitionitamongtheNprocedures,givingprocedureAiafractionpi(piisourprobabilitymassonactioni),sothatAi'sbeliefaboutthelossofactionjisPtpti`tj,andmatchesthecostwewouldincurputtingi'sprobabilitymassonj.Intheproof,procedureAiwillinsomesenseberesponsibleforensuringlowregret 20A.BlumandY.Mansour - - -ANA1Hqt1p1`tptn`tpt`tqtNeeeFig.4.1.Thestructureoftheswapregretreduction.ofthei!jvariety.Thekeytomakingthisworkisthatwewillbeabletodenethep'ssothatthesumofthelossesoftheproceduresAiontheirownlossvectorsmatchesouroveralltrueloss.RecallthedenitionofanRexternalregretprocedure.Denition4.14AnRexternalregretprocedureAguaranteesthatforanysequenceofTlosses`tandforanyactionj2f1;:::;Ng,wehaveLTA=TXt=1`tATXt=1`tj+R=LTj+R:WeassumewehaveNcopiesA1;:::;ANofanRexternalregretproce-dure.WecombinetheNprocedurestoonemasterprocedureHasfollows.Ateachtimestept,eachprocedureAioutputsadistributionqti,whereqti;jisthefractionitassignsactionj.Wecomputeasingledistributionptsuchthatptj=Piptiqti;j.Thatis,pt=ptQt,whereptisourdistributionandQtisthematrixofqti;j.(WecanviewptasastationarydistributionoftheMarkovProcessdenedbyQt,anditiswellknownsuchaptexistsandisecientlycomputable.)Forintuitionintothischoiceofpt,noticethatitimplieswecanconsideractionselectionintwoequivalentways.Therstissimplyusingthedistributionpttoselectactionjwithprobabilityptj.The Learning,Regretminimization,andEquilibria21secondistoselectprocedureAiwithprobabilityptiandthentouseAitoselecttheaction(whichproducesdistributionptQt).Whentheadversaryreturnsthelossvector`t,wereturntoeachAithelossvectorpi`t.So,procedureAiexperiencesloss(pti`t)qti=pti(qti`t).SinceAiisanRexternalregretprocedure,foranyactionj,wehave,TXt=1pti(qti`t)TXt=1pti`tj+R(4.1)IfwesumthelossesoftheNproceduresatagiventimet,wegetPipti(qti`t)=ptQt`t,whereptistherow-vectorofourdistribution,Qtisthematrixofqti;j,and`tisviewedasacolumn-vector.Bydesignofpt,wehaveptQt=pt.So,thesumoftheperceivedlossesoftheNproceduresisequaltoouractuallosspt`t.Therefore,summingequation(4.1)overallNprocedures,theleft-hand-sidesumstoLTH,whereHisourmasteronlineprocedure.Sincetheright-hand-sideofequation(4.1)holdsforanyj,wehavethatforanyfunctionF:f1;:::;Ng!f1;:::;Ng,LTHNXi=1TXt=1pti`tF(i)+NR=LTH;F+NRThereforewehaveproventhefollowingtheorem.Theorem4.15GivenanRexternalregretprocedure,themasteronlineprocedureHhasthefollowingguarantee.ForeveryfunctionF:f1;:::;Ng!f1;:::;Ng,LHLH;F+NR;i.e.,theswapregretofHisatmostNR.UsingTheorem4.6wecanimmediatelyderivethefollowingcorollary.Corollary4.16ThereexistsanonlinealgorithmHsuchthatforeveryfunctionF:f1;:::;Ng!f1;:::;Ng,wehavethatLHLH;F+O(Np TlogN);i.e.,theswapregretofHisatmostO(Np TlogN).Remark:SeeExercise6foranimprovementtoO(p NTlogN). 22A.BlumandY.Mansour4.6ThePartialInformationModelInthissectionweshow,forexternalregret,asimplereductionfromthepartialinformationtothefullinformationmodel.yThemaindierencebetweenthetwomodelsisthatinthefullinformationmodel,theonlineprocedurehasaccesstothelossofeveryaction.Inthepartialinformationmodeltheonlineprocedurereceivesasfeedbackonlythelossofasingleaction,theactionitperformed.Thisverynaturallyleadstoanexplorationversusexploitationtradeointhepartialinformationmodel,andessentiallyanyonlineprocedurewillhavetosomehowexplorethevariousactionsandestimatetheirloss.Thehighlevelideaofthereductionisasfollows.AssumethatthenumberoftimestepsTisgivenasaparameter.WewillpartitiontheTtimestepsintoKblocks.Theprocedurewillusethesamedistributionoveractionsinallthetimestepsofanygivenblock,exceptitwillalsorandomlysampleeachactiononce(theexplorationpart).ThepartialinformationprocedureMABwillpasstothefullinformationprocedureFIBthevectoroflossesreceivedfromitsexplorationsteps.ThefullinformationprocedureFIBwillthenreturnanewdistributionoveractions.ThemainpartoftheproofwillbetorelatethelossofthefullinformationprocedureFIBonthelosssequenceitobservestothelossofthepartialinformationprocedureMABonthereallosssequence.WestartbyconsideringafullinformationprocedureFIBthatpartitionstheTtimestepsintoKblocks,B1;:::;BK,whereBi=f(i 1)(T=K)+1;:::;i(T=K)g,andusesthesamedistributioninallthetimestepsofablock.(ForsimplicityweassumethatKdividesT.)ConsideranRKexternalregretminimizationprocedureFIB(overKtimesteps),whichattheendofblockiupdatesthedistributionusingtheaveragelossvector,i.e.,c=Pt2B`t=jBj.LetCKi=PK=1ciandCKmin=miniCKi.SinceFIBhasexternalregretatmostRK,thisimpliesthatthelossofFIB,overthelosssequencec,isatmostCKmin+RK.SinceineveryblockBtheprocedureFIBusesasingledistributionp,itslossontheentirelosssequenceis:LTFIB=KX=1Xt2Bp`t=T KKX=1pcT K[CKmin+RK]:AtthispointitisworthnotingthatifRK=O(p KlogN)theoverallregretisO((T=p K)p logN),whichisminimizedatK=T,namelybyhavingeachblockbeasingletimestep.However,wewillhaveanadditionalyThisreductiondoesnotproducethebestknownboundsforthepartialinformationmodel(see,e.g.,[ACBFS02]forbetterbounds)butisparticularlysimpleandgeneric. Learning,Regretminimization,andEquilibria23lossassociatedwitheachblock(duetothesampling)whichwillcausetheoptimizationtorequirethatKT.ThenextstepindevelopingthepartialinformationprocedureMABistouselossvectorswhicharenotthe\trueaverage"butwhoseexpectationisthesame.Moreformally,thefeedbacktothefullinformationprocedureFIBwillbearandomvariablevector^csuchthatforanyactioniwehaveE[^ci]=ci.Similarly,let^CKi=PK=1^ciand^CKmin=mini^CKi.(Intuitively,wewillgeneratethevector^cusingsamplingwithinablock.)ThisimpliesthatforanyblockBandanydistributionpwehave1 jBjXt2Bp`t=pc=NXi=1pici=NXi=1piE[^ci](4.2)Thatis,thelossofpinBisequaltoitsexpectedlosswithrespectto^c.ThefullinformationprocedureFIBobservesthelosses^c,for2f1;:::;Kg.However,since^carerandomvariables,thedistributionpisalsoarandomvariablethatdependsonthepreviouslosses,i.e.,^c1;:::^c 1.Still,withrespecttoanysequenceoflosses^c,wehavethat^CKFIB=KX=1p^c^CKmin+RKSinceE[^CKi]=CKi,thisimpliesthatE[^CKFIB]E[^CKmin]+RKCKmin+RK;whereweusedthefactthatE[mini^CKi]miniE[^CKi]andtheexpectationisoverthechoicesof^c.Notethatforanysequenceoflosses^c1;:::;^cK,bothFIBandMABwillusethesamesequenceofdistributionsp1;:::;pK.From(4.2)wehavethatinanyblockBtheexpectedlossofFIBandthelossofMABarethesame,assumingtheybothusethesamedistributionp.ThisimpliesthatE[CKMAB]=E[^CKFIB]:Wenowneedtoshowhowtoderiverandomvariables^cwiththedesiredproperty.Thiswillbedonebychoosingrandomly,foreachactioniandblockB,anexplorationtimeti2B.(Thesedonotneedtobeindependentoverthedierentactions,socaneasilybedonewithoutcollisions.)AttimetitheprocedureMABwillplayactioni(i.e.,theprobabilityvectorwithallprobabilitymassoni).Thisimpliesthatthefeedbackthatitreceiveswillbe`tii,andwewillthenset^citobe`tii.ThisguaranteesthatE[^ci]=ci. 24A.BlumandY.MansourSofarwehaveignoredthelossintheexplorationsteps.Sincethemax-imumlossis1,andthereareNexplorationstepsineachoftheKblocks,thetotallossinalltheexplorationstepsisatmostNK.Thereforewehave:E[LTMAB]NK+(T=K)E[CKMAB]NK+(T=K)[CKmin+RK]=LTmin+NK+(T=K)RK:ByTheorem4.6,thereareexternalregretproceduresthathaveregretRK=O(p KlogN).BysettingK=(T=N)2=3,forTN,wehavethefollowingtheorem.Theorem4.17GivenanO(p KlogN)externalregretprocedureFIB(forKtimesteps),thereisapartialinformationprocedureMABthatguaranteesLTMABLTmin+O(T2=3N1=3logN);whereTN.4.7Onconvergenceofregret-minimizingstrategiestoNashequilibriuminroutinggamesAsmentionedearlier,onenaturalsettingforregret-minimizingalgorithmsisonlinerouting.Forexample,apersoncouldusesuchalgorithmstoselectwhichofNavailableroutestousetodrivetoworkeachmorninginsuchawaythathisperformancewillbenearlyasgoodasthebestxedrouteinhindsight,eveniftracchangesarbitrarilyfromdaytoday.Infact,eventhoughinagraphG,thenumberofpathsNbetweentwonodesmaybeexponentialinthesizeofG,thereareanumberofexternal-regretmin-imizingalgorithmswhoserunningtimeandregretboundsarepolynomialinthegraphsize.Moreover,anumberofextensionshaveshownhowthesealgorithmscanbeappliedeventothepartial-informationsettingwhereonlythecostofthepathtraversedisrevealedtothealgorithm.Inthissectionweconsiderthegame-theoreticpropertiesofsuchalgo-rithmsintheWardropmodeloftrac\row.Inthismodel,wehaveadirectednetworkG=(V;E),andoneunit\rowoftrac(alargepopulationofinnitesimalusersthatweviewashavingoneunitofvolume)wantingtotravelbetweentwodistinguishednodesvstartandvend.(Forsimplicity,weareconsideringjustthesingle-commodityversionofthemodel.)Weassumeeachedgeehasacostgivenbyalatencyfunction`ethatissomenon-decreasingfunctionoftheamountoftrac\rowingonedgee.Inother Learning,Regretminimization,andEquilibria25words,thetimetotraverseeachedgeeisafunctionoftheamountofcon-gestiononthatedge.Inparticular,givensome\rowf,whereweusefetodenotetheamountof\rowonagivenedgee,thecostofsomepathPisPe2P`e(fe)andtheaveragetraveltimeofallusersinthepopulationcanbewrittenasPe2E`e(fe)fe.A\rowfisatNashequilibriumifall\row-carryingpathsPfromvstarttovendareminimum-latencypathsgiventhe\rowf.Chapter18considersthismodelinmuchmoredetail,analyzingtherela-tionshipbetweenlatenciesinNashequilibrium\rowsandthoseinglobally-optimum\rows(\rowsthatminimizethetotaltraveltimeaveragedoverallusers).Inthissectionwedescriberesultsshowingthatiftheusersinsuchasettingareadaptingtheirpathsfromdaytodayusingexternal-regretmin-imizingalgorithms(oreveniftheyjusthappentoexperiencelow-regret,regardlessofthespecicalgorithmsused)then\rowwillapproachNashequilibrium.NotethataNashequilibriumispreciselyasetofstaticstrate-giesthatareallno-regretwithrespecttoeachother,sosucharesultseemsnatural;howevertherearemanysimplegamesforwhichregret-minimizingalgorithmsdonotapproachNashequilibriumandcanevenperformmuchworsethananyNashequilibrium.Specically,onecanshowthatifeachuserhasregreto(T),orevenifjusttheaverageregret(averagedovertheusers)iso(T),then\rowapproachesNashequilibriuminthesensethata1 fractionofdaysthavethepropertythata1 fractionoftheusersthatdayexperiencetraveltimeatmostlargerthanthebestpathforthatday,whereapproaches0ataratethatdependspolynomiallyonthesizeofthegraph,theregret-boundsofthealgorithms,andthemaximumslopeofanylatencyfunction.Notethatthisisasomewhatnonstandardnotionofconvergencetoequilibrium:usuallyforan\-approximateequilibrium"onerequiresthatallparticipantshaveatmostincentivetodeviate.However,sincelow-regretalgorithmsareallowedtooccasionallytakelongpaths,andinfactalgorithmsintheMABmodelmustoccasionallyexplorepathstheyhavenottriedinalongtime(toavoidregretifthepathshavebecomemuchbetterinthemeantime),themultiplelevelsofhedgingareactuallynecessaryforaresultofthiskind.Inthissectionwepresentjustaspecialcaseofthisresult.LetPdenotethesetofallsimplepathsfromvstarttovendandletftdenotethe\rowondayt.LetC(f)=Pe2E`e(fe)fedenotethecostofa\rowf.NotethatC(f)isaweightedaverageofcostsofpathsinPandinfactisequaltotheaveragecostofallusersinthe\rowf.Denea\rowftobe-NashifC(f)+minP2PPe2P`e(fe);thatis,theaverageincentivetodeviateoverallusersisatmost.LetR(T)denotetheaverageregret(averaged 26A.BlumandY.Mansouroverusers)upthroughdayT,soR(T)TXt=1Xe2E`e(fte)fte minP2PTXt=1Xe2P`e(fte):Finally,letTdenotethenumberoftimestepsTneededsothatR(T)TforallTT.ForexampletheRWMandPWalgorithmsdiscussedinSection4.3achieveT=O(1 2logN)ifweset==2.Thenwewillshow:Theorem4.18Supposethelatencyfunctions`earelinear.ThenforTT,theaverage\row^f=1 T(f1+:::+fT)is-Nash.ProofFromthelinearityofthelatencyfunctions,wehaveforalle,`e(^fe)=1 TPTt=1`e(fte).Since`e(fte)fteisaconvexfunctionofthe\row,thisimplies`e(^fe)^fe1 TTXt=1`e(fte)fte:Summingoveralle,wehaveC(^f)1 TPTt=1C(ft)+minP1 TPTt=1Pe2P`e(fte)(bydenitionofT)=+minPPe2P`e(^fe):(bylinearity) Thisresultshowsthetime-average\rowisanapproximateNashequilib-rium.ThiscanthenbeusedtoprovethatmostoftheftmustinfactbeapproximateNash.Thekeyideahereisthatifthecostofanyedgewereto\ructuatewildlyovertime,thenthatwouldimplythatmostoftheusersofthatedgeexperiencedlatencysubstantiallygreaterthantheedge'saveragecost(becausemoreusersareusingtheedgewhenitiscongestedthanwhenitisnotcongested),whichinturnimpliestheyexperiencesubstantialregret.Theseargumentscanthenbecarriedovertothecaseofgeneral(non-linear)latencyfunctions.CurrentResearchDirectionsInthissectionwesketchsomecurrentresearchdirectionswithrespecttoregretminimization. Learning,Regretminimization,andEquilibria27RenedRegretBounds:TheregretboundsthatwepresenteddependonthenumberoftimestepsT,andareindependentoftheperformanceofthebestaction.Suchboundsarealsocalledzeroorderbounds.Morerenedrstorderboundsdependonthelossofthebestaction,andsecondorderboundsdependonthesumofsquaresofthelosses(suchasQTkinThe-orem4.6).Aninterestingopenproblemistogetanexternalregretwhichisproportionaltotheempiricalvarianceofthebestaction.Anotherchal-lengeistoreducethepriorinformationneededbytheregretminimizationalgorithm.Ideally,itshouldbeabletolearnandadapttoparameterssuchasthemaximumandminimumloss.See[CBMS05]foradetaileddiscussionofthoseissues.Largeactionsspaces:InthischapterweassumedthenumberofactionsNissmallenoughtobeabletolistthemall,andouralgorithmsworkintimeproportionaltoN.However,inmanysettingsNisexponentialinthenaturalparametersoftheproblem.Forexample,theNactionsmightbeallsimplepathsbetweentwonodessandtinann-nodegraph,orallbinarysearchtreesonf1;:::;ng.SincethefullinformationexternalregretboundsareonlylogarithmicinN,fromthepointofviewofinformation,wecanderivepolynomialregretbounds.Thechallengeiswhetherinsuchsettingswecanproducecomputationallyecientalgorithms.Therehaverecentlybeenseveralresultsabletohandlebroadclassesofproblemsofthistype.KalaiandVempala[KV03]giveanecientalgorithmforanyprobleminwhich(a)thesetXofactionscanbeviewedasasubsetofRn,(b)thelossvectors`arelinearfunctionsoverRn(sothelossofactionxis`x),and(c)wecanecientlysolvetheoineoptimizationproblemargminx2S[x`]foranygivenlossvector`.Forinstance,thissettingcanmodelthepathandsearch-treeexamplesabove.yZinkevich[Zin03]extendsthistoconvexlossfunctionswithaprojectionoracle,andthereissubstantialinterestintryingtobroadentheclassofsettingsthatecientregret-minimizationalgorithmscanbeappliedto.Dynamics:Itisalsoveryinterestingtoanalyzethedynamicsofregretmin-imizationalgorithms.Theclassicalexampleisthatofswapregret:whenalltheplayersplayswapregretminimizationalgorithms,theempiricaldistribu-tionconvergestothesetofcorrelatedequilibria(Section4.4).Wealsosawconvergenceintwo-playerzerosumgamestotheminimaxvalueofthegameyThecaseofsearchtreeshastheadditionalissuethatthereisarotationcostassociatedwithusingadierentaction(tree)attimet+1thanthatusedattimet.Thisisaddressedin[KV03]aswell. 28A.BlumandY.Mansour(Section4.4),andconvergencetoNashequilibriuminaWardrop-modelroutinggame(Section4.7).Furtherresultsonconvergencetoequilibriainothersettingswouldbeofsubstantialinterest.Atahighlevel,under-standingthedynamicsofregretminimizationalgorithmswouldallowustobetterunderstandthestrengthsandweaknessesofusingsuchprocedures.Formoreinformationonlearningingames,seethebook[FL98].Exercises4.1ShowthatswapregretisatmostNtimeslargerthaninternalregret.4.2Showanexample(evenwithN=3)wheretheratiobetweentheexternalandswapregretisunbounded.4.3ShowthattheRWMalgorithmwithupdaterulewti=wt 1i(1 )`t 1iachievesthesameexternalregretboundasgiveninTheorem4.6forthePWalgorithm,forlossesin[0;1].4.4Considerasettingwherethepayosareintherange[ 1;+1],andthegoalofthealgorithmistomaximizeitspayo.Deriveamodi-edPWalgorithmwhoseexternalregretisO(q QTmaxlogN+logN),whereQTmaxQTkfork2Xi.4.5Showa\n(p TlogN)lowerboundonexternalregret,forthecasethatTN.4.6ImprovetheswapregretboundtoO(p NTlogN).Hint:usetheobservationthatthesumofthelossesofalltheAiisboundedbyT.4.7(OpenProblem)Doesthereexistan\n(p TNlogN)lowerboundforswapregret?4.8ShowthatifaplayerplaysalgorithmRWM(orPW)thenitgive-dominatedactionssmallweight.Also,showthattherearecaseswheretheexternalregretofaplayercanbesmall,yetitgives-dominatedactionshighweight.NotesHannan[Han57]wasthersttodevelopalgorithmswithexternalregretsublinearinT.Later,motivatedbymachinelearningsettingsinwhichNcanbequitelarge,algorithmsthatfurthermorehaveonlyalogarithmicdependenceonNweredevelopedin[LW94,FS97,FS99,CBFH+97].Inparticular,theRandomizedWeightedMajorityalgorithmandTheorem4.5arefrom[LW94]andthePolynomialWeightsalgorithmandTheorem4.6isfrom[CBMS05].Computationallyecientalgorithmsforgenericframe-worksthatmodelmanysettingsinwhichNmaybeexponentialinthe Exercises29naturalproblemdescription(suchasconsideringalls-tpathsinagraphorallbinarysearchtreesonnelements)weredevelopedin[KV03,Zin03].Thenotionofinternalregretanditsconnectiontocorrelatedequilib-riumappearin[FV98,HMC00],andmoregeneralmodicationruleswereconsideredin[Leh03].Anumberofspeciclowinternalregretalgorithmsweredevelopedby[FV97,FV98,FV99,HMC00,CBL03,BM05,SL05].ThereductioninSection4.5fromexternaltoswapregretisfrom[BM05].Algorithmswithstrongexternalregretboundsforthepartialinformationmodelaregivenin[ACBFS02],andalgorithmswithlowinternalregretappearin[BM05,CBLS06].ThereductionfromfullinformationtopartialinformationinSection4.6isinthespiritofalgorithmsof[AM03,AK04].Extensionsofthealgorithmof[KV03]tothepartialinformationsettingappearin[AK04,MB04,DH06].TheresultsinSection4.7onapproachingNashequilibriainroutinggamesarefrom[BEL06].Bibliography[ACBFS02]PeterAuer,NicoloCesa-Bianchi,YoavFreund,andRobertE.Schapire.Thenonstochasticmultiarmedbanditproblem.SIAMJournalonComputing,32(1):48{77,2002.[AK04]BaruchAwerbuchandRobertD.Kleinberg.Adaptiveroutingwithend-to-endfeedback:distributedlearningandgeometricapproaches.InSTOC,pages45{53,2004.[AM03]BaruchAwerbuchandYishayMansour.Adaptingtoareliablenetworkpath.InPODC,pages360{367,2003.[BEL06]AvrimBlum,EyalEven-Dar,andKatrinaLigett.Routingwithoutregret:Onconvergencetonashequilibriaofregret-minimizingalgorithmsinroutinggames.InPODC,2006.[BEY98]AllanBorodinandRanEl-Yaniv.OnlineComputationandCompetitiveAnalysis.CambridgeUniversityPress,1998.[BM05]AvrimBlumandYishayMansour.Fromexternaltointernalregret.InCOLT,2005.[CBFH+97]NicoloCesa-Bianchi,YoavFreund,DavidP.Helmbold,DavidHaussler,RobertE.Schapire,andManfredK.Warmuth.Howtouseexpertadvice.JournaloftheACM,44(3):427{485,1997.[CBL03]NicoloCesa-BianchiandGaborLugosi.Potential-basedalgorithmsinon-linepredictionandgametheory.MachineLearning,51(3):239{261,2003.[CBL06]NicoloCesa-BianchiandGaborLugosi.Prediction,LearningandGames.CambridgeUniversityPress,2006.[CBLS06]NicoloCesa-Bianchi,GaborLugosi,andGillesStoltz.Regretminimiza-tionunderpartialmonitoring.MathofO.R.(toappear),2006.[CBMS05]NicoloCesa-Bianchi,YishayMansour,andGillesStoltz.Improvedsecond-orderboundsforpredictionwithexpertadvice.InCOLT,2005.[DH06]VarshaDaniandThomasP.Hayes.Robbingthebandit:Lessregretinonlinegeometricoptimizationagainstanadaptiveadversary.InSODA,pages937{943,2006. 30A.BlumandY.Mansour[FL98]DrewFudenbergandDavidK.Levine.Thetheoryoflearningingames.MITpress,1998.[FS97]YoavFreundandRobertE.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.JCSS,55(1):119{139,1997.[FS99]YoavFreundandRobertE.Schapire.Adaptivegameplayingusingmulti-plicativeweights.GamesandEconomicBehavior,29:79{103,1999.[FV97]D.FosterandR.Vohra.Calibratedlearningandcorrelatedequilibrium.GamesandEconomicBehavior,21:40{55,1997.[FV98]D.FosterandR.Vohra.Asymptoticcalibration.Biometrika,85:379{390,1998.[FV99]D.FosterandR.Vohra.Regretintheon-linedecisionproblem.GamesandEconomicBehavior,29:7{36,1999.[Han57]J.Hannan.Approximationtobayesriskinrepeatedplays.InM.Dresher,A.Tucker,andP.Wolfe,editors,ContributionstotheTheoryofGames,vol-ume3,pages97{139.PrincetonUniversityPress,1957.[HMC00]S.HartandA.Mas-Colell.Asimpleadaptiveprocedureleadingtocorre-latedequilibrium.Econometrica,68:1127{1150,2000.[KV03]AdamKalaiandSantoshVempala.Ecientalgorithmsforonlinedecisionproblems.InCOLT,pages26{40,2003.[Leh03]E.Lehrer.Awiderangeno-regrettheorem.GamesandEconomicBehavior,42:101{115,2003.[LW94]NickLittlestoneandManfredK.Warmuth.Theweightedmajorityalgo-rithm.InformationandComputation,108:212{261,1994.[MB04]H.BrendanMcMahanandAvrimBlum.Onlinegeometricoptimizationinthebanditsettingagainstanadaptiveadversary.InProc.17thAnnualConferenceonLearningTheory(COLT),pages109{123,2004.[Owe82]GuillermoOwen.Gametheory.Academicpress,1982.[SL05]GillesStoltzandGaborLugosi.Internalregretinon-lineportfolioselection.MachineLearningJournal,59:125{159,2005.[ST85]D.SleatorandR.E.Tarjan.Amortizedeciencyoflistupdateandpagingrules.CommunicationsoftheACM,28:202{208,1985.[Zin03]MartinZinkevich.Onlineconvexprogrammingandgeneralizedinnitesimalgradientascent.InProc.ICML,pages928{936,2003.