/
Learning Regret minimization and Equilibria A Learning Regret minimization and Equilibria A

Learning Regret minimization and Equilibria A - PDF document

briana-ranney
briana-ranney . @briana-ranney
Follow
459 views
Uploaded On 2014-12-24

Learning Regret minimization and Equilibria A - PPT Presentation

Blum and Y Mansour Abstract Many situations involve repeatedly making decisions in an u ncertain envi ronment for instance deciding what route to drive to work e ach day or repeated play of a game against an opponent with an unknown st rategy In thi ID: 28981

Blum and Mansour

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Learning Regret minimization and Equilib..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

4Learning,Regretminimization,andEquilibriaA.BlumandY.MansourAbstractManysituationsinvolverepeatedlymakingdecisionsinanuncertainenvi-ronment:forinstance,decidingwhatroutetodrivetoworkeachday,orrepeatedplayofagameagainstanopponentwithanunknownstrategy.Inthischapterwedescribelearningalgorithmswithstrongguaranteesforset-tingsofthistype,alongwithconnectionstogame-theoreticequilibriawhenallplayersinasystemaresimultaneouslyadaptinginsuchamanner.Webeginbypresentingalgorithmsforrepeatedplayofamatrixgamewiththeguaranteethatagainstanyopponent,theywillperformnearlyaswellasthebest xedactioninhindsight(alsocalledtheproblemofcombiningexpertadviceorminimizingexternalregret).Inazero-sumgame,suchalgorithmsareguaranteedtoapproachorexceedtheminimaxvalueofthegame,andevenprovideasimpleproofoftheminimaxtheorem.Wethenturntoalgorithmsthatminimizeanevenstrongerformofregret,knownasinternalorswapregret.Wepresentageneralreductionshowinghowtoconvertanyalgorithmforminimizingexternalregrettoonethatminimizesthisstrongerformofregretaswell.Internalregretisimportantbecausewhenallplayersinagameminimizethisstrongertypeofregret,theempiricaldistributionofplayisknowntoconvergetocorrelatedequilibrium.Thethirdpartofthischapterexplainsadi erentreduction:howtocon-vertfromthefullinformationsettinginwhichtheactionchosenbytheopponentisrevealedaftereachtimestep,tothepartialinformation(ban-dit)setting,whereateachtimesteponlythepayo oftheselectedactionisobserved(suchasinrouting),andstillmaintainasmallexternalregret.Finally,weendbydiscussingroutinggamesintheWardropmodel,whereonecanshowthatifallparticipantsminimizetheirownexternalregret,then4 Learning,Regretminimization,andEquilibria5overalltracisguaranteedtoconvergetoanapproximateNashEquilibrium.Thisfurthermotivatesprice-of-anarchyresults.4.1IntroductionInthischapterweconsidertheproblemofrepeatedlymakingdecisionsinanuncertainenvironment.ThebasicsettingiswehaveaspaceofNactions,suchaswhatroutetousetodrivetowork,ortherowsofamatrixgamelikefrock,paper,scissorsg.Ateachtimestep,thealgorithmprobabilisticallychoosesanaction(say,selectingwhatroutetotake),theenvironmentmakesits\move"(settingtheroadcongestionsonthatday),andthealgorithmthenincursthelossforitsactionchosen(howlongitsroutetook).Theprocessthenrepeatsthenextday.Whatwewouldlikeareadaptivealgo-rithmsthatcanperformwellinsuchsettings,aswellastounderstandthedynamicsofthesystemwhentherearemultipleplayers,alladjustingtheirbehaviorinsuchaway.Akeytechniqueforanalyzingproblemsofthissortisknownasregretanalysis.Themotivationbehindregretanalysiscanbeviewedasthefol-lowing:wedesignasophisticatedonlinealgorithmthatdealswithvariousissuesofuncertaintyanddecisionmaking,andsellittoaclient.Ouralgo-rithmrunsforsometimeandincursacertainloss.Wewouldliketoavoidtheembarrassmentthatourclientwillcomebacktousandclaimthatinretrospectwecouldhaveincurredamuchlowerlossifweusedhissimplealternativepolicy.Theregretofouronlinealgorithmisthedi erencebetweenthelossofouralgorithmandthelossusing.Di erentnotionsofregretquantifydi erentlywhatisconsideredtobea\simple"alternativepolicy.Externalregret,alsocalledtheproblemofcombiningexpertadvice,comparesperformancetothebestsingleactioninretrospect.Thisimpliesthatthesimplealternativepolicyperformsthesameactioninalltimesteps,whichindeedisquitesimple.Nonetheless,exter-nalregretprovidesageneralmethodologyfordevelopingonlinealgorithmswhoseperformancematchesthatofanoptimalstaticoinealgorithmbymodelingthepossiblestaticsolutionsasdi erentactions.Inthecontextofmachinelearning,algorithmswithgoodexternalregretboundscanbepow-erfultoolsforachievingperformancecomparabletotheoptimalpredictionrulefromsomelargeclassofhypotheses.InSection4.3wedescribeseveralalgorithmswithparticularlystrongexternalregretbounds.Westartwiththeveryweakgreedyalgorithm,andbuilduptoanalgorithmwhoselossisatmostO(p TlogN)greaterthanthatofthebestaction,whereTisthenumberoftimesteps.Thatis, 6A.BlumandY.MansourtheregretpertimestepdropsasO(p (logN)=T).InSection4.4weshowthatinazero-sumgame,suchalgorithmsareguaranteedtoapproachorexceedthevalueofthegame,andevenyieldasimpleproofoftheminimaxtheorem.Asecondcategoryofalternativepoliciesarethosethatconsidertheonlinesequenceofactionsandsuggestasimplemodi cationtoit,suchas\everytimeyouboughtIBM,youshouldhaveboughtMicrosoftinstead".Whileonecanstudyverygeneralclassesofmodi cationrules,themostcommonform,knownasinternalorswapregret,allowsonetomodifytheonlineactionsequencebychangingeveryoccurrenceofagivenactionibyanalternativeactionj.(Thedistinctionbetweeninternalandswapregretisthatinternalregretallowsonlyoneactiontobereplacedbyanother,whereasswapregretallowsanymappingfromf1;:::;Ngtof1;:::;NgandcanbeuptoafactorNlarger).InSection4.5wepresentasimplewaytoecientlyconvertanyexternalregretminimizingalgorithmintoonethatminimizesswapregretwithonlyafactorNincreaseintheregretterm.UsingtheresultsforexternalregretthisachievesaswapregretboundofO(p TNlogN).(Algorithmsforswapregrethavealsobeendevelopedfrom rstprinciples|seetheNotessectionofthischapterforreferences|butthisproceduregivesthebestboundsknownforecientalgorithms).Theimportanceofswapregretisduetoitstightconnectiontocorrelatedequilibria,de nedinChapter1.Infact,onewaytothinkofacorrelatedequilibriumisthatitisadistributionQoverthejointactionspacesuchthateveryplayerwouldhavezerointernal(orswap)regretwhenplayingit.AswepointoutinSection4.4,ifeachplayercanachieveswapregretT,thentheempiricaldistributionofthejointactionsoftheplayerswillbean-correlatedequilibrium.Wealsodescribehowexternalregretresultscanbeextendedtothepartialinformationmodel,alsocalledthemulti-armedbandit(MAB)problem.Inthismodel,theonlinealgorithmonlygetstoobservethelossoftheactionactuallyselected,anddoesnotseethelossesoftheactionsnotchosen.Forexample,inthecaseofdrivingtowork,youmayonlyobservethetraveltimeontherouteyouactuallydrive,anddonotgetto ndouthowlongitwouldhavetakenhadyouchosensomealternativeroute.InSection4.6wepresentageneralreduction,showinghowtoconvertanalgorithmwithlowexternalregretinthefullinformationmodeltooneforthepartialinformationmodel(thoughtheboundsproducednotthebestknownboundsforthisproblem).Noticethattheroute-choosingproblemcanbeviewedasageneral-sumgame:yourtraveltimedependsonthechoicesoftheotherdriversaswell.InSection4.7wediscussresultsshowingthatintheWardropmodelof Learning,Regretminimization,andEquilibria7in nitesimalagents(consideredinChapter18),ifeachdriveractstomini-mizeexternalregret,thentrac\rowovertimecanbeshowntoapproachanapproximateNashequilibrium.Thisservestofurthermotivateprice-of-anarchyresultsinthiscontext,sinceitmeanstheyapplytothecasethatparticipantsareusingwell-motivatedself-interestedadaptivebehavior.Weremarkthattheresultswepresentinthischapterarenotalwaysthestrongestknown,andtheinterestedreaderisreferredtotherecentbook[CBL06]whichgivesathoroughcoverageofmanyofthethetopicsinthischapter.SeealsotheNotessectionforfurtherreferences.4.2ModelandPreliminariesWeassumeanadversarialonlinemodelwherethereareNavailableactionsX=f1;:::;Ng.Ateachtimestept,anonlinealgorithmHselectsadistributionptovertheNactions.Afterthat,theadversaryselectsalossvector`t2[0;1]N,where`ti2[0;1]isthelossofthei-thactionattimet.Inthefullinformationmodel,theonlinealgorithmHreceivesthelossvector`tandexperiencesaloss`tH=PNi=1pti`ti.(Thiscanbeviewedasanexpectedlosswhentheonlinealgorithmselectsactioni2Xwithprobabilitypti.)Inthepartialinformationmodel,theonlinealgorithmreceives(`tkt;kt),wherektisdistributedaccordingtopt,and`tH=`tktisitsloss.Thelossofthei-thactionduringthe rstTtimestepsisLTi=PTt=1`ti,andthelossofHisLTH=PTt=1`tH.TheaimfortheexternalregretsettingistodesignanonlinealgorithmthatwillbeabletoapproachtheperformanceofthebestalgorithmfromagivenclassofalgorithmsG;namely,tohavealossclosetoLTG;min=ming2GLTg.FormallywewouldliketominimizetheexternalregretRG=LTHLTG;min,andGiscalledthecomparisonclass.ThemoststudiedcomparisonclassGistheonethatconsistsofallthesingleactions,i.e.,G=X.Inthischapterweconcentrateonthisimportantcomparisonclass,namely,wewanttheonlinealgorithm'slosstobeclosetoLTmin=miniLTi,andlettheexternalregretbeR=LTHLTmin.Externalregretusesa xedcomparisonclassG,butonecanalsoenvisionacomparisonclassthatdependsontheonlinealgorithm'sactions.Wecanconsidermodi cationrulesthatmodifytheactionsselectedbytheonlinealgorithm,producinganalternativestrategywhichwewillwanttocompeteagainst.Amodi cationruleFhasasinputthehistoryandthecurrentactionselectedbytheonlineprocedureandoutputsa(possiblydi erent)action.(WedenotebyFtthefunctionFattimet,includinganydependencyonthehistory.)Givenasequenceofprobabilitydistributionsptusedbyan 8A.BlumandY.MansouronlinealgorithmH,andamodi cationruleF,wede neanewsequenceofprobabilitydistributionsft=Ft(pt),wherefti=Pj:Ft(j)=iptj.Thelossofthemodi edsequenceisLH;F=PtPifti`ti.Notethatattimetthemodi cationruleFshiftstheprobabilitythatHassignedtoactionjtoactionFt(j).Thisimpliesthatthemodi cationruleFgeneratesadi erentdistribution,asafunctionoftheonlinealgorithm'sdistributionpt.Wewillfocusonthecaseofa nitesetFofmemorylessmodi cationrules(theydonotdependonhistory).Givenasequenceoflossvectors,theregretofanonlinealgorithmHwithrespecttothemodi cationrulesFisRF=maxF2FfLTHLTH;Fg:NotethattheexternalregretsettingisequivalenttohavingasetFexofNmodi cationrulesFi,whereFialwaysoutputsactioni.Forinternalregret,thesetFinconsistsofN(N1)modi cationrulesFi;j,whereFi;j(i)=jandFi;j(i0)=i0fori0=i.Thatis,theinternalregretofHismaxF2FinfLTHLTH;Fg=maxi;j2X(TXt=1pti(`ti`tj)):Amoregeneralclassofmemorylessmodi cationrulesisswapregretde- nedbytheclassFsw,whichincludesallNNfunctionsF:f1;:::;Ng!f1;:::;Ng,wherethefunctionFswapsthecurrentonlineactioniwithF(i)(whichcanbethesameoradi erentaction).Thatis,theswapregretofHismaxF2FswfLTHLTH;Fg=NXi=1maxj2X(TXt=1pti(`ti`tj)):NotethatsinceFexFswandFinFsw,bothexternalandinternalregretareupper-boundedbyswapregret.(SeealsoExercises1and2.)4.3ExternalRegretMinimizationBeforedescribingtheexternalregretresults,webeginbypointingoutthatitisnotpossibletoguaranteelowregretwithrespecttotheoveralloptimalsequenceofdecisionsinhindsight,asisdoneincompetitiveanalysis[ST85,BEY98].Thiswillmotivatewhywewillbeconcentratingonmorerestrictedcomparisonclasses.Inparticular,letGallbethesetofallfunctionsmappingtimesf1;:::;TgtoactionsX=f1;:::;Ng.Theorem4.1ForanyonlinealgorithmHthereexistsasequenceofTlossvectorssuchthatregretRGallisatleastT(11=N). Learning,Regretminimization,andEquilibria9ProofThesequenceissimplyasfollows:ateachtimet,theactionitoflowestprobabilityptigetsalossof0,andalltheotheractionsgetalossof1.Sinceminifptig1=N,thismeansthelossofHinTtimestepsisatleastT(11=N).Ontheotherhand,thereexistsg2Gall,namelyg(t)=it,withatotallossof0. Theaboveproofshowsthatifweconsiderallpossiblefunctions,wehaveaverylargeregret.FortherestofthesectionwewillusethecomparisonclassGa=fgi:i2Xg,wheregialwaysselectsactioni.Namely,wecomparetheonlinealgorithmtothebestsingleaction.Warmup:GreedyandRandomized-GreedyAlgorithmsInthissection,forsimplicitywewillassumealllossesareeither0or1(ratherthanarealnumberin[0;1]),whichwillsimplifynotationandproofs,thougheverythingpresentedcanbeeasilyextendedtothegeneralcase.Our rstattempttodevelopagoodregretminimizationalgorithmwillbetoconsiderthegreedyalgorithm.RecallthatLti=Pt=1`i,namelythecumulativelossuptotimetofactioni.TheGreedyalgorithmateachtimetselectsactionxt=argmini2XLt1i(iftherearemultipleactionswiththesamecumulativeloss,itpreferstheactionwiththelowestindex).Formally:GreedyAlgorithmInitially:x1=1.Attimet:LetLt1min=mini2XLt1i,andSt1=fi:Lt1i=Lt1ming.Letxt=minSt1.Theorem4.2TheGreedyalgorithm,foranysequenceoflosseshasLTGreedyNLTmin+(N1):ProofAteachtimetsuchthatGreedyincursalossof1andLtmindoesnotincrease,atleastoneactionisremovedfromSt.ThiscanoccuratmostNtimesbeforeLtminincreasesby1.Therefore,GreedyincurslossatmostNbetweensuccessiveincrementsinLtmin.Moreformally,thisshowsinductivelythatLtGreedyNjStj+NLtmin. TheaboveguaranteeonGreedyisquiteweak,statingonlythatitslossisatmostafactorofNlargerthanthelossofthebestaction.Thefollowingtheoremshowsthatthisweaknessissharedbyanydeterministiconlinealgorithm.(Adeterministicalgorithmconcentratesitsentireweightonasingleactionateachtimestep.) 10A.BlumandY.MansourTheorem4.3ForanydeterministicalgorithmDthereexistsalossse-quenceforwhichLTD=TandLTmin=bT=Nc.NotethattheabovetheoremimpliesthatLTDNLTmin+(TmodN),whichalmostmatchestheupperboundforGreedy(Theorem4.2).ProofFixadeterministiconlinealgorithmDandletxtbetheactionitselectsattimet.Wewillgeneratethelosssequenceinthefollowingway.Attimet,letthelossofxtbe1andthelossofanyotheractionbe0.ThisensuresthatDincursloss1ateachtimestep,soLTD=T.SincethereareNdi erentactions,thereissomeactionthatalgorithmDhasselectedatmostbT=Nctimes.Byconstruction,onlytheactionsselectedbyDeverhavealoss,sothisimpliesthatLTminbT=Nc. Theorem4.3motivatesconsideringrandomizedalgorithms.Inparticular,oneweaknessofthegreedyalgorithmwasthatithadadeterministictiebreaker.Onecanhopethatiftheonlinealgorithmsplitsitsweightbetweenallthecurrentlybestactions,betterperformancecouldbeachieved.Speci -cally,letRandomizedGreedy(RG)betheprocedurethatassignsauniformdistributionoverallthoseactionswithminimumtotallosssofar.Wenowwillshowthatthisalgorithmachievesasigni cantperformanceimprove-ment:itslossisatmostanO(logN)factorfromthebestaction,ratherthanO(N).(Thisissimilartotheanalysisoftherandomizedmarkingalgorithmincompetitiveanalysis).RandomizedGreedy(RG)AlgorithmInitially:p1i=1=Nfori2X.Attimet:LetLt1min=mini2XLt1i,andSt1=fi:Lt1i=Lt1ming.Letpti=1=jSt1jfori2St1andpti=0otherwise.Theorem4.4TheRandomizedGreedy(RG)algorithm,foranylossse-quence,hasLTRG(lnN)+(1+lnN)LTmin:ProofTheprooffollowsfromshowingthatthelossincurredbyRandomizedGreedybetweensuccessiveincreasesinLtminisatmost1+lnN.Speci cally,lettjdenotethetimestepatwhichLtmin rstreachesalossofj,soweareinterestedinthelossofRandomizedGreedybetweentimestepstjandtj+1.Attimeanytwehave1jStjN.Furthermore,ifattimet2(tj;tj+1]thesizeofStshrinksbykfromsomesizen0downton0k,thenthelossoftheonlinealgorithmRGisk=n0,sinceeachsuchactionhasweight1=n0.Finally, Learning,Regretminimization,andEquilibria11noticethatwecanupperboundk=n0by1=n0+1=(n01)+:::+1=(n0k+1).Therefore,overtheentiretime-interval(tj;tj+1],thelossofRandomizedGreedyisatmost:1=N+1=(N1)+1=(N2)+:::+1=11+lnN:Moreformally,thisshowsinductivelythatLtRG(1=N+1=(N1)+:::+1=(jStj+1))+(1+lnN)Ltmin. RandomizedWeightedMajorityalgorithmAlthoughRandomizedGreedyachievedasigni cantperformancegaincom-paredtotheGreedyalgorithm,westillhavealogarithmicratiotothebestaction.Lookingmorecloselyattheproof,onecanseethatthelossesaregreatestwhenthesetsStaresmall,sincetheonlinelosscanbeviewedasproportionalto1=jStj.Onewaytoovercomethisweaknessistogivesomeweighttoactionswhicharecurrently\nearbest".Thatis,wewouldliketheprobabilitymassonsomeactiontodecaygracefullywithitsdistancetooptimality.ThisistheideaoftheRandomizedWeightedMajorityalgorithmofLittlestoneandWarmuth.Speci cally,intheRandomizedWeightedMajorityalgorithm,wegiveanactioniwhosetotallosssofarisLiaweightwi=(1)Li,andthenchooseprobabilitiesproportionaltotheweights:pi=wi=PNj=1wj.Theparameterwillbesettooptimizecertaintradeo sbutconceptuallythinkofitasasmallconstant,say0:01.Inthissectionwewillagainassumelossesinf0;1gratherthan[0;1]becauseitallowsforanespeciallyintuitiveinterpretationoftheproof(Theorem4.5).Wethenrelaxthisassumptioninthenextsection(Theorem4.6).RandomizedWeightedMajority(RWM)AlgorithmInitially:w1i=1andp1i=1=N,fori2X.Attimet:If`t1i=1,letwti=wt1i(1);else(`t1i=0)letwti=wt1i.Letpti=wti=Wt,whereWt=Pi2Xwti.AlgorithmRWMandTheorem4.5canbegeneralizedtolossesin[0;1]byreplacingtheupdaterulewithwti=wt1i(1)`t1i(seeExercise3).Theorem4.5For1=2,thelossofRandomizedWeightedMajority(RWM)onanysequenceofbinaryf0;1glossessatis es:LTRWM(1+)LTmin+lnN :Setting=minfp (lnN)=T;1=2gyieldsLTRWMLTmin+2p TlnN. 12A.BlumandY.Mansour(Note:thesecondpartofthetheoremassumesTisknowninadvance.IfTisunknown,thena\guessanddouble"approachcanbeusedtosetwithjustaconstant-factorlossinregret.Infact,onecanachievethepotentiallybetterboundLTRWMLTmin+2p LminlnNbysetting=minfp (lnN)=Lmin;1=2g.)ProofThekeytotheproofistoconsiderthetotalweightWt.Whatwewillshowisthatanytimetheonlinealgorithmhassigni cantexpectedloss,thetotalweightmustdropsubstantially.WewillthencombinethiswiththefactthatWT+1maxiwT+1i=(1)LTmintoachievethedesiredbound.Speci cally,letFt=(Pi:`ti=1wti)=WtdenotethefractionoftheweightWtthatisonactionsthatexperiencealossof1attimet;so,FtequalstheexpectedlossofalgorithmRWMattimet.Now,eachoftheactionsexperiencingalossof1hasitsweightmultipliedby(1)whiletherestareunchanged.Therefore,Wt+1=WtFtWt=Wt(1Ft).Inotherwords,theproportionoftheweightremovedfromthesystemateachtimetisexactlyproportionaltotheexpectedlossoftheonlinealgorithm.Now,usingthefactthatW1=NandusingourlowerboundonWT+1wehave:(1)LTminWT+1=W1TYt=1(1Ft)=NTYt=1(1Ft):Takinglogarithms,LTminln(1)(lnN)+TXt=1ln(1Ft)(lnN)TXt=1Ft(Usingtheinequalityln(1z)z)=(lnN)LTRWM(byde nitionofFt)Therefore,LTRWMLTminln(1) +ln(N) (1+)LTmin+ln(N) ;(Usingtheinequalityln(1z)z+z2for0z1 2)whichcompletestheproof. Learning,Regretminimization,andEquilibria13PolynomialWeightsalgorithmThePolynomialWeights(PW)algorithmisanaturalextensionoftheRWMalgorithmtolossesin[0;1](oreventothecaseofbothlossesandgains,seeExercise4)thatmaintainsthesameproofstructureasthatusedforRWMandinadditionperformsespeciallywellinthecaseofsmalllosses.PolynomialWeights(PW)AlgorithmInitially:w1i=1andp1i=1=N,fori2X.Attimet:Letwti=wt1i(1`t1i).Letpti=wti=Wt,whereWt=Pi2Xwti.Noticethattheonlydi erencebetweenPWandRWMisintheupdatestep.Inparticular,itisnolongernecessarilythecasethatanactionoftotallossLhasweight(1)L.However,whatismaintainedisthepropertythatifthealgorithm'slossattimetisFt,thenexactlyanFtfractionofthetotalweightisremovedfromthesystem.Speci cally,fromtheupdaterulewehaveWt+1=WtPiwti`ti=Wt(1Ft)whereFt=(Piwti`ti)=WtisthelossofPWattimet.Wecanusethisfacttoprovethefollowing:Theorem4.6ThePolynomialWeights(PW)algorithm,using1=2,forany[0;1]-valuedlosssequenceandforanykhas,LTPWLTk+QTk+ln(N) ;whereQTk=PTt=1(`tk)2.Setting=minfp (lnN)=T;1=2gandnotingthatQTkT,wehaveLTPWLTmin+2p TlnN:yProofAsnotedabove,wehaveWt+1=Wt(1Ft)whereFtisPW'slossattimet.So,aswiththeanalysisofRWM,wehaveWT+1=NQTt=1(1Ft)andtherefore:lnWT+1=lnN+TXt=1ln(1Ft)lnNTXt=1Ft=lnNLTPW:Nowforthelowerbound,wehave:lnWT+1lnwT+1k=TXt=1ln(1`tk)(usingtherecursivede nitionofweights)yAgain,forsimplicityweassumethatthenumberoftimestepsTisgivenasaparametertothealgorithm;otherwiseonecanusea\guessanddouble"methodtoset. 14A.BlumandY.MansourTXt=1`tkTXt=1(`tk)2(usingtheinequalityln(1z)zz2for0z1 2)=LTk2QTk:CombiningtheupperandlowerboundsonlnWT+1wehave:LTk2QTklnNLTPW;whichyieldsthetheorem. LowerBoundsAnobviousquestioniswhetheronecansigni cantlyimprovetheboundinTheorem4.6.Wewillshowtwosimpleresultsthatimplythattheregretboundisnearoptimal(seeExercise5forabetterlowerbound).The rstresultshowsthatonecannothopetogetsublinearregretwhenTissmallcomparedtologN,andthesecondshowsthatonecannothopetoachieveregreto(p T)evenwhenN=2.Theorem4.7ConsiderTlog2N.Thereexistsastochasticgenerationoflossessuchthat,foranyonlinealgorithmR1,wehaveE[LTR1]=T=2andyetLTmin=0.ProofConsiderthefollowingsequenceoflosses.Attimet=1,arandomsubsetofN=2actionsgetalossof0andtherestgetalossof1.Attimet=2,arandomsubsetofN=4oftheactionsthathadloss0attimet=1getalossof0,andtherest(includingactionsthathadalossof1attime1)getalossof1.Thisprocessrepeats:ateachtimestep,arandomsubsetofhalfoftheactionsthathavereceivedloss0sofargetalossof0,whilealltherestgetalossof1.Anyonlinealgorithmincursanexpectedlossof1=2ateachtimestep,becauseateachtimestepttheexpectedfractionofprobabilitymassptionactionsthatreceivealossof0isatmost1=2.Yet,forTlog2Ntherewillalwaysbesomeactionwithtotallossof0. Theorem4.8ConsiderN=2.Thereexistsastochasticgenerationoflossessuchthat,foranyonlinealgorithmR2,wehaveE[LTR2LTmin]=\n(p T).ProofAttimet,we\ripafaircoinandset`t=z1=(0;1)withprobability1=2and`t=z2=(1;0)withprobability1=2.Foranydistributionptthe Learning,Regretminimization,andEquilibria15expectedlossattimetisexactly1=2.ThereforeanyonlinealgorithmR2hasexpectedlossofT=2.GivenasequenceofTsuchlosses,withT=2+ylossesz1andT=2ylossesz2,wehaveT=2LTmin=jyj.ItremainstolowerboundE[jyj].NotethattheprobabilityofyisTT=2+y=2T,whichisupperboundedbyO(1=p T)(usingaSterlingapproximation).Thisimpliesthatwithaconstantprobabilitywehavejyj=\n(p T),whichcompletestheproof. 4.4RegretminimizationandgametheoryInthissectionweoutlinetheconnectionbetweenregretminimizationandcentralconceptsingametheory.Westartbyshowingthatinatwoplayerconstantsumgame,aplayerwithexternalregretsublinearinTwillhaveanaveragepayo thatisatleastthevalueofthegame,minusavanish-ingerrorterm.Forageneralgame,wewillseethatifalltheplayersuseprocedureswithsublinearswap-regret,thentheywillconvergetoanapprox-imatecorrelatedequilibrium.Wealsoshowthatforaplayerwhominimizesswap-regret,thefrequencyofplayingdominatedactionsisvanishing.GametheoreticmodelWestartwiththestandardde nitionsofagame(seealsoChapter1).AgameG=hM;(Xi);(si)ihasa nitesetMofmplayers.PlayerihasasetXiofNactionsandalossfunctionsi:Xi(j=iXj)![0;1]thatmapstheactionofplayeriandtheactionsoftheotherplayerstoarealnumber.(Wehavescaledlossesto[0;1].)ThejointactionspaceisX=Xi.WeconsideraplayerithatplaysagameGforTtimestepsusinganonlineprocedureON.Attimestept,playeriplaysadistribution(mixedaction)Pti,whiletheotherplayersplaythejointdistributionPti.Wedenoteby`tONthelossofplayeriattimet,i.e.,ExPt[si(xt)],anditscumulativelossisLTON=PTt=1`tON.yItisnaturaltode ne,forplayeriattimet,thelossvectoras`t=(`t1;:::;`tN),where`tj=ExtiPti[si(xtj;xti)].Namely,`tjisthelossplayeriwouldhaveobservedifattimetithadplayedactionxj.Thecumulativelossofactionxj2XiofplayeriisLTj=PTt=1`tj,andLTmin=minjLTj.yAlternatively,wecouldconsiderxtiasarandomvariabledistributedaccordingtoPti,andsimilarlydiscusstheexpectedloss.Weprefertheabovepresentationforconsistencywiththerestofthechapter. 16A.BlumandY.MansourConstantsumgamesandexternalregretminimizationAtwoplayerconstantsumgameG=hf1;2g;(Xi);(si)ihasthepropertythatforsomeconstantc,foreveryx12X1andx22X2wehaves1(x1;x2)+s2(x1;x2)=c.Itiswellknownthatanyconstantsumgamehasawellde nedvalue(v1;v2)forthegame,andplayeri2f1;2ghasamixedstrategywhichguaranteesthatitsexpectedlossisatmostvi,regardlessoftheotherplayer'sstrategy.(See[Owe82]formoredetails.)Insuchgames,externalregret-minimizationproceduresprovidethefollowingguarantee:Theorem4.9LetGbeaconstantsumgamewithgamevalue(v1;v2).Ifplayeri2f1;2gplaysforTstepsusingaprocedureONwithexternalregretR,thenitsaverageloss1 TLTONisatmostvi+R=T.ProofLetqbethemixedstrategycorrespondingtotheobservedfrequenciesoftheactionsplayer2hasplayed;thatis,qj=PTt=1Pt2;j=T,wherePt2;jistheweightplayer2givestoactionjattimet.Bythetheoryofconstantsumgames,foranymixedstrategyqofplayer2,player1hassomeactionxk2X1suchthatEx2q[s1(xk;x2)]v1(see[Owe82]).Thisimplies,inoursetting,thatifplayer1hasalwaysplayedactionxk,thenitslosswouldbeatmostv1T.ThereforeLTminLTkv1T.Now,usingthefactthatplayer1isplayingaprocedureONwithexternalregretR,wehavethatLTONLTmin+Rv1T+R: Thus,usingaprocedurewithregretR=O(p TlogN)asinTheorem4.6willguaranteeaveragelossatmostvi+O(p (logN)=T).Infact,wecanusetheexistenceofexternalregretminimizationalgo-rithmstoprovetheminimaxtheoremoftwo-playerzero-sumgames.Forplayer1,letv1min=maxz2(X2)minx12X1Ex2z[s1(x1;x2)]andv1max=minz2(X1)maxx22X2Ex1z[s1(x1;x2)].Thatis,v1ministhebestlossthatplayer1canguaranteeforitselfifitistoldthemixedactionofplayer2inadvance.Similarly,v1maxisthebestlossthatplayer1canguaranteetoitselfifithastogo rstinselectingamixedaction,andplayer2'sactionmaythendependonit.Theminimaxtheoremstatesthatv1min=v1max.Sinces1(x1;x2)=s2(x1;x2)wecansimilarlyde nev2min=v1maxandv2max=v1min.Inthefollowingwegiveaproofoftheminimaxtheorembasedontheex-istenceofexternalregretalgorithms.Assumeforcontradictionthatv1max=v1min+\rforsome\r�0(itiseasytoseethatv1maxv1min).ConsiderbothplayersplayingaregretminimizationalgorithmforTstepshavingexternalregretofatmostR,suchthatR=T\r=2.LetLONbethelossofplayer1 Learning,Regretminimization,andEquilibria17andnotethatLONisthelossofplayer2.LetLiminbethecumulativelossofthebestactionofplayeri2f1;2g.Asbefore,letqibethemixedstrat-egycorrespondingtotheobservedfrequenciesofactionsofplayeri2f1;2g.Then,L1min=Tv1min,sinceforL1minweselectthebestactionwithrespecttoaspeci cmixedaction,namelyq2.Similarly,L2min=Tv2min.Theregretminimizationalgorithmsguaranteeforplayer1thatLONL1min+R,andforplayer2thatLONL2min+R.Combiningtheinequalitieswehave:Tv1maxR=Tv2maxRL2minRLONL1min+RTv1min+R:Thisimpliesthatv1maxv1min2R=T\r,whichisacontradiction.There-fore,v1max=v1min,whichestablishestheminimaxtheorem.CorrelatedEquilibriumandswapregretminimizationWe rstde netherelevantmodi cationrulesandestablishtheconnectionbetweenthemandequilibriumnotions.Forx1;b1;b22Xi,letswitchi(x1;b1;b2)bethefollowingmodi cationfunctionoftheactionx1ofplayeri:switchi(x1;b1;b2)=(b2ifx1=b1x1otherwiseGivenamodi cationfunctionfforplayeri,wecanmeasuretheregretofplayeriwithrespecttofasthedecreaseinitsloss,i.e.,regreti(x;f)=si(x)si(f(xi);xi):Forexample,whenweconsiderf(x1)=switchi(x1;b1;b2),fora xedb1;b22Xi,thenregreti(x;f)ismeasuringtheregretplayerihasforplayingactionb1ratherthanb2,whentheotherplayersplayxi.AcorrelatedequilibriumisadistributionPoverthejointactionspacewiththefollowingproperty.Imagineacorrelatingdevicedrawsavectorofactionsx2XusingdistributionPoverX,andgivesplayeritheactionxifromx.(Playeriisnotgivenanyotherinformationregardingx.)TheprobabilitydistributionPisacorrelatedequilibriumif,foreachplayer,itisabestresponsetoplaythesuggestedaction,providedthattheotherplayersalsodonotdeviate.(ForamoredetaileddiscussionofcorrelatedequilibriumseeChapter1.)De nition4.10AjointprobabilitydistributionPoverXisacorrelatedequilibriumifforeveryplayeri,andanyactionsb1;b22Xi,wehavethatExP[regreti(x;switchi(;b1;b2))]0: 18A.BlumandY.MansourAnequivalentde nitionthatextendsmorenaturallytothecaseofap-proximateequilibriaistosaythatratherthanonlyswitchingbetweenapairofactions,weallowsimultaneouslyreplacingeveryactioninXiwithanotheractioninXi(possiblythesameaction).AdistributionPisacorrelatedequi-libriumi foranyfunctionF:Xi!XiwehaveExP[regreti(x;F)]0.Wenowde nean-correlatedequilibrium.An-correlatedequilibriumisadistributionPsuchthateachplayerhasinexpectationatmostanincentivetodeviate.Formally,De nition4.11AjointprobabilitydistributionPoverXisan-correlatedequilibriaifforeveryplayeriandforanyfunctionFi:Xi!Xi,wehaveExP[regreti(x;Fi)].Thefollowingtheoremrelatestheempiricaldistributionoftheactionsperformedbyeachplayer,theirswapregret,andthedistancetocorrelatedequilibrium.Theorem4.12LetG=hM;(Xi);(si)ibeagameandassumethatforTtimestepseveryplayerfollowsastrategythathasswapregretofatmostR.Then,theempiricaldistributionQofthejointactionsplayedbytheplayersisan(R=T)-correlatedequilibrium.ProofTheempiricaldistributionQassignstoeveryPtaprobabilityof1=T.FixafunctionF:Xi!Xiforplayeri.SinceplayerihasswapregretatmostR,wehaveLTONLTON;F+R,whereLTONisthelossofplayeri.Byde nitionoftheregretfunction,wethereforehave:LTONLTON;F=TXt=1ExtPt[si(xt)]TXt=1ExtPt[si(F(xti);xti)]=TXt=1ExtPt[regreti(xt;F)]=TExQ[regreti(x;F)]:Therefore,foranyfunctionFi:Xi!XiwehaveExQ[regreti(x;Fi)]R=T. Theabovetheoremstatesthatthepayo ofeachplayerisitspayo insomeapproximatecorrelatedequilibrium.Inaddition,itrelatestheswapregrettothedistancefromequilibrium.Notethatiftheaverageswapregretvanishesthentheprocedureconverges,inthelimit,tothesetofcorrelatedequilibria. Learning,Regretminimization,andEquilibria19DominatedstrategiesWesaythatanactionxj2Xiis-dominatedbyactionxk2Xiifforanyxi2Xiwehavesi(xj;xi)+si(xk;xi).Similarly,actionxj2Xiis-dominatedbyamixedactiony2(Xi)ifforanyxi2Xiwehavesi(xj;xi)+Exy[si(xd;xi)].Intuitively,agoodlearningalgorithmoughttobeabletolearnnottoplayactionsthatare-dominatedbyothers,andinthissectionweshowthatindeedifplayeriplaysaprocedurewithsublinearswapregret,thenitwillveryrarelyplaydominatedactions.Moreprecisely,letactionxjbe-dominatedbyactionxk2Xi.Usingournotation,thisimpliesthatforanyxiwehavethatregreti(x;switchi(;xj;xk)).LetDbethesetof-dominatedactionsofplayeri,andletwbetheweightthatplayeriputsonactionsinD,averagedovertime,i.e.,w=1 TPTt=1Pj2DPti;j.Playeri'sswapregretisatleastwT(sincewecouldreplaceeachactioninDwiththeactionthatdominatesit).So,iftheplayer'sswapregretisR,thenwTR.Therefore,thetime-averageweightthatplayeriputsonthesetof-dominatedactionsisatmostR=(T)whichtendsto0ifRissublinearinT.Thatis:Theorem4.13ConsideragameGandaplayerithatusesaprocedureofswapregretRforTtimesteps.Thentheaverageweightthatplayeriputsonthesetof-dominatedactionsisatmostR=(T).Weremarkthatingeneralthepropertyofhavinglowexternalregretisnotsucientbyitselftogivesuchaguarantee,thoughthealgorithmsRWMandPWdoindeedhavesuchaguarantee(seeExercise8).4.5GenericreductionfromswaptoexternalregretInthissectionwegiveablack-boxreductionshowinghowanyprocedureAachievinggoodexternalregretcanbeusedasasubroutinetoachievegoodswapregretaswell.Thehigh-levelideaisasfollows(seealsoFig.4.1).WewillinstantiateNcopiesA1;:::;ANoftheexternal-regretprocedure.Ateachtimestep,theseprocedureswilleachgiveusaprobabilityvector,whichwewillcombineinaparticularwaytoproduceourownprobabilityvectorp.Whenwereceivealossvector`,wewillpartitionitamongtheNprocedures,givingprocedureAiafractionpi(piisourprobabilitymassonactioni),sothatAi'sbeliefaboutthelossofactionjisPtpti`tj,andmatchesthecostwewouldincurputtingi'sprobabilitymassonj.Intheproof,procedureAiwillinsomesenseberesponsibleforensuringlowregret 20A.BlumandY.Mansour -   -  -ANA1Hqt1p1`tptn`tpt`tqtNeeeFig.4.1.Thestructureoftheswapregretreduction.ofthei!jvariety.Thekeytomakingthisworkisthatwewillbeabletode nethep'ssothatthesumofthelossesoftheproceduresAiontheirownlossvectorsmatchesouroveralltrueloss.Recallthede nitionofanRexternalregretprocedure.De nition4.14AnRexternalregretprocedureAguaranteesthatforanysequenceofTlosses`tandforanyactionj2f1;:::;Ng,wehaveLTA=TXt=1`tATXt=1`tj+R=LTj+R:WeassumewehaveNcopiesA1;:::;ANofanRexternalregretproce-dure.WecombinetheNprocedurestoonemasterprocedureHasfollows.Ateachtimestept,eachprocedureAioutputsadistributionqti,whereqti;jisthefractionitassignsactionj.Wecomputeasingledistributionptsuchthatptj=Piptiqti;j.Thatis,pt=ptQt,whereptisourdistributionandQtisthematrixofqti;j.(WecanviewptasastationarydistributionoftheMarkovProcessde nedbyQt,anditiswellknownsuchaptexistsandisecientlycomputable.)Forintuitionintothischoiceofpt,noticethatitimplieswecanconsideractionselectionintwoequivalentways.The rstissimplyusingthedistributionpttoselectactionjwithprobabilityptj.The Learning,Regretminimization,andEquilibria21secondistoselectprocedureAiwithprobabilityptiandthentouseAitoselecttheaction(whichproducesdistributionptQt).Whentheadversaryreturnsthelossvector`t,wereturntoeachAithelossvectorpi`t.So,procedureAiexperiencesloss(pti`t)qti=pti(qti`t).SinceAiisanRexternalregretprocedure,foranyactionj,wehave,TXt=1pti(qti`t)TXt=1pti`tj+R(4.1)IfwesumthelossesoftheNproceduresatagiventimet,wegetPipti(qti`t)=ptQt`t,whereptistherow-vectorofourdistribution,Qtisthematrixofqti;j,and`tisviewedasacolumn-vector.Bydesignofpt,wehaveptQt=pt.So,thesumoftheperceivedlossesoftheNproceduresisequaltoouractuallosspt`t.Therefore,summingequation(4.1)overallNprocedures,theleft-hand-sidesumstoLTH,whereHisourmasteronlineprocedure.Sincetheright-hand-sideofequation(4.1)holdsforanyj,wehavethatforanyfunctionF:f1;:::;Ng!f1;:::;Ng,LTHNXi=1TXt=1pti`tF(i)+NR=LTH;F+NRThereforewehaveproventhefollowingtheorem.Theorem4.15GivenanRexternalregretprocedure,themasteronlineprocedureHhasthefollowingguarantee.ForeveryfunctionF:f1;:::;Ng!f1;:::;Ng,LHLH;F+NR;i.e.,theswapregretofHisatmostNR.UsingTheorem4.6wecanimmediatelyderivethefollowingcorollary.Corollary4.16ThereexistsanonlinealgorithmHsuchthatforeveryfunctionF:f1;:::;Ng!f1;:::;Ng,wehavethatLHLH;F+O(Np TlogN);i.e.,theswapregretofHisatmostO(Np TlogN).Remark:SeeExercise6foranimprovementtoO(p NTlogN). 22A.BlumandY.Mansour4.6ThePartialInformationModelInthissectionweshow,forexternalregret,asimplereductionfromthepartialinformationtothefullinformationmodel.yThemaindi erencebetweenthetwomodelsisthatinthefullinformationmodel,theonlineprocedurehasaccesstothelossofeveryaction.Inthepartialinformationmodeltheonlineprocedurereceivesasfeedbackonlythelossofasingleaction,theactionitperformed.Thisverynaturallyleadstoanexplorationversusexploitationtradeo inthepartialinformationmodel,andessentiallyanyonlineprocedurewillhavetosomehowexplorethevariousactionsandestimatetheirloss.Thehighlevelideaofthereductionisasfollows.AssumethatthenumberoftimestepsTisgivenasaparameter.WewillpartitiontheTtimestepsintoKblocks.Theprocedurewillusethesamedistributionoveractionsinallthetimestepsofanygivenblock,exceptitwillalsorandomlysampleeachactiononce(theexplorationpart).ThepartialinformationprocedureMABwillpasstothefullinformationprocedureFIBthevectoroflossesreceivedfromitsexplorationsteps.ThefullinformationprocedureFIBwillthenreturnanewdistributionoveractions.ThemainpartoftheproofwillbetorelatethelossofthefullinformationprocedureFIBonthelosssequenceitobservestothelossofthepartialinformationprocedureMABonthereallosssequence.WestartbyconsideringafullinformationprocedureFIBthatpartitionstheTtimestepsintoKblocks,B1;:::;BK,whereBi=f(i1)(T=K)+1;:::;i(T=K)g,andusesthesamedistributioninallthetimestepsofablock.(ForsimplicityweassumethatKdividesT.)ConsideranRKexternalregretminimizationprocedureFIB(overKtimesteps),whichattheendofblockiupdatesthedistributionusingtheaveragelossvector,i.e.,c=Pt2B`t=jBj.LetCKi=PK=1ciandCKmin=miniCKi.SinceFIBhasexternalregretatmostRK,thisimpliesthatthelossofFIB,overthelosssequencec,isatmostCKmin+RK.SinceineveryblockBtheprocedureFIBusesasingledistributionp,itslossontheentirelosssequenceis:LTFIB=KX=1Xt2Bp`t=T KKX=1pcT K[CKmin+RK]:AtthispointitisworthnotingthatifRK=O(p KlogN)theoverallregretisO((T=p K)p logN),whichisminimizedatK=T,namelybyhavingeachblockbeasingletimestep.However,wewillhaveanadditionalyThisreductiondoesnotproducethebestknownboundsforthepartialinformationmodel(see,e.g.,[ACBFS02]forbetterbounds)butisparticularlysimpleandgeneric. Learning,Regretminimization,andEquilibria23lossassociatedwitheachblock(duetothesampling)whichwillcausetheoptimizationtorequirethatKT.ThenextstepindevelopingthepartialinformationprocedureMABistouselossvectorswhicharenotthe\trueaverage"butwhoseexpectationisthesame.Moreformally,thefeedbacktothefullinformationprocedureFIBwillbearandomvariablevector^csuchthatforanyactioniwehaveE[^ci]=ci.Similarly,let^CKi=PK=1^ciand^CKmin=mini^CKi.(Intuitively,wewillgeneratethevector^cusingsamplingwithinablock.)ThisimpliesthatforanyblockBandanydistributionpwehave1 jBjXt2Bp`t=pc=NXi=1pici=NXi=1piE[^ci](4.2)Thatis,thelossofpinBisequaltoitsexpectedlosswithrespectto^c.ThefullinformationprocedureFIBobservesthelosses^c,for2f1;:::;Kg.However,since^carerandomvariables,thedistributionpisalsoarandomvariablethatdependsonthepreviouslosses,i.e.,^c1;:::^c1.Still,withrespecttoanysequenceoflosses^c,wehavethat^CKFIB=KX=1p^c^CKmin+RKSinceE[^CKi]=CKi,thisimpliesthatE[^CKFIB]E[^CKmin]+RKCKmin+RK;whereweusedthefactthatE[mini^CKi]miniE[^CKi]andtheexpectationisoverthechoicesof^c.Notethatforanysequenceoflosses^c1;:::;^cK,bothFIBandMABwillusethesamesequenceofdistributionsp1;:::;pK.From(4.2)wehavethatinanyblockBtheexpectedlossofFIBandthelossofMABarethesame,assumingtheybothusethesamedistributionp.ThisimpliesthatE[CKMAB]=E[^CKFIB]:Wenowneedtoshowhowtoderiverandomvariables^cwiththedesiredproperty.Thiswillbedonebychoosingrandomly,foreachactioniandblockB,anexplorationtimeti2B.(Thesedonotneedtobeindependentoverthedi erentactions,socaneasilybedonewithoutcollisions.)AttimetitheprocedureMABwillplayactioni(i.e.,theprobabilityvectorwithallprobabilitymassoni).Thisimpliesthatthefeedbackthatitreceiveswillbe`tii,andwewillthenset^citobe`tii.ThisguaranteesthatE[^ci]=ci. 24A.BlumandY.MansourSofarwehaveignoredthelossintheexplorationsteps.Sincethemax-imumlossis1,andthereareNexplorationstepsineachoftheKblocks,thetotallossinalltheexplorationstepsisatmostNK.Thereforewehave:E[LTMAB]NK+(T=K)E[CKMAB]NK+(T=K)[CKmin+RK]=LTmin+NK+(T=K)RK:ByTheorem4.6,thereareexternalregretproceduresthathaveregretRK=O(p KlogN).BysettingK=(T=N)2=3,forTN,wehavethefollowingtheorem.Theorem4.17GivenanO(p KlogN)externalregretprocedureFIB(forKtimesteps),thereisapartialinformationprocedureMABthatguaranteesLTMABLTmin+O(T2=3N1=3logN);whereTN.4.7Onconvergenceofregret-minimizingstrategiestoNashequilibriuminroutinggamesAsmentionedearlier,onenaturalsettingforregret-minimizingalgorithmsisonlinerouting.Forexample,apersoncouldusesuchalgorithmstoselectwhichofNavailableroutestousetodrivetoworkeachmorninginsuchawaythathisperformancewillbenearlyasgoodasthebest xedrouteinhindsight,eveniftracchangesarbitrarilyfromdaytoday.Infact,eventhoughinagraphG,thenumberofpathsNbetweentwonodesmaybeexponentialinthesizeofG,thereareanumberofexternal-regretmin-imizingalgorithmswhoserunningtimeandregretboundsarepolynomialinthegraphsize.Moreover,anumberofextensionshaveshownhowthesealgorithmscanbeappliedeventothepartial-informationsettingwhereonlythecostofthepathtraversedisrevealedtothealgorithm.Inthissectionweconsiderthegame-theoreticpropertiesofsuchalgo-rithmsintheWardropmodeloftrac\row.Inthismodel,wehaveadirectednetworkG=(V;E),andoneunit\rowoftrac(alargepopulationofin nitesimalusersthatweviewashavingoneunitofvolume)wantingtotravelbetweentwodistinguishednodesvstartandvend.(Forsimplicity,weareconsideringjustthesingle-commodityversionofthemodel.)Weassumeeachedgeehasacostgivenbyalatencyfunction`ethatissomenon-decreasingfunctionoftheamountoftrac\rowingonedgee.Inother Learning,Regretminimization,andEquilibria25words,thetimetotraverseeachedgeeisafunctionoftheamountofcon-gestiononthatedge.Inparticular,givensome\rowf,whereweusefetodenotetheamountof\rowonagivenedgee,thecostofsomepathPisPe2P`e(fe)andtheaveragetraveltimeofallusersinthepopulationcanbewrittenasPe2E`e(fe)fe.A\rowfisatNashequilibriumifall\row-carryingpathsPfromvstarttovendareminimum-latencypathsgiventhe\rowf.Chapter18considersthismodelinmuchmoredetail,analyzingtherela-tionshipbetweenlatenciesinNashequilibrium\rowsandthoseinglobally-optimum\rows(\rowsthatminimizethetotaltraveltimeaveragedoverallusers).Inthissectionwedescriberesultsshowingthatiftheusersinsuchasettingareadaptingtheirpathsfromdaytodayusingexternal-regretmin-imizingalgorithms(oreveniftheyjusthappentoexperiencelow-regret,regardlessofthespeci calgorithmsused)then\rowwillapproachNashequilibrium.NotethataNashequilibriumispreciselyasetofstaticstrate-giesthatareallno-regretwithrespecttoeachother,sosucharesultseemsnatural;howevertherearemanysimplegamesforwhichregret-minimizingalgorithmsdonotapproachNashequilibriumandcanevenperformmuchworsethananyNashequilibrium.Speci cally,onecanshowthatifeachuserhasregreto(T),orevenifjusttheaverageregret(averagedovertheusers)iso(T),then\rowapproachesNashequilibriuminthesensethata1fractionofdaysthavethepropertythata1fractionoftheusersthatdayexperiencetraveltimeatmostlargerthanthebestpathforthatday,whereapproaches0ataratethatdependspolynomiallyonthesizeofthegraph,theregret-boundsofthealgorithms,andthemaximumslopeofanylatencyfunction.Notethatthisisasomewhatnonstandardnotionofconvergencetoequilibrium:usuallyforan\-approximateequilibrium"onerequiresthatallparticipantshaveatmostincentivetodeviate.However,sincelow-regretalgorithmsareallowedtooccasionallytakelongpaths,andinfactalgorithmsintheMABmodelmustoccasionallyexplorepathstheyhavenottriedinalongtime(toavoidregretifthepathshavebecomemuchbetterinthemeantime),themultiplelevelsofhedgingareactuallynecessaryforaresultofthiskind.Inthissectionwepresentjustaspecialcaseofthisresult.LetPdenotethesetofallsimplepathsfromvstarttovendandletftdenotethe\rowondayt.LetC(f)=Pe2E`e(fe)fedenotethecostofa\rowf.NotethatC(f)isaweightedaverageofcostsofpathsinPandinfactisequaltotheaveragecostofallusersinthe\rowf.De nea\rowftobe-NashifC(f)+minP2PPe2P`e(fe);thatis,theaverageincentivetodeviateoverallusersisatmost.LetR(T)denotetheaverageregret(averaged 26A.BlumandY.Mansouroverusers)upthroughdayT,soR(T)TXt=1Xe2E`e(fte)fteminP2PTXt=1Xe2P`e(fte):Finally,letTdenotethenumberoftimestepsTneededsothatR(T)TforallTT.ForexampletheRWMandPWalgorithmsdiscussedinSection4.3achieveT=O(1 2logN)ifweset==2.Thenwewillshow:Theorem4.18Supposethelatencyfunctions`earelinear.ThenforTT,theaverage\row^f=1 T(f1+:::+fT)is-Nash.ProofFromthelinearityofthelatencyfunctions,wehaveforalle,`e(^fe)=1 TPTt=1`e(fte).Since`e(fte)fteisaconvexfunctionofthe\row,thisimplies`e(^fe)^fe1 TTXt=1`e(fte)fte:Summingoveralle,wehaveC(^f)1 TPTt=1C(ft)+minP1 TPTt=1Pe2P`e(fte)(byde nitionofT)=+minPPe2P`e(^fe):(bylinearity) Thisresultshowsthetime-average\rowisanapproximateNashequilib-rium.ThiscanthenbeusedtoprovethatmostoftheftmustinfactbeapproximateNash.Thekeyideahereisthatifthecostofanyedgewereto\ructuatewildlyovertime,thenthatwouldimplythatmostoftheusersofthatedgeexperiencedlatencysubstantiallygreaterthantheedge'saveragecost(becausemoreusersareusingtheedgewhenitiscongestedthanwhenitisnotcongested),whichinturnimpliestheyexperiencesubstantialregret.Theseargumentscanthenbecarriedovertothecaseofgeneral(non-linear)latencyfunctions.CurrentResearchDirectionsInthissectionwesketchsomecurrentresearchdirectionswithrespecttoregretminimization. Learning,Regretminimization,andEquilibria27Re nedRegretBounds:TheregretboundsthatwepresenteddependonthenumberoftimestepsT,andareindependentoftheperformanceofthebestaction.Suchboundsarealsocalledzeroorderbounds.Morere ned rstorderboundsdependonthelossofthebestaction,andsecondorderboundsdependonthesumofsquaresofthelosses(suchasQTkinThe-orem4.6).Aninterestingopenproblemistogetanexternalregretwhichisproportionaltotheempiricalvarianceofthebestaction.Anotherchal-lengeistoreducethepriorinformationneededbytheregretminimizationalgorithm.Ideally,itshouldbeabletolearnandadapttoparameterssuchasthemaximumandminimumloss.See[CBMS05]foradetaileddiscussionofthoseissues.Largeactionsspaces:InthischapterweassumedthenumberofactionsNissmallenoughtobeabletolistthemall,andouralgorithmsworkintimeproportionaltoN.However,inmanysettingsNisexponentialinthenaturalparametersoftheproblem.Forexample,theNactionsmightbeallsimplepathsbetweentwonodessandtinann-nodegraph,orallbinarysearchtreesonf1;:::;ng.SincethefullinformationexternalregretboundsareonlylogarithmicinN,fromthepointofviewofinformation,wecanderivepolynomialregretbounds.Thechallengeiswhetherinsuchsettingswecanproducecomputationallyecientalgorithms.Therehaverecentlybeenseveralresultsabletohandlebroadclassesofproblemsofthistype.KalaiandVempala[KV03]giveanecientalgorithmforanyprobleminwhich(a)thesetXofactionscanbeviewedasasubsetofRn,(b)thelossvectors`arelinearfunctionsoverRn(sothelossofactionxis`x),and(c)wecanecientlysolvetheoineoptimizationproblemargminx2S[x`]foranygivenlossvector`.Forinstance,thissettingcanmodelthepathandsearch-treeexamplesabove.yZinkevich[Zin03]extendsthistoconvexlossfunctionswithaprojectionoracle,andthereissubstantialinterestintryingtobroadentheclassofsettingsthatecientregret-minimizationalgorithmscanbeappliedto.Dynamics:Itisalsoveryinterestingtoanalyzethedynamicsofregretmin-imizationalgorithms.Theclassicalexampleisthatofswapregret:whenalltheplayersplayswapregretminimizationalgorithms,theempiricaldistribu-tionconvergestothesetofcorrelatedequilibria(Section4.4).Wealsosawconvergenceintwo-playerzerosumgamestotheminimaxvalueofthegameyThecaseofsearchtreeshastheadditionalissuethatthereisarotationcostassociatedwithusingadi erentaction(tree)attimet+1thanthatusedattimet.Thisisaddressedin[KV03]aswell. 28A.BlumandY.Mansour(Section4.4),andconvergencetoNashequilibriuminaWardrop-modelroutinggame(Section4.7).Furtherresultsonconvergencetoequilibriainothersettingswouldbeofsubstantialinterest.Atahighlevel,under-standingthedynamicsofregretminimizationalgorithmswouldallowustobetterunderstandthestrengthsandweaknessesofusingsuchprocedures.Formoreinformationonlearningingames,seethebook[FL98].Exercises4.1ShowthatswapregretisatmostNtimeslargerthaninternalregret.4.2Showanexample(evenwithN=3)wheretheratiobetweentheexternalandswapregretisunbounded.4.3ShowthattheRWMalgorithmwithupdaterulewti=wt1i(1)`t1iachievesthesameexternalregretboundasgiveninTheorem4.6forthePWalgorithm,forlossesin[0;1].4.4Considerasettingwherethepayo sareintherange[1;+1],andthegoalofthealgorithmistomaximizeitspayo .Deriveamodi- edPWalgorithmwhoseexternalregretisO(q QTmaxlogN+logN),whereQTmaxQTkfork2Xi.4.5Showa\n(p TlogN)lowerboundonexternalregret,forthecasethatTN.4.6ImprovetheswapregretboundtoO(p NTlogN).Hint:usetheobservationthatthesumofthelossesofalltheAiisboundedbyT.4.7(OpenProblem)Doesthereexistan\n(p TNlogN)lowerboundforswapregret?4.8ShowthatifaplayerplaysalgorithmRWM(orPW)thenitgive-dominatedactionssmallweight.Also,showthattherearecaseswheretheexternalregretofaplayercanbesmall,yetitgives-dominatedactionshighweight.NotesHannan[Han57]wasthe rsttodevelopalgorithmswithexternalregretsublinearinT.Later,motivatedbymachinelearningsettingsinwhichNcanbequitelarge,algorithmsthatfurthermorehaveonlyalogarithmicdependenceonNweredevelopedin[LW94,FS97,FS99,CBFH+97].Inparticular,theRandomizedWeightedMajorityalgorithmandTheorem4.5arefrom[LW94]andthePolynomialWeightsalgorithmandTheorem4.6isfrom[CBMS05].Computationallyecientalgorithmsforgenericframe-worksthatmodelmanysettingsinwhichNmaybeexponentialinthe Exercises29naturalproblemdescription(suchasconsideringalls-tpathsinagraphorallbinarysearchtreesonnelements)weredevelopedin[KV03,Zin03].Thenotionofinternalregretanditsconnectiontocorrelatedequilib-riumappearin[FV98,HMC00],andmoregeneralmodi cationruleswereconsideredin[Leh03].Anumberofspeci clowinternalregretalgorithmsweredevelopedby[FV97,FV98,FV99,HMC00,CBL03,BM05,SL05].ThereductioninSection4.5fromexternaltoswapregretisfrom[BM05].Algorithmswithstrongexternalregretboundsforthepartialinformationmodelaregivenin[ACBFS02],andalgorithmswithlowinternalregretappearin[BM05,CBLS06].ThereductionfromfullinformationtopartialinformationinSection4.6isinthespiritofalgorithmsof[AM03,AK04].Extensionsofthealgorithmof[KV03]tothepartialinformationsettingappearin[AK04,MB04,DH06].TheresultsinSection4.7onapproachingNashequilibriainroutinggamesarefrom[BEL06].Bibliography[ACBFS02]PeterAuer,NicoloCesa-Bianchi,YoavFreund,andRobertE.Schapire.Thenonstochasticmultiarmedbanditproblem.SIAMJournalonComputing,32(1):48{77,2002.[AK04]BaruchAwerbuchandRobertD.Kleinberg.Adaptiveroutingwithend-to-endfeedback:distributedlearningandgeometricapproaches.InSTOC,pages45{53,2004.[AM03]BaruchAwerbuchandYishayMansour.Adaptingtoareliablenetworkpath.InPODC,pages360{367,2003.[BEL06]AvrimBlum,EyalEven-Dar,andKatrinaLigett.Routingwithoutregret:Onconvergencetonashequilibriaofregret-minimizingalgorithmsinroutinggames.InPODC,2006.[BEY98]AllanBorodinandRanEl-Yaniv.OnlineComputationandCompetitiveAnalysis.CambridgeUniversityPress,1998.[BM05]AvrimBlumandYishayMansour.Fromexternaltointernalregret.InCOLT,2005.[CBFH+97]NicoloCesa-Bianchi,YoavFreund,DavidP.Helmbold,DavidHaussler,RobertE.Schapire,andManfredK.Warmuth.Howtouseexpertadvice.JournaloftheACM,44(3):427{485,1997.[CBL03]NicoloCesa-BianchiandGaborLugosi.Potential-basedalgorithmsinon-linepredictionandgametheory.MachineLearning,51(3):239{261,2003.[CBL06]NicoloCesa-BianchiandGaborLugosi.Prediction,LearningandGames.CambridgeUniversityPress,2006.[CBLS06]NicoloCesa-Bianchi,GaborLugosi,andGillesStoltz.Regretminimiza-tionunderpartialmonitoring.MathofO.R.(toappear),2006.[CBMS05]NicoloCesa-Bianchi,YishayMansour,andGillesStoltz.Improvedsecond-orderboundsforpredictionwithexpertadvice.InCOLT,2005.[DH06]VarshaDaniandThomasP.Hayes.Robbingthebandit:Lessregretinonlinegeometricoptimizationagainstanadaptiveadversary.InSODA,pages937{943,2006. 30A.BlumandY.Mansour[FL98]DrewFudenbergandDavidK.Levine.Thetheoryoflearningingames.MITpress,1998.[FS97]YoavFreundandRobertE.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.JCSS,55(1):119{139,1997.[FS99]YoavFreundandRobertE.Schapire.Adaptivegameplayingusingmulti-plicativeweights.GamesandEconomicBehavior,29:79{103,1999.[FV97]D.FosterandR.Vohra.Calibratedlearningandcorrelatedequilibrium.GamesandEconomicBehavior,21:40{55,1997.[FV98]D.FosterandR.Vohra.Asymptoticcalibration.Biometrika,85:379{390,1998.[FV99]D.FosterandR.Vohra.Regretintheon-linedecisionproblem.GamesandEconomicBehavior,29:7{36,1999.[Han57]J.Hannan.Approximationtobayesriskinrepeatedplays.InM.Dresher,A.Tucker,andP.Wolfe,editors,ContributionstotheTheoryofGames,vol-ume3,pages97{139.PrincetonUniversityPress,1957.[HMC00]S.HartandA.Mas-Colell.Asimpleadaptiveprocedureleadingtocorre-latedequilibrium.Econometrica,68:1127{1150,2000.[KV03]AdamKalaiandSantoshVempala.Ecientalgorithmsforonlinedecisionproblems.InCOLT,pages26{40,2003.[Leh03]E.Lehrer.Awiderangeno-regrettheorem.GamesandEconomicBehavior,42:101{115,2003.[LW94]NickLittlestoneandManfredK.Warmuth.Theweightedmajorityalgo-rithm.InformationandComputation,108:212{261,1994.[MB04]H.BrendanMcMahanandAvrimBlum.Onlinegeometricoptimizationinthebanditsettingagainstanadaptiveadversary.InProc.17thAnnualConferenceonLearningTheory(COLT),pages109{123,2004.[Owe82]GuillermoOwen.Gametheory.Academicpress,1982.[SL05]GillesStoltzandGaborLugosi.Internalregretinon-lineportfolioselection.MachineLearningJournal,59:125{159,2005.[ST85]D.SleatorandR.E.Tarjan.Amortizedeciencyoflistupdateandpagingrules.CommunicationsoftheACM,28:202{208,1985.[Zin03]MartinZinkevich.Onlineconvexprogrammingandgeneralizedin nitesimalgradientascent.InProc.ICML,pages928{936,2003.