/
Optimal o unctions for Mem ers of Collectiv es Da vid H Optimal o unctions for Mem ers of Collectiv es Da vid H

Optimal o unctions for Mem ers of Collectiv es Da vid H - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
419 views
Uploaded On 2015-01-15

Optimal o unctions for Mem ers of Collectiv es Da vid H - PPT Presentation

olp ert NASA Ames Researc Cen ter Mo64256ett Field CA 94035 dhwptolemyar cnasagov Kagan umer NASA Ames Researc Cen ter Mo64256ett Field CA 94035 kaganptolemyar cnasagov Octob er 31 2001 Abstract consider the problem of designing p erhaps massiv ely ID: 31588

olp ert NASA Ames

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Optimal o unctions for Mem ers of Collec..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

OptimalPayo®FunctionsforMembersofCollectives¤DavidH.WolpertNASAAmesResearchCenterMo®ettField,CA94035dhw@ptolemy.arc.nasa.govKaganTumerNASAAmesResearchCenterMo®ettField,CA94035kagan@ptolemy.arc.nasa.govOctober31,2001AbstractWeconsidertheproblemofdesigning(perhapsmassivelydistributed)collectivesofcomputationalprocessestomaximizeaprovided\world"utilityfunction.Weconsiderthisproblemwhenthebehaviorofeachprocessinthecollectivecanbecastasstrivingtomaximizeitsownpayo®utilityfunction.Forsuchcasesthecentraldesignissueishowtoinitialize/updatethosepayo®utilityfunctionsoftheindividualprocessessoastoinducebehavioroftheentirecollectivehavinggoodvaluesoftheworldutility.Tradi-tional\teamgame"approachestothisproblemsimplyassigntoeachprocesstheworldutilityasitspayo®utilityfunction.Inpreviousworkweusedthe\CollectiveIntelli-gence"(COIN)frameworktoderiveabetterchoiceofpayo®utilityfunctions,onethatresultsinworldutilityperformanceuptoordersofmagnitudesuperiortothatensuingfromuseoftheteamgameutility.Inthispaperweextendtheseresultsusinganovelmathematicalframework.Wereviewthederivationunderthatnewframeworkofthegeneralclassofpayo®utilityfunctionsthatbothi)areeasyfortheindividualprocessestotrytomaximize,andii)havethepropertythatifgoodvaluesofthemareachieved,thenweareassuredofahighvalueofworldutility.Thesearethe\AristocratUtility"andanewvariantofthe\WonderfulLifeUtility"thatwasintroducedinthepreviousCOINwork.Wedemonstrateexperimentallythatusingthesenewutilityfunctionscanresultinsigni¯cantlyimprovedperformanceoverthatofpreviouslyinvestigatedCOINpayo®utilities,overandabovethosepreviousutilities'superioritytotheconventionalteamgameutility.Theseresultsalsoillustratethesubstantialsuperiorityofthesepayo®functionstotheperhapsthemostnaturalversionoftheeconomicstechniqueof\endo-genizingexternalities".1IntroductionInthispaperweareconcernedwithlargedistributedcollectivesofinteractinggoal-drivencomputationalprocesses,wherethereisaprovided`worldutility'functionthatratesthepossiblebehaviorsofthatcollective[31,30].Weareparticularlyconcernedwithsuchcol-lectiveswheretheindividualcomputationalprocessesusemachinelearningtechniques(e.g.,ReinforcementLearning(RL)[14,22,21,25])totrytoachievetheirindividualgoals.We¤AppearsinAdvancesinComplexSystems,Vol.4,Nos.2&3,pp.265-279,October2001.1 representthosegoalsoftheindividualprocessesasmaximizinganassociated`payo®'utilityfunction,onethatingeneralcandi®erfromtheworldutility.Insuchasystem,weareconfrontedwiththefollowinginverseproblem:Howshouldoneinitialize/updatethepayo®utilityfunctionsoftheindividualprocessessothattheensuingbe-havioroftheentirecollectiveachieveslargevaluesoftheprovidedworldutility?Inparticular,sinceintrulylargesystemsdetailedmodelingofthesystemisusuallyimpossible,howcanweavoidsuchmodeling?Canweinsteadleveragethesimpleassumptionthatourlearneringalgorithmsareindividuallyfairlygoodatwhattheydotoachievealargeworldutilityvalue?Thisproblemisrelatedtoworkinmanyother¯elds,includingmulti-agentsystems(MAS's),computationaleconomics,mechanismdesign,reinforcementlearning,statisticalme-chanics,computationalecologies,(partiallyobservable)Markovdecisionprocessesandgametheory.Howevernoneofthese¯eldsisbothapplicableinlargeproblems,anddirectlyad-dressesthegeneralinverseproblem,ratherthanaspecialinstanceofit.(See[30]foradetaileddiscussionoftherelationshipbetweenthese¯elds,involvinghundredsofreferences.)Forexample,the¯eldofmechanismdesignisnotgenerallyapplicable,beinglargelytai-loredtocollectivesofhumanbeings,andinparticulartotheidiosyncracyofsuchcollectivesthattheirmembershavehiddenvariableswhosevaluesthey\donotwanttoreveal".Thereisotherpreviousworkthatdoesconsiderthegeneralinverseproblem,andevenhaseachindividualcomputationalprocess(or\agent")usereinforcementlearning[2,7,10,15,18].However,inthatworkingeneraleachprocesshastheworldutilityfunctionasitspayo®utilityfunction(i.e.,implementsa\teamgame"oran\exactpotentialgame"[8]).Unfortunately,asexpoundedbelowandinpreviouswork,thisapproachscalesextremelypoorlytolargeproblems.(Intuitively,thedi±cultyisthateachagentcanhaveahardtimediscerningtheechoofitsbehaviorontheworldutilitywhenthesystemislarge;eachagenthasahorrible\signal-to-noise"problem.)Intuitively,weareconcernedwithpayo®utilityfunctionsthatare\aligned"withtheworldutility,inthatmodi¯cationsaplayermightmakethatwouldimproveitspayo®utilityalsomustimproveworldutility.Fortunatelytheequivalenceclassofsuchpayo®utilitiesexendswellbeyondteam-gameutilities.Asaparticularexample,inpreviousworkweusedtheCOllectiveINtelligence(COIN)frameworktoderivethe`WonderfulLifeUtility'(WLU)payo®function[30]asanalternativetoateam-gamepayo®utility.TheWLUisalignedwithworldutility,asdesired.Inadditionthough,WLUovercomesmuchofthesignal-to-noiseproblemofteamgameutilities[24,31,30,33].Asanexample,insomeofourpreviousworkweusedtheWLUfordistributedcontrolofnetworkpacketrouting[31].Conventionalapproachestopacketroutinghaveeachrouterrunashortestpathalgorithm(SPA),i.e.,eachrouterroutesitspacketsinthewaythatitexpectswillgetthosepacketstotheirdestinationsmostquickly.UnlikewithaWLU-basedcollective,withSPA-basedroutingtheroutershavenoconcernforthepossibledeleteriousside-e®ectsoftheirroutingdecisionsontheglobalgoal(e.g.,theyhavenoconcernforwhethertheyinducebottlenecks).WeransimulationsthatdemonstratedthataWLU-basedcollectivehassubstantiallybetterthroughputsthandoesthebestpossibleSPA-basedsystem[31],eventhoughthatSPA-basedsystemhasinformationdeniedtheCOINsystem.InrelatedworkwehaveshownthatuseoftheWLUautomaticallyavoidstheinfamousBraess'paradox,inwhichaddingnewlinkscanactuallydecreasethroughput|asituationthatreadilyensnaresSPA's.2 Asanotherexample,in[32]weconsideredthepared-downproblemdomainofacongestiongame,inparticularamorechallengingvariantofArthur'sElFarolbarattendanceproblem[1],sometimesalsoknownasthe\minoritygame"[6].Inthisproblemtheindividualprocessesmakingupthecollectiveareexplicitlyviewedas`players'involvedinanon-cooperativegame.Eachplayerhastodeterminewhichnightintheweektoattendabar.Theproblemissetupsothatifeithertoofewpeopleattend(boringevening)ortoomanypeopleattend(crowdedevening),thetotalenjoymentoftheattendingplayersdrops.Ourgoalistodesignthepayo®functionsoftheplayerssothatthetotalenjoymentacrossallnightsismaximized.InthispreviousworkweshowedthatuseoftheWLUcanresultinperformanceordersofmagnitudesuperiortothatofteamgameutilities.Inthisarticleweextendthispreviouswork,byinvestigatingtheimpactofthechoiceofthesinglefreeparameterintheWLU(the`clampingparameter'),whichwesimplysetto0inourpreviouswork.Inparticular,weemploysomeofthenewmathematicsofCOINstoderivethe`AristocratUtility'(AU)asanoptimalutilitypayo®function.Wederivetheoptimalvalueoftheclampingparameterasthevaluethatgivesa\mean-¯eld"approximationtotheAU.Wethenpresentexperimentalteststovalidatethatchoiceofclampingparameter.InthenextsectionwereviewtherelevantconceptsofthemathematicsofCOINs.Thenwesketchhowtousethoseconceptstoderivetheoptimalclampingparameter.Tofacilitatecomparisonwithpreviouswork,wechosetoconductourexperimentalinvestigationsoftheperformancewiththisoptimalclampingparameterinvariationsoftheBarProblem.WepresentthosevariationsinSection3.FinallywepresenttheresultsoftheexperimentsinSection4.Thoseresultscorroboratethepredictedimprovementinperformancewhenusingourtheoreticallyderivedclampingparameter.ThisextendsthesuperiorityoftheCOIN-basedapproachaboveconventionalteam-gameapproachesevenfurtherthanhadbeendonepreviously.TheresultsalsoillustratethesubstantialsuperiorityofCOIN-basedtechniquestoanaturalversionoftheeconomicstechniqueof\endogenizingexternalities\.2TheMathematicsofCollectiveIntelligenceSinceinthispaperwearerestrictingattentiontovariantsofthebarproblem,weviewtheindividualcomputationalprocessesasplayersinvolvedinaniteratedsingle-stagegame.ThefullmathematicsoftheCOINframeworkextendssigni¯cantlybeyondwhatisneededtoaddresssuchgames1TherestrictedversionwewillcalluponherestartswithanarbitraryvectorspaceZwhoseelements³givethejointmoveofallplayersinthecollectiveinsomestage.Wewishtosearchforthe³thatmaximizestheprovidedworldutilityG(³).InadditiontoGweareconcernedwithpayo®utilityfunctionsfg´g,onesuchfunctionforeachvariable/player´.Weusethenotation^´torefertoallplayersotherthan´.Wewillneedtohaveawayto\standardize"utilityfunctionssothatthenumericvaluetheyassigntoa³onlyre°ectstheirrankingof³relativetocertainotherelementsofZ.WecallsuchastandardizationofsomearbitraryutilityUforplayer´the\intelligencefor´at³withrespecttoU".Herewewilluseintelligencesthatareequivalenttopercentiles:1Thatframeworkencompasses,forexample,arbitrarydynamicrede¯nitionsofthe\players"(i.e.,dynamicreassignmentsofhowthevarioussubsetsofthevariablescomprisingthecollectiveareassignedtoplayers),aswellasmodi¯cationoftheplayers'informationsets(i.e.,modi¯cationofinter-playercommunication).See[28].3 ²U(³:´)´Zd¹³^´(³0)£[U(³)¡U(³0)];(1)wheretheHeavisidefunction£isde¯nedtoequal1whenitsargumentisgreaterthanorequalto0,andtoequal0otherwise,andwherethesubscriptonthe(normalized)measured¹indicatesitisrestrictedto³0sharingthesamenon-´componentsas³.2Intelligencevaluearealwaysbetween0and1.Ouruncertaintyconcerningthebehaviorofthesystemisre°ectedinaprobabilitydis-tributionoverZ.Ourabilitytocontrolthesystemconsistsofsettingthevalueofsomecharacteristicofthecollective,e.g.,settingthepayo®functionsoftheplayers.Indicatingthatvaluebys,ouranalysisrevolvesaroundthefollowingcentralequationforP(Gjs),whichfollowsfromBayes'theorem:P(Gjs)=Zd~²GP(Gj~²G;s)Zd~²gP(~²Gj~²g;s)P(~²gjs);(2)where~²g´(²g´1(³:´1);²g´2(³:´2);¢¢¢)isthevectoroftheintelligencesoftheplayerswithrespecttotheirassociatedpayo®functions,and~²G´(²G(³:´1);²G(³:´2);¢¢¢)isthevectoroftheintelligencesoftheplayerswithrespecttoG.Notethat²g´(³:´)=1meansthatplayer´isfullyrationalat³,inthatitsmovemaximizesitspayo®,giventhemovesoftheplayers.Inotherwords,apoint³where²g´(³:´)=1forallplayers´isagame-theoryNashequilibrium.3Ontheotherhand,a³atwhichallcomponentsof~²G=1isalocalmaximumofG(ormoreprecisely,acriticalpointoftheG(³)surface).Ifwecanchoosessothatthethirdconditionalprobabilityintheintegrandispeakedaroundvectors~²gallofwhosecomponentsarecloseto1,thenwehavelikelyinducedlarge(payo®function)intelligences.Ifwecanalsohavethesecondtermbepeakedabout~²Gequalto~²g,then~²Gwillalsobelarge.Finally,ifthe¯rsttermintheintegrandispeakedabouthighGwhen~²Gislarge,thenourchoiceofswilllikelyresultinhighG,asdesired.Intuitively,therequirementthatpayo®functionshavehigh\signal-to-noise"(anissuenotconsideredinconventionalworkinmechanismdesign)arisesinthethirdterm.Itisinthesecondtermthattherequirementthatthepayo®functionsbe\alignedwithG"arises.Inthisworkweconcentrateonthesetwoterms,andshowhowtosimultaneouslysetthemtohavethedesiredform.4Detailsofthestochasticenvironmentinwhichthecollectiveoperates,togetherwithde-tailsofthelearningalgorithmsoftheplayers,arere°ectedinthedistributionP(³)which2Themeasuremustre°ectthetypeofsystemathand,e.g.,whetherZiscountableornot,andifnot,whatcoordinatesystemisbeingused.Otherthanthat,anyconvenientchoiceofmeasuremaybeusedandthetheoremswillstillhold.3See[9].Notethatconsiderationofpoints³atwhichnotallintelligencesequal1providesthebasisforamodel-independentformalizationofboundedrationalitygametheory,aformalizationthatcontainsvariantsofmanyofthetheoremsofconventionalfull-rationalitygametheory.See[27].4Non-game-theory-basedfunctionmaximizationtechniqueslikesimulatedannealinginsteadaddresshowtohaveterm1havethedesiredform.TheydothisbytryingtoensurethatthelocalmaximathattheunderlyingsystemultimatelysettlesnearhavehighG,by\tradingo®explorationandexploitation".Onecancombinesuchterm-1-basedtechniqueswiththetechniquespresentedhere,Theresultanthybridalgorithm,addressingallthreeterms,outperformssimulatedannealingbyovertwoordersofmagnitude[29].4 'z'zU( )z:h( )= 0.6eUzFigure1:Intelligenceofagent´atstate³forutilityU:³istheactualjointmoveathand.Thex-axisshowsagent´'salternativepossiblemoves(allstates³0having³'svaluesforthemovesofallplayersotherthan´.).Thethicksectionsofthex-axisshowthealternativemovesthat´couldhavemadethatwouldhavegiven´aworsevalueoftheutilityU.Thefractionofthefullsetof´'spossiblemovesthatliesinthosethicksections(whichis0.6inthisexample)istheintelligenceofagent´at³forutilityU,denotedby²U(³:´).underliesthedistributionsappearinginEquation2.Notethoughthatindependentoftheseconsiderations,ourdesiredformforthesecondterminEquation2isassuredifwehavechosenpayo®utilitiessuchthat~²gequals~²Gexactlyforall³.Wecallsuchasystemfactored.Ingame-theorylanguage,theNashequilibriaofafactoredcollectivearelocalmaximaofG.Inadditiontothisdesirableequilibriumbehavior,factoredcollectivesalsoautomaticallyprovideappropriateo®-equilibriumincentivestotheplayers(anissuerarelyconsideredinthegametheory/mechanismdesignliterature).Asatrivialexample,any\teamgame"inwhichallthepayo®functionsequalGisfac-tored[8,16].Howeverteamgamesoftenhaveverypoorformsforterm3inEquation2,formswhichgetprogressivelyworseasthesizeofthecollectivegrows.Thisisbecauseforsuchpayo®functionseachplayer´willusuallyconfrontaverypoor\signal-to-noise"ratiointryingtodiscernhowitsactionsa®ectitspayo®g´=G,sincesomanyotherplayer'sactionsalsoa®ectGandthereforedilute´'se®ectonitsownpayo®function.Wenowfocusonalgorithmsbasedonpayo®functionsfg´gthatoptimizethesignal/noiseratiore°ectedinthethirdterm,subjecttotherequirementthatthesystembefactored.Tounderstandhowthesealgorithmswork,givenameasured¹(³´),de¯netheopacityat³ofutilityUas:­U(³:´;s)´Zd³0J(³0j³)jU(³)¡U(³0^´;³´)jjU(³)¡U(³^´;³0´)j;(3)whereJisde¯nedintermsoftheunderlyingprobabilitydistributions,5and(³0^´;³´)isde¯nedastheworldlinewhose^´componentsarethesameasthoseof³0whileits´componentsare5Writingitoutinfull,J(³0j³)´J(³´;³0j³^´;s)=P(³´j³^´;s),with:J(³´;³0j³^´;s)´P(³´j³^´;s)P(³0´j³´;s)¹(³0´)2+P(³0´j³0^´;s)P(³´j³0´;s)¹(³´)2:5 thesameasthoseof³.ThedenominatorabsolutevalueintheintegrandinEquation3re°ectshowsensitiveU(³)istochanging³´.Incontrast,thenumeratorabsolutevaluere°ectshowsensitiveU(³)istochanging³^´.Sothesmallertheopacityofapayo®functiong´,themoreg´(³)dependsonlyonthemoveofplayer´,i.e.,thebettertheassociatedsignal-to-noiseratiofor´.Intuitivelythen,loweropacityshouldmeanitiseasierfor´toachievealargevalueofitsintelligence.Toformallyestablishthis,weusethesamemeasured¹tode¯neopacityastheonethatde¯nedintelligence.Underthischoiceexpectedopacityboundshowcloseto1expectedintelligencecanbe[28]:E(²U(³:´)js)·1¡K;whereK·E(­U(³:´;s)js):(4)Solowexpectedopacityofutilityg´ensurethatanecessaryconditionismetforthethirdterminEquation2tohavethedesiredformforplayer´.Whilelowopacityisnot,formallyspeaking,alsosu±cientforE(²U(³:´)js)tobecloseto1,inpracticetheboundsinEquation4areusuallytight.Itispossibletosolveforthesetofallpayo®utilitiesthatarefactoredwithrespecttoaparticularworldutility.Unfortunately,ingeneralitisnotpossibleforacollectivebothtobefactoredandtohavezeroopacityforallofitsplayers.Howeverconsiderdi®erenceutilities,whichareoftheformU(³)=G(³)¡¡(f(³))(5)where¡(f)isindependent³´.Anydi®erenceutilityisfactored[28].Inaddition,underusuallybenignapproximations,E(­ujs)isminimizedoverthesetofdi®erenceutilitiesbychoosing¡(f(³))=E(Gj³^´;s);(6)uptoanoveralladditiveconstant.Wecalltheresultantdi®erenceutilitytheAristocratutility(AU),looselyre°ectingthefactthatitmeasuresthedi®erencebetweenaplayer'sactualactionandtheaverageaction.Ifpossible,wewouldlikeeachplayer´tousetheassociatedAUasitspayo®functiontoensuregoodformforbothterms2and3inEquation2.Thisisnotalwaysfeasiblehowever.Theproblemisthattoevaluatetheexpectationvaluede¯ningitsAUeachplayerneedstoevaluatethecurrentprobabilitiesofeachofitspotentialmoves.Howeveriftheplayerthenchangesitspayo®functiontobetheassociatedAUitwillingeneralsubstantiallychangeitsensuingbehavior.(Theplayernowwantstochoosemovesthatmaximizeadi®erentfunctionfromtheoneitwasmaximizingbefore.)Inotherwords,itwillchangetheprobabilitiesofitsmoves,whichmeansthatitsnewpayo®functionisinfactnottheAUforitsactual(new)probabilities.Therearewaysaroundthisself-consistencyproblem,butinpracticeitisofteneasiertobypasstheentireissue,bygivingeach´apayo®functionthatdoesnotdependonthe6 ³(³^´2;~0)(³^´2;~a)´1´2´3´42666410000110001037775=)Clamp´2to\null"2666410000010001037775=)Clamp´2to\average"26664100:33:33:3310001037775Figure2:Thisexampleshowstheimpactoftheclampingoperationonthejointstateofafour-playersystemwhereeachplayerhasthreepossiblemoves,eachsuchmoverepresentedbyathree-dimensionalunaryvector.The¯rstmatrixrepresentsthejointstateofthesystem³whereplayer1hasselectedaction1,player2hasselectedaction3,player3hasselectedaction1andplayer4hasselectedmove2.Thesecondmatrixdisplaysthee®ectofclampingplayer2'sactiontothe\null"vector(i.e.,replacing³´2with~0).Thethirdmatrixshowsthee®ectofinsteadclampingplayer2'smovetothe\average"actionvector~a=f:33;:33;:33g,whichamountstoreplacingthatplayer'smovewiththe\illegal"moveoffractionallytakingeachpossiblemove(³´2=~a).probabilitiesof´'sownmoves.Onesuchpayo®functionistheWonderfulLifeUtility(WLU).TheWLUforplayer´isparameterizedbyapre-¯xedclampingparameterCL´chosenfromamong´'spossiblemoves:WLU´´G(³)¡G(³^´;CL´):(7)WLUisfactorednomatterwhatthechoiceofclampingparameter.Furthermore,whilenotmatchingthelowopacityofAU,WLUusuallyhasfarbetteropacitythandoesateamgame.Figure2providesanexampleofclamping.Asinthatexample,inmanycircumstancesthereisaparticularchoiceofclampingparameterforplayer´thatisa\null"moveforthatplayer,equivalenttoremovingthatplayerfromthesystem.(Hencethenameofthispayo®function|cf.theFrankCapramovie.)ForsuchaclampingparameterassigningtheassociatedWLUto´asitspayo®functioniscloselyrelatedtotheeconomicstechniqueof\endogenizingaplayer'sexternalities"[17].HoweveritisusuallythecasethatusingWLUwithaclampingparameterthatisascloseaspossibletotheexpectedmovede¯ningAUresultsinfarloweropacitythandoesclampingtothenullmove.SuchaWLUisroughlyakintoamean-¯eldapproximationtoAU.6Forexample,inFig.2,iftheprobabilitiesofplayer2makingeachofitspossiblemoveswas1=3,thenonewouldexpectthataclampingparameterof~awouldbeclosetooptimal.Accordingly,inpracticeuseofsuchanalternativeWLUderivedasa\mean-¯eldapproximation"toAUalmostalwaysresultsinfarbettervaluesofGthandoesthe\endogenizing"WLU.Intuitively,onecanlookatAUandWLUfromtheperspectiveofahumancompany,withGthe\bottomline"ofthecompany,theplayers´identi¯edwiththeemployeesofthatcompany,andtheassociatedg´givenbytheemployees'performance-basedcompensationpackages.Forexample,fora\factoredcompany",eachemployee'scompensationpackage6Formally,ourapproximationisexactonlyiftheexpectedvalueofGequalsGevaluatedattheexpectedjointmove(bothexpectationsbeingconditionedongivenmovesbyallplayersotherthan´).Ingeneralthough,forrelativelysmoothG,wewouldexpectsuchamean-¯eldapproximationtoAU,togivegoodresults,eveniftheapproximationdoesnotholdexactly.7 containsincentivesdesignedsuchthatthebetterthebottomlineofthecorporation,thegreatertheemployee'scompensation.Asanexample,theCEOofacompanywishingtohavethepayo®utilitiesoftheemployeesbefactoredwithGmaygivestockoptionstotheemployees.Thenete®ectofthisactionistoensurethatwhatisgoodfortheemployeeisalsogoodforthecompany.Inaddition,ifthecompensationpackagesare\lowopacity",theemployeeswillhavearelativelyeasytimediscerningtherelationshipbetweentheirbehaviorandtheircompensation.Insuchacasetheemployeeswillbothhavetheincentivetohelpthecompanyandbeabletodeterminehowbesttodoso.Notethatinpractice,providingstockoptionsisusuallymoree®ectiveinsmallcompaniesthaninlargeones.ThismakesperfectsenseintermsoftheCOINformalism,sincesuchoptionsgenerallyhaveloweropacityinsmallcompaniesthantheydoinlargecompanies,inwhicheachemployeehasahardtimeseeinghowhis/hermovesa®ectthecompany'sstockprice.3TheBarProblemArthur'sbarproblem[1]canbeviewedasaproblemindesigningCOINs.Looselyspeaking,inthisproblemateachtimestepeachplayer´decideswhethertoattendabarbypredicting,basedonitspreviousexperience,whetherthebarwillbetoocrowdedtobe\rewarding"atthattime,asquanti¯edbyautilityfunctionG.Thesel¯shnatureoftheplayersfrustratestheglobalgoalofmaximizingG.Thisisbecauseifmostplayersthinktheattendancewillbelow(andthereforechoosetoattend),theattendancewillactuallybehigh,andvice-versa.Here,wefocusonthefollowingsixmoregeneralvariantsofthebarprobleminvestigatedin[33]:ThereareNplayers,eachpickingoneoutofsevenmoveseveryweek.Eachvariantofthegameisparameterizedby`2f1;2;3;4;5;6g.Inagivenvariant,eachmoveofanagentcorrespondstoattendingthebaronsomeparticularsubsetof`outofthesevennightsofthecurrentweek(i.e.,given`,eachpossiblemoveisan`attendancepro¯le'vertexofthe7-dimensionalunithypercubehaving`1's).Ineachweekeveryplayerchoosesamove.Thentheassociatedpayo®sforeachplayerarecommunicatedtothatplayer,andtheprocessisrepeated.Forsimplicity,foreach`wechosethesevenpossibleattendancepro¯lessothatifthemovesareselectedrandomlyuniformly,theexpectedresultantattendancepro¯leacrossallsevennightsisalsouniform.(Forexample,or`=2,thosepro¯lesare(1,1,0,0,0,0,0),(0,1,1,0,0,0,0),etc.)Moreformally,theworldutilityinanyparticularweekis:G(³)´7Xk=1Á(xk(³));(8)wherexk(³)isthetotalattendanceonnightk;³´is´'smoveinthatweek;Á(y)´yexp(¡y=c);andcisareal-valuedparameter.OurchoiceofÁ(:)meansthatwheneithertoofewortoomanyplayersattendsomenightinsomeweekworldutilityGislow.Sincewewishtoconcentrateonthee®ectsoftheutilitiesratherthanontheRLalgorithmsthatusethem,weuse(very)simpleRLalgorithms.7Wewouldexpectthatevenmarginally7Ontheotherhand,tousealgorithmssopatentlyde¯cientthattheyhaveneverevenbeenconsideredintheRLcommunity|likethealgorithmsusedinmostofthebarproblemliterature|wouldseriouslyinterferewithourabilitytointerpretourexperiments.8 moresophisticatedRLalgorithmswouldgivebetterperformance.Inouralgorithmeachplayer´hasa7-dimensionalvectorgivingitsestimatesoftheutilityitwouldreceivefortakingeachpossiblemove.Atthebeginningofeachweek,each´picksthenighttoattendrandomly,usingaBoltzmanndistributionoverthesevencomponentsof´'sestimatedutilitiesvector.Forsimplicity,temperaturedoesnotdecayintime.Howevertore°ectthefactthateachplayeroperatesinanon-stationaryenvironment,utilityestimatesareformedusingexponentiallyageddata:inanyweekt,theestimate´makesfortheutilityforattendingnightiisaweightedaverageofalltheutilitiesithaspreviouslyreceivedwhenitattendedthatnight,withtheweightsgivenbyanexponentialfunctionofhowlongagoeachsuchutilitywas.Toformtheplayers'initialtrainingset,wehadaninitialperiodinwhichallmovesbyallplayerswerechosenuniformlyrandomly,withnolearning.4ExperimentalResultsWeinvestigatethreechoicesofclampingparameter:~0,~1=(1;1;1;1;1;1;1),andthe\aver-age"move,~a=`~17,whereasusual`2f1;2;3;4;5;6g,dependingontheproblem.(Tokeepthe\congestion"levelofthedi®erentproblemsclosetooneanother,for`goingfrom1to6,c=f3;6;8;10;12;15grespectively.)TheassociatedWLU'saredistinguishedwithasuper-script.Ineachoftheexperimentsreportedhereallplayershavethesameutilityfunction,sofromnowonwedroptheplayersubscriptfromthepayo®utilities.Writingthemout,thethreeWLUfunctionsare:WLU~0(³)´G(³)¡G(³^´;~0)=Á(xd´(³))¡Á(xd´(³)¡1)WLU~1(³)´G(³)¡G(³^´;~1)=7Xd6=d´(Á(xd(³))¡Á(xd(³)+1))WLU~a(³)´G(³)¡G(³^´;~a)=7Xd=1Á(xd(³))¡7Xd6=d´Áµxd(³)+`7¶¡Áµxd´(³)¡1+`7¶whered´isthenightpickedby´.Basedontheanalysispresentedabove,WLU~aistheWLUthatwewouldexpecttobeoptimaliftheprobabilityofeachmovebyanyparticularplayerwere1=7,sinceinthosecircumstancesitisthemean-¯eldapproximationtoAU.Inpracticeofcourse,itisnottruethateachplayerhasauniformdistributionoveritspossiblemoves.Here,toavoidaddressingtheself-consistencyissueassociatedwithevaluatingAU,wechoosetohandicapthealgorithmsbasedonthepredictionsofourmathematicsbymakingthe(clearlyverycrude)approximationthateachplayer'sprobabilitydistributionisindeeduniform,inadditiontomakingthemean-¯eldapproximation.InthatitsubtractsfromtheactualvalueofGwhatGwouldhavebeenifplayer´hadneverexisted,WLU~0isakintothestandardeconomicstechniqueof\endogenizingplayer´'sexternalities".Notethattoevaluateitforplayer´oneonlyneedstoknowthetotal9 attendanceonthenight(s)attendedby´.Incontrast,G(teamgameutility)andWLU~arequirecentralizedcommunicationconcerningall7nights,andWLU~1requirescommunicationconcerning6nights.Inthe¯rstexperimenteachplayerhadtoselectonenighttoattendthebar(`=1andthevertexistheidentitymatrix.)InthiscaseCL´=~0isequivalenttotheplayer\stayingathome".Ontheotherhand,CL´=~1correspondstotheplayerattendingeverynight.Finally,CL´=~a=~17isequivalenttotheplayersattendingpartiallyonallnightsinproportionsequivalenttotheoverallattendancepro¯leofallplayersacrosstheinitialtrainingperiod.(Notethatnoneofthese\moves"areactuallyavailabletotheplayers.Ratherthese¯ctionalmovesareusedtocomputetheplayers'payo®utilities,asdescribedinSection2.)7654320100200300400500Global RewardTime Figure3:E®ectofpayo®utilityfunctionsonsystemperformance;`=1;(WLU~ais3;WLU~0is+;WLU~1is2;Gis£)Figure3graphsworldutilityagainsttime,averagedover100runs,for60playersandc=3.(Throughoutthispaper,errorbarsaretoosmalltodepict.)Thetwostraightlinescorrespondtotheoptimalperformance,andthe\baseline"performancegivenbyuniformoccupanciesacrossallnights.SystemsusingWL~aandWL~0rapidlyconvergedtooptimalandtoquitegoodperformance,respectively.Thisindicatesthatforthebarproblemthe\mildassumptions"mentionedabovehold,andthattheapproximationsinthederivationoftheoptimalclampingparameterarevalid.Figure4showsthenormalizedworldutilityobtainedforthedi®erentpayo®utilitiesasafunctionof`(i.e.,whenplayersattendthebaron`nightsinoneweek).Temperaturesvariedbetween.01and.02forthethreeWLutilities,andbetween.1and.2fortheGutility,whichprovidedtherespectivebestperformancesforeach.WLU~aperformswellforallproblems.WLU~1ontheotherhandperformspoorlywhenplayersonlyattendonafewnights,butreachestheperformanceofWLU~awhenplayersneedtoselectsixnights,asituationwherethetwoclampingvectorsareverysimilar(~1and~67,respectively).WLU~0showsaslightdropinperformancewhenthenumberofnightstoattendincreases.ThefactthatitalwaysperformsworsethandoesWLU~aillustratestheshortcomingofconventionaleconomicstechniqueswhichdonotconsiderthekindofsignal-to-noiseissuesthatdriveopacity.Gshowsanevenmorepronounceddropwiththenumberofnightstoattendthandoes10 00.10.20.30.40.50.60.70.80.91123456Normalized PerformanceNumber of Nights to Attend Figure4:Behaviorofpayo®utilityfunctionswithrespecttonumberofnightstoattend(`=1).(WLU~ais3;WLU~0is+;WLU~1is2;Gis£)WLU~0.Furthermore,inagreementwithourpreviousresults[33],forallproblemsthepoorsignal-to-noisewhenusingthepayo®functionGresultsinpoorvaluesofworldutility,despitethatpayo®function'sbeingfactored.Takentogether,theseresultscon¯rmourtheoreticalpredictionofwhatpayo®utilityconvergesfastesttotheworldutilitymaximum.Figure5showshowt=500performancescaleswithNforeachoftheutilityfunctions.Forcomparisonpurposestheperformanceisnormalized|foreachutilityUweplotU¡GbaseGopt¡Gbase,whereGoptandGbasearetheoptimalperformanceandacanonicalbaselineperformancegivenbyuniformattendanceacrossallnights,respectively.Systemsusingteamgameutility(G)performadequatelywhenNislow.AsNincreaseshowever,itbecomesincreasinglydi±cultfortheplayerstoextracttheinformationtheyneedfromG.Becauseoftheirloweropacity,systemsusingthedi®erentWLUsovercomethissignal-to-noiseproblemtoagreatextent.BecausetheWLUisbasedonthedi®erencebetweentheactualstateandthestatewhereoneplayer'sstateisclamped,theyaremuchlessa®ectedbythetotalnumberofplayers.However,notethatvectortowhichplayersareclampedsigni¯cantlya®ectsthescalingproperties,showingthatevenamongdi®erenceutilities,optimizingforopacityprovidesasigni¯cantgain.Wealsostudiedthesensitivityofperformancetotheinternalparametersofthelearningalgorithms.Figure6presentsexperimentswith`=1forasetofdi®erenttemperaturesintheRLalgorithms.(Thetwostraightlinescorrespondtotheoptimalperformance,andthe\baseline"performancegivenbyuniformoccupanciesacrossallnights.)WLU~aisfairlyinsensitivetothetemperature,untilitgetssohighthatplayers'movesarechosenalmostran-domly.WLU~0dependsmorethanWLU~adoesonhavingsu±cientexplorationandthereforehasanarrowerrangeofgoodtemperatures.BothWLU~1andGhavemoreseriousopacityproblems,andthereforehaveshallowerandthinnerperformancegraphs.11 0.10.20.30.40.50.60.70.80.914080120160200240280320Normalized PerformanceNumber of AgentsFigure5:Scalingpropertiesofthepayo®utilityfunctions;`=1;c=f2;3;4;6;8;10;15gforN=f40;60;80;120;160;200;300g,respectively.(WLU~ais3;WLU~0is+;WLU~1is2;Gis£)5ConclusionInthisarticleweconsideredhowtodesignlargemulti-agentsystemstomeetapre-speci¯edgoalwheneachagentinthesystemusesreinforcementlearningtochooseitsactions.Wecastthisproblemashowtoinitialize/updatetheindividualagents'payo®utilityfunctionssothattheircollectivebehavioroptimizesapre-speci¯edworldutilityfunction.ThemathematicsofCOINsisspeci¯callyconcernedwiththisproblem.Inpreviousexperimentsweshowedthatsystemsbasedonthatmathfaroutperformedconventional\teamgame"systems,inwhicheachagenthastheworldutilityasitsprivateutilityfunction.Moreover,thegaininperformancegrowswiththesizeofthesystem,typicallyreachingordersofmagnitudeforsystemsthatconsistofhundredofagents.InthosepreviousexperimentstheCOIN-basedprivateutilitieshadafreeparameter,whichwearbitrarilysetto0.Howeverassynopsizedinthispaper,itturnsoutthataseriesofapproximationsintheallowsonetoderiveanoptimalvalueforthatparameter.Herewehaverepeatedsomeofourpreviouscomputerexperiments,onlyusingthisnewvaluefortheparameter.Theseexperimentscon¯rmthatwiththisnewvaluethesystemconvergestosigni¯cantlysuperiorworldutilityvalues,withlesssensitivitytotheparametersoftheagents'RLalgorithms.ThismakesevenstrongertheargumentsforusingaCOIN-basedsystemratherthanateam-gamesystem.Futureworkinvolvesimprovingtheapproximationsneededtocalculatetheoptimalprivateutilityparametervalue.Inparticular,giventhatthatvaluevariesintime,weintendtoinvestigatehavingitbecalculatedinanon-linemanner.References[1]W.B.Arthur.Complexityineconomictheory:Inductivereasoningandboundedratio-nality.TheAmericanEconomicReview,84(2):406{411,May1994.12 7 6 5 4 3-3-2-10Global RewardLog (Temperature)Figure6:Sensitivityofpayo®utilityfunctionstointernalparameters;`=1;c=3.(WLU~ais3;WLU~0is+;WLU~1is2;Gis£)[2]C.Boutilier.Multiagentsystems:Challengesandopportunitiesfordecisiontheoreticplanning.AIMagazine,20:35{43,winter1999.[3]C.Boutilier,Y.Shoham,andM.P.Wellman.Editorial:Economicprinciplesofmulti-agentsystems.Arti¯cialIntelligenceJournal,94:1{6,1997.[4]J.M.Bradshaw,editor.SoftwareAgents.MITPress,1997.[5]G.Caldarelli,M.Marsili,andY.C.Zhang.Aprototypemodelofstockexchange.EurophysicsLetters,40:479{484,1997.[6]D.ChalletandY.C.Zhang.Ontheminoritygame:Analyticalandnumericalstudies.PhysicaA,256:514,1998.[7]C.ClausandC.Boutilier.Thedynamicsofreinforcementlearningcooperativemultiagentsystems.InProceedingsoftheFifteenthNationalConferenceonArti¯cialIntelligence,pages746{752,Madison,WI,June1998.[8]R.H.CritesandA.G.Barto.Improvingelevatorperformanceusingreinforcementlearning.InD.S.Touretzky,M.C.Mozer,andM.E.Hasselmo,editors,AdvancesinNeuralInformationProcessingSystems-8,pages1017{1023.MITPress,1996.[9]D.FudenbergandJ.Tirole.GameTheory.MITPress,Cambridge,MA,1991.[10]J.HuandM.P.Wellman.Multiagentreinforcementlearning:Theoreticalframeworkandanalgorithm.InProceedingsoftheFifteenthInternationalConferenceonMachineLearning,pages242{250,June1998.[11]B.A.HubermanandT.Hogg.Thebehaviorofcomputationalecologies.InTheEcologyofComputation,pages77{115.North-Holland,1988.[12]N.R.Jennings,K.Sycara,andM.Wooldridge.Aroadmapofagentresearchanddevel-opment.AutonomousAgentsandMulti-AgentSystems,1:7{38,1998.13 [13]N.F.Johnson,S.Jarvis,R.Jonson,P.Cheung,Y.R.Kwong,andP.M.Hui.Volatilityandagentadaptabilityinaself-organizingmarket.preprintcond-mat/9802177,February1998.[14]L.P.Kaelbing,M.L.Littman,andA.W.Moore.Reinforcementlearning:Asurvey.JournalofArti¯cialIntelligenceResearch,4:237{285,1996.[15]M.L.Littman.Markovgamesasaframeworkformulti-agentreinforcementlearning.InProceedingsofthe11thInternationalConferenceonMachineLearning,pages157{163,1994.[16]D.MondererandL.S.Sharpley.Potentialgames.GamesandEconomicBehavior,14:124{143,1996.[17]W.Nicholson.MicroeconomicTheory.TheDrydenPress,seventhedition,1998.[18]T.SandholmandR.Crites.Multiagentreinforcementlearningintheiteratedprisoner'sdilemma.Biosystems,37:147{166,1995.[19]T.Sandholm,K.Larson,M.Anderson,O.Shehory,andF.Tohme.Anytimecoalitionstructuregenerationwithworstcaseguarantees.InProceedingsoftheFifteenthNationalConferenceonArti¯cialIntelligence,pages46{53,1998.[20]S.Sen.Multi-AgentLearning:Papersfromthe1997AAAIWorkshop(TechnicalReportWS-97-03.AAAIPress,MenloPark,CA,1997.[21]R.S.Sutton.Learningtopredictbythemethodsoftemporaldi®erences.MachineLearning,3:9{44,1988.[22]R.S.SuttonandA.G.Barto.ReinforcementLearning:AnIntroduction.MITPress,Cambridge,MA,1998.[23]K.Sycara.Multiagentsystems.AIMagazine,19(2):79{92,1998.[24]K.TumerandD.H.Wolpert.CollectiveintelligenceandBraess'paradox.InProceedingsoftheSeventeenthNationalConferenceonArti¯cialIntelligence,pages104{109,Austin,TX,2000.[25]C.WatkinsandP.Dayan.Q-learning.MachineLearning,8(3/4):279{292,1992.[26]M.P.Wellman.Amarket-orientedprogrammingenvironmentanditsapplicationtodistributedmulticommodity°owproblems.InJournalofArti¯cialIntelligenceResearch,1993.[27]D.H.Wolpert.Bounded-rationalitygametheory.pre-print,2001.[28]D.H.Wolpert.Themathematicsofcollectiveintelligence.pre-print,2001.[29]D.H.Wolpert,E.Bandari,andK.Tumer.Improvingsimulatedannealingbyrecastingitasanon-cooperativegame.Nature,2001.submitted.[30]D.H.WolpertandK.Tumer.AnIntroductiontoCollectiveIntelligence.TechnicalReportNASA-ARC-IC-99-63,NASAAmesResearchCenter,1999.URL:http://ic.arc.nasa.gov/ic/projects/coinpubs.html.ToappearinHand-bookofAgentTechnology,Ed.J.M.Bradshaw,AAAI/MITPress.14 [31]D.H.Wolpert,K.Tumer,andJ.Frank.Usingcollectiveintelligencetorouteinternettra±c.InAdvancesinNeuralInformationProcessingSystems-11,pages952{958.MITPress,1999.[32]D.H.Wolpert,K.Wheeler,andK.Tumer.Generalprinciplesoflearning-basedmulti-agentsystems.InProceedingsoftheThirdInternationalConferenceofAutonomousAgents,pages77{83,1999.[33]D.H.Wolpert,K.Wheeler,andK.Tumer.Collectiveintelligenceforcontrolofdis-tributeddynamicalsystems.EurophysicsLetters,49(6),March2000.[34]Y.C.Zhang.Modelingmarketmechanismwithevolutionarygames.EurophysicsLetters,March/April1998.15