edu Abstract consider class of games with realv alued strate gies and payof information ailable only in the form of data from gi en sample of strate gy pro57346les Solving such games with respect to the un derlying strate gy space requires generalizi ID: 26338 Download Pdf

141K - views

Published bymyesha-ticknor

edu Abstract consider class of games with realv alued strate gies and payof information ailable only in the form of data from gi en sample of strate gy pro57346les Solving such games with respect to the un derlying strate gy space requires generalizi

Download Pdf

Download Pdf - The PPT/PDF document "Lear ning ay off Functions in Innite Gam..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Page 1

Lear ning ay off Functions in Innite Games vgeniy or obeychik, Michael ellman, and Satinder Singh Uni ersity of Michigan Articial Intelligence Laboratory 1101 Beal enue Ann Arbor MI 48109-2110 USA yv orobe ellman, ba eja @umich.edu Abstract consider class of games with real-v alued strate gies and payof information ailable only in the form of data from gi en sample of strate gy proles. Solving such games with respect to the un- derlying strate gy space requires generalizing from the data to complete payof f-function representa- tion. address payof

f-function learning as stan- dard re gression problem, with pro vision for captur ing kno wn structure (symmetry) in the multiagent en vironment. measure learning performance, we consider the relati utility of prescribed strate- gies, rather than the accurac of payof functions per se. demonstrate our approach and alu- ate its ef fecti eness on tw xamples: tw o-player ersion of the rst-price sealed-bid auction (with kno wn analytical form), and e-player mark et- based scheduling game (with no kno wn solution). Intr oduction Game-theoretic analysis typically be gins with complete de-

scription of strate gic interactions, that is, the game con- sider the prior question of determining what the game actually is, gi en database of game xperience rather than an direct specication. This is one possible tar get of learning applied to games Shoham et al. 2003 When agents ha fe ail- able actions and outcomes are deterministic, the game can be identied through systematic xploration. or instance, we can ask the agents to play each strate gy prole in the en- tire joint strate gy set and record the payof fs for each. If the joint action space is small enough,

limited nondeterminism can be handled by sampling. Coordinating xploration of the joint set does pose dif cult issues. Brafman and ennenholtz, for xample, address these carefully for the case of common- interest stochastic games Brafman and ennenholtz, 2003 as well as the general problem of maintaining an equilib- rium among learning algorithms Brafman and ennenholtz, 2004 Further dif culties are posed by intractably lar ge (or in- nite) strate gy sets. can mak this problem tractable by reducing the number of proles that agents are allo wed to play ut this comes

at the cost of transforming the game of interest into dif ferent game entirely Instead, we seek to identify the full game (or at least less restricti game) from limited data, entailing some generalization from observ ed in- stances. Approximating payof functions using supervised learning (re gression) methods allo ws us to deal with contin- uous agent strate gy sets, pro viding payof for an arbitrary strate gy prole. In so doing, we adopt functional forms con- sistent with prior kno wledge about the game, and also ad- mit biases to ard forms acilitating subsequent game analysis (e.g.,

equilibrium calculation). In this paper we present our rst in estigation of approxi- mating payof functions, emplo ying re gression to lo w-de gree polynomials. xplore tw xample games, both with in- complete information and real-v alued actions. First is the standard rst-price sealed bid auction, with tw players and symmetric alue distrib utions. The solution to this game is well-kno wn Krishna, 2002 and its ailability in analytical form pro es useful for benchmarking our learning approach. Our second xample is e-player mark et-based schedul- ing game Ree es et al. 2005 where

time slots are allocated by simultaneous ascending auctions Milgrom, 2000 This game has no kno wn solution, though pre vious ork has iden- tied equilibria on discretized subsets of the strate gy space. Pr eliminaries 2.1 Notation generic normal form game is formally xpressed as ( where refers to the set of players and is the number of players. is the set of strate gies ailable to player and the set ( is the simple of mix ed strate gies er Finally is the payof function of player when all players jointly play with each As is com- mon, we assume on Neumann-Mor genstern

utility allo w- ing an agent payof for particular mix ed strate gy pro- le to be )] where [0 1] is mix ed strate gy of player assign- ing probability to each pure strate gy such that all probabilities er the agent' strate gy set add to (i.e., ( ). It will often be con enient to refer to the strate gy (pure or mix ed) of player separately from that of the remaining players. accommodate this, we use to denote the joint strate gy of all players other than player 977 MULTI-AGENT SYSTEMS

Page 2

2.2 Nash Equilibrium In this paper we are concerned with one-shot normal-form

games, in which players mak decisions simultaneously and accrue payof fs, upon which the game ends. This single-shot nature may seem to preclude learning from xperience, ut in act repeated episodes are allo wed, as long as actions can- not af fect future opportunities, or condition future strate gies. Game payof data may also be obtained from observ ations of other agents playing the game, or from simulations of hypo- thetical runs of the game. In an of these cases, learning is rele ant despite the act that the game is to be played only once. aced with one-shot game, an agent ould ideally play

its best strate gy gi en those played by the other agents. conguration where all agents play strate gies that are best re- sponses to the others constitutes Nash equilibrium Denition str ate gy pr ole constitutes (pur e-str ate gy) Nash equilibrium of game if for very similar denition applies when mix ed strate gies are al- lo wed. Denition str ate gy pr ole con- stitutes mix ed-strate gy Nash equilibrium of game ( if for very ( In this study we de ote particular attention to games that xhibit symmetry with respect to payof fs.

Denition game ( is symmetric if i; (a) and (b) whene ver and Symmetric games ha relati ely compact descriptions and may present associated computational adv antages. Gi en symmetric game, we may focus on the subclass of symmet- ric equilibria, which are ar guably most natural Kreps, 1990 and oid the need to coordinate on roles. In airly general settings, symmetric games do possess symmetric equilibria Nash, 1951 ay off Function ppr oximation 3.1 Pr oblem Denition are gi en set of data points s; each describing an instance where agents played strate gy prole and

realized alue or deterministic games of complete information, is simply ith incomplete information or stochastic outcomes, is random ariable, more specically an independent dra from distrib ution function of with xpected alue The payof function appr oximation task is to select func- tion from candidate set minimizing some measure of de viation from the true payof function Because the true function is unkno wn, of course, we must base our selection on vidence pro vided by the gi en data points. Our goal in approximating payof functions is typically not predicting payof fs themselv es,

ut rather in assessing strate- gic beha vior Therefore, for assessing our results, we measure approximation quality not directly in terms of distance be- tween and ut rather in terms of the str ate gies dictated by aluated with respect to or this we appeal to the notion of approximate Nash equilibrium. Denition str ate gy pr ole constitutes an -Nash equilibrium of game ( if for very ( propose using in the abo denition as mea- sure of approximation error of and emplo it in aluat- ing our learning methods. When is kno wn, we can com- pute in straightforw

ard manner Let denote best- esponse corr espondence dened by arg max or clarity of xposition, we tak to be single-v alued. Let be solution (e.g., Nash equilibrium) of game ( Then is an Nash equilibrium of the true game ( for max )] Since in general will either be unkno wn or not amenable to this analysis, we de eloped method for estimating from data. will describe it in some detail belo or the remainder of this report, we focus on special case of the general problem, where action sets are real-v alued in- terv als, [0 1] Moreo er we restrict attention to sym- metric

games and further limit the number of ariables in payof f-function hypotheses by using some form of aggre ga- tion of other agents' actions. The assumption of symmetry allo ws us to adopt the con ention for the remainder of the paper that payof is to the agent playing 3.2 olynomial Regr ession One class of models we consider are the th-de gr ee separ able polynomials )) d; (1) where represents some aggre gation of the strate gies played by agents other than or tw o-player games, is simply the identity function. refer to polynomials of the form (1) as separable, since the lack terms combining

and also consider models with such terms, for xample, the non-separ able quadr atic )) )+ cs d: (2) Note that (2) and (1) coincide in the case and In the xperiments described belo we emplo simpler ersion of non-separable quadratic that tak es One adv antage of the quadratic form is that we can ana- lytically solv for Nash equilibrium. Gi en general non- separable quadratic (2), the necessary rst-order condition for an interior solution is c )) This reduces to in the separable case. or the non-separable case with additive gr gation sum Although none of these restrictions are inherent in

the approach, one must of course recognize the tradeof fs in comple xity of the hy- pothesis space and generalization performance. 978 MULTI-AGENT SYSTEMS

Page 3

we can deri an xplicit rst-order condition for symmetric equilibrium: (2 1) While pure-strate gy equilibrium will necessarily xist for an separable polynomial model, it is only guaranteed to x- ist in the non-separable case when the learned quadratic is conca e. In the xperiments that follo when the learned non-separable quadratic does not ha pure Nash equilib- rium, we generate an arbitrary symmetric pure

prole as the approximate Nash equilibrium. Another dif culty arises when polynomial of de gree higher than three has more than one Nash equilibrium. In such case we select an equilibrium arbitrarily 3.3 Local Regr ession In addition to polynomial models, we xplored learning using tw local re gression methods: locally weighted erage and locally weighted quadratic re gression Atk eson et al. 1997 Unlik model-based methods such as polynomial re gression, local methods do not attempt to infer model coef cients from data. Instead, these methods weigh the training data points

by distance from the query point and estimate the answer —in our case, the payof at the strate gy prole point—using some function of the weighted data set. used Gaussian weight function: where is the distance of the training data point from the query point and is the weight that is assigned to that training point. In the case of locally weighted erage, we simply tak the weighted erage of the payof fs of the training data points as our payof at an arbitrary strate gy prole. Locally weighted quadratic re gression, on the other hand, ts quadratic re- gression to the

weighted data set for each query point. 3.4 Support ector Machine Regr ession The third cate gory of learning methods we used as Support ector Machines (SVMs). or details re garding this learn- ing method, we refer an interested reader to apnik, 1995 In our xperiments, we used SVM light package Joachims, 1999 which is an open-source implementation of SVM clas- sication and re gression algorithms. 3.5 Finding Mixed Strategy Equilibria In the case of polynomial re gression, we were able to nd ei- ther analytic or simple and rob ust numeric methods for com- puting pure Nash

equilibria. ith local re gression and SVM learning we are not so fortunate, as we do not ha access to closed-form description of the function we are learning. Furthermore, we are often interested in mix ed strate gy ap- proximate equilibria, and our polynomial models and solution methods yield pure strate gy equilibria. When particular learned model is not amenable to closed-form solution, we can approximate the learned game with nite strate gy grid and nd mix ed-strate gy equi- librium of the resulting nite game using general-purpose nite-game solv er emplo yed

replicator dynamics Fu- denber and Le vine, 1998 which searches for symmetric mix ed equilibrium using an iterati olutionary algorithm. treat the result after x ed number of iterations as an approximate Nash equilibrium of the learned game. 3.6 Strategy Aggr egation As noted abo e, we consider payof functions on tw o- dimensional strate gy proles in the form )) As long as is in ariant under dif ferent permutations of the same strate gies in the payof func- tion is symmetric. Since the actual payof functions for our xample games are also kno wn to be symmetric, we constrain that

preserv the symmetry of the underlying game. In our xperiments, we compared three ariants of First and most compact is the simple sum, sum Sec- ond is the ordered pair sum ss ), where ss The third ariant, identity sim- ply tak es the strate gies in their direct, unaggre gated form. enforce the symmetry requirement in this last case, we sort the strate gies in First-Price Sealed-Bid uction In the standard rst-price sealed-bid (FPSB) auction game Krishna, 2002 agents ha pri ate aluations for the good for sale, and simultaneously choose bid price representing their of fer to purchase the

good. The bidder naming the high- est price gets the good and pays the of fered price. Other agents recei and pay nothing. In the classic setup rst ana- lyzed by ickre 1961 agents ha identical aluation dis- trib utions, uniform on [0 1] and these distrib utions are com- mon kno wledge. The unique (Bayesian) Nash equilibrium of this game is for agent to bid where is alua- tion for the good. Note that strate gies in this game (and generally for games of incomplete information), [0 1] [0 1] are func- tions of the agent' pri ate information. consider re- stricted case, where bid functions

are constrained to the form [0 1] This constraint transforms the action space to real interv al, corresponding to choice of parameter can easily see that the restricted strate gy space includes the kno wn equilibrium of the full game, with for all which is also an equilibrium of the restricted game in which agents are constrained to strate gies of the gi en form. further focus on the special case with corre- sponding equilibrium at or the tw o-player FPSB, we can also deri closed-form description of the actual xpected payof function: 25 if 1) 3( 6( if (1 otherwise (3) The ailability of kno wn

solutions for this xample a- cilitates analysis of our learning approach. Our results are summarized in Figure 1. or each of our methods (classes of functional forms), we measured erage for arying train- ing set sizes. or instance, to aluate the performance of separable quadratic approximation with training size we independently dra strate gies, uniformly on [0 1] The corresponding training set comprises points: (( )) for i; with as gi en by (3). nd the best separable quadratic t to 979 MULTI-AGENT SYSTEMS

Page 4

these points, and nd Nash equilibrium

corresponding to then calculate the least for which this strate gy prole is an -Nash equilibrium with respect to the actual payof func- tion repeat this process 200 times, eraging the results er strate gy dra ws, to obtain each alue plotted in Figure 1. 10 15 20 0.005 0.01 0.015 0.02 0.025 0.03 Number of strategies in training set Average Sample best Separable quadratic Non−separable quadratic 3rd degree poly 4th degree poly Figure 1: Epsilon ersus number of training strate gy points for dif ferent functional forms. 0.2 0.4 0.6 0.8 0.05 0.1 0.15 0.2 payoffs(s ,0.5) Actual

function Learned quadratic Figure 2: Learned and actual payof function when the other agent plays 0.5. The learned function is the separable quadratic, for particular sample with As we can see, both second-de gree polynomial forms we tried do quite well on this game. or 20 quadratic re gression outperforms the model labeled “sample best”, in which the payof function is approximated by the discrete training set directly The deri ed equilibrium in this model is simply Nash equilibrium er the discrete strate gies in the training set. At rst, the success of the quadratic model may be

surprising, since the actual payof function (3) is only piece wise dif ferentiable and has point of discontinu- ity Ho we er as we can see from Figure 2, it appears quite smooth and well approximated by quadratic polynomial. The higher -de gree polynomials apparently ert the data, as indicated by their inferior learning performance displayed in this game. The results of this game pro vide an optimistic vie of ho well re gression might be xpected to perform compared to discretization. This game is quite easy for learning since the underlying payof function is well captured by our lo wer

de gree model. Moreo er our xperimental setup eliminated the issue of noisy payof observ ations, by emplo ying the ac- tual xpected payof fs for selected strate gies. Mark et-Based Scheduling Game The second game we in estigate presents signicantly more dif cult learning challenge. It is e-player symmetric game, with no analytic characterization, and no (theoreti- cally) kno wn solution. The game hinges on incomplete infor mation, and training data is ailable only from simulator that samples from the underlying distrib ution. The game is based on mark et-based scheduling

scenario Ree es et al. 2005 where agents bid in simultaneous auc- tions for time-inde ed resources necessary to perform their gi en jobs. Agents ha pri ate information about their job lengths, and alues for completing their jobs by arious dead- lines. Note that the full space of strate gies is quite comple x: it is dependent on multi-dimensional pri ate information about preferences as well as price histories for all the time slots. As in the FPSB xample, we transform this polic space to the real interv al by constraining strate gies to parametrized form. In particular we start from simple

myopic polic y str aightforwar bidding Milgrom, 2000 and modify it by scalar parameter (called “sunk areness”, and denoted by that controls the agent' tendenc to stick with slots that it is currently winning. Although the details and moti ation for sunk areness are inessential to the current study we note that [0 1] and that the optimal setting of in olv es tradeof fs, generally dependent on other agents' beha vior in estigate learning for this game, we collected data for all strate gy proles er the discrete set of alues 05 Accounting for symmetry this represents 53,130 distinct strate

gy proles. or aluation purposes, we treat the sample erages for each discrete prole as the true xpected payof fs on this grid. The pre vious empirical study of this game by Ree es et al. 2005 estimated the payof function er dis- crete grid of proles assembled from the strate gies 85 95 computing an approximate Nash equilibrium using replicator dynamics. therefore generated training set based on the data for these strate gies (300000 samples per prole), re gressed to the quadratic forms, and calculated empirical alues with respect to the entire data set by

computing the maxi- mum benet from de viation within the data: emp max max )] where is the strate gy set of player represented within the data set. Since the game is symmetric, the maximum er the players can be dropped, and all the agent strate gy sets are identical. From the results presented in able 1, we see that the Nash equilibria for the learned functions are quite close to that pro- duced by replicator dynamics, ut with alues quite bit 980 MULTI-AGENT SYSTEMS

Page 5

lo wer (Since 0.876 is not grid point, we determined its post hoc, by running further prole

simulations with all agents playing 0.876, and where one agent de viates to an of the strate gies in 05 .) Method Equilibrium Separable quadratic 0.876 0.0027 Non-separable quadratic 0.876 0.0027 Replicator Dynamics (0,0.94,0.06,0,0 0.0238 able 1: alues of for the symmetric pure-strate gy equilib- ria of games dened by dif ferent payof function approxima- tion methods. The quadratic models were trained on proles conned to strate gies in 0.8,0.85,0.9,0.95 ,1 In more comprehensi trial, we collected 2.2 million ad- ditional samples per prole, and ran our learning

algorithms on 100 training sets, each uniformly randomly selected from the discrete grid 05 Each training set included proles generated from between and ten of the twenty- one agent strate gies on the grid. Since in this case the pro- le of interest does not typically appear in the complete data set, we de eloped method for estimating for pure sym- metric approximate equilibria in symmetric games based on mixture of neighbor strate gies that do appear in the test set. Let us designate the pure symmetric equilibrium strate gy of the approximated game by rst determine the

closest neighbors to in the symmetric strate gy set represented within the data. Let these neighbors be denoted by and 00 dene mix ed strate gy er support 00 as the probability of playing computed based on the relati distance of from its neighbors: 00 Note that symmetry allo ws more compact representation of payof function if agents other than ha choice of only tw strate gies. Thus, we dene as the payof to (symmetric) player for playing strate gy when other agents play strate gy If agents each independently choose whether to play with probability then the proba- bility that

xactly will choose is gi en by Pr ; (1 can thus approximate of the mix ed strate gy by max =0 Pr ; U (1 00 )) Using this method of estimating on the complete data set, we compared results from polynomial re gression to the method which simply selects from the training set the pure strate gy prole with the smallest alue of refer to this method as “sample best”, dif ferentiating between the case where we only consider symmetric pure proles (la- beled “sample best (symmetric)”) and all pure proles (la- beled “sample best (all)”). It is interesting

to observ in Figures and that when we re- strict the search for best pure strate gy prole to symmetric proles, we on erage do better in terms of then when this restriction is not imposed. 10 10 11 12 x 10 −3 Number of strategies in training set Average Separable quadratic :sum :(sum,sum squares) :identity Sample best (symmetric) Sample best (all) Figure 3: Ef fecti eness of learning separable quadratic model with dif ferent forms of 10 10 11 12 x 10 −3 Number of strategies in training set Average Non−separable quadratic :sum :(sum,sum squares) :identity Sample

best (symmetric) Sample best (all) Figure 4: Ef fecti eness of learning non-separable quadratic model with dif ferent forms of From Figure we see that re gression to separable quadratic produces considerably better approximate equi- librium when the size of the training set is relati ely small. Figure sho ws that the non-separable quadratic performs similarly The results appear relati ely insensiti to the de- gree of aggre gation applied to the representation of other agents' strate gies. The polynomial re gression methods we emplo yed yield pure-strate gy Nash equilibria. further aluated four

methods that generally produce mix ed-strate gy equilibria: tw local re gression learning methods, SVM with Gaussian radial basis ernel, and direct estimation using the training data. As discussed abo e, we computed mix ed strate gy equi- libria by applying replicator dynamics to discrete approxima- tions of the learned payof functions. Since we ensure that In the case of direct estimation from training data, the data itself 981 MULTI-AGENT SYSTEMS

Page 6

the support of an mix ed strate gy equilibrium produced by these methods is in the complete data set, we can compute of the

equilibria directly As we can see in Figure 5, locally weighted erage method appears to ork better than the other three for most data sets that include between and ten strate gies. Ad- ditionally locally weighted re gression performs better than replicator dynamics on four of the six data set sizes we con- sidered, and SVM consistently beats replicator dynamics for all six data set sizes. It is some what surprising to see ho irre gular our results appear for the local re gression methods. cannot xplain this irre gularity although of course there is no reason for us to xpect otherwise: en

though increasing the size of the training data set may impro the quality of t, impro ement in quality of equilibrium approximation does not necessarily follo 10 x 10 −3 Number of strategies in training set Average Local Regression Replicator dynamics Locally Weighted Average Locally Weighted Regression SVM Figure 5: Ef fecti eness of learning local and SVM re gres- sion to estimate mix ed strate gy symmetric equilibria, with sum ss Conclusion While there has been much ork in game theory attempting to solv particular games dened by some payof functions, lit- tle attention

has been gi en to approximating such functions from data. This ork addresses the question of payof func- tion approximation by introducing re gression learning tech- niques and applying them to representati games of inter est. Our results in both the FPSB and mark et-based schedul- ing games suggest that when data is sparse, such methods as used as input to the replicator dynamics algorithm. or the other three methods we used x ed ten-strate gy grid as the discretized approximation of the learned game. Note that we do not compare these results to those for the poly- nomial re gression

methods. Gi en noise in the data set, mix ed- strate gy proles with lar ger supports may xhibit lo wer simply due to the smoothing ef fect of the mixtures. can pro vide better approximations of the underlying game at least in terms of -Nash equilibria—than discrete approxi- mations using the same data set. Re gression or other generalization methods of fer the po- tential to xtend game-theoretic analysis to strate gy spaces (e en innite sets) be yond directly ailable xperience. By selecting tar get functions that support tractable equilibrium calculations, we render such

analysis analytically con enient. By adopting functional forms that capture kno wn structure of the payof function (e.g., symmetry), we acilitate learnabil- ity This study pro vides initial vidence that we can some- times nd models serving all these criteria. In future ork we xpect to apply some of the methods de eloped here to other challenging domains. Refer ences Atk eson et al. 1997 Christopher G. Atk eson, Andre Moore, and Stef an Schaal. Locally weighted learning. Ar ticial Intellig ence Re vie 11:11–73, 1997. Brafman and ennenholtz, 2003 Ronen I. Brafman and Moshe

ennenholtz. Learning to coordinate ef ciently: model-based approach. ournal of Articial Intellig ence Resear 19:11–23, 2003. Brafman and ennenholtz, 2004 Ronen I. Brafman and Moshe ennenholtz. Ef cient learning equilibrium. Ar ticial Intellig ence 159:27–47, 2004. Fudenber and Le vine, 1998 D. Fudenber and D. Le vine. The Theory of Learning in Games MIT Press, 1998. Joachims, 1999 Thorsten Joachims. Making lar ge-scale SVM learning practical. In B. Scholk opf, C. Bur ges, and A. Smola, editors, Advances in ernel Methods Support ector Learning MIT Press, 1999.

Kreps, 1990 Da vid M. Kreps. Game Theory and Economic Modelling Oxford Uni ersity Press, 1990. Krishna, 2002 ijay Krishna. uction Theory Academic Press, 2002. Milgrom, 2000 aul Milgrom. Putting auction theory to ork: The simultaneous ascending auction. ournal of olitical Economy 108:245–272, 2000. Nash, 1951 John Nash. Non-cooperati games. Annals of Mathematics 54:286–295, 1951. Ree es et al. 2005 Daniel M. Ree es, Michael ell- man, Jef fre K. MacKie-Mason, and Anna Osepayshvili. Exploring bidding strate gies for mark et-based scheduling. Decision Support Systems 39:67–85, 2005. Shoham et al.

2003 oa Shoham, Rob Po wers, and rond Grenager Multi-agent reinforcement learning: critical surv echnical report, Stanford Uni ersity 2003. apnik, 1995 Vladimir apnik. The Natur of Statistical Learning Theory Springer -V erlag, 1995. ickre 1961 illiam ickre Counterspeculation, auc- tions, and competiti sealed tenders. ournal of inance 16:8–37, 1961. 982 MULTI-AGENT SYSTEMS

Â© 2020 docslides.com Inc.

All rights reserved.