/
Regret Minimization in Bounded Memory Games Jeremiah Blocki Regret Minimization in Bounded Memory Games Jeremiah Blocki

Regret Minimization in Bounded Memory Games Jeremiah Blocki - PowerPoint Presentation

debby-jeon
debby-jeon . @debby-jeon
Follow
344 views
Uploaded On 2019-11-06

Regret Minimization in Bounded Memory Games Jeremiah Blocki - PPT Presentation

Regret Minimization in Bounded Memory Games Jeremiah Blocki Nicolas Christin Anupam Datta Arunesh Sinha Work in progress Motivating Example 2 Employee Actions Behave Violate Audit Process Example ID: 763818

regret adversary game left adversary regret left game adaptive memory games reward bounded minimization view defender oblivious strategy top

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Regret Minimization in Bounded Memory Ga..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Regret Minimization in Bounded Memory Games Jeremiah BlockiNicolas ChristinAnupam DattaArunesh Sinha Work in progress

Motivating Example 2 Employee Actions: {Behave, Violate}

Audit Process Example 3BehaveViolate Ignore 0 -5 Investigate -1 -1 Game proceeds in rounds Employee Organization Expert Outcome: Reward: Round 1 Behave Ignore Investigate No Violation0 Round 2ViolateIgnoreInvestigateMissed V-5 Round 3ViolateInvestigateInvestigateDetected V-1 …………… Reward Organization: -6 Rounds: 3 Reward of Best Expert (hindsight): -3 Regret : (-3) -(-6) = 3 Average Regret: 1

Talk Outline MotivationBounded Memory Game ModelDefining RegretRegret Minimization in Bounded Memory GamesFeasibilityComplexity4

Elements of Game Model Two Players: Adversary (Employee) and Defender (Organization)Actions:Adversary Actions: {Violate, Behave}Defender Actions: {Investigate, Ignore}Repeated InteractionsEach interaction has an outcome History of game play is a sequence of outcomesImperfect Information:The organization doesn’t always observe the actions of the employee 5 Could be formalized as a repeated game

Additional Elements of Game Model Move to richer game modelHistory-dependent Rewards:Save money by ignoringReputation possibly damaged if we Ignore and the employee did violateReputation of the organization depends both on its history and on the current outcome History -dependent Actions: Players’ behavior may depend on history Defender’s behavior may depend on complete history 6

Adversary Models Adversary behavior depends only on history he remembersFully Oblivious Adversaries – only remembers the round numberk-Adaptive Adversary – remembers the round number, but history is reset every k turnsAdaptive Adversary – remembers complete game history7

Goal Define game modelDefine notions of regret for defender in game model wrt different adversary modelsStudy complexity of regret minimization problem in this game model8

Prior Work: Regret Minimization for Game Models Regret Minimization well studied in repeated games including imperfect information (bandit model)[AK04, McMahanB04,K05,FKM05, DH06]Defender compares his performance to the performance of the best expert in hindsight.Traditional: for the sake of comparison we assume that the adversary was oblivious 9

Talk Outline MotivationBounded Memory Game ModelDefining RegretRegret Minimization in Bounded Memory Games10

Two Player Stochastic Games 11Transitions between states can be probabilistic, and may depend on the actions (a,b) taken by Defender and Adversary. r ( a,b,s ) - Payoff when action a,b is played at state s Two Players: Defender and Adversary

Stochastic Games Captures dependence of rewards on historyA fixed strategy is a function f mapping S (states) to actions AExperts: Set of fixed strategiesCaptures dependence of defender’s actions on history12Recall: Additional elements of game model in motivating example

Notation Number of States in the Game (n)Number of Actions (|A|)Number of Experts (N)N = |A|nTotal rounds of play T13

Theorem 1: Regret Minimization is Impossible for Stochastic Games 14Oblivious Strategy i: Play b 2 i times then play b 1 Optimal Defender Strategy: Play a 1 every round Oblivious Strategy: Play b 2 every turn Optimal Defender Strategy: Play a 2 every round

Our Game Model Definition of bounded memory games, a subclass of stochastic gamesMemory-m gamesStates record last m outcomes in the history of the game play15

Bounded Memory Game: States & Actions 16Four states – record last outcome (memory-1)Defender Actions: {UP, DOWN}Adversary Actions: {LEFT, RIGHT}

Bounded Memory Game: Outcomes 17Four Outcomes:(Up, Left)(Up, Right)(Down, Left) (Down, Right) The outcome depends only on the current actions of defender and adversary. It is independent of the current state

Bounded Memory Game: Outcomes 18Four States:(Top, Left)(Top, Right)(Bottom, Left) (Bottom Right)

Bounded Memory Game: Example Game Play 19RoundStateDefenderAdversaryOutcome Round 3 Bottom, Right … … … Round 2 Bottom,Left Down Right Bottom,Right Round 1 Top, Left Down Left(Bottom,Left)

Bounded Memory Games: Rewards Defender sees reward 1 if Adversary plays LEFT from a green stateAdversary plays RIGHT from a blue state20

Traditional Regret in Bounded Memory Games Adversary strategyPlays RIGHT from a green statePlays LEFT from a blue state21 Defender will never see a reward! In hindsight, it may look like the fixed strategy “always play UP” would have received a reward What view of regret makes sense?

Traditional Regret in Bounded Memory Games RoundStateDefenderAdversaryReward 22 Five Top, Left … … … Four Bottom, Left UP Left 0 Three Bottom, Left Down Left0TwoTop, Right DownLeft0One Top,LeftUPRight0

Traditional Regret in Bounded Memory Games RoundStateUPAdversaryReward 23 Five Top, Left … … … Four Bottom, Left UP Left 0 Three Bottom, Left Down Left0TwoTop, Right DownLeft0One Top,LeftUPRight0One Top,Left UP Right 0 Two Top, Right UP Left 0 Three Top, Left UP Left 1 Four Top, Left UP Left 1 Player A will never see a reward! In hindsight, it looks like the fixed strategy “always play UP” would have received reward 2 What are other ways to compare our performance?

Comparing Performance with The Experts Actual GameHypothetical Game (Fixed Strategy f)Compare performance of defender in actual game to performance of f in hypothetical game24 Defender Real Adversary f Hypothetical Adversary

Regret Models 25Actual AdversaryHypothetical Adversary Oblivious k-Adaptive Adaptive Oblivious k-Adaptive _________ Adaptive _________ _________ Hypothetical Oblivious Adversary – hard code moves played by actual adversary Hypothetical k-Adaptive Adversary – hard code state of actual adversary after each window of k moves.

Regret in Bounded Memory Games (Revisited) 26 In hindsight, our hypothetical adversary model is k-Adaptive (k=5) What is our regret now? … … … … … Five Top, Left Up Right 0 Four Top, Right Up Left 0 Three Top, Left Up Right 0 Two Top, ‘Right Up Left 0 One Top, Left Up Right 0 Round One Two Three Four Five … State Top, Left Top, Right Bottom, Left Top, Left Bottom, Right … Defender Up Down Down Down Up … Adversary Right Left Left Left Right … Reward 0 0 0 0 0 … Round State UP Defender Reward Actual Performance: 0 Performance of Expert: 0 Regret: 0

Measuring Regret in Hindsight View 1: hypothetical adversary is obliviousThe adversary would have played the exact same moves played in the actual game.Traditional View of Regret: Repeated Games[Blackwell56,Hannan57,LW89], etc…Impossible for Bounded Memory Games (Example) View 2: hypothetical adversary is k-Adaptive The adversary would have used the exact same strategy during each window of k-moves. View 3: hypothetical adversary fully adaptive The hypothetical adversary is the real adversary. 27

Regret Minimization Algorithms Regret Minimization Algorithm:Average Regret  0 as T  Examples for repeated games:Weighted Majority Algorithm [LW89]:Average Regret: O((log N)/T)½[ACFS02] Bandit Setting:Average Regret: O(((N log N)/T)½) 28

Regret Minimization in Repeated Games 29Easy consequence of Theorem 2 Actual Adversary Hypothetical Adversary Oblivious k-Adaptive Adaptive Oblivious    k-Adaptive _ Adaptive _ _X

Regret Minimization in Stochastic Games 30Theorem 1: No Regret Minimization Algorithm exists for the general class of stochastic games. Actual Adversary Hypothetical Adversary Oblivious k-Adaptive Adaptive Oblivious X X X k-Adaptive _ X XAdaptive _ _X

Theorem 1: Regret Minimization is Impossible for Stochastic Games 31Oblivious Strategy i: Play b 2 i times then play b 1 Optimal Defender Strategy: Play a 1 every round Oblivious Strategy: Play b 2 every turn Optimal Defender Strategy: Play a 2 every round

Regret Minimization in Bounded Memory Games Theorem 2 (Time Permitting)32Actual Adversary Hypothetical Adversary Oblivious k-Adaptive Adaptive Oblivious  X X k-Adaptive _   Adaptive _ _X

Regret Minimization in Bounded Memory Games Theorem 3: Unless RP = NP there is no efficient regret minimization algorithm for the general class of bounded memory games.33 Actual Adversary Hypothetical Adversary Oblivious k-Adaptive Adaptive Oblivious Hard X X k-Adaptive _ Hard Hard Adaptive _ _X

Theorem 3 Unless RP=NP there is no efficient RegretMinimization algorithm for Bounded Memory Games even against an oblivious adversary.Reduction from MAX 3-SAT(7/8+ε) [Hastad01]Similar to reduction in [EKM05] for MDP’s 34

Theorem 3: Setup Defender Actions A: {0,1}x{0,1}m = O(log n) States: Two states for each variable S 0 = {s 1 ,…, s n} S1 = {s’1,…,s’n } Intuition: A fixed strategy corresponds to a variable assignment 35

Theorem 3: Overview The adversary picks a clause uniformly at random for the next n roundsDefender can earn reward 1 by satisfying this unknown clause in the next n roundsThe game will “remember” if a reward has already been given so that defender cannot earn a reward multiple times during n rounds36

Theorem 3: State Transitions 37Adversary Actions B: {0,1}x{0,1,2,3} b = (b 1, b 2 ) g(a,b) = b 1 f( a,b ) = S 1 if a2 = 1 or b2 = a1 (reward already given) S0 else (no reward given)

Theorem 3: Rewards 38b = (b1,b2 ) No reward whenever B plays b 2 = 2 r( a,b,s ) = 1 if s  S 0 and a = b2 -5 if s  S1 and f(a,b) = S0 and b 2  3 0 otherwise No reward whenever s  S 1

Theorem 3: Oblivious Adversary (d1,…,dn) - binary De Buijn sequence of order nPick a clause C uniformly at randomFor i = 1,…,n Play b = (di,b2) Repeat Step 1 39 b 2 = 1 If x i  C 0 If  x i  C 3If i = n 2Otherwise

Analysis Defender can never be rewarded from s  S1Get Reward => Transition to s  S1Defender is punished for leaving S1Unless adversary plays b2 = 3 (i.e when i = n) 40 f( a,b ) = S 1 if a 2 = 1 or b 2 = a 1 S 0 else r(a,b,s)= 1 if s  S0 and a = b2 -5 if s  S 1 and f( a,b ) = S 0 and b 2  3 0 otherwise

Theorem 3: Analysis φ - assignment satisfying ρ fraction of clausesfφ – average score ρ/n Claim: No strategy (fixed or adaptive) can obtain an average expected score better than ρ*/nRegret Minimization AlgorithmRun until expected average regret < ε/nExpected average score > (ρ*- ε )/n 41

Open Questions How hard is Regret Minimization against an oblivious adversary when A = {0,1} and m = O(log n)?How hard is Regret Minimization against a oblivious adversary in the complete information model?42

Questions? 43

Regret Minimization in Bounded Memory Games We can reduce our Bounded Memory Game to a Repeated GameOne Round Repeated => m*k*t roundsActions of Player A: {f: f is a fixed strategy} Action f “use the fixed strategy f for the next m*k*t rounds” 44

Regret Minimization in Bounded Memory Games The reward player A sees in the Repeated Game for playing f, is the actual reward earned by using the fixed strategy in the Bounded Memory Game for m*k rounds. 45

Regret Minimization in Bounded Memory Games Ri Actual reward observed when we played fi Forwarded to Repeated Game Imperfect Information Only observe reward for action f i R j – (hidden) reward we would have observed playing f j in this stage 46 f 1 f2……fi……… …fN????Ri?? ???

Modeling Loss Stagei-1 Stage i (m*k rounds) View 1: … O -1 O 0 O 1 … O m … View 2: … O-1O0 O’1…O’m…View 3: …O’-1 O’ 0 O’ 1 O’ m … 47 View 1 – Actual game play View 2 – Substitute in strategy f beginning in stage i View 3 – Substitute in strategy f from beginning Previous Outcomes O’ -1 ,O’ 0 may be different!

Modeling Loss Stagei-1 Stage i (m*k*t rounds) View 1: … O -1 O 0 O 1 … O m … View 2: … O-1O0 O’1…O’m…View 3: …O’-1 O’ 0 O’ 1 O’ m … 48 How do we know O’ 1 from view 2 and O’ 1 from view 3 must be equal? We assume that the adversary is k-Adaptive! After m rounds in Stage i View 1 and View 2 must converge to the same state

Modeling Loss Stagei-1 Stage i (m*k*t rounds) View 1: … O -1 O 0 O 1 … O m … View 2: … O-1O0 O’1…O’m…View 3: …O’-1 O’ 0 O’ 1 O’ m … 49 Let r j,2 denote the reward seen during round j in view 2 Average Modeling Loss:

Average Regret T’ = T/(kt)Pick t such that kt = T1/4 50

Traditional Regret Player A/Player BRockPaperScissorsRock 0 -1 1 Paper 1 0 -1 Scissors -1 1 0 51 Player B Rock Scissors Rock PaperPlayer APaper ScissorsRockRock“Expert”RockRock RockRock Reward Player A: 0Reward Expert: 1Total Regret : 1Average Regret: 1/4

Bounded Memory Games with δ-discounting RewardsIntuition: Rewards are most highly influenced by recent outcomes.Let sm = (o1,…, om) R( a,b,s m ) =  i δi f(a,b,{om,…, o m-i }) f(x)  [0,1], δ  [0,1) 52

Bounded Memory Games with δ-discounting Rewards: ResultsIdea: Experts can be partitioned into a few groups such that experts in same group perform “the same” 53 Actual Adversary Hypothetical Adversary Oblivious k-Adaptive Adaptive Oblivious P X X k-Adaptive _ P P Adaptive _ _ X