/
1 Monte-Carlo Tree Search 1 Monte-Carlo Tree Search

1 Monte-Carlo Tree Search - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
384 views
Uploaded On 2017-08-13

1 Monte-Carlo Tree Search - PPT Presentation

Alan Fern 2 Introduction Rollout does not guarantee optimality or near optimality It only guarantees policy improvement under certain conditions Theoretical Question Can we develop MonteCarlo methods that give us near optimal policies ID: 578399

action tree state policy tree action policy state values max node select actions nodes current average reward world leaf

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Monte-Carlo Tree Search" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1

Monte-Carlo Tree Search

Alan Fern Slide2

2

IntroductionRollout does not guarantee optimality or near optimalityIt only guarantees policy

improvement (under certain conditions)

Theoretical Question:

Can

we develop Monte-Carlo methods that give us near optimal policies?

With computation that

does NOT depend on number of states

!

This was an open

theoretical

question until late 90’s

.

Practical Question:

Can we develop Monte-Carlo methods that improve smoothly and quickly with more computation time? Slide3

3

Look-Ahead TreesRollout can be viewed as performing one level of search on top of the base policy

In deterministic games and search problems it is common to build a

look-ahead tree

at a state to select best action

Can we generalize this to general stochastic MDPs?

Sparse Sampling

is one such algorithm

Strong theoretical guarantees of near optimality

a

1

a

2

a

k

Maybe we should search

for multiple levels.Slide4

4

Online Planning with Look-Ahead Trees

At each state we encounter in the environment we build a

look-ahead tree of depth

h

and use it to estimate optimal Q-values of each action

Select action with highest Q-value estimate

s

= current stateRepeat

T = BuildLookAheadTree(s) ;; sparse sampling or UCT

;; tree provides Q-value estimates for root actiona = BestRootAction(T) ;; action with best Q-value

Execute action a in environments is the resulting stateSlide5

5Planning with Look-Ahead Trees

s

a2

a1

Build look-ahead tree

Build look-ahead tree

Real world

state/action

sequence

s

a

1

a

2

a

1

a

2

s

11

a

1

a

2

s

1w

a

1

a

2

s

2

1

a

1

a

2

s

2

w

R(

s

11

,a

1

)

R(

s

11

,a

2

)

R(

s

1w

,a

1

)

R(

s

1w

,a

2

)

R(

s

2

1

,a

1

)

R(

s

21

,a

2

)

R(

s

2w

,a

1

)

R(

s

2w

,a

2

)

s

a

1

a

2

a

1

a

2

s

11

a

1

a

2

s

1w

a

1

a

2

s

2

1

a

1

a

2

s

2

w

R(

s

11

,a

1

)

R(

s

11

,a

2

)

R(

s

1w

,a

1

)

R(

s

1w

,a

2

)

R(

s

2

1

,a

1

)

R(

s

21

,a

2

)

R(

s

2w

,a

1

)

R(

s

2w

,a

2

)

…………

…………Slide6

Expectimax Tree (depth 1)Alternate max &expectation

s

max node (max over actions)

expectation node

(weighted average

over states)

a

b

States -- weighted by

probability of occuring

after taking a

in s

After taking each action there is a distribution over next states (nature’s moves)

The value of an action depends on the immediate reward and weighted value of those states

s

1

s

2

s

n-1

s

n

s

1

s

2

s

n-1

s

nSlide7

Expectimax Tree (depth 2)Alternate max &expectation

…......

s

max node (max over actions)

expectation node

(max over actions)

a

b

Max Nodes:

value equal to max of values of child expectation nodes

Expectation Nodes:

value is weighted average of values of its children

Compute values from bottom up (leaf values = 0).

Select root action with best value.

a

b

a

bSlide8

Exact Expectimax TreeAlternate max &expectation

In general can grow tree to any horizon HSize depends on size of the state-space. Bad!

V*(s,H)

Q*(s,a,H)Slide9

Sparse Sampling Tree

V*(s,H)

Q*(s,a,H)

Replace expectation with average over

w

samples

w

will typically be much smaller than

n.

(kw)

H

leaves

Sampling

width wSlide10

Sparse Sampling [Kearns et. al. 2002]SparseSampleTree(s,h,w) If h == 0 Then Return [0, null]For each action a in sQ*(s,a,h) = 0For i = 1 to w Simulate taking a in s resulting in si and reward

ri [V*(si,h),a*] = SparseSample(si

,h-1,w)

Q*(

s,a,h

) = Q*(

s,a,h

) +

ri + β V*(si,h)

Q*(s,a,h) = Q*(

s,a,h) / w ;; estimate of Q*(s,a,h)V*(s,h) =

maxa Q*(s,a,h) ;; estimate of V*(s,h)

a* = argmaxa Q*(s,a,h)

Return [V*(s,h), a*]

The Sparse Sampling algorithm computes root value via depth first expansionReturn value estimate V*(s,h) of state s and estimated optimal action a*Slide11

SparseSample(h=2,w=2)Alternate max &averaging

s

a

Max Nodes:

value equal to max of values of child expectation nodes

Average Nodes:

value is average of values of w sampled children

Compute values from bottom up (leaf

values

= 0).

Select root action with best value.Slide12

SparseSample(h=2,w=2)Alternate max &averaging

s

a

Max Nodes:

value equal to max of values of child expectation nodes

Average Nodes:

value is average of values of w sampled children

Compute values from bottom up (leaf values = 0).

Select root action with best value.

SparseSample(h=1,w=2)

10Slide13

SparseSample(h=2,w=2)Alternate max &averaging

s

a

Max Nodes:

value equal to max of values of child expectation nodes

Average Nodes:

value is average of values of w sampled children

Compute values from bottom up (leaf values = 0).

Select root action with best value.

SparseSample(h=1,w=2)

10

SparseSample(h=1,w=2)

0Slide14

SparseSample(h=2,w=2)Alternate max &averaging

s

a

Max Nodes:

value equal to max of values of child expectation nodes

Average Nodes:

value is average of values of w sampled children

Compute values from bottom up (leaf values = 0).

Select root action with best value.

SparseSample(h=1,w=2)

10

SparseSample(h=1,w=2)

0

5Slide15

SparseSample(h=2,w=2)Alternate max &averaging

s

a

Max Nodes:

value equal to max of values of child expectation nodes

Average Nodes:

value is average of values of w sampled children

Compute values from bottom up (leaf values = 0).

Select root action with best value.

5

SparseSample(h=1,w=2)

SparseSample(h=1,w=2)

4

2

b

3

Select action ‘a’ since its value is larger than b’s. Slide16

Sparse Sampling (Cont’d)For a given desired accuracy, how largeshould sampling width and depth be?Answered: Kearns, Mansour, and Ng (1999)Good news: gives values for w and H to achieve policy arbitrarily close to optimalValues are independent of state-space size!First near-optimal general MDP planning algorithm whose runtime didn’t depend on size of state-spaceBad news: the theoretical values are typically still intractably large---also exponential in H Exponential in H is the best we can do in generalIn practice: use small H and use heuristic at leavesSlide17

Sparse Sampling (Cont’d)In practice: use small H & evaluate leaves with heuristicFor example, if we have a policy, leaves could be evaluated by estimating the policy value (i.e. average reward across runs of policy)

…......

a

1

a

2

policy

simsSlide18

18

Uniform vs. Adaptive Tree SearchSparse sampling wastes time on bad parts of treeDevotes equal resources to each state encountered in the tree

Would like to focus on most promising parts of tree

But how to control exploration of new parts of tree vs. exploiting promising parts?

Need adaptive bandit algorithm

that explores more effectivelySlide19

19

What now? Adaptive Monte-Carlo Tree SearchUCB Bandit Algorithm

UCT Monte-Carlo Tree SearchSlide20

20Bandits: Cumulative Regret Objective

s

a

1

a

2

a

k

Problem:

find arm-pulling strategy such that the expected total reward at time

n

is close to the best possible (one pull per time step)

Optimal (in expectation) is to pull optimal arm

n

times

UniformBandit

is poor choice --- waste time on bad arms

Must balance

exploring

machines to find good payoffs and

exploiting

current knowledgeSlide21

21Cumulative Regret Objective

Theoretical results are often about “expected cumulative regret” of an arm pulling strategy.

Protocol:

At time step n the algorithm picks an arm

based on what it has seen so far and receives reward

(

are random variables)

.

Expected Cumulative Regret (

:

difference between optimal expected

cumulative

reward and expected cumulative reward of our strategy at time n

 Slide22

22UCB Algorithm for Minimizing Cumulative Regret

Q(a) : average reward for trying action a

(in our single state

s

) so far

n(a)

: number of pulls of arm

a

so farAction choice by UCB after n pulls:Assumes rewards in [0,1]. We can always normalize if we know max value.

Auer, P.,

Cesa

-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the

multiarmed

bandit problem.

Machine learning, 47(2), 235-256.Slide23

23UCB: Bounded Sub-Optimality

Value Term:

favors actions that looked

good historically

Exploration Term:

actions get an exploration

bonus that grows with ln(n)

Expected number of pulls of sub-optimal arm

a

is bounded by:

where

is the sub-optimality of arm

a

Doesn’t waste much time on sub-optimal arms, unlike uniform!Slide24

24

UCB Performance Guarantee

[

Auer, Cesa-Bianchi, & Fischer, 2002

]

Theorem

: The expected cumulative regret of UCB

after

n

arm pulls is bounded by

O(log

n)Is this good?

Yes. The average per-step regret is O

Theorem:

No algorithm can achieve a better expected regret (up to constant factors) Slide25

25

What now? Adaptive Monte-Carlo Tree SearchUCB Bandit AlgorithmUCT Monte-Carlo Tree SearchSlide26

UCT is an instance of Monte-Carlo Tree SearchApplies principle of UCBSimilar theoretical properties to sparse samplingMuch better anytime behavior than sparse samplingFamous for yielding a major advance in computer GoA growing number of success storiesUCT Algorithm [Kocsis & Szepesvari, 2006]Slide27

Builds a sparse look-ahead tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy” During construction each tree node s stores: state-visitation count n(s) action counts n(s,a) action values Q(s,a)

Repeat until time is upExecute rollout policy starting from root until horizon(generates a state-action-reward trajectory)Add first node not in current tree to the tree

Update statistics of each

tree node

s

on trajectory

Increment

n(s) and n(s,a) for selected action a

Update Q(

s,a) by total reward observed after the node

Monte-Carlo Tree Search

What is the rollout policy? Slide28

Monte-Carlo Tree Search algorithms mainly differ on their choice of rollout policyRollout policies have two distinct phasesTree policy: selects actions at nodes already in tree(each action must be selected at least once)Default policy: selects actions after leaving treeKey Idea: the tree policy can use statistics collected from previous trajectories to intelligently expand tree in most promising direction Rather than uniformly explore actions at each nodeRollout PoliciesSlide29

For illustration purposes we will assume MDP is:Deterministic Only non-zero rewards are at terminal/leaf nodesAlgorithm is well-defined without these assumpitonsExample MCTSSlide30

Current World State

Default

Policy

Terminal

(reward = 1)

1

1

1

At a leaf node tree policy selects a random action then executes default

Initially tree is single leaf

Assume all non-zero reward occurs at terminal nodes.

new tree node

Iteration 1Slide31

Current World State

1

1

Must select each action at a node at least once

0

Default

Policy

Terminal

(reward = 0)

new tree node

Iteration 2Slide32

Current World State

1

1/2

Must select each action at a node at least once

0

Iteration 3Slide33

Current World State

1

1/2

0

When all node actions tried once, select action according to tree policy

Tree Policy

Iteration 3Slide34

Current World State

1

1/2

When all node actions tried once, select action according to tree policy

0

Tree Policy

0

Default

Policy

new tree node

Iteration 3Slide35

Current World State

1/2

1/3

When all node actions tried once, select action according to tree policy

0

Tree

Policy

0

Iteration 4Slide36

Current World State

1

1/2

1/3

When all node actions tried once, select action according to tree policy

0

0

Iteration 4Slide37

Current World State

2/3

2/3

When all node actions tried once, select action according to tree policy

0

Tree

Policy

0

What is an appropriate tree policy?

Default policy?

1Slide38

38

Basic UCT uses random default policyIn practice often use hand-coded or learned policyTree policy is based on UCB:Q(s,a)

: average reward received in trajectories so far after taking action

a

in state

s

n(s,a)

: number of times action

a taken in sn(s) : number of times state s encountered

Theoretical constant that must

be selected empirically in practice

UCT Algorithm

[Kocsis & Szepesvari, 2006]Slide39

Current World State

1

1/2

1/3

When all node actions tried once, select action according to tree policy

0

Tree

Policy

0

a

1

a

2Slide40

Current World State

1

1/2

1/3

When all node actions tried once, select action according to tree policy

0

Tree

Policy

0Slide41

Current World State

1

1/2

1/3

When all node actions tried once, select action according to tree policy

0

Tree

Policy

0

0Slide42

Current World State

1

1/3

1/4

When all node actions tried once, select action according to tree policy

0

Tree

Policy

0/1

0Slide43

Current World State

1

1/3

1/4

When all node actions tried once, select action according to tree policy

0

Tree

Policy

0/1

0

1Slide44

Current World State

1

1/3

2/5

When all node actions tried once, select action according to tree policy

1/2

Tree

Policy

0/1

0

1Slide45

45

UCT RecapTo select an action at a state sBuild a tree using N iterations of monte-carlo tree search

Default policy is uniform random

Tree policy is based on UCB rule

Select action that maximizes Q(s,a)

(note that this final action selection does not take the exploration term into account, just the Q-value estimate)

The more simulations the more accurateSlide46

Some SuccessesComputer GoKlondike Solitaire (wins 40% of games)General Game Playing CompetitionReal-Time Strategy GamesCombinatorial OptimizationCrowd SourcingList is growingUsually extend UCT is some waysSlide47

Some ImprovementsUse domain knowledge to handcraft a more intelligent default policy than randomE.g. don’t choose obviously stupid actionsIn Go a hand-coded default policy is usedLearn a heuristic function to evaluate positionsUse the heuristic function to initialize leaf nodes(otherwise initialized to zero) Slide48

Practical IssuesSelecting KThere is no fixed ruleExperiment with different values in your domainRule of thumb – try values of K that are of the same order of magnitude as the reward signal you expect The best value of K may depend on the number of iterations NUsually a single value of K will work well for values of N that are large enoughSlide49

Practical Issues

a

b

s

UCT can have trouble building deep trees when actions can result in a large # of possible next states

Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leafSlide50

Practical IssuesUCT can have trouble building deep trees when actions can result in a large # of possible next statesEach time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf

a

b

sSlide51

Practical IssuesDegenerates into a depth 1 tree searchSolution: We have a “sparseness parameter” that controls how many new children of an action will be generatedSet sparseness to something like 5 or 10

a

b

s

…..

…..