/
The  Nonstochastic   Multiarmed The  Nonstochastic   Multiarmed

The Nonstochastic Multiarmed - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
343 views
Uploaded On 2018-11-05

The Nonstochastic Multiarmed - PPT Presentation

Bandit Problem Seminar on Experts and Bandits Fall 20172018 Barak Itkin Problem setup slot machines No experts Rewards bounded in Can also be generalized to other bounds Partial information ID: 715671

bound proof action time proof bound time action arm fact weights rewards

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "The Nonstochastic Multiarmed" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Nonstochastic Multiarmed Bandit Problem

Seminar on Experts and Bandits, Fall

2017-2018

Barak ItkinSlide2

Problem setup

slot machines

No experts

Rewards bounded in Can also be generalized to other boundsPartial informationYou only learn about the arm you pulled“No assumptions at all” on slot machinesNot even a fixed distributionIn-fact, can be adversarial

 Slide3

MotivationThink of packet routingMultiple routes exist to reach the destination

Rewards are the round-trip time

Will change depending on load in different parts of the network

Partial informationYou only learn about the packet you sent“No assumptions at all” on load behaviorLoad can change dramatically over-timeSlide4

Back to the problemArm assignment is determined in advance

I.e. before the first arm is pulled

Adversarial = Assignment can be picked after strategy is already known

But still, before the game beginsWe want to minimize the regret“How much better could have we done?”We’ll start with “Weak” RegretComparing to always pulling one arm (the “best” arm)Slide5

Our Goal

We would like to show an algorithm that has the following bound on the “weak” regret:

is the return of the best arm

This should work for any setup of arms

Including random/adversarial setups

 Slide6

Difference from last week?

In the second half we saw:

experts, each loses a certain fraction of what they get

We can split between multiple expertsWe always learn about all the expertsRegret compare to single best expert (“weak” regret)Where’s the difference?Solo investment – we chose only one expertPartial information – we learn only about chosen expertBounds:Last week:

Ours:

 Slide7

Notations

– The time step. Also known as “trial”

The reward of arm

at trial

– an algorithm, choosing the arm

at trial

The input at time

is the arms chosen and their rewards till now

F

ormally -

 Slide8

Notations++

– The return of algorithm

at time horizon

Will be abbreviated as

where the

is obvious

– The best single-arm reward at time horizon

Will also be abbreviated as

 Slide9

The naïve approachRecalling our baselineSlide10

Simplistic approach

Explore

Check each arm

timesExploitContinue pulling only the best armProfit?Only with “fixed distribution” (no significant changes over time)May fail with arbitrary/adversary rewards

 Slide11

Multiplicative Weights approach

Maintain weights

for arm

at trial Start with uniform weightsUse the weights to define a distribution

Typically

Pick an arm by sampling the distribution

Update weights based on rewards of each action

When rewards are known, multiply by a function

of the reward:

We’ll discuss partial information setup today

 Slide12

The Exp3 algorithmFirst attemptSlide13

Exp3 – Exploration

Start with uniform weights

No surprise here

Always encourage the algorithm to explore more armsAdd an “exploration factor” to the probabilityEach arm will always have a probability of at least The exploration factor is controlled by

Can be fine-tuned later

The exploration factor does not change over time

Rewards may be arbitrary or even adversarial

So, we must always continue exploring

 Slide14

Estimation rational

Question

story time

:You randomly visit a shop on 20% of the daysOnly on those days you know their profitHowever, you need to give a profit estimation for every dayWhat do you do?Answer:On days you visit, estimate

On other days, estimate

More generally:

if seen

otherwise

 Slide15

Exp3 – Weight updates

To update the weights, use “estimated rewards” instead of the actual rewards

For actions that weren’t chosen, consider as if

This will create an unbiased estimator:

This equality holds for all actions – not just the one chosen

This helps with the weight updates on partial information

 Slide16

Exp3

Initialize:

Set

for all actions For each :Sample

from

Receive reward

Update the weights with

for all but the selected action

Practically, we only update

 Slide17

Bounds

Since we have a probabilistic argument, we care about the

expected

“weak” regret:

Theorem:

No, this is not what we wanted (

)

We’ll fix that later

 Slide18

Bound intuition

is the exploration factor

As

, we stop exploiting our weights

Instead we prefer “randomly” looking around

Thus we come closer and closer to

As

, we exploit our weights

too much

We check for less changes in other arms

Becomes worse when we have more arms (

)

We must find the “sweet spot”

 Slide19

The sweet spot

For some

, consider

This yields

Which is similar to the bound we wanted (

)

 Slide20

Proof

Enough

talking, let’s start the

proof!

We assume

The bound trivially holds for

(as

)

Furthermore, we’ll want to rely on some facts

As outlined in the next slides

 Slide21

Fact #1Irrelevant to our causeWe just want numbering (starting at 2) to be consistent with the paper

Slide22

Fact #2

Proof:

Since

, we get:

Since

, we get:

 Slide23

Fact #3

Since

, the sum becomes:

We defined

for all unobserved actions

And so we remain only with

 Slide24

Fact #4

From the previous slide (fact #3) we get

Since

we get

As before, since un-observed actions cancel out we can do:

 Slide25

Proof - outline

Denote

Establish a relation

between

and

Establish a relation

between

and

Obtain with

Establish a relation

between

and

Obtain by summing over

Compute a direct bound over

Apply bound to

 Slide26

Proof

We begin

with the definition of the weight sums:

Recall

the weight update definition

 Slide27

Proof

Recall

the

probability definition:

Some algebra and we get:

 Slide28

Proof

Note the following inequality

for

(No, it’s not a famous one. Yes it works, I tested it)

 Slide29

Proof

Opening the square brackets, we get (just ugly math)

 Slide30

Proof

Combining it all, we have

Using fact #3 (

) and

Using fact #4 (

)

 Slide31

Proof

Taking the

log

and using

(specifically

):

Summing over

from

to

we get:

 Slide32

Proof

We’ll get back to this equation in a second

 Slide33

Proof

Choosing some action

and since

, note that:

Recall that

and so we can

expand the log recursively

:

Recall

that

and

(uniform initialization

) and so:

 Slide34

Proof

Combining

and

we get:

 Slide35

Proof

Recall we saw that

.

Taking

expectation on

both sides

we get:

Since

for any

, and since we have a

minus sign

, we can do:

 Slide36

Proof

Note this was true for any action

, which means including the maximal action!

Q.E.D.(!)

 Slide37

Break(?)Slide38

Is it enough?Maybe?I

f it was a clear “yes”, I wouldn’t have asked it :P

What was our problem?Slide39

Room for improvement

We assumed an upper limit

is known

We then selected

This obtained the bound

If we don’t know

, we may over-shoot

For example, setting

This would yield a less tight result :/

 Slide40

New goal

Maintain the same bound over the weak regret

Don’t assume a known

In fact, do better!

Maintain this bound uniformly throughout the execution

*

*

A formal definition will follow

 Slide41

Notations# (better than ++++)

– The return of action

till (exclusive) time horizon

– The

estimated return

of action

till

time

horizon

– The

maximal return

of

any single action till

time horizon

 Slide42

Re-stating our goal

At

every time-step

, the weak regret till that point should maintain the bound

This bound will hold all the way!

This

also gives a hint on how to do this:

Previously we guessed

Instead, lets maintain

Update

and

as

grows

Effectively – we are searching for the right

!

 Slide43

Exp3.1 (creative names!)

Initialize:

Set For each “epoch” do:

(Re-)Initialize Exp3

with:

While

do:

Do one step with Exp3

(Update our tracking of

for all actions)

 Slide44

Bounds (Theorem 4.1)

Theorem 4.1:

Proof outline:

Lemma 4.2: Bound the weak regret per “epoch”

By giving a lower bound on the actual reward

Lemma

4.3: Bound the number of epochs for a finite horizon

Combine both to obtain the final bound

 Slide45

Epoch notations

Let

be the overall number of steps

Needed only for the proofWe don’t need to know this in advanceLet be the overall number of epochsSame hereLet and

be the first and last steps (

) of epoch

,

An epoch may be empty

– in that case

 Slide46

Lemma 4.2

For any action

and for every epoch

H

olds trivially for empty epochs

We’ll prove for

 Slide47

Proof

Some

many

slides back, we proved the following:

Translating to our current notations, this becomes

 Slide48

Proof

Merging the estimated reward also with all previous epochs:

From

the

termination condition, we know that

for any action

, at time

, in epoch

:

From fact #2, we know that

Together we get that for any action

 Slide49

Proof

Recall that

Substitute and

Q.E.D

 Slide50

Lemma 4.3

Denote

So that

Then the number of epochs

satisfies the inequality

Holds trivially for

We’ll prove for

 Slide51

Proof

Regardless of the lemma, we know that:

is the termination step, so we know the epoch continuation condition was violated

 Slide52

Proof

So, we’ve shown that

is

a variable

,

as we are trying to bound

This function is a parabola of

The minimum is obtained at

(Reminder: Extremum of

is at

It increases monotonically for

 Slide53

Proof

Now, suppose the claim is false

Reversing the original inequality, we get

and the expression in purple are both larger than

This is in the increasing part of

:

 Slide54

Proof

Assuming the claim is false, we got

But, before we proved that

Q.E.D.

 Slide55

Theorem 4.1

Our target was:

Let’s begin

 Slide56

Proof

Lemma 4.2 says that for any action

Then this

should also be for the maximal action

 Slide57

Proof

 Slide58

Proof

Using lemma 4.3 we obtain

 Slide59

Finishing the proof

Using Jensen’s inequality (where the variable is

), we complete the proof and take the expectation

Full details in the paper

We don’t have time at this point anyhow :P

 Slide60

Questions?