Bandit Problem Seminar on Experts and Bandits Fall 20172018 Barak Itkin Problem setup slot machines No experts Rewards bounded in Can also be generalized to other bounds Partial information ID: 715671
Download Presentation The PPT/PDF document "The Nonstochastic Multiarmed" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Nonstochastic Multiarmed Bandit Problem
Seminar on Experts and Bandits, Fall
2017-2018
Barak ItkinSlide2
Problem setup
slot machines
No experts
Rewards bounded in Can also be generalized to other boundsPartial informationYou only learn about the arm you pulled“No assumptions at all” on slot machinesNot even a fixed distributionIn-fact, can be adversarial
Slide3
MotivationThink of packet routingMultiple routes exist to reach the destination
Rewards are the round-trip time
Will change depending on load in different parts of the network
Partial informationYou only learn about the packet you sent“No assumptions at all” on load behaviorLoad can change dramatically over-timeSlide4
Back to the problemArm assignment is determined in advance
I.e. before the first arm is pulled
Adversarial = Assignment can be picked after strategy is already known
But still, before the game beginsWe want to minimize the regret“How much better could have we done?”We’ll start with “Weak” RegretComparing to always pulling one arm (the “best” arm)Slide5
Our Goal
We would like to show an algorithm that has the following bound on the “weak” regret:
is the return of the best arm
This should work for any setup of arms
Including random/adversarial setups
Slide6
Difference from last week?
In the second half we saw:
experts, each loses a certain fraction of what they get
We can split between multiple expertsWe always learn about all the expertsRegret compare to single best expert (“weak” regret)Where’s the difference?Solo investment – we chose only one expertPartial information – we learn only about chosen expertBounds:Last week:
Ours:
Slide7
Notations
– The time step. Also known as “trial”
–
The reward of arm
at trial
– an algorithm, choosing the arm
at trial
The input at time
is the arms chosen and their rewards till now
F
ormally -
Slide8
Notations++
– The return of algorithm
at time horizon
Will be abbreviated as
where the
is obvious
– The best single-arm reward at time horizon
Will also be abbreviated as
Slide9
The naïve approachRecalling our baselineSlide10
Simplistic approach
Explore
Check each arm
timesExploitContinue pulling only the best armProfit?Only with “fixed distribution” (no significant changes over time)May fail with arbitrary/adversary rewards
Slide11
Multiplicative Weights approach
Maintain weights
for arm
at trial Start with uniform weightsUse the weights to define a distribution
Typically
Pick an arm by sampling the distribution
Update weights based on rewards of each action
When rewards are known, multiply by a function
of the reward:
We’ll discuss partial information setup today
Slide12
The Exp3 algorithmFirst attemptSlide13
Exp3 – Exploration
Start with uniform weights
No surprise here
Always encourage the algorithm to explore more armsAdd an “exploration factor” to the probabilityEach arm will always have a probability of at least The exploration factor is controlled by
Can be fine-tuned later
The exploration factor does not change over time
Rewards may be arbitrary or even adversarial
So, we must always continue exploring
Slide14
Estimation rational
Question
story time
:You randomly visit a shop on 20% of the daysOnly on those days you know their profitHowever, you need to give a profit estimation for every dayWhat do you do?Answer:On days you visit, estimate
On other days, estimate
More generally:
if seen
otherwise
Slide15
Exp3 – Weight updates
To update the weights, use “estimated rewards” instead of the actual rewards
For actions that weren’t chosen, consider as if
This will create an unbiased estimator:
This equality holds for all actions – not just the one chosen
This helps with the weight updates on partial information
Slide16
Exp3
Initialize:
Set
for all actions For each :Sample
from
Receive reward
Update the weights with
for all but the selected action
Practically, we only update
Slide17
Bounds
Since we have a probabilistic argument, we care about the
expected
“weak” regret:
Theorem:
No, this is not what we wanted (
)
We’ll fix that later
Slide18
Bound intuition
is the exploration factor
As
, we stop exploiting our weights
Instead we prefer “randomly” looking around
Thus we come closer and closer to
As
, we exploit our weights
too much
We check for less changes in other arms
Becomes worse when we have more arms (
)
We must find the “sweet spot”
Slide19
The sweet spot
For some
, consider
This yields
Which is similar to the bound we wanted (
)
Slide20
Proof
Enough
talking, let’s start the
proof!
We assume
The bound trivially holds for
(as
)
Furthermore, we’ll want to rely on some facts
As outlined in the next slides
Slide21
Fact #1Irrelevant to our causeWe just want numbering (starting at 2) to be consistent with the paper
Slide22
Fact #2
Proof:
Since
, we get:
Since
, we get:
Slide23
Fact #3
Since
, the sum becomes:
We defined
for all unobserved actions
And so we remain only with
Slide24
Fact #4
From the previous slide (fact #3) we get
Since
we get
As before, since un-observed actions cancel out we can do:
Slide25
Proof - outline
Denote
Establish a relation
between
and
Establish a relation
between
and
Obtain with
Establish a relation
between
and
Obtain by summing over
Compute a direct bound over
Apply bound to
Slide26
Proof
We begin
with the definition of the weight sums:
Recall
the weight update definition
Slide27
Proof
Recall
the
probability definition:
Some algebra and we get:
Slide28
Proof
Note the following inequality
for
(No, it’s not a famous one. Yes it works, I tested it)
Slide29
Proof
Opening the square brackets, we get (just ugly math)
Slide30
Proof
Combining it all, we have
Using fact #3 (
) and
Using fact #4 (
)
Slide31
Proof
Taking the
log
and using
(specifically
):
Summing over
from
to
we get:
Slide32
Proof
We’ll get back to this equation in a second
Slide33
Proof
Choosing some action
and since
, note that:
Recall that
and so we can
expand the log recursively
:
Recall
that
and
(uniform initialization
) and so:
Slide34
Proof
Combining
and
we get:
Slide35
Proof
Recall we saw that
.
Taking
expectation on
both sides
we get:
Since
for any
, and since we have a
minus sign
, we can do:
Slide36
Proof
Note this was true for any action
, which means including the maximal action!
Q.E.D.(!)
Slide37
Break(?)Slide38
Is it enough?Maybe?I
f it was a clear “yes”, I wouldn’t have asked it :P
What was our problem?Slide39
Room for improvement
We assumed an upper limit
is known
We then selected
This obtained the bound
If we don’t know
, we may over-shoot
For example, setting
This would yield a less tight result :/
Slide40
New goal
Maintain the same bound over the weak regret
Don’t assume a known
In fact, do better!
Maintain this bound uniformly throughout the execution
*
*
A formal definition will follow
Slide41
Notations# (better than ++++)
– The return of action
till (exclusive) time horizon
– The
estimated return
of action
till
time
horizon
– The
maximal return
of
any single action till
time horizon
Slide42
Re-stating our goal
At
every time-step
, the weak regret till that point should maintain the bound
This bound will hold all the way!
This
also gives a hint on how to do this:
Previously we guessed
Instead, lets maintain
Update
and
as
grows
Effectively – we are searching for the right
!
Slide43
Exp3.1 (creative names!)
Initialize:
Set For each “epoch” do:
(Re-)Initialize Exp3
with:
While
do:
Do one step with Exp3
(Update our tracking of
for all actions)
Slide44
Bounds (Theorem 4.1)
Theorem 4.1:
Proof outline:
Lemma 4.2: Bound the weak regret per “epoch”
By giving a lower bound on the actual reward
Lemma
4.3: Bound the number of epochs for a finite horizon
Combine both to obtain the final bound
Slide45
Epoch notations
Let
be the overall number of steps
Needed only for the proofWe don’t need to know this in advanceLet be the overall number of epochsSame hereLet and
be the first and last steps (
) of epoch
,
An epoch may be empty
– in that case
Slide46
Lemma 4.2
For any action
and for every epoch
H
olds trivially for empty epochs
We’ll prove for
Slide47
Proof
Some
many
slides back, we proved the following:
Translating to our current notations, this becomes
Slide48
Proof
Merging the estimated reward also with all previous epochs:
From
the
termination condition, we know that
for any action
, at time
, in epoch
:
From fact #2, we know that
Together we get that for any action
Slide49
Proof
Recall that
Substitute and
Q.E.D
Slide50
Lemma 4.3
Denote
So that
Then the number of epochs
satisfies the inequality
Holds trivially for
We’ll prove for
Slide51
Proof
Regardless of the lemma, we know that:
is the termination step, so we know the epoch continuation condition was violated
Slide52
Proof
So, we’ve shown that
is
a variable
,
as we are trying to bound
This function is a parabola of
The minimum is obtained at
(Reminder: Extremum of
is at
It increases monotonically for
Slide53
Proof
Now, suppose the claim is false
Reversing the original inequality, we get
and the expression in purple are both larger than
This is in the increasing part of
:
Slide54
Proof
Assuming the claim is false, we got
But, before we proved that
Q.E.D.
Slide55
Theorem 4.1
Our target was:
Let’s begin
Slide56
Proof
Lemma 4.2 says that for any action
Then this
should also be for the maximal action
Slide57
Proof
Slide58
Proof
Using lemma 4.3 we obtain
Slide59
Finishing the proof
Using Jensen’s inequality (where the variable is
), we complete the proof and take the expectation
Full details in the paper
We don’t have time at this point anyhow :P
Slide60
Questions?