/
Beat the Mean Bandit Beat the Mean Bandit

Beat the Mean Bandit - PowerPoint Presentation

pasty-toler
pasty-toler . @pasty-toler
Follow
406 views
Uploaded On 2017-11-16

Beat the Mean Bandit - PPT Presentation

Yisong Yue CMU amp Thorsten Joachims Cornell Team Draft Interleaving Comparison Oracle for Search Ranking A Napa Valley The authority for lodging wwwnapavalleycom Napa Valley Wineries Plan your wine ID: 605837

wins napa www bandit napa wins bandit www valley napavalley bandits regret winstotal org bound wikipedia beat 421500 wineries 320 transitivity books

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Beat the Mean Bandit" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Beat the Mean Bandit

Yisong Yue (CMU) & Thorsten

Joachims

(Cornell)

Team Draft Interleaving

(Comparison Oracle for Search)

Ranking A

Napa Valley – The authority for lodging...

www.napavalley.com

Napa Valley Wineries - Plan your wine...

www.napavalley.com/wineries

Napa Valley College

www.napavalley.edu/homex.asp4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/166815. Napa Valley Wineries and Wine www.napavintners.com6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley

Ranking B1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging... www.napavalley.com3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com5. NapaValley.org www.napavalley.org6. The Napa Valley Marathon www.napavalleymarathon.org

Presented Ranking

Napa Valley – The authority for lodging...

www.napavalley.com

2. Napa Country, California – Wikipedia

en.wikipedia.org/wiki/

Napa_Valley3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=...Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com Napa Balley College www.napavalley.edu/homex.asp7 NapaValley.org www.napavalley.org

B wins!

Click

[

Radlinski et al. 2008]

Click

Dueling Bandits Problem

Given K bandits b

1, …, bKEach iteration: compare (duel) two bandits (E.g., interleaving two retrieval functions)Cost function (regret):(bt, bt’) are the two bandits chosenb* is the overall best one(% users who prefer best bandit over chosen ones)

[Yue et al. 2009]

Example Pairwise Preferences

A

BCDEFA00.050.050.040.110.11B-0.0500.050.060.080.10C-0.05-0.0500.040.010.06D-0.04-0.04-0.0400.040.00E-0.11-0.08-0.01-0.0400.01F-0.11-0.10-0.06-0.00-0.010

Values are Pr(row > col) – 0.5Derived from interleaving experiments on http://arXiv.org

Compare E & F:P(A > E) = 0.61P(A > F) = 0.61Incurred Regret = 0.22

Compare D & F:P(A > D) = 0.54P(A > F) = 0.61Incurred Regret = 0.15

Compare A & B:P(A > A) = 0.50P(A > B) = 0.55Incurred Regret = 0.05

Violation in internal consistency!For strong stochastic transitivity: A > D should be at least 0.06 C > E should be at least 0.04

A

BCDEFMeanLowerBoundUpperBoundA winsTotal 1325162411221628203013210.591500.490.69B wins Total1430153013191520172620250.631500.530.73C wins Total12281022 13231528202413250.551500.450.65D winsTotal 920152810211123152815300.501500.400.60E wins Total82411256221429143110190.421500.320.52F wins Total112942510181225143013230.431500.330.53

Optimizing Information Retrieval Systems

Increasingly reliant on user feedback (E.g., clicks on search results)Online learning is a popular modeling tool (Especially partial-information (bandit) settings)Our focus: learning from relative preferences Motivated by recent work on interleaved retrieval evaluation

ABCDEFMeanLowerBoundUpperBoundA winsTotal 1325162411221628203013210.581200.490.67B wins Total1430153013191520152620250.621240.510.73C wins Total12281022 13231528202413250.501260.390.61D winsTotal 920152810211123152815300.491220.380.60E wins Total82411256221429143110190.421500.320.52F wins Total112942510181225143013230.421200.310.53

ABCDEFMeanLowerBoundUpperBoundA winsTotal 1530192914281833233015250.551200.430.67B wins Total1533173415242027152623270.561180.440.68C wins Total1331112814291530202416270.451180.330.57D winsTotal 1126173112261429152817330.481120.360.60E wins Total82411256221429143110190.421500.320.52F wins Total123273013261328143015290.411450.310.51

ABCDEFMeanLowerBoundUpperBoundA winsTotal 4180447538704275233015250.51800.380.64B wins Total3169387847785175152623270.521470.450.49C wins Total3377317735703976202416270.332250.240.42D winsTotal 3076277735743573152817330.423000.350.49E wins Total82411256221429143110190.421500.320.52F wins Total123273013261328143015290.411450.310.51

Regret Guarantee

Playing against

mean bandit

calibrates preference scores -- Estimates of (active) bandits directly comparable -- One estimate per active bandit = linear number of estimatesWe can bound comparisons needed to remove worst bandit -- Varies smoothly with transitivity parameter γ -- High probability bound We can bound the regret incurred by each comparison -- Varies smoothly with transitivity parameter γCan bound the total regret with high probability: -- γ is typically close to 1

We

also have a similar PAC guarantee.

Assumptions

Assumptions of preference behavior (required for theoretical analysis)P(bi > bj) = ½ + εij (distinguishability)

Conclusions

Online learning approach using pairwise feedback

-- Well-suited for optimizing information retrieval systems from user feedback-- Models exploration/exploitation tradeoff-- Models violations in preference transitivityAlgorithm: Beat-the-Mean-- Regret linear in #bandits and logarithmic in #iterations-- Degrades smoothly with transitivity violation-- Stronger guarantees than previous work-- Also has PAC guarantees-- Empirically supported

Empirical Results

Stochastic Triangle InequalityFor three bandits b* > bj > bk :Diminishing returns property

Simulation experiment where γ = 1Light (Beat-the-Mean)Dark (Interleaved Filter [Yue et al. 2009])Beat-the-Mean exhibits lower variance.

Simulation experiment where γ = 1.3Light (Beat-the-Mean)Dark (Interleaved Filter [Yue et al. 2009])Interleaved Filter has quad. regret in worst case

Relaxed Stochastic TransitivityFor three bandits b* > bj > bk :Internal consistency property

γ

= 1 required in previous work, and required to apply for all bandit tripletsγ = 1.5 in Example Pairwise Preferences shown in left column

Beat-the-Mean

-- Each bandit (row) maintains score against mean bandit

-- Mean bandit is average against all active bandits (averaging over columns A-F)-- Maintains upper/lower bound confidence intervals (last two columns)-- When one bandit dominates another (lower bound > upper bound), remove bandit (grey out)-- Remove comparisons from estimate of score against mean bandit (don’t count greyed out columns)-- Remaining scores form estimate of versus new mean bandit (of remaining active bandits)-- Continue until one bandit remains

← This is not possible

with previous work!