Yisong Yue CMU amp Thorsten Joachims Cornell Team Draft Interleaving Comparison Oracle for Search Ranking A Napa Valley The authority for lodging wwwnapavalleycom Napa Valley Wineries Plan your wine ID: 605837
Download Presentation The PPT/PDF document "Beat the Mean Bandit" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Beat the Mean Bandit
Yisong Yue (CMU) & Thorsten
Joachims
(Cornell)
Team Draft Interleaving
(Comparison Oracle for Search)
Ranking A
Napa Valley – The authority for lodging...
www.napavalley.com
Napa Valley Wineries - Plan your wine...
www.napavalley.com/wineries
Napa Valley College
www.napavalley.edu/homex.asp4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/166815. Napa Valley Wineries and Wine www.napavintners.com6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley
Ranking B1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley2. Napa Valley – The authority for lodging... www.napavalley.com3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=...4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com5. NapaValley.org www.napavalley.org6. The Napa Valley Marathon www.napavalleymarathon.org
Presented Ranking
Napa Valley – The authority for lodging...
www.napavalley.com
2. Napa Country, California – Wikipedia
en.wikipedia.org/wiki/
Napa_Valley3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=...Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com Napa Balley College www.napavalley.edu/homex.asp7 NapaValley.org www.napavalley.org
B wins!
Click
[
Radlinski et al. 2008]
Click
Dueling Bandits Problem
Given K bandits b
1, …, bKEach iteration: compare (duel) two bandits (E.g., interleaving two retrieval functions)Cost function (regret):(bt, bt’) are the two bandits chosenb* is the overall best one(% users who prefer best bandit over chosen ones)
[Yue et al. 2009]
Example Pairwise Preferences
A
BCDEFA00.050.050.040.110.11B-0.0500.050.060.080.10C-0.05-0.0500.040.010.06D-0.04-0.04-0.0400.040.00E-0.11-0.08-0.01-0.0400.01F-0.11-0.10-0.06-0.00-0.010
Values are Pr(row > col) – 0.5Derived from interleaving experiments on http://arXiv.org
Compare E & F:P(A > E) = 0.61P(A > F) = 0.61Incurred Regret = 0.22
Compare D & F:P(A > D) = 0.54P(A > F) = 0.61Incurred Regret = 0.15
Compare A & B:P(A > A) = 0.50P(A > B) = 0.55Incurred Regret = 0.05
Violation in internal consistency!For strong stochastic transitivity: A > D should be at least 0.06 C > E should be at least 0.04
A
BCDEFMeanLowerBoundUpperBoundA winsTotal 1325162411221628203013210.591500.490.69B wins Total1430153013191520172620250.631500.530.73C wins Total12281022 13231528202413250.551500.450.65D winsTotal 920152810211123152815300.501500.400.60E wins Total82411256221429143110190.421500.320.52F wins Total112942510181225143013230.431500.330.53
Optimizing Information Retrieval Systems
Increasingly reliant on user feedback (E.g., clicks on search results)Online learning is a popular modeling tool (Especially partial-information (bandit) settings)Our focus: learning from relative preferences Motivated by recent work on interleaved retrieval evaluation
ABCDEFMeanLowerBoundUpperBoundA winsTotal 1325162411221628203013210.581200.490.67B wins Total1430153013191520152620250.621240.510.73C wins Total12281022 13231528202413250.501260.390.61D winsTotal 920152810211123152815300.491220.380.60E wins Total82411256221429143110190.421500.320.52F wins Total112942510181225143013230.421200.310.53
ABCDEFMeanLowerBoundUpperBoundA winsTotal 1530192914281833233015250.551200.430.67B wins Total1533173415242027152623270.561180.440.68C wins Total1331112814291530202416270.451180.330.57D winsTotal 1126173112261429152817330.481120.360.60E wins Total82411256221429143110190.421500.320.52F wins Total123273013261328143015290.411450.310.51
ABCDEFMeanLowerBoundUpperBoundA winsTotal 4180447538704275233015250.51800.380.64B wins Total3169387847785175152623270.521470.450.49C wins Total3377317735703976202416270.332250.240.42D winsTotal 3076277735743573152817330.423000.350.49E wins Total82411256221429143110190.421500.320.52F wins Total123273013261328143015290.411450.310.51
Regret Guarantee
Playing against
mean bandit
calibrates preference scores -- Estimates of (active) bandits directly comparable -- One estimate per active bandit = linear number of estimatesWe can bound comparisons needed to remove worst bandit -- Varies smoothly with transitivity parameter γ -- High probability bound We can bound the regret incurred by each comparison -- Varies smoothly with transitivity parameter γCan bound the total regret with high probability: -- γ is typically close to 1
We
also have a similar PAC guarantee.
Assumptions
Assumptions of preference behavior (required for theoretical analysis)P(bi > bj) = ½ + εij (distinguishability)
Conclusions
Online learning approach using pairwise feedback
-- Well-suited for optimizing information retrieval systems from user feedback-- Models exploration/exploitation tradeoff-- Models violations in preference transitivityAlgorithm: Beat-the-Mean-- Regret linear in #bandits and logarithmic in #iterations-- Degrades smoothly with transitivity violation-- Stronger guarantees than previous work-- Also has PAC guarantees-- Empirically supported
Empirical Results
Stochastic Triangle InequalityFor three bandits b* > bj > bk :Diminishing returns property
Simulation experiment where γ = 1Light (Beat-the-Mean)Dark (Interleaved Filter [Yue et al. 2009])Beat-the-Mean exhibits lower variance.
Simulation experiment where γ = 1.3Light (Beat-the-Mean)Dark (Interleaved Filter [Yue et al. 2009])Interleaved Filter has quad. regret in worst case
Relaxed Stochastic TransitivityFor three bandits b* > bj > bk :Internal consistency property
γ
= 1 required in previous work, and required to apply for all bandit tripletsγ = 1.5 in Example Pairwise Preferences shown in left column
Beat-the-Mean
-- Each bandit (row) maintains score against mean bandit
-- Mean bandit is average against all active bandits (averaging over columns A-F)-- Maintains upper/lower bound confidence intervals (last two columns)-- When one bandit dominates another (lower bound > upper bound), remove bandit (grey out)-- Remove comparisons from estimate of score against mean bandit (don’t count greyed out columns)-- Remaining scores form estimate of versus new mean bandit (of remaining active bandits)-- Continue until one bandit remains
← This is not possible
with previous work!