Learning to Interact Towards Selflearning Search Solutions Presenting work by various authors and own work in collaboration with colleagues at Microsoft and the University of Amsterdam katjahofmann ID: 170861
Download Presentation The PPT/PDF document "Katja Hofmann" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Katja Hofmann
Learning to InteractTowards “Self-learning” Search Solutions
Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam
@
katjahofmannSlide2
Motivation
Example task:
Find best news articles based on user context; optimize click-through rate
Example task:
Tune ad display parameters (e.g., mainline reserve) to optimize revenue
Example task:
Improve ranking of QAC to optimize suggestion usage
Typical approach: lots of offline tuning + AB testing.Slide3
AB Testing
= controlled experiment (often at large scale) with (at least) 2 conditions
[Kohavi et al.
’09, ‘12]
Example:
which search interface results in higher revenue?Slide4
Limitations of AB testing
High manual effortNeed to carefully design / tune each treatment
Few tested alternativesTypically compare 2-5 options
Large required sample size
Depending on effect size and variance, thousands to millions of impressions required to detect statistically significant differences
Result:
slow
development
cycles (e.g., weeks)
Can any of this be automated to speed up innovation?Slide5
Towards “Self-learning” Search Solutions
Contextual Bandits
Counterfactual Reasoning
Online Learning to RankSlide6
Image
a
dapted from: https://www.flickr.com/photos/prayitnophotography/4464000634
Contextual banditsSlide7
Why bandits?
Interactive systems only observe user feedback (
reward) on the items (actions
) they present to their users.
Exploration – exploitation trade-off
Formalized as (
contextual) bandit problem
s
ubmit query,
interact with result lists
g
enerate results
interpret feedbackSlide8
Bandits
Address key challenge: how to balance exploration and exploitation – explore
to learn, exploit to benefit from what has been learned.
= Reinforcement learning problem where actions do not affect future statesSlide9
Bandits
Example
Successes so far:
100 50 10
Arm pulls so far: ?? ?? ??
A
B
CSlide10
Bandits
Example
Successes so far:
100 50 10
Arm pulls so far: 1000 100 20
A
B
CSlide11
Bandits
Example
Successes so far:
100 50 10
Arm pulls so far: 1000 100 20
both arms are promising,
higher uncertainty for C
A
B
C
Bandit approaches balance exploration and exploitation based on expected payoff and uncertainty. Slide12
Adding context
Goal: take the best action based on context information (e.g., topics in user history)
Contextual
ε
-greedy
Idea 1:
Use simple exploration approach (here:
ε
-greedy)
Idea 2:
E
xplore efficiently in a small action space, but use machine learning to optimize over a context space.
[Li et al. ‘12]
Slide13
Contextual bandits
Example application: news recommendation.
[Li et al. ‘12]
Li et al. propose
to learn
generalized
linear
models using contextual
ε
-greedy
.
Models:
Example results:
Balancing exploration and exploitation is crucial for good results.Slide14
Summary: Contextual Bandits
Key ideas
Balance exploration and exploitation, to ensure continued learning while applying what has been learned
Explore in a small action space, but learn in a large contextual spaceSlide15
Illustrated Sutra of Cause and Effect
"E
innga kyo
" by Unknown - Woodblock reproduction, published in 1941 by
Sinbi-Shoin
Co., Tokyo. Licensed under Public domain via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:E_innga_kyo.jpg#mediaviewer/File:E_innga_kyo.jpg
Counterfactual ReasoningSlide16
Example: ad placement
Problem: estimate effects of mainline reserve changes.
[Bottou
et. al ‘13]
Slide17
Counterfactual analysis
[
Bottou et. al ‘13]
controlled experiment
counterfactual reasoningSlide18
Answering “what-if” questions
Key idea: estimate what would have happened if a different system (distribution over parameter values) had been used, using importance sampling.
Step 1: factorize based on known causal graph
This works because:
[
Bottou
et. al
‘13]
Step 2: compute estimates using importance sampling
=
=
Example distributions:
[
Precup
et. al ‘00]
Slide19
Example result
[
Bottou et. al ‘13]
Counterfactual reasoning allows analysis over a continuous range.Slide20
Summary: Counterfactual Reasoning
Key ideas
Leverage known causal structure and importance sampling to reason about “alternative realities”
Bound estimator error to distinguish between uncertainty due to low sample size and exploration coverageSlide21
Online Learning to RankSlide22
Compare two
rankings:
Generate interleaved (combined) ranking
Observe user clicks
Credit clicks to original rankers to infer outcome
document 1
document 2
document 3
document 4
document 2
document 3
document 4
document 1
document 1
document 2
document 3
document 4
Interleaved Comparison Methods
[Joachims et al.
’05,
Chapelle
et al. ‘12, Hofmann et al. ‘13a]
Example:
optimize QAC rankingSlide23
Learning from relative feedback
Dueling bandit gradient descent (DBGD) optimizes a weight vector for weighted-linear combinations of ranking features.
current best weight vector
sample unit sphere to generate candidate ranker
randomly generated candidate
feature 1
feature 2
Relative
listwise
feedback is obtained using interleaving
Learning approach
[Yue & Joachims ‘09]
Slide24
Improving sample efficiency
Idea 1: Generate several candidate rankers, and select the best one by running a tournament on historical data
Idea 2: Use probabilistic interleave and importance sampling for ranker comparisons during the tournament
Estimate comparison outcomes using probabilistic
interleave + importance
sampling:
generate many candidates and select the most promising one
feature 1
feature 2
[Hofmann et al.
’13c]
Approach: candidate pre-selection (CPS)Slide25
Analysis: Speed of Learning
informational click model
[Hofmann et al.
’13b,
Hofmann et al.
’13c]
From earlier work: learning from relative
listwise
feedback is robust to noise.
Here: adding structure further dramatically improves performance.Slide26
Summary: Online Learning to Rank
Key ideas
Avoid combinatorial action space by exploring in parameter space
Reduce variance using relative feedback
Leverage known structures for sample-efficient learningSlide27
Summary
Optimizing interactive systems
Slow with manually designed alternatives and AB testing – how can we automate?
Contextual bandits
Systematic approach to balancing exploration and exploitation; contextual bandits explore in small action space but optimize in large context space.
Counterfactual reasoning
Leverages causal structure and importance sampling for “what if” analyses.
Online learning to rank
Avoids combinatorial explosion by exploring and learning in parameter space; uses known ranking structure for sample-efficient learning.Slide28
What’s next?
Research
Measuring reward, low-risk and low-variance exploration schemes, new learning mechanisms
Applications
Assess action and solution spaces in a given application, collect and learn from exploration data, increase experimental agility
Try this (at home)
Try open-source code samples;
Living labs challenge allows experimentation with online learning and evaluation methods
Challenge
:
http
://living-labs.net/challenge
/
Code:
https://bitbucket.org/ilps/lerotSlide29
References and further reading
A/B testing
[Kohavi
et al. ‘09]
R.
Kohavi
, R.
Longbotham
, D.
Sommerfield, R. M. Henne
: Controlled experiments on the web: survey and practical guide (Data Mining and Knowledge Discovery 18, 2009).
[
Kohavi
et al. ‘12]
R.
Kohavi
, A. Deng, B.
Frasca
, R.
Longbotham
, T. Walker, Y. Xu:
Trustworthy online controlled experiments: five puzzling outcomes explained
(KDD 2012).
Contextual bandits
[Li et al. ‘11]
L. Li, W. Chu, J. Langford, X. Wang:
Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms
(WWW, 2014
).
[Li et al. ‘
12]
L.
Li,
W. Chu
,
J. Langford
,
T. Moon
,
X. Wang: An Unbiased Offline Evaluation of Contextual Bandit Algorithms based on Generalized Linear Models
,
ICML-2011
Workshop on Online Trading of Exploration and
Exploitation.
Counterfactual reasoning
[
Bottou
et. al ‘13]
L.
Bottou
, J. Peters, J.
Quiñonero
-Candela, D.X. Charles, D.M.
Chickering
, E.
Portugaly
, D. Ray, P.
Simard
, E.
Snelson
:
Counterfactual reasoning and learning systems: the example of computational advertising
(Journal of Machine Learning Research 14 (1), 2013
).
[
Precup
et al. ‘00]
D.
Precup
, R. S. Sutton, S. Singh:
Eligibility Traces for Off-Policy Policy Evaluation
(ICML 2000
).
Interleaving
[
Chapelle
et al. ‘12]
O.
Chapelle
, T. Joachims, F. Radlinski, Y. Yue:
Large Scale Validation and Analysis of Interleaved Search Evaluation
(ACM Transactions on Information Systems 30(1): 6, 2012).
[
Hofmann et al.
’13a]
K. Hofmann, S. Whiteson, M. de Rijke:
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods
(ACM Transactions on Information Systems 31(4): 17, 2013).
[Radlinski et al. ‘08]
F. Radlinski, M.
Kurup
, and T. Joachims:
How does
clickthrough
data reflect retrieval quality?
(CIKM 2008).
Online learning to rank
[Yue & Joachims ‘09]
Y. Yue, T. Joachims:
Interactively optimizing information retrieval system as a
dueling
bandits problem
(ICML 2009).
[Hofmann et al.
’13b]
K. Hofmann, A. Schuth, S. Whiteson, M. de Rijke:
Reusing Historical Interaction Data for Faster Online Learning to Rank for IR
(WSDM 2013).
[Hofmann et al.
’13c]
K. Hofmann, S. Whiteson, M. de Rijke:
Balancing exploration and exploitation in
listwise
and pairwise online learning to rank for information retrieval
(Information Retrieval 16, 2013
).Slide30
© 2013 Microsoft Corporation. All rights reserved. Microsoft,
Windows and
other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT
MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.