from heuristics to theoretic approaches Guest Lecture by Hongning Wang wang296illlinoisedu Congratulations Job Offer Design the ranking module for Bingcom Guest Lecture for Learing to Rank ID: 379660
Download Presentation The PPT/PDF document "Learning to Rank" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Learning to Rankfrom heuristics to theoretic approaches
Guest Lecture by Hongning Wangwang296@illlinois.eduSlide2
CongratulationsJob OfferDesign the ranking module for Bing.com
Guest Lecture for Learing to Rank
2Slide3
How should I rank documents?
Answer: Rank by relevance!
Guest Lecture for Learing to Rank
3Slide4
Relevance ?!
Guest Lecture for Learing to Rank
4Slide5
The Notion of Relevance
Relevance
(Rep(q), Rep(d))
Similarity
P(r=1|q,d) r
{0,1}
Probability of Relevance
P(d
q) or P(q
d)
Probabilistic inference
Different
rep &
similarity
Vector space
model
(Salton et al., 75)
Prob. distr.
model
(Wong & Yao, 89)
…
Generative
Model
Regression
Model
(Fuhr 89)
Classical
prob. Model
(Robertson &
Sparck Jones, 76)
Doc
generation
Query
generation
LM
approach
(Ponte & Croft, 98)
(Lafferty & Zhai, 01a)
Prob. concept
space model
(Wong & Yao, 95)
Different
inference system
Inference
network
model
(Turtle & Croft, 91)
Div. from Randomness
(Amati & Rijsbergen 02)
Learn. To Rank
(Joachims 02,
Berges et al. 05)
Relevance constraints
[Fang et al. 04]
Lecture notes from
CS598CXZ
Guest Lecture for Learing to Rank
5Slide6
Relevance EstimationQuery matchingLanguage model
BM25Vector space cosine similarityDocument importancePageRankHITS
Guest Lecture for Learing to Rank
6Slide7
Did I do a good job of ranking documents?
IR evaluations metrics
Precision
MAP
NDCG
PageRank
BM25
Guest Lecture for Learing to Rank
7Slide8
Take advantage of different relevance estimator?Ensemble the cues
Linear? Non-linear?Decision tree-like
BM25>0.5
LM > 0.1
PageRank > 0.3
True
False
True
r= 1.0
r= 0.7
False
True
r= 0.4
r= 0.1
False
}
Guest Lecture for Learing to Rank
8Slide9
What if we have thousands of features?Is there any way I can do better?
Optimizing the metrics automatically!
How to determine those
s?
Where to find those tree structures?
Guest Lecture for Learing to Rank
9Slide10
Rethink the task
Given: (query, document) pairs represented by a set of relevance estimators, a.k.a., featuresNeeded: a way of combining the estimators
ordered
Criterion: optimize IR metrics
P@k
, MAP, NDCG, etc.
DocID
BM25
LM
PageRank
Label
0001
1.6
1.1
0.9
0
0002
2.7
1.9
0.2
1
Key!
Guest Lecture for Learing to Rank
10Slide11
Machine Learning
Input:
, where
Object function :
Output:
, such that
Classification
http://en.wikipedia.org/wiki/Statistical_classification
Regression
http://en.wikipedia.org/wiki/Regression_analysis
M
NOTE: We will only talk about supervised learning.
Guest Lecture for Learing to Rank
11Slide12
Learning to Rank
General solution in optimization frameworkInput:
, where
Object: O = {
P@k
, MAP, NDCG}
Output:
,
s.t.,
DocID
BM25
LM
PageRank
Label
0001
1.6
1.1
0.9
0
0002
2.7
1.9
0.2
1
Guest Lecture for Learing to Rank
12Slide13
Challenge: how to optimize?
Evaluation metric recapAverage Precision DCG Order is essential!
order
metric
Not continuous with respect to f(X)!
Guest Lecture for Learing to Rank
13Slide14
Approximating the Objects!Pointwise
Fit the relevance labels individuallyPairwiseFit the relative ordersListwiseFit the whole order
Guest Lecture for Learing to Rank
14Slide15
Pointwise Learning to Rank
Ideally perfect relevance prediction leads to perfect ranking
score
order
metricReducing ranking problem toRegression
Subset Ranking using Regression,
D.Cossock
and
T.Zhang
, COLT 2006
(multi-)Classification
Ranking with Large Margin
Principles,
A.
Shashua
and A. Levin, NIPS 2002
Guest Lecture for Learing to Rank
15Slide16
Subset Ranking using RegressionD.Cossock
and T.Zhang, COLT 2006Fit relevance labels via regression Emphasize more on relevant documents
http://en.wikipedia.org/wiki/Regression_analysis
Weights on each document
Most positive document
Guest Lecture for Learing to Rank
16Slide17
Ranking with Large Margin PrinciplesA. Shashua and A. Levin, NIPS 2002
Goal: correctly placing the documents in the corresponding category and maximize the margin
Y=0
Y=2
Y=1
Reduce the violations
Guest Lecture for Learing to Rank
17Slide18
Maximizing the sum of margins
Ranking with Large Margin Principles
A.
Shashua
and A. Levin, NIPS 2002
Y=0
Y=2
Y=1
Guest Lecture for Learing to Rank
18Slide19
Ranking with Large Margin Principles A. Shashua and A. Levin, NIPS 2002
Ranking lost is consistently decreasing with more training data
Guest Lecture for Learing to Rank
19Slide20
What did we learnMachine learning helps!Derive something
optimizableMore efficient and guided
Guest Lecture for Learing to Rank
20Slide21
There is always a catch
Cannot directly optimize IR metrics(0
1
,
2
0) worse than (0->-2, 2->4)
Position of documents are ignoredPenalty on documents at higher positions should be largerFavor the queries with more documents
Guest Lecture for Learing to Rank
21Slide22
Pairwise Learning to Rank
Ideally perfect partial order leads to perfect ranking partial order
order
metric
Ordinal
regression
Relative
ordering
between different documents is significant
E.g.,
(
0
->
-2
,
2
-
>
4
) is better than (
0
1
, 2 0) Large body of work
Guest Lecture for Learing to Rank22Slide23
Optimizing Search Engines using Clickthrough DataThorsten
Joachims, KDD’02Minimizing the number of mis-ordered pairs
1
0
linear combination of features
RankingSVM
Keep the relative orders
Guest Lecture for Learing to Rank
23Slide24
Optimizing Search Engines using Clickthrough DataThorsten
Joachims, KDD’02How to use it?
s
order
Guest Lecture for Learing to Rank
24Slide25
Optimizing Search Engines using Clickthrough DataThorsten
Joachims, KDD’02What did it learn from the data?Linear correlations
Positive correlated features
Negative correlated features
Guest Lecture for Learing to Rank
25Slide26
How good is it?Test on real system
Optimizing Search Engines using Clickthrough Data
Thorsten
Joachims
, KDD’02
Guest Lecture for Learing to Rank
26Slide27
An Efficient Boosting Algorithm for Combining PreferencesY. Freund, R. Iyer, et al. JMLR 2003
Smooth the loss on mis-ordered pairs
exponential loss
hinge loss (
RankingSVM
)
0/1 loss
square loss
from Pattern Recognition and Machine Learning, P337
Guest Lecture for Learing to Rank
27Slide28
RankBoost: optimize via boostingVote by a committee
An Efficient Boosting Algorithm for Combining PreferencesY. Freund, R. Iyer, et al. JMLR 2003
from Pattern Recognition and Machine Learning, P658
Updating
Credibility of each committee member (ranking feature)
BM25
PageRank
Cosine
Guest Lecture for Learing to Rank
28Slide29
How good is it?
An Efficient Boosting Algorithm for Combining
Preferences
Y. Freund, R. Iyer, et al. JMLR
2003
Guest Lecture for Learing to Rank
29Slide30
A Regression Framework for Learning Ranking Functions Using Relative Relevance JudgmentsZheng et al. SIRIG’07
Non-linear ensemble of featuresObject:
Gradient descent boosting tree
Boosting tree
Using regression tree to minimize the residuals
r= 0.4
r= 0.1
BM25>0.5
LM > 0.1
PageRank > 0.3
True
False
True
r= 1.0
r= 0.7
False
True
False
Guest Lecture for Learing to Rank
30Slide31
A Regression Framework for Learning Ranking Functions Using Relative Relevance JudgmentsZheng et al. SIRIG’07
Non-linear v.s. linearComparing with RankingSVM
Guest Lecture for Learing to Rank
31Slide32
Where do we get the relative ordersHuman annotations
Small scale, expensive to acquireClickthroughsLarge amount, easy to acquire
Guest Lecture for Learing to Rank
32Slide33
Accurately Interpreting Clickthrough Data as Implicit Feedback
Thorsten Joachims, et al., SIGIR’05Position bias
Your click is not because of its relevance, but its position?!
Guest Lecture for Learing to Rank
33Slide34
Accurately Interpreting Clickthrough Data as Implicit Feedback
Thorsten Joachims, et al., SIGIR’05Controlled experimentOver trust of the top ranked positions
Guest Lecture for Learing to Rank
34Slide35
Pairwise preference mattersClick: examined and clicked document Skip: examined but non-clicked document
Accurately Interpreting
Clickthrough
Data as Implicit
Feedback
Thorsten
Joachims, et al., SIGIR’05
Click > Skip
Guest Lecture for Learing to Rank
35Slide36
What did we learnPredicting relative order Getting closer to the nature of ranking
Promising performance in practicePairwise preferences from click-throughs
Guest Lecture for Learing to Rank
36Slide37
Listwise Learning to Rank
Can we directly optimize the ranking? order
metric
Tackle the challenge
Optimization without gradient
Guest Lecture for Learing to Rank
37Slide38
From RankNet to LambdaRank to
LambdaMART: An OverviewChristopher J.C. Burges, 2010Minimizing mis
-ordered pair => maximizing IR metrics?
Mis
-ordered pairs: 6
Mis
-ordered pairs: 4
AP:
AP:
DCG:
DCG:
Position is crucial!
Guest Lecture for Learing to Rank
38Slide39
From RankNet to LambdaRank to
LambdaMART: An OverviewChristopher J.C. Burges, 2010
Weight the
mis
-ordered pairs?
Some pairs are more important to be placed in the right order
Inject into object function
Inject into gradient
Gradient with respect to approximated object, i.e., exponential loss on
mis
-ordered pairs
Change in original object, e.g., NDCG, if we switch the documents
i
and j, leaving the other documents unchanged
Depend on the ranking of document
i
, j in the whole list
Guest Lecture for Learing to Rank
39Slide40
Lambda functionsGradient? Yes, it meets the sufficient and necessary condition of being partial derivative Lead to optimal solution of original problem?
EmpiricallyFrom RankNet to LambdaRank
to
LambdaMART
: An Overview
Christopher J.C. Burges, 2010
Guest Lecture for Learing to Rank40Slide41
EvolutionFrom
RankNet to LambdaRank to LambdaMART: An OverviewChristopher J.C. Burges, 2010
RankNet
LambdaRank
LambdaMART
Object
Cross
entropy over the pairs
Unknown
Unknown
Gradient (
function)
Gradient of cross entropy
Gradient of cross entropy
times
pairwise
change in target metric
Gradient of cross entropy
times pairwise change in target metric Optimization methodneural networkstochastic gradient descent
Multiple Additive Regression Trees (MART)RankNetLambdaRankLambdaMARTObject
Cross entropy over the pairsUnknownUnknown
Gradient of cross entropy
Gradient of cross entropy times
pairwise change in
target metric Gradient of cross entropy times pairwise change in target metric Optimization method
neural networkstochastic gradient descent
Multiple Additive Regression Trees (MART)
As we discussed in RankBoost
Optimize solely by gradient
Non-linear combination
Guest Lecture for Learing to Rank
41Slide42
A Lambda tree
From
RankNet
to
LambdaRank
to
LambdaMART: An OverviewChristopher J.C. Burges, 2010
splitting
Combination of features
Guest Lecture for Learing to Rank
42Slide43
AdaRank: a boosting algorithm for information retrievalJun
Xu & Hang Li, SIGIR’07Loss defined by IR metrics
Optimizing by boosting
Target metrics: MAP, NDCG, MRR
from Pattern Recognition and Machine Learning, P658
Updating
Credibility
of each committee member (ranking feature)
BM25
PageRank
Cosine
Guest Lecture for Learing to Rank
43Slide44
A Support Vector Machine for Optimizing Average PrecisionYisong
Yue, et al., SIGIR’07RankingSVMMinimizing the pairwise loss
SVM-MAP
Minimizing
the structural loss
Loss defined on the number of
mis
-ordered document pairs
Loss defined on the quality of the whole list of ordered documents
Guest Lecture for Learing to Rank
44
MAP differenceSlide45
Max margin principlePush the ground-truth far away from any mistakes you might makeFinding the most violated constraints
A Support Vector Machine for Optimizing Average PrecisionYisong Yue, et al., SIGIR’07
Guest Lecture for Learing to Rank
45Slide46
Finding the most violated constraintsMAP is invariant to permutation of (
ir)relevant documentsMaximize MAP over a series of swaps between relevant and irrelevant documents A Support Vector Machine for Optimizing Average Precision
Yisong
Yue
, et al., SIGIR’07
Right-hand side of constraints
Greedy solution
Guest Lecture for Learing to Rank
46
Start from the reverse order of ideal rankingSlide47
Experiment results
A Support Vector Machine for Optimizing Average Precision
Yisong
Yue
, et al., SIGIR’07
Guest Lecture for Learing to Rank
47Slide48
Other listwise solutions
Soften the metrics to make them differentiableMichael Taylor et al., SoftRank: optimizing non-smooth rank metrics, WSDM'08Minimize a loss function defined on permutations
Zhe
Cao et al., Learning to rank: from pairwise approach to
listwise
approach, ICML'07
Guest Lecture for Learing to Rank48Slide49
What did we learnTaking a list of documents as a wholePositions are visible for the learning algorithm
Directly optimizing the target metricLimitationThe search space is huge!
Guest Lecture for Learing to Rank
49Slide50
SummaryLearning to rank
Automatic combination of ranking features for optimizing IR evaluation metricsApproachesPointwiseFit the relevance labels individuallyPairwiseFit the relative ordersListwise
Fit the whole order
Guest Lecture for Learing to Rank
50Slide51
Experimental ComparisonsRanking performance
Guest Lecture for Learing to Rank
51Slide52
Experimental ComparisonsWinning countOver seven different data sets
Guest Lecture for Learing to Rank
52Slide53
Experimental ComparisonsMy experiments1.2k queries, 45.5K documents with
1890 features800 queries for training, 400 queries for testing
MAP
P@1
ERR
MRR
NDCG@5
ListNET
0.2863
0.2074
0.1661
0.3714
0.2949
LambdaMART
0.4644
0.4630
0.2654
0.6105
0.5236
RankNET
0.3005
0.2222
0.1873
0.3816
0.3386
RankBoost
0.4548
0.4370
0.2463
0.58290.4866
RankingSVM
0.35070.23700.18950.4154
0.3585AdaRank
0.43210.41110.2307
0.54820.4421pLogistic0.45190.39260.24890.5535
0.4945
Logistic0.43480.3778
0.24100.5526
0.4762Guest Lecture for Learing to Rank53Slide54
Analysis of the ApproachesWhat are they really optimizing?Relation with IR metrics
gap
Guest Lecture for Learing to Rank
54Slide55
Pointwise ApproachesRegression based
Classification based
Regression loss
Discount coefficients in DCG
Classification loss
Discount coefficients in DCG
Guest Lecture for Learing to Rank
55Slide56
From
Guest Lecture for Learing to Rank
56Slide57
From
Discount coefficients in DCG
Guest Lecture for Learing to Rank
57Slide58
Listwise ApproachesNo general analysis
Method dependentDirectness and consistencyGuest Lecture for Learing to Rank
58Slide59
Connection with Traditional IRPeople have foreseen this topic long time ago
Nicely fit in the risk minimization frameworkGuest Lecture for Learing to Rank59Slide60
Applying Bayesian Decision Theory
Choice: (D
1
,
1
)
Choice: (D
2
,
2
)
Choice: (D
n
,
n
)
...
query
q
user
U
doc set
C
source
S
q
1
N
hidden
observed
loss
Bayes risk for choice (D,
)
RISK MINIMIZATION
Loss
L
L
L
Metric to be optimized
Available ranking features
Guest Lecture for Learing to Rank
60Slide61
Traditional Solution
Set-based models (choose D)Ranking models (choose )Independent loss
Relevance-based loss
Distance-based loss
Dependent loss
MMR loss
MDR loss
Boolean model
Probabilistic relevance model
Generative Relevance Theory
Subtopic retrieval model
Vector-space Model
Two-stage LM
KL-divergence model
Pointwise
Pairwise/
Listwise
Unsupervised!
Guest Lecture for Learing to Rank
61Slide62
Traditional Notion of Relevance
Relevance
(Rep(q), Rep(d))
Similarity
P(r=1|q,d) r
{0,1}
Probability of Relevance
P(d
q) or P(q
d)
Probabilistic inference
Different
rep &
similarity
Vector space
model
(Salton et al., 75)
Prob. distr.
model
(Wong & Yao, 89)
…
Generative
Model
Regression
Model
(Fuhr 89)
Classical
prob. Model
(Robertson &
Sparck Jones, 76)
Doc
generation
Query
generation
LM
approach
(Ponte & Croft, 98)
(Lafferty & Zhai, 01a)
Prob. concept
space model
(Wong & Yao, 95)
Different
inference system
Inference
network
model
(Turtle & Croft, 91)
Div. from Randomness
(Amati & Rijsbergen 02)
Learn. To Rank
(Joachims 02,
Berges et al. 05)
Relevance constraints
[Fang et al. 04]
Guest Lecture for Learing to Rank
62Slide63
Broader Notion of Relevance
Traditional viewContent-drivenVector space modelProbability relevance modelLanguage modelModern viewAnything related to the quality of the documentClicks/views
Link structure
Visual structure
Social network
….
Query-Document specific
Unsupervised
Query, Document, Query-Document specific
Supervised
Guest Lecture for Learing to Rank
63Slide64
Broader Notion of Relevance
Documents
Query
BM25
Language Model
Cosine
Documents
Query
BM25
Language Model
Cosine
likes
Linkage structure
Query relation
Query relation
Linkage structure
Documents
Query
BM25
Language Model
Cosine
Click/View
Visual Structure
Social network
Guest Lecture for Learing to Rank
64Slide65
FutureTighter boundsFaster solution
Larger scaleWider application scenario
Guest Lecture for Learing to Rank
65Slide66
ResourcesBooks
Liu, Tie-Yan. Learning to rank for information retrieval. Vol. 13. Springer, 2011.Li, Hang. "Learning to rank for information retrieval and natural language processing." Synthesis Lectures on Human Language Technologies 4.1 (2011): 1-113.
Helpful pages
http://en.wikipedia.org/wiki/Learning_to_rank
Packages
RankingSVM
: http://svmlight.joachims.org/RankLib: http://people.cs.umass.edu/~vdang/ranklib.htmlData setsLETOR http://research.microsoft.com/en-us/um/beijing/projects/letor//
Yahoo! Learning to rank challenge http://learningtorankchallenge.yahoo.com/
Guest Lecture for Learing to Rank
66Slide67
ReferencesLiu, Tie-Yan. "Learning to rank for information retrieval." Foundations and Trends in Information Retrieval 3.3 (2009): 225-331
.Cossock, David, and Tong Zhang. "Subset ranking using regression." Learning theory (2006): 605-619.Shashua,
Amnon
, and
Anat
Levin. "Ranking with large margin principle: Two approaches." Advances in neural information processing systems 15 (2003): 937-944
.Joachims, Thorsten. "Optimizing search engines using clickthrough data." Proceedings of the eighth ACM SIGKDD. ACM, 2002.Freund, Yoav, et al. "An efficient boosting algorithm for combining preferences." The Journal of Machine Learning Research 4 (2003): 933-969.Zheng,
Zhaohui, et al. "A regression framework for learning ranking functions using relative relevance judgments." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007.
Guest Lecture for Learing to Rank
67Slide68
ReferencesJoachims, Thorsten, et al. "Accurately interpreting
clickthrough data as implicit feedback." Proceedings of the 28th annual international ACM SIGIR. ACM, 2005.Burges, C. "From ranknet to lambdarank to lambdamart: An overview." Learning 11 (2010): 23-581.Xu
, Jun, and Hang Li. "
AdaRank
: a boosting algorithm for information retrieval." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007.
Yue
, Yisong, et al. "A support vector method for optimizing average precision." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007.Taylor, Michael, et al. "Softrank: optimizing non-smooth rank metrics." Proceedings of the international conference WSDM. ACM, 2008.Cao, Zhe, et al. "Learning to rank: from pairwise approach to listwise approach." Proceedings of the 24th ICML. ACM, 2007.
Guest Lecture for Learing to Rank
68Slide69
Thank you!
Q&A
Guest Lecture for Learing to Rank
69Slide70
Recap of last lectureGoalDesign the ranking module for Bing.com
Guest Lecture for Learing to Rank70Slide71
Basic Search Engine ArchitectureThe anatomy of a large-scale hypertextual
Web search engineSergey Brin and Lawrence Page. Computer networks and ISDN systems 30.1 (1998): 107-117.
Guest Lecture for Learing to Rank
71
Your JobSlide72
Learning to Rank
Given: (query, document) pairs represented by a set of relevance estimators, a.k.a., featuresNeeded: a way of combining the estimators
ordered
Criterion: optimize IR metrics
P@k
, MAP, NDCG, etc.
QueryID
DocID
BM25
LM
PageRank
Label
0001
0001
1.6
1.1
0.9
0
0001
0002
2.7
1.9
0.2
1
Key!Guest Lecture for Learing to Rank
72Slide73
Challenge: how to optimize?
Order is essential! order metric
Evaluation metrics are not continuous and not differentiable
Guest Lecture for Learing to Rank
73Slide74
Approximating the Objects!Pointwise
Fit the relevance labels individuallyPairwiseFit the relative ordersListwiseFit the whole order
Guest Lecture for Learing to Rank
74Slide75
Pointwise Learning to Rank
Ideally perfect relevance prediction leads to perfect ranking
score
order
metric
Reduce ranking problem toRegression
Guest Lecture for Learing to Rank
75
http://en.wikipedia.org/wiki/Statistical_classification
http://en.wikipedia.org/wiki/Regression_analysis
-
ClassificationSlide76
Deficiency
Cannot directly optimize IR metrics(0 1
,
2
0) worse than (0->-2, 2->4)
Position of documents are ignoredPenalty on documents at higher positions should be larger
Favor the queries with more documents
Guest Lecture for
Learing
to Rank
76
1
0
-2
2
-1
Slide77
Pairwise Learning to Rank
Ideally perfect partial order leads to perfect ranking partial order
order
metric
Ordinal
regression
Relative
ordering
between different documents is significant
Guest Lecture for Learing to Rank
77
1
0
-2
2
-1
Slide78
RankingSVMThorsten Joachims, KDD’02
Minimizing the number of mis-ordered pairs
1
0
linear combination of features
Keep the relative orders
Guest Lecture for Learing to Rank
78
QueryID
DocID
BM25
PageRank
Label
0001
0001
1.6
0.9
4
0001
0002
2.7
0.2
2
0001
0003
1.3
0.2
1
0001
0004
1.2
0.7
0Slide79
RankingSVMThorsten Joachims, KDD’02
Minimizing the number of mis-ordered pairs
Guest Lecture for
Learing
to Rank
79
Slide80
General Idea of Pairwise Learning to Rank
For any pair of
Guest Lecture for Learing to Rank
80
exponential loss
hinge loss (
RankingSVM
)
0/1 loss
square loss (
GBRank
)
80
Guest Lecture for Learing to Rank