/
Learning to Rank Learning to Rank

Learning to Rank - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
478 views
Uploaded On 2016-06-27

Learning to Rank - PPT Presentation

from heuristics to theoretic approaches Guest Lecture by Hongning Wang wang296illlinoisedu Congratulations Job Offer Design the ranking module for Bingcom Guest Lecture for Learing to Rank ID: 379660

lecture rank learing guest rank lecture guest learing relevance learning ranking loss model documents order amp regression optimizing joachims query pairwise metrics

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Learning to Rank" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Learning to Rankfrom heuristics to theoretic approaches

Guest Lecture by Hongning Wangwang296@illlinois.eduSlide2

CongratulationsJob OfferDesign the ranking module for Bing.com

Guest Lecture for Learing to Rank

2Slide3

How should I rank documents?

Answer: Rank by relevance!

Guest Lecture for Learing to Rank

3Slide4

Relevance ?!

Guest Lecture for Learing to Rank

4Slide5

The Notion of Relevance

Relevance

(Rep(q), Rep(d))

Similarity

P(r=1|q,d) r

{0,1}

Probability of Relevance

P(d

q) or P(q

d)

Probabilistic inference

Different

rep &

similarity

Vector space

model

(Salton et al., 75)

Prob. distr.

model

(Wong & Yao, 89)

Generative

Model

Regression

Model

(Fuhr 89)

Classical

prob. Model

(Robertson &

Sparck Jones, 76)

Doc

generation

Query

generation

LM

approach

(Ponte & Croft, 98)

(Lafferty & Zhai, 01a)

Prob. concept

space model

(Wong & Yao, 95)

Different

inference system

Inference

network

model

(Turtle & Croft, 91)

Div. from Randomness

(Amati & Rijsbergen 02)

Learn. To Rank

(Joachims 02,

Berges et al. 05)

Relevance constraints

[Fang et al. 04]

Lecture notes from

CS598CXZ

Guest Lecture for Learing to Rank

5Slide6

Relevance EstimationQuery matchingLanguage model

BM25Vector space cosine similarityDocument importancePageRankHITS

Guest Lecture for Learing to Rank

6Slide7

Did I do a good job of ranking documents?

IR evaluations metrics

Precision

MAP

NDCG

PageRank

BM25

Guest Lecture for Learing to Rank

7Slide8

Take advantage of different relevance estimator?Ensemble the cues

Linear? Non-linear?Decision tree-like

 

BM25>0.5

LM > 0.1

PageRank > 0.3

True

False

True

r= 1.0

r= 0.7

False

True

r= 0.4

r= 0.1

False

}

 

 

 

Guest Lecture for Learing to Rank

8Slide9

What if we have thousands of features?Is there any way I can do better?

Optimizing the metrics automatically!

How to determine those

s?

 

Where to find those tree structures?

Guest Lecture for Learing to Rank

9Slide10

Rethink the task

Given: (query, document) pairs represented by a set of relevance estimators, a.k.a., featuresNeeded: a way of combining the estimators

ordered

Criterion: optimize IR metrics

P@k

, MAP, NDCG, etc.

 

DocID

BM25

LM

PageRank

Label

0001

1.6

1.1

0.9

0

0002

2.7

1.9

0.2

1

Key!

Guest Lecture for Learing to Rank

10Slide11

Machine Learning

Input:

, where

Object function :

Output:

, such that

 

 

Classification

http://en.wikipedia.org/wiki/Statistical_classification

Regression

 

http://en.wikipedia.org/wiki/Regression_analysis

M

NOTE: We will only talk about supervised learning.

Guest Lecture for Learing to Rank

11Slide12

Learning to Rank

General solution in optimization frameworkInput:

, where

Object: O = {

P@k

, MAP, NDCG}

Output:

,

s.t.,

 

DocID

BM25

LM

PageRank

Label

0001

1.6

1.1

0.9

0

0002

2.7

1.9

0.2

1

Guest Lecture for Learing to Rank

12Slide13

Challenge: how to optimize?

Evaluation metric recapAverage Precision DCG Order is essential!

order

metric

 

Not continuous with respect to f(X)!

Guest Lecture for Learing to Rank

13Slide14

Approximating the Objects!Pointwise

Fit the relevance labels individuallyPairwiseFit the relative ordersListwiseFit the whole order

Guest Lecture for Learing to Rank

14Slide15

Pointwise Learning to Rank

Ideally perfect relevance prediction leads to perfect ranking

score

order

metricReducing ranking problem toRegression

Subset Ranking using Regression,

D.Cossock

and

T.Zhang

, COLT 2006

(multi-)Classification

Ranking with Large Margin

Principles,

A.

Shashua

and A. Levin, NIPS 2002

 

Guest Lecture for Learing to Rank

15Slide16

Subset Ranking using RegressionD.Cossock

and T.Zhang, COLT 2006Fit relevance labels via regression Emphasize more on relevant documents

http://en.wikipedia.org/wiki/Regression_analysis

Weights on each document

Most positive document

Guest Lecture for Learing to Rank

16Slide17

Ranking with Large Margin PrinciplesA. Shashua and A. Levin, NIPS 2002

Goal: correctly placing the documents in the corresponding category and maximize the margin

Y=0

Y=2

Y=1

 

Reduce the violations

Guest Lecture for Learing to Rank

17Slide18

Maximizing the sum of margins

Ranking with Large Margin Principles

A.

Shashua

and A. Levin, NIPS 2002

Y=0

Y=2

Y=1

 

Guest Lecture for Learing to Rank

18Slide19

Ranking with Large Margin Principles A. Shashua and A. Levin, NIPS 2002

Ranking lost is consistently decreasing with more training data

Guest Lecture for Learing to Rank

19Slide20

What did we learnMachine learning helps!Derive something

optimizableMore efficient and guided

Guest Lecture for Learing to Rank

20Slide21

There is always a catch

Cannot directly optimize IR metrics(0

1

,

2

0) worse than (0->-2, 2->4)

Position of documents are ignoredPenalty on documents at higher positions should be largerFavor the queries with more documents

 

Guest Lecture for Learing to Rank

21Slide22

Pairwise Learning to Rank

Ideally perfect partial order leads to perfect ranking partial order

order

metric

Ordinal

regression

Relative

ordering

between different documents is significant

E.g.,

(

0

->

-2

,

2

-

>

4

) is better than (

0

1

, 2 0) Large body of work

 

Guest Lecture for Learing to Rank22Slide23

Optimizing Search Engines using Clickthrough DataThorsten

Joachims, KDD’02Minimizing the number of mis-ordered pairs

 

1

0

linear combination of features

 

RankingSVM

 

Keep the relative orders

Guest Lecture for Learing to Rank

23Slide24

Optimizing Search Engines using Clickthrough DataThorsten

Joachims, KDD’02How to use it?

s

order

 

 

Guest Lecture for Learing to Rank

24Slide25

Optimizing Search Engines using Clickthrough DataThorsten

Joachims, KDD’02What did it learn from the data?Linear correlations

Positive correlated features

Negative correlated features

Guest Lecture for Learing to Rank

25Slide26

How good is it?Test on real system

Optimizing Search Engines using Clickthrough Data

Thorsten

Joachims

, KDD’02

Guest Lecture for Learing to Rank

26Slide27

An Efficient Boosting Algorithm for Combining PreferencesY. Freund, R. Iyer, et al. JMLR 2003

Smooth the loss on mis-ordered pairs

 

exponential loss

hinge loss (

RankingSVM

)

0/1 loss

square loss

from Pattern Recognition and Machine Learning, P337

Guest Lecture for Learing to Rank

27Slide28

RankBoost: optimize via boostingVote by a committee

An Efficient Boosting Algorithm for Combining PreferencesY. Freund, R. Iyer, et al. JMLR 2003

from Pattern Recognition and Machine Learning, P658

Updating

 

Credibility of each committee member (ranking feature)

BM25

PageRank

Cosine

Guest Lecture for Learing to Rank

28Slide29

How good is it?

An Efficient Boosting Algorithm for Combining

Preferences

Y. Freund, R. Iyer, et al. JMLR

2003

Guest Lecture for Learing to Rank

29Slide30

A Regression Framework for Learning Ranking Functions Using Relative Relevance JudgmentsZheng et al. SIRIG’07

Non-linear ensemble of featuresObject:

Gradient descent boosting tree

Boosting tree

Using regression tree to minimize the residuals

 

r= 0.4

r= 0.1

BM25>0.5

LM > 0.1

PageRank > 0.3

True

False

True

r= 1.0

r= 0.7

False

True

False

Guest Lecture for Learing to Rank

30Slide31

A Regression Framework for Learning Ranking Functions Using Relative Relevance JudgmentsZheng et al. SIRIG’07

Non-linear v.s. linearComparing with RankingSVM

Guest Lecture for Learing to Rank

31Slide32

Where do we get the relative ordersHuman annotations

Small scale, expensive to acquireClickthroughsLarge amount, easy to acquire

Guest Lecture for Learing to Rank

32Slide33

Accurately Interpreting Clickthrough Data as Implicit Feedback

Thorsten Joachims, et al., SIGIR’05Position bias

Your click is not because of its relevance, but its position?!

Guest Lecture for Learing to Rank

33Slide34

Accurately Interpreting Clickthrough Data as Implicit Feedback

Thorsten Joachims, et al., SIGIR’05Controlled experimentOver trust of the top ranked positions

Guest Lecture for Learing to Rank

34Slide35

Pairwise preference mattersClick: examined and clicked document Skip: examined but non-clicked document

Accurately Interpreting

Clickthrough

Data as Implicit

Feedback

Thorsten

Joachims, et al., SIGIR’05

Click > Skip

Guest Lecture for Learing to Rank

35Slide36

What did we learnPredicting relative order Getting closer to the nature of ranking

Promising performance in practicePairwise preferences from click-throughs

Guest Lecture for Learing to Rank

36Slide37

Listwise Learning to Rank

Can we directly optimize the ranking? order

metric

Tackle the challenge

Optimization without gradient

 

Guest Lecture for Learing to Rank

37Slide38

From RankNet to LambdaRank to

LambdaMART: An OverviewChristopher J.C. Burges, 2010Minimizing mis

-ordered pair => maximizing IR metrics?

Mis

-ordered pairs: 6

Mis

-ordered pairs: 4

AP:

 

AP:

 

DCG:

 

DCG:

 

Position is crucial!

Guest Lecture for Learing to Rank

38Slide39

From RankNet to LambdaRank to

LambdaMART: An OverviewChristopher J.C. Burges, 2010

Weight the

mis

-ordered pairs?

Some pairs are more important to be placed in the right order

Inject into object function

Inject into gradient

 

Gradient with respect to approximated object, i.e., exponential loss on

mis

-ordered pairs

Change in original object, e.g., NDCG, if we switch the documents

i

and j, leaving the other documents unchanged

Depend on the ranking of document

i

, j in the whole list

Guest Lecture for Learing to Rank

39Slide40

Lambda functionsGradient? Yes, it meets the sufficient and necessary condition of being partial derivative Lead to optimal solution of original problem?

EmpiricallyFrom RankNet to LambdaRank

to

LambdaMART

: An Overview

Christopher J.C. Burges, 2010

Guest Lecture for Learing to Rank40Slide41

EvolutionFrom

RankNet to LambdaRank to LambdaMART: An OverviewChristopher J.C. Burges, 2010

RankNet

LambdaRank

LambdaMART

Object

Cross

entropy over the pairs

Unknown

Unknown

Gradient (

function)

Gradient of cross entropy

Gradient of cross entropy

times

pairwise

change in target metric

Gradient of cross entropy

times pairwise change in target metric Optimization methodneural networkstochastic gradient descent

Multiple Additive Regression Trees (MART)RankNetLambdaRankLambdaMARTObject

Cross entropy over the pairsUnknownUnknown

Gradient of cross entropy

Gradient of cross entropy times

pairwise change in

target metric Gradient of cross entropy times pairwise change in target metric Optimization method

neural networkstochastic gradient descent

Multiple Additive Regression Trees (MART)

As we discussed in RankBoost

Optimize solely by gradient

Non-linear combination

Guest Lecture for Learing to Rank

41Slide42

A Lambda tree

From

RankNet

to

LambdaRank

to

LambdaMART: An OverviewChristopher J.C. Burges, 2010

splitting

Combination of features

Guest Lecture for Learing to Rank

42Slide43

AdaRank: a boosting algorithm for information retrievalJun

Xu & Hang Li, SIGIR’07Loss defined by IR metrics

Optimizing by boosting

 

Target metrics: MAP, NDCG, MRR

from Pattern Recognition and Machine Learning, P658

Updating

 

Credibility

of each committee member (ranking feature)

BM25

PageRank

Cosine

Guest Lecture for Learing to Rank

43Slide44

A Support Vector Machine for Optimizing Average PrecisionYisong

Yue, et al., SIGIR’07RankingSVMMinimizing the pairwise loss

SVM-MAP

Minimizing

the structural loss

Loss defined on the number of

mis

-ordered document pairs

Loss defined on the quality of the whole list of ordered documents

Guest Lecture for Learing to Rank

44

MAP differenceSlide45

Max margin principlePush the ground-truth far away from any mistakes you might makeFinding the most violated constraints

A Support Vector Machine for Optimizing Average PrecisionYisong Yue, et al., SIGIR’07

Guest Lecture for Learing to Rank

45Slide46

Finding the most violated constraintsMAP is invariant to permutation of (

ir)relevant documentsMaximize MAP over a series of swaps between relevant and irrelevant documents A Support Vector Machine for Optimizing Average Precision

Yisong

Yue

, et al., SIGIR’07

Right-hand side of constraints

Greedy solution

Guest Lecture for Learing to Rank

46

Start from the reverse order of ideal rankingSlide47

Experiment results

A Support Vector Machine for Optimizing Average Precision

Yisong

Yue

, et al., SIGIR’07

Guest Lecture for Learing to Rank

47Slide48

Other listwise solutions

Soften the metrics to make them differentiableMichael Taylor et al., SoftRank: optimizing non-smooth rank metrics, WSDM'08Minimize a loss function defined on permutations

Zhe

Cao et al., Learning to rank: from pairwise approach to

listwise

approach, ICML'07

Guest Lecture for Learing to Rank48Slide49

What did we learnTaking a list of documents as a wholePositions are visible for the learning algorithm

Directly optimizing the target metricLimitationThe search space is huge!

Guest Lecture for Learing to Rank

49Slide50

SummaryLearning to rank

Automatic combination of ranking features for optimizing IR evaluation metricsApproachesPointwiseFit the relevance labels individuallyPairwiseFit the relative ordersListwise

Fit the whole order

Guest Lecture for Learing to Rank

50Slide51

Experimental ComparisonsRanking performance

Guest Lecture for Learing to Rank

51Slide52

Experimental ComparisonsWinning countOver seven different data sets

Guest Lecture for Learing to Rank

52Slide53

Experimental ComparisonsMy experiments1.2k queries, 45.5K documents with

1890 features800 queries for training, 400 queries for testing

 

MAP

P@1

ERR

MRR

NDCG@5

ListNET

0.2863

0.2074

0.1661

0.3714

0.2949

LambdaMART

0.4644

0.4630

0.2654

0.6105

0.5236

RankNET

0.3005

0.2222

0.1873

0.3816

0.3386

RankBoost

0.4548

0.4370

0.2463

0.58290.4866

RankingSVM

0.35070.23700.18950.4154

0.3585AdaRank

0.43210.41110.2307

0.54820.4421pLogistic0.45190.39260.24890.5535

0.4945

Logistic0.43480.3778

0.24100.5526

0.4762Guest Lecture for Learing to Rank53Slide54

Analysis of the ApproachesWhat are they really optimizing?Relation with IR metrics

gap

Guest Lecture for Learing to Rank

54Slide55

Pointwise ApproachesRegression based

Classification based

Regression loss

Discount coefficients in DCG

Classification loss

Discount coefficients in DCG

Guest Lecture for Learing to Rank

55Slide56

From

Guest Lecture for Learing to Rank

56Slide57

From

Discount coefficients in DCG

Guest Lecture for Learing to Rank

57Slide58

Listwise ApproachesNo general analysis

Method dependentDirectness and consistencyGuest Lecture for Learing to Rank

58Slide59

Connection with Traditional IRPeople have foreseen this topic long time ago

Nicely fit in the risk minimization frameworkGuest Lecture for Learing to Rank59Slide60

Applying Bayesian Decision Theory

Choice: (D

1

,

1

)

Choice: (D

2

,

2

)

Choice: (D

n

,

n

)

...

query

q

user

U

doc set

C

source

S

q

1

N

hidden

observed

loss

Bayes risk for choice (D,

)

RISK MINIMIZATION

Loss

L

L

L

Metric to be optimized

Available ranking features

Guest Lecture for Learing to Rank

60Slide61

Traditional Solution

Set-based models (choose D)Ranking models (choose )Independent loss

Relevance-based loss

Distance-based loss

Dependent loss

MMR loss

MDR loss

Boolean model

Probabilistic relevance model

Generative Relevance Theory

Subtopic retrieval model

Vector-space Model

Two-stage LM

KL-divergence model

Pointwise

Pairwise/

Listwise

Unsupervised!

Guest Lecture for Learing to Rank

61Slide62

Traditional Notion of Relevance

Relevance

(Rep(q), Rep(d))

Similarity

P(r=1|q,d) r

{0,1}

Probability of Relevance

P(d

q) or P(q

d)

Probabilistic inference

Different

rep &

similarity

Vector space

model

(Salton et al., 75)

Prob. distr.

model

(Wong & Yao, 89)

Generative

Model

Regression

Model

(Fuhr 89)

Classical

prob. Model

(Robertson &

Sparck Jones, 76)

Doc

generation

Query

generation

LM

approach

(Ponte & Croft, 98)

(Lafferty & Zhai, 01a)

Prob. concept

space model

(Wong & Yao, 95)

Different

inference system

Inference

network

model

(Turtle & Croft, 91)

Div. from Randomness

(Amati & Rijsbergen 02)

Learn. To Rank

(Joachims 02,

Berges et al. 05)

Relevance constraints

[Fang et al. 04]

Guest Lecture for Learing to Rank

62Slide63

Broader Notion of Relevance

Traditional viewContent-drivenVector space modelProbability relevance modelLanguage modelModern viewAnything related to the quality of the documentClicks/views

Link structure

Visual structure

Social network

….

Query-Document specific

Unsupervised

Query, Document, Query-Document specific

Supervised

Guest Lecture for Learing to Rank

63Slide64

Broader Notion of Relevance

Documents

Query

BM25

Language Model

Cosine

Documents

Query

BM25

Language Model

Cosine

likes

Linkage structure

Query relation

Query relation

Linkage structure

Documents

Query

BM25

Language Model

Cosine

Click/View

Visual Structure

Social network

Guest Lecture for Learing to Rank

64Slide65

FutureTighter boundsFaster solution

Larger scaleWider application scenario

Guest Lecture for Learing to Rank

65Slide66

ResourcesBooks

Liu, Tie-Yan. Learning to rank for information retrieval. Vol. 13. Springer, 2011.Li, Hang. "Learning to rank for information retrieval and natural language processing." Synthesis Lectures on Human Language Technologies 4.1 (2011): 1-113.

Helpful pages

http://en.wikipedia.org/wiki/Learning_to_rank

Packages

RankingSVM

: http://svmlight.joachims.org/RankLib: http://people.cs.umass.edu/~vdang/ranklib.htmlData setsLETOR http://research.microsoft.com/en-us/um/beijing/projects/letor//

Yahoo! Learning to rank challenge http://learningtorankchallenge.yahoo.com/

Guest Lecture for Learing to Rank

66Slide67

ReferencesLiu, Tie-Yan. "Learning to rank for information retrieval." Foundations and Trends in Information Retrieval 3.3 (2009): 225-331

.Cossock, David, and Tong Zhang. "Subset ranking using regression." Learning theory (2006): 605-619.Shashua,

Amnon

, and

Anat

Levin. "Ranking with large margin principle: Two approaches." Advances in neural information processing systems 15 (2003): 937-944

.Joachims, Thorsten. "Optimizing search engines using clickthrough data." Proceedings of the eighth ACM SIGKDD. ACM, 2002.Freund, Yoav, et al. "An efficient boosting algorithm for combining preferences." The Journal of Machine Learning Research 4 (2003): 933-969.Zheng,

Zhaohui, et al. "A regression framework for learning ranking functions using relative relevance judgments." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007.

Guest Lecture for Learing to Rank

67Slide68

ReferencesJoachims, Thorsten, et al. "Accurately interpreting

clickthrough data as implicit feedback." Proceedings of the 28th annual international ACM SIGIR. ACM, 2005.Burges, C. "From ranknet to lambdarank to lambdamart: An overview." Learning 11 (2010): 23-581.Xu

, Jun, and Hang Li. "

AdaRank

: a boosting algorithm for information retrieval." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007.

Yue

, Yisong, et al. "A support vector method for optimizing average precision." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007.Taylor, Michael, et al. "Softrank: optimizing non-smooth rank metrics." Proceedings of the international conference WSDM. ACM, 2008.Cao, Zhe, et al. "Learning to rank: from pairwise approach to listwise approach." Proceedings of the 24th ICML. ACM, 2007.

Guest Lecture for Learing to Rank

68Slide69

Thank you!

Q&A

Guest Lecture for Learing to Rank

69Slide70

Recap of last lectureGoalDesign the ranking module for Bing.com

Guest Lecture for Learing to Rank70Slide71

Basic Search Engine ArchitectureThe anatomy of a large-scale hypertextual

Web search engineSergey Brin and Lawrence Page. Computer networks and ISDN systems 30.1 (1998): 107-117.

Guest Lecture for Learing to Rank

71

Your JobSlide72

Learning to Rank

Given: (query, document) pairs represented by a set of relevance estimators, a.k.a., featuresNeeded: a way of combining the estimators

ordered

Criterion: optimize IR metrics

P@k

, MAP, NDCG, etc.

 

QueryID

DocID

BM25

LM

PageRank

Label

0001

0001

1.6

1.1

0.9

0

0001

0002

2.7

1.9

0.2

1

Key!Guest Lecture for Learing to Rank

72Slide73

Challenge: how to optimize?

Order is essential! order metric

Evaluation metrics are not continuous and not differentiable

 

Guest Lecture for Learing to Rank

73Slide74

Approximating the Objects!Pointwise

Fit the relevance labels individuallyPairwiseFit the relative ordersListwiseFit the whole order

Guest Lecture for Learing to Rank

74Slide75

Pointwise Learning to Rank

Ideally perfect relevance prediction leads to perfect ranking

score

order

metric

Reduce ranking problem toRegression

 

Guest Lecture for Learing to Rank

75

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Regression_analysis

-

ClassificationSlide76

Deficiency

Cannot directly optimize IR metrics(0 1

,

2

0) worse than (0->-2, 2->4)

Position of documents are ignoredPenalty on documents at higher positions should be larger

Favor the queries with more documents

 

Guest Lecture for

Learing

to Rank

76

 

 

 

1

0

-2

2

-1

 Slide77

Pairwise Learning to Rank

Ideally perfect partial order leads to perfect ranking partial order

order

metric

Ordinal

regression

Relative

ordering

between different documents is significant

 

Guest Lecture for Learing to Rank

77

 

 

 

1

0

-2

2

-1

 Slide78

RankingSVMThorsten Joachims, KDD’02

Minimizing the number of mis-ordered pairs

 

1

0

linear combination of features

 

 

Keep the relative orders

Guest Lecture for Learing to Rank

78

QueryID

DocID

BM25

PageRank

Label

0001

0001

1.6

0.9

4

0001

0002

2.7

0.2

2

0001

0003

1.3

0.2

1

0001

0004

1.2

0.7

0Slide79

RankingSVMThorsten Joachims, KDD’02

Minimizing the number of mis-ordered pairs

Guest Lecture for

Learing

to Rank

79

 Slide80

General Idea of Pairwise Learning to Rank

For any pair of

 

Guest Lecture for Learing to Rank

80

exponential loss

hinge loss (

RankingSVM

)

0/1 loss

square loss (

GBRank

)

80

Guest Lecture for Learing to Rank