/
Term Weighting  and  Ranking Models Term Weighting  and  Ranking Models

Term Weighting and Ranking Models - PowerPoint Presentation

byrne
byrne . @byrne
Follow
27 views
Uploaded On 2024-02-09

Term Weighting and Ranking Models - PPT Presentation

Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata Parametric and zone indexes Thus far a doc has been a sequence of terms In fact documents have multiple parts some with special semantics ID: 1045666

6doc top query term top 6doc term query score docs doc document 1doc 3doc 830 170 documents scores postings

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Term Weighting and Ranking Models" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Term Weighting and Ranking ModelsDebapriyo MajumdarInformation Retrieval – Spring 2015Indian Statistical Institute Kolkata

2. Parametric and zone indexesThus far, a doc has been a sequence of termsIn fact documents have multiple parts, some with special semantics:AuthorTitleDate of publicationLanguageFormatetc.These constitute the metadata about a documentSec. 6.1

3. FieldsWe sometimes wish to search by these metadataE.g., find docs authored by William Shakespeare in the year 1601, containing alas poor YorickYear = 1601 is an example of a fieldAlso, author last name = shakespeare, etc.Field or parametric index: postings for each field valueSometimes build range trees (e.g., for dates)Field query typically treated as conjunction(doc must be authored by shakespeare)Sec. 6.1

4. ZoneA zone is a region of the doc that can contain an arbitrary amount of text, e.g.,TitleAbstractReferences …Build inverted indexes on zones as well to permit queryingE.g., “find docs with merchant in the title zone and matching the query gentle rain”Sec. 6.1

5. Example zone indexesEncode zones in dictionary vs. postings.Sec. 6.1

6. Basics of RankingBoolean retrieval models simply return documents satisfying the Boolean condition(s)Among those, are all documents equally “good”? NoConsider the case of a single term queryNot all documents containing the term are equally associated with that termFrom Boolean model to term weighting Weight of a term in a document is 1 or 0 in Boolean modelUse more granular term weightingWeight of a term in a document: represent how much the term is important in the document and vice versa6

7. Term weighting – TF.iDFHow important is a term t in a document dIntuition 1: More times a term is present in a document, more important it isTerm frequency (TF)Intuition 2: If a term is present in many documents, it is less important particularly to any one of themDocument frequency (DF)Combining the two: TF.iDF (term frequency × inverse document frequency)Many variants exist for both TF and DF7

8. Term frequency (TF)Variants of TF(t, d)Simplest term frequency: Number of times t occurs in d: freq(t, d)If a term a is present 10 times, b is present 2 times, is a 5 times more important than b in that document? Logarithmically scaled frequency: 1 + log(freq(t, d))Still, long documents on same topic would have the more frequency for the same termsAugmented frequency: avoid bias towards longer documents8for all t in d; 0 otherwiseHalf the score for just being presentRest is a function of frequency

9. (Inverse) document frequency (iDF)Inverse document frequency of t : iDF(t)9where N = total number of documentsDF(t) = number of documents in which t occursDistribution of termsZipf’s law: Let T = {t1, … , tm} be the terms, sorted by decreasing order of the number of documents in which they occur. Then:For some constant cZipf’s law fitting for Reuter’s RCV1 collection

10. Vector space modelEach term represents a dimensionDocuments are vectors in the term-spaceTerm-document matrix: a very sparse matrixEntries are scores of the terms in the documents (Boolean  Count  Weight)Query is also a vector in the term-space10d1d2d3d4d5qdiwali0.50000india0.20.200.20.11flying00.4000population0000.50autumn00100statistical00.3000.21Vector similarity: inverse of “distance”Euclidean distance?

11. Problem with Euclidean distanceProblemTopic-wise d3 is closer to d1 than d2 isEuclidean distance wise d2 is closer to d1Dot product solves this problem11Term 1Term 2d2d3d1

12. Vector space modelEach term represents a dimensionDocuments are vectors in the term-spaceTerm-document matrix: a very sparse matrixEntries are scores of the terms in the documents (Boolean  Count  Weight)Query is also a vector in the term-space12d1d2d3d4d5qdiwali0.50000india0.20.200.20.11flying00.4000population0000.50autumn00100statistical00.3000.21Vector similarity: dot product

13. Problem with dot productProblemTopic-wise d2 is closer to d1 than d3 isDot product of d3 and d1 is greater because of the length of d3Consider angleCosine of the angle between two vectorsSame direction: 1 (similar)Orthogonal: 0 (unrelated)Opposite direction: ˗1 (opposite)13Term 1Term 2d2d3 d1

14. Vector space modelEach term represents a dimensionDocuments are vectors in the term-spaceTerm-document matrix: a very sparse matrixEntries are scores of the terms in the documents (Boolean  Count  Weight)Query is also a vector in the term-space14d1d2d3d4d5qdiwali0.50000india0.2000.20.41flying00.4000population0000.50autumn00100statistical00.3000.61Vector similarity: cosine of the angle between the vectors

15. Query processingHow to compute cosine similarity and find top-k?Similar to merge union or intersection using inverted indexThe aggregation function changes to compute cosineWe are interested only in ranking, no need of normalizing with query lengthIf the query words are unweighted, this becomes very similar to merge unionNeed to find top-k results after computing the union list with cosine scores15

16. Partial sort using heap for selecting top kUsual sort takes n log n (may be n = 1 million)Partial sortBinary tree in which each node’s value > the values of childrenTakes 2n operations to construct, then each of top k read off in 2 log n steps.For n =1 million, k = 100, this is about 10% of the cost of sorting1.9.3.8.3.1.1Sec. 7.1

17. Length normalization and query – document similarity17Term – document scoresTF.iDF or similar scoring. May already apply some normalization to eliminate bias on long documentsDocument length normalizationRanking by cosine similarity is equivalent to further normalizing the term – document scores by document length. Query length normalization is redundant.Similarity measure and aggregationCosine similaritySum the scores with unionSum the scores with intersectionOther possible approaches

18. Top-k algorithmsIf there are millions of documents in the listsCan the ranking be done without accessing the lists fully?Exact top-k algorithms (used more in databases)Family of threshold algorithms (Ronald Fagin et al)Threshold algorithm (TA)No random access algorithm (NRA) [we will discuss, as an example]Combined algorithm (CA)Other follow up works18

19. NRA (No Random Access) Algorithmlists sorted by scoredoc 250.6doc 170.6doc 830.9doc 780.5doc 380.6doc 170.7doc 830.4doc 140.6doc 610.3doc 170.3doc 50.6doc 810.2doc 210.2doc 830.5doc 650.1doc 910.1doc 210.3doc 100.1doc 440.1List 1List 2List 3Fagin’s NRA Algorithm:read one doc from every list19

20. NRA (No Random Access) Algorithmlists sorted by scoredoc 250.6doc 170.6doc 830.9doc 780.5doc 380.6doc 170.7doc 830.4doc 140.6doc 610.3doc 170.3doc 50.6doc 810.2doc 210.2doc 830.5doc 650.1doc 910.1doc 210.3doc 100.1doc 440.1Fagin’s NRA Algorithm: round 1doc 83[0.9, 2.1]doc 17[0.6, 2.1]doc 25[0.6, 2.1]Candidatesmin top-2 score: 0.6maximum score for unseen docs: 2.1min-top-2 < best-score of candidatesList 1List 2List 3read one doc from every listcurrent scorebest-score0.6 + 0.6 + 0.9 = 2.120

21. NRA (No Random Access) Algorithmlists sorted by scoreFagin’s NRA Algorithm: round 2doc 17[1.3, 1.8]doc 83[0.9, 2.0]doc 25[0.6, 1.9]doc 38[0.6, 1.8]doc 78[0.5, 1.8]Candidatesmin top-2 score: 0.9maximum score for unseen docs: 1.8doc 250.6doc 170.6doc 830.9doc 780.5doc 380.6doc 170.7doc 830.4doc 140.6doc 610.3doc 170.3doc 50.6doc 810.2doc 210.2doc 830.5doc 650.1doc 910.1doc 210.3doc 100.1doc 440.1List 1List 2List 3min-top-2 < best-score of candidatesread one doc from every list0.5 + 0.6 + 0.7 = 1.821

22. NRA (No Random Access) Algorithmlists sorted by scoredoc 83[1.3, 1.9]doc 17[1.3, 1.7]doc 25[0.6, 1.5]doc 78[0.5, 1.4]Candidatesmin top-2 score: 1.3maximum score for unseen docs: 1.3doc 250.6doc 170.6doc 830.9doc 780.5doc 380.6doc 170.7doc 830.4doc 140.6doc 610.3doc 170.3doc 50.6doc 810.2doc 210.2doc 830.5doc 650.1doc 910.1doc 210.3doc 100.1doc 440.1Fagin’s NRA Algorithm: round 3List 1List 2List 3min-top-2 < best-score of candidatesno more new docs can get into top-2but, extra candidates left in queueread one doc from every list0.4 + 0.6 + 0.3 = 1.322

23. NRA (No Random Access) Algorithmdoc 250.6doc 170.6doc 830.9doc 780.5doc 380.6doc 170.7doc 830.4doc 140.6doc 610.3doc 170.3doc 50.6doc 810.2doc 210.2doc 830.5doc 650.1doc 910.1doc 210.3doc 100.1doc 440.1lists sorted by scoredoc 171.6doc 83[1.3, 1.9]doc 25[0.6, 1.4]Candidatesmin top-2 score: 1.3maximum score for unseen docs: 1.1Fagin’s NRA Algorithm: round 4List 1List 2List 3min-top-2 < best-score of candidatesno more new docs can get into top-2but, extra candidates left in queueread one doc from every list0.3 + 0.6 + 0.2 = 1.123

24. NRA (No Random Access) Algorithmdoc 250.6doc 170.6doc 830.9doc 780.5doc 380.6doc 170.7doc 830.4doc 140.6doc 610.3doc 170.3doc 50.6doc 810.2doc 210.2doc 830.5doc 650.1doc 910.1doc 210.3doc 100.1doc 440.1lists sorted by scoredoc 831.8doc 171.6Candidatesmin top-2 score: 1.6maximum score for unseen docs: 0.8Done!Fagin’s NRA Algorithm: round 5List 1List 2List 3no extra candidate in queueread one doc from every list0.2 + 0.5 + 0.1 = 0.824More approaches:Periodically also perform random accesses on documents to reduce uncertainty (CA)Sophisticated scheduling on listsCrude approximation: NRA may take a lot of time to stop. Just stop after a while with approximate top-k – who cares if the results are perfect according to the scores?

25. Inexact top-k retrievalDoes the exact top-k matter?How much are we sure that the 101st ranked document is less important than the 100th ranked?All the scores are simplified models for what information may be associated with the documentsSuffices to retrieve k documents with Many of them from the exact top-kThe others having score close to the top-k25

26. Champion listsPrecompute for each dictionary term t, the r docs of highest score in t’s posting listIdeally k < r << n (n = size of the posting list)Champion list for t (or fancy list or top docs for t)Note: r has to be chosen at index build timeThus, it’s possible that r < kAt query time, only compute scores for docs in the champion list of some query termPick the K top-scoring docs from amongst theseSec. 7.1.3

27. Static quality scoresWe want top-ranking documents to be both relevant and authoritativeRelevance is being modeled by cosine scoresAuthority is typically a query-independent property of a documentExamples of authority signalsWikipedia among websitesArticles in certain newspapersA paper with many citations(Pagerank)Sec. 7.1.4

28. Modeling authorityAssign to each document a query-independent quality score in [0,1] to each document dDenote this by g(d)Consider a simple total score combining cosine relevance and authorityNet-score(q,d) = g(d) + cosine(q,d)Can use some other linear combinationIndeed, any function of the two “signals” of user happiness – more laterNow we seek the top k docs by net scoreSec. 7.1.4

29. Top k by net score – fast methodsFirst idea: Order all postings by g(d)Key: this is a common ordering for all postingsThus, can concurrently traverse query terms’ postings forPostings intersectionCosine score computationUnder g(d)-ordering, top-scoring docs likely to appear early in postings traversalIn time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal earlyShort of computing scores for all docs in postingsSec. 7.1.4

30. Champion lists in g(d)-orderingCan combine champion lists with g(d)-orderingMaintain for each term a champion list of the r docs with highest g(d) + tf-idftdSeek top-k results from only the docs in these champion listsSec. 7.1.4

31. High and low listsFor each term, we maintain two postings lists called high and lowThink of high as the champion listWhen traversing postings on a query, only traverse high lists firstIf we get more than k docs, select the top k and stopElse proceed to get docs from the low listsCan be used even for simple cosine scores, without global quality g(d)A means for segmenting index into two tiersSec. 7.1.4

32. Impact-ordered postingsWe only want to compute scores for docs d for which wft,d, for query term t, is high enoughSort each postings list by wft,dNow: not all postings in a common order!How do we compute scores in order to pick top k?Two ideas followSec. 7.1.5

33. 1. Early terminationWhen traversing t’s postings, stop early after eithera fixed number of r docswft,d drops below some thresholdTake the union of the resulting sets of docsOne from the postings of each query termCompute only the scores for docs in this unionSec. 7.1.5

34. 2. iDF-ordered termsWhen considering the postings of query termsLook at them in order of decreasing idfHigh idf terms likely to contribute most to scoreAs we update score contribution from each query termStop if doc scores relatively unchangedCan apply to cosine or some other net scoresSec. 7.1.5

35. Cluster pruningPreprocessingPick N documents at random: call these leadersFor every other document, pre-compute nearest leaderDocuments attached to a leader: its followers;Likely: each leader has ~ N followers.Query processingProcess a query as follows:Given query Q, find its nearest leader L.Seek k nearest docs from among L’s followers.Sec. 7.1.6

36. Cluster pruningQueryLeaderFollowerSec. 7.1.6

37. Cluster pruningWhy random sampling?FastLeaders reflect data distributionVariants:Have each follower attached to b1=3 (say) nearest leaders.From query, find b2=4 (say) nearest leaders and their followers.Can recurse on leader/follower construction.Sec. 7.1.6

38. Tiered indexesBreak postings up into a hierarchy of listsMost important…Least importantCan be done by g(d) or another measureInverted index thus broken up into tiers of decreasing importanceAt query time use top tier unless it fails to yield K docsIf so drop to lower tiersSec. 7.2.1

39. Example tiered indexSec. 7.2.1

40. Putting it all togetherSec. 7.2.4

41. Sources and AcknowledgementsIR Book by Manning, Raghavan and Schuetze: http://nlp.stanford.edu/IR-book/Several slides are adapted from the slides by Prof. Nayak and Prof. Raghavan for their course in Stanford University41