/
Faster TF-IDF Faster TF-IDF

Faster TF-IDF - PowerPoint Presentation

debby-jeon
debby-jeon . @debby-jeon
Follow
414 views
Uploaded On 2017-10-26

Faster TF-IDF - PPT Presentation

David Kauchak cs160 Fall 2009 adapted from httpwwwstanfordeduclasscs276handouts lecture6tfidf ppt Administrative Assignment 1 Assignment 2 Look at the assignment by Wed New ID: 599427

scores query top postings query scores postings top docs idf term index list score ordering cosine documents terms lists champion document 128

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Faster TF-IDF" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Faster TF-IDF

David Kauchakcs160Fall 2009adapted from:http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.pptSlide2

AdministrativeAssignment 1Assignment 2Look at the assignment by Wed!New

turnin procedureClass participationSlide3

Stoplist and dictionary sizeSlide4

Recap: Queries as vectorsRepresent the queries as vectorsRepresent the documents as vectors

proximity = similarity of vectorsWhat do the entries in the vector represent in the tf-idf scheme?Slide5

Recap: tf-idf weightingThe tf-idf weight of a term is the product of its tf

weight and its idf weight.For each document, there is one entry for every term in the vocabularyEach entry in that vector is the tf-idf weight aboveHow do we calculate the similarity?Slide6

Recap: cosine(query,document)

Dot productUnit vectors

cos(

q,d

) is the cosine similarity of

q

and

d

… or,

equivalently, the cosine of the angle between

q

and

d

.Slide7

OutlineCalculating tf-idf score

Faster rankingStatic quality scoresImpact orderingCluster pruningSlide8

Calculating cosine similarityTraverse entries calculating the productAccumulate the vector lengths and divide at the end

How can we do it faster if we have a sparse representation?

tf-idf

entriesSlide9

Calculating cosine tf-idf from indexWhat should we store in the index?How do we construct the index?

How do we calculate the document ranking?w1

w

2

w

3

indexSlide10

I did enact JuliusCaesar I was killed

i' the Capitol; Brutus killed me.Doc 1So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious

Doc 2

Index construction:

collect

documentIDsSlide11

Index construction:

sort dictionarysort based on termsSlide12

Index construction:

create postings listcreate postings listsfrom identical entries

word 1

word 2

word n

Do we have all the information we need?Slide13

Obtaining tf-idf weightsStore the tf initially in the index

In addition, store the number of documents the term occurs in in the indexHow do we get the idfs?We can either compute these on the fly using the number of documents in each termWe can make another pass through the index and update the weights for each entryPros and cons of each approach?Slide14

Do we have everything we need?Still need the document lengthsStore these in a separate data structureMake another pass through the data and update the weights

Benefits/drawbacks?Slide15

Computing cosine scoresSimilar to the merge operationAccumulate scores for each documentfloat

scores[N] = 0for each query term tcalculate wt,q for each entry in t’s postings list: docID, wt,dscores[docID] += wt,q * wt,dreturn top k components of scoresSlide16

EfficiencyWhat are the inefficiencies here?Only want the scores for the top k but are calculating all the scores

Sort to obtain top k?float scores[N] = 0for each query term tcalculate wt,q for each entry in t’s postings list: docID, wt,dscores[docID] += wt,q * wt,dreturn top

k

components of scoresSlide17

OutlineCalculating tf-idf scoreFaster ranking

Static quality scoresImpact orderingCluster pruningSlide18

Efficient cosine rankingWhat we’re doing in effect: solving the

K-nearest neighbor problem for a query vectorIn general, we do not know how to do this efficiently for high-dimensional spacesTwo simplifying assumptionsQueries are short!Assume no weighting on query terms and that each query term occurs only onceThen for ranking, don’t need to normalize query vectorSlide19

Computing cosine scoresAssume no weighting on query terms and that each query term occurs only once

float scores[N] = 0for each query term tfor each entry in t’s postings list: docID, wt,dscores[docID] += wt,dreturn top k components of scoresSlide20

Selecting top KWe could sort the scores and then pick the top KWhat is the runtime of this approach?O(N log N)

Can we do better?Use a heap (i.e. priority queue)Build a heap out of the scoresGet the top K scores from the heapRunning time?O(N + K log N)For N=1M, K=100, this is about 10% of the cost of sorting

1

.9

.3

.8

.3

.1

.1Slide21

Inexact top K

What if we don’t return the exactly the top K, but a set close to the top K?User has a task and a query formulationCosine is a proxy for matching this task/queryIf we get a list of K docs “close” to the top K by cosine measure, should still be okSlide22

Current approach

Documents

Score documents

Pick top KSlide23

Approximate

approach

Documents

Select A candidates

K < A << N

Pick top K in A

Score documents in ASlide24

Exact vs. approximateDepending on how A is selected and how large A is, can get different resultsCan think of it as pruning

the initial set of docsHow might we pick A?

Exact

ApproximateSlide25

Docs containing many query termsSo far, we consider any document with at least one query term in itFor multi-term queries, only compute scores for docs containing several of the query terms

Say, at least 3 out of 4Imposes a “soft conjunction” on queries seen on web search engines (early Google)Easy to implement in postings traversalSlide26

3 of 4 query termsBrutus

CaesarCalpurnia

1

2

3

5

8

13

21

34

2

4

8

16

32

64

128

13

16

Antony

3

4

8

16

32

64

128

32

Scores only computed for 8, 16 and 32.Slide27

High-idf query terms onlyFor a query such as catcher in the ryeOnly accumulate scores from

catcher and ryeIntuition: in and the contribute little to the scores and don’t alter rank-ordering muchBenefit:Postings of low-idf terms have many docs  these (many) docs get eliminated from ACan we calculate this efficiently from the index?Slide28

Champion listsPrecompute for each dictionary term the

r docs of highest weight in the term’s postingsCall this the champion list for a term(aka fancy list or top docs for a term)This must be done at index timeBrutus

Caesar

1

2

3

5

8

13

21

34

2

4

8

16

32

64

128

Antony

3

4

8

16

32

64

128Slide29

Champion lists

At query time, only compute scores for docs in the champion list of some query term

Pick the

K

top-scoring docs from amongst these

Are we guaranteed to always get K documents?

Brutus

Caesar

Antony

8

16

128

8

32

128

1

16

128Slide30

High and low listsFor each term, we maintain two postings lists called high and lowThink of

high as the champion listWhen traversing postings on a query, only traverse high lists firstIf we get more than K docs, select the top K and stopElse proceed to get docs from the low listsA means for segmenting index into two tiersSlide31

Tiered indexesBreak postings up into a hierarchy of listsMost important…

Least importantInverted index thus broken up into tiers of decreasing importanceAt query time use top tier unless it fails to yield K docsIf so drop to lower tiersSlide32

Example tiered indexSlide33

Quick reviewRather than selecting the best K scores from all N documentsInitially filter the documents to a smaller setSelect the K best scores from this smaller set

Methods for selecting this smaller setDocuments with more than one query termTerms with high IDFDocuments with the highest weightsSlide34

DiscussionHow can Champion Lists be implemented in an inverted

index? How do we modify the data structure?BrutusCaesar

1

2

3

5

8

13

21

34

2

4

8

16

32

64

128

Antony

3

4

8

16

32

64

128Slide35

OutlineCalculating tf-idf scoreFaster ranking

Static quality scoresImpact orderingCluster pruningSlide36

Static quality scoresWe want top-ranking documents to be both relevant and authoritative

Relevance is being modeled by cosine scoresAuthority is typically a query-independent property of a documentWhat are some examples of authority signals?Wikipedia among websitesArticles in certain newspapersA paper with many citationsMany diggs, Y!buzzes or del.icio.us marks

PagerankSlide37

Modeling authorityAssign to each document a query-independent

quality score in [0,1] denoted g(d)Thus, a quantity like the number of citations is scaled into [0,1]Google PageRankSlide38

Net scoreWe want a total score that combines cosine relevance and authoritynet-score(

q,d) = g(d) + cosine(q,d)Can use some other linear combination than an equal weightingIndeed, any function of the two “signals” of user happinessNow we seek the top K docs by net scoreDoing this exactly, is similar to incorporating document length normalizationSlide39

Top K by net score – fast methodsOrder all postings by g(d

)Is this ok? Does it change our merge/traversal algorithms?Key: this is still a common ordering for all postingsBrutusCaesar

Antony

1

2

3

1

3

2

2

g(1) = 0.5, g(2) = .25, g(3) = 1Slide40

Why order postings by g(d)?Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal

In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal earlyBrutusCaesar

Antony

1

2

3

1

3

2

2

g(1) = 0.5, g(2) = .25, g(3) = 1Slide41

Champion lists in g(d)-orderingWe can still use the notion of champion lists…

Combine champion lists with g(d)-orderingMaintain for each term a champion list of the r docs with highest g(d) + tf-idftdSeek top-K results from only the docs in these champion listsSlide42

OutlineCalculating tf-idf scoreFaster ranking

Static quality scoresImpact orderingCluster pruningSlide43

Impact-ordered postings

Why do we need a common ordering of the postings list?Allows us to easily traverse the postings list and check for intersectionIs that required for our tf-idf traversal algorithm?float scores[N] = 0for each query term tfor each entry in t’s

postings list:

docID

,

w

t,d

scores[docID

] +=

w

t,d

return top

k

components of scoresSlide44

Impact-ordered postings

The ordering no long plays a roleOur algorithm for computing document scores “accumulates” scores for each documentIdea: sort each postings list by wt,dOnly compute scores for docs for which w

t

,d

is high enough

Given this ordering, how might we construct A when processing a query?Slide45

Impact-ordering: early terminationWhen traversing a postings list, stop early after either

a fixed number of r docswt,d drops below some thresholdTake the union of the resulting sets of docsOne from the postings of each query termCompute only the scores for docs in this unionSlide46

Impact-ordering: idf-ordered termsWhen considering the postings of query termsLook at them in order of decreasing

idfHigh idf terms likely to contribute most to scoreAs we update score contribution from each query termStop if doc scores relatively unchangedCan apply to cosine or other net scoresSlide47

OutlineCalculating tf-idf scoreFaster ranking

Static quality scoresImpact orderingCluster pruningSlide48

Cluster pruning: preprocessingPick

N docs, call these leadersFor every other doc, pre-compute nearest leaderDocs attached to a leader are called followers

Likely

: each leader has ~

N

followersSlide49

Cluster pruning: query processing

Process a query as follows:Given query Q, find its nearest leader LSeek K nearest docs from among L’s followersSlide50

Visualization

Query

Leader

FollowerSlide51

Cluster pruning variants

Have each follower attached to b1 (e.g. 2) nearest leadersFrom query, find b2 (e.g. 3) nearest leaders and their

followers

Query

Leader

FollowerSlide52

Can Microsoft's Bing, or Anyone, Seriously Challenge Google?Will it ever be possible to dethrone Google as the leader in web search? What would a search engine need to improve upon the model Google offers

? Is Bing a serious threat to Google’s dominance?