/
1 Vector Space Model for IR: 1 Vector Space Model for IR:

1 Vector Space Model for IR: - PowerPoint Presentation

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
428 views
Uploaded On 2016-12-01

1 Vector Space Model for IR: - PPT Presentation

Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based in part on notes by Dr Raymond J Mooney at the University of Texas at Austin 2 Implementation Issues for Ranking System ID: 495770

length document file vector document length vector file idf documents public index token term postings query weights inverted weight

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Vector Space Model for IR:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1

Vector Space Model for IR:Implementation Notes

CSC 575Intelligent Information Retrieval

These notes are based, in part, on

notes by

Dr. Raymond J. Mooney at the University of Texas at Austin. Slide2

2

Implementation Issues for Ranking System

Recall the use of inverted files for the storage of index termssplit into a dictionary and a postings filethe dictionary contains terms, but will now also contain statistics about the term (e.g., number of postings) and the IDF valuethe postings file will contain the record ids and the weights for all occurrences of the term

Doc

Freq

Doc

FreqSlide3

3

Implementation Issues for Ranking Systems

Four major options for storing weights in the postings fileStore the raw frequencyslow, but flexible (term weights can be changed without changing the index)Store normalized frequencythe normalization would be done during the creation of the final dictionary and postings filesthe normalized frequency would be inserted into the postings file instead of the raw term frequencyStore the completely weighted termallows simple addition of term weights during the search, rather than first multiplying by the IDF of the termvery fast, but changes to index requires changing all postings because the IDF changes; also relevance feedback reweighting is difficultNo within-record frequencypostings records do not store weightsall processing will be done at search timeSlide4

4

Implementation Issues for Ranking Systems

Searching the inverted filediagram illustrates the basic processif stem is found in dictionary, the address of the postings list is returned along with the IDF and the number of postingsin each posting, the record ids are read; total term-weight for each record id added to a unique accumulator for that id (set up as a hash table)depending on which option for storing weights was used, additional normalization or IDF computation may be necessaryas each query term is processed, its postings cause further additions to the accumulators

Query

Parser

Dictionary Lookup

Using Binary Search

Accumulator

Sort by Weight

Get Weights from

Postings File

Terms

Dictionary Entry

Rec. Numbers for Terms

Rec. Numbers; Total WeightsSlide5

5

VSR Implementation: Inverted Index

system

computer

database

science

D

2

, 4

D

5

, 2

D

1

, 3

D

7

, 4

Index terms

df

3

2

4

1

D

j

, tf

j

Index file

Postings lists

  Slide6

6

TokenInfo Class

public class TokenInfo{ public double idf; // A list of TokenOccurences giving documents where // this token occurs public ArrayList occList; // Create an initially empty data structure public TokenInfo() { occList = new ArrayList(); idf = 0.0; }

}Slide7

7

TokenOccurrence Class

// A lightweight object for storing information about an// occurrence of a token (a.k.a word, term) in a Documentpublic class TokenOccurrence { // A reference to the Document where it occurs public DocumentReference docRef = null; // The number of times it occurs in the Document public int count = 0; // Create an occurrence with these values public TokenOccurrence(DocumentReference docRef, int count) {

this.docRef = docRef;

this.count = count;

}

}Slide8

8

DocumentReference Class

// A simple data structure for storing a reference to a // document file that includes information on the length of // its document vector. This is a lightweight object to store // in inverted index without storing an entire Document object.public class DocumentReference { // The file where the referenced document is stored. public File file = null; // The length of the corresponding Document vector. public double length = 0.0; public DocumentReference(File file, double length) {

this.file = file;

this.length = length;

}

. . .

}Slide9

9

Representing a Document Vector

HashmapVectorA data structure for a term vector of a document stored as a HashMap that maps tokens to the weight of that token in the documentNeeded as an efficient, indexed representation of sparse document vectors.Possible Operations Associated with a HashmapVectorIncrement the weight of a given tokenMultiply the vector by a constant factorReturn Max weight of any token in the vectorOthers: copy, dot product, add, etc.Slide10

10

Creating an Inverted Index

Create an empty HashMap, H

(will hold the dictionary)

For each document, D,

(i.e. file in an input directory)

:

Create a HashMapVector,V, for D;

For each token, T, in V:

If T is not already in H, create an empty

TokenInfo for T and insert it into H;

Create a TokenOccurence for T in D and

add it to the occList in the TokenInfo for T;

Compute IDF for all tokens in H;

Compute vector lengths for all documents in H;Slide11

11

Computing IDF

Let N be the total number of Documents;For each token, T, in H: Determine the total number of documents, M, in which T occurs (the length of T’s occList); Set the IDF for T to log(N/M); Note this requires a second pass through all the tokens after all documents have been indexed.Slide12

12

Document Vector Length

Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens.Remember the weight of a token is: TF * IDF.Therefore, must wait until IDF’s are known (and therefore until all documents are indexed) before document lengths can be determined.Slide13

13

Computing Document Lengths

Assume the length of all document vectors are initialized to 0.0;

For each token T in H:

Let, I be the IDF weight of T;

For each TokenOccurence of T in document D

Let, C, be the count of T in D;

Increment the length of D by (I*C)

2

;

For each document D in H:

Set the length of D to be the square-root of the

current stored length;Slide14

14

Time Complexity of Indexing

Complexity of creating vector and indexing a document of n tokens is O(n).So indexing m such documents is O(m n).Computing token IDFs for a vocabulary V is O(|V|).Computing vector lengths is also O(m n).Since |V|  m n, complete process is O(m n

), which is also the complexity of just reading in the corpus.Slide15

15

Retrieval with an Inverted Index

Tokens that are not in both the query and the document do not effect the dot products in the computation of Cosine similarities.Usually the query is fairly short, and therefore its vector is extremely sparse.Use inverted index to find the limited set of documents that contain at least one of the query words.Slide16

16

Inverted Query Retrieval Efficiency

Assume that, on average, a query word appears in B documents:Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all N documents, O(|V| N), because |Q| << |V| and B << N.

Q = q

1

q

2

… q

n

D

11

…D

1B

D

21

…D

2B

D

n1

…D

nBSlide17

17

Processing the Query

Incrementally compute cosine similarity of each indexed document as query words are processed one by one.To accumulate a total score for each retrieved documentstore retrieved documents in a hashtable DocumentReference is the key and the partial accumulated score is the valueSlide18

18

Inverted-Index Retrieval Algorithm

Create a HashMapVector, Q, for the query.

Create empty HashMap, R, to store retrieved documents with scores.

For each token, T, in Q:

Let I be the IDF of T, and K be the count of T in Q;

Set the weight of T in Q: W = K * I;

Let L be the list of TokenOccurences of T from H;

For each TokenOccurence, O, in L:

Let D be the document of O, and C be the count of O

(tf of T in D)

;

If D is not already in R

(D was not previously retrieved)

Then add D to R and initialize score to 0.0;

Increment D’s score by W * I * C; (product of T-weight in Q and D)Slide19

19

Retrieval Algorithm (cont)

Compute the length, L, of the vector Q (square-root of the sum of the squares of its weights).

For each retrieved document D in R:

Let S be the current accumulated score of D;

(S is the dot-product of D and Q)

Let Y be the length of D as stored in its DocumentReference;

Normalize D’s final score to S/(L * Y);

Sort retrieved documents in R by final score

Return results in an array