Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based in part on notes by Dr Raymond J Mooney at the University of Texas at Austin 2 Implementation Issues for Ranking System ID: 495770
Download Presentation The PPT/PDF document "1 Vector Space Model for IR:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
Vector Space Model for IR:Implementation Notes
CSC 575Intelligent Information Retrieval
These notes are based, in part, on
notes by
Dr. Raymond J. Mooney at the University of Texas at Austin. Slide2
2
Implementation Issues for Ranking System
Recall the use of inverted files for the storage of index termssplit into a dictionary and a postings filethe dictionary contains terms, but will now also contain statistics about the term (e.g., number of postings) and the IDF valuethe postings file will contain the record ids and the weights for all occurrences of the term
Doc
Freq
Doc
FreqSlide3
3
Implementation Issues for Ranking Systems
Four major options for storing weights in the postings fileStore the raw frequencyslow, but flexible (term weights can be changed without changing the index)Store normalized frequencythe normalization would be done during the creation of the final dictionary and postings filesthe normalized frequency would be inserted into the postings file instead of the raw term frequencyStore the completely weighted termallows simple addition of term weights during the search, rather than first multiplying by the IDF of the termvery fast, but changes to index requires changing all postings because the IDF changes; also relevance feedback reweighting is difficultNo within-record frequencypostings records do not store weightsall processing will be done at search timeSlide4
4
Implementation Issues for Ranking Systems
Searching the inverted filediagram illustrates the basic processif stem is found in dictionary, the address of the postings list is returned along with the IDF and the number of postingsin each posting, the record ids are read; total term-weight for each record id added to a unique accumulator for that id (set up as a hash table)depending on which option for storing weights was used, additional normalization or IDF computation may be necessaryas each query term is processed, its postings cause further additions to the accumulators
Query
Parser
Dictionary Lookup
Using Binary Search
Accumulator
Sort by Weight
Get Weights from
Postings File
Terms
Dictionary Entry
Rec. Numbers for Terms
Rec. Numbers; Total WeightsSlide5
5
VSR Implementation: Inverted Index
system
computer
database
science
D
2
, 4
D
5
, 2
D
1
, 3
D
7
, 4
Index terms
df
3
2
4
1
D
j
, tf
j
Index file
Postings lists
Slide6
6
TokenInfo Class
public class TokenInfo{ public double idf; // A list of TokenOccurences giving documents where // this token occurs public ArrayList occList; // Create an initially empty data structure public TokenInfo() { occList = new ArrayList(); idf = 0.0; }
}Slide7
7
TokenOccurrence Class
// A lightweight object for storing information about an// occurrence of a token (a.k.a word, term) in a Documentpublic class TokenOccurrence { // A reference to the Document where it occurs public DocumentReference docRef = null; // The number of times it occurs in the Document public int count = 0; // Create an occurrence with these values public TokenOccurrence(DocumentReference docRef, int count) {
this.docRef = docRef;
this.count = count;
}
}Slide8
8
DocumentReference Class
// A simple data structure for storing a reference to a // document file that includes information on the length of // its document vector. This is a lightweight object to store // in inverted index without storing an entire Document object.public class DocumentReference { // The file where the referenced document is stored. public File file = null; // The length of the corresponding Document vector. public double length = 0.0; public DocumentReference(File file, double length) {
this.file = file;
this.length = length;
}
. . .
}Slide9
9
Representing a Document Vector
HashmapVectorA data structure for a term vector of a document stored as a HashMap that maps tokens to the weight of that token in the documentNeeded as an efficient, indexed representation of sparse document vectors.Possible Operations Associated with a HashmapVectorIncrement the weight of a given tokenMultiply the vector by a constant factorReturn Max weight of any token in the vectorOthers: copy, dot product, add, etc.Slide10
10
Creating an Inverted Index
Create an empty HashMap, H
(will hold the dictionary)
For each document, D,
(i.e. file in an input directory)
:
Create a HashMapVector,V, for D;
For each token, T, in V:
If T is not already in H, create an empty
TokenInfo for T and insert it into H;
Create a TokenOccurence for T in D and
add it to the occList in the TokenInfo for T;
Compute IDF for all tokens in H;
Compute vector lengths for all documents in H;Slide11
11
Computing IDF
Let N be the total number of Documents;For each token, T, in H: Determine the total number of documents, M, in which T occurs (the length of T’s occList); Set the IDF for T to log(N/M); Note this requires a second pass through all the tokens after all documents have been indexed.Slide12
12
Document Vector Length
Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens.Remember the weight of a token is: TF * IDF.Therefore, must wait until IDF’s are known (and therefore until all documents are indexed) before document lengths can be determined.Slide13
13
Computing Document Lengths
Assume the length of all document vectors are initialized to 0.0;
For each token T in H:
Let, I be the IDF weight of T;
For each TokenOccurence of T in document D
Let, C, be the count of T in D;
Increment the length of D by (I*C)
2
;
For each document D in H:
Set the length of D to be the square-root of the
current stored length;Slide14
14
Time Complexity of Indexing
Complexity of creating vector and indexing a document of n tokens is O(n).So indexing m such documents is O(m n).Computing token IDFs for a vocabulary V is O(|V|).Computing vector lengths is also O(m n).Since |V| m n, complete process is O(m n
), which is also the complexity of just reading in the corpus.Slide15
15
Retrieval with an Inverted Index
Tokens that are not in both the query and the document do not effect the dot products in the computation of Cosine similarities.Usually the query is fairly short, and therefore its vector is extremely sparse.Use inverted index to find the limited set of documents that contain at least one of the query words.Slide16
16
Inverted Query Retrieval Efficiency
Assume that, on average, a query word appears in B documents:Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all N documents, O(|V| N), because |Q| << |V| and B << N.
Q = q
1
q
2
… q
n
D
11
…D
1B
D
21
…D
2B
D
n1
…D
nBSlide17
17
Processing the Query
Incrementally compute cosine similarity of each indexed document as query words are processed one by one.To accumulate a total score for each retrieved documentstore retrieved documents in a hashtable DocumentReference is the key and the partial accumulated score is the valueSlide18
18
Inverted-Index Retrieval Algorithm
Create a HashMapVector, Q, for the query.
Create empty HashMap, R, to store retrieved documents with scores.
For each token, T, in Q:
Let I be the IDF of T, and K be the count of T in Q;
Set the weight of T in Q: W = K * I;
Let L be the list of TokenOccurences of T from H;
For each TokenOccurence, O, in L:
Let D be the document of O, and C be the count of O
(tf of T in D)
;
If D is not already in R
(D was not previously retrieved)
Then add D to R and initialize score to 0.0;
Increment D’s score by W * I * C; (product of T-weight in Q and D)Slide19
19
Retrieval Algorithm (cont)
Compute the length, L, of the vector Q (square-root of the sum of the squares of its weights).
For each retrieved document D in R:
Let S be the current accumulated score of D;
(S is the dot-product of D and Q)
Let Y be the length of D as stored in its DocumentReference;
Normalize D’s final score to S/(L * Y);
Sort retrieved documents in R by final score
Return results in an array