Dr Susan Gauch Query Term Weights The vector space model matches queries to documents with the inner productcosine similarity measure Query vector Document vector inner product Normalizedqvector ID: 659569
Download Presentation The PPT/PDF document "Advanced Query Processing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Advanced Query Processing
Dr. Susan GauchSlide2
Query Term Weights
The vector space model matches queries to documents with the inner product/cosine similarity measure
Query vector * Document vector (inner product)
Normalized_q_vector
Normalized_doc_vector
Sum over all terms (
i
ε
t in vector space)
nwt_q
i
*
nwt_d
ij
We implement this with:
For all terms
i
with non-zero query weight
For all documents j that contain term
i
Sum (
nwt_d
ij
)Slide3
Query term weights
Where did the query term weights go?
Essentially, we assume that all query terms are weighted “1”
If a term occurs twice in a query
E.g., “dog cat dog”
Process “dog” twice, add the postings for “dog” twice, so we effectively have a
q_wt
of 2 for “dog”
Can do this more efficiently by preprocessing query using a…. HASHTABLE! To count the term frequencies in the query
Dog (2) cat (1)Slide4
Query Term Weights – Simple Implementation
Can do this more efficiently by preprocessing query using a…. HASHTABLE! To count the term frequencies in the query
Dog (2) cat (1)
For all terms
i
with non-zero query weight
For all documents j that contain term
i
Sum (
q_wt
i
*
nwt_d
ij
)Slide5
Query Term Weights – Proper Implementation
Can change query syntax to allow users to specify weights:
Dog (2) Cat (1)
Dog 0.7 Cat 0.3
Need better query parsing
Can tie to interfaces (sliders)
Users poor at selecting weights and often get worse retrieval not better, so infrequently implementedSlide6
Query Term Weights – Document Similarity
Where are query term weights actually used?
When trying to locate “similar” documents
Consider: how do you find the most similar documents to document
d
k
Applications: plagiarism detection, document clustering/classification (unsupervised/supervised learning)
Simple implementation:
Treat
d
k
as a query
Top results are most similar documentsSlide7
Document Similarity
For all terms
i
with non-zero
weight in
d
k
For all documents j that contain term
i
Sum
(
nwt_d
ik
*
nwt_d
ij
)
What is weight
d
ik
Tf
*
idf
of terms in
d
ik
We would need to store this
Or, start with document and calculate on the fly using stored
idf
in
dict
file
Efficiency
Linear in number of terms
Very slow for long documents
Calculate
tf
*
idf
for all terms in document k
Sort and use top n weighted terms (n ~ 10 .. 50)Slide8
~Boolean Queries
Vector space model merely sums the weights of the query terms in each document
Top document may not have all query terms in it
How implement quasi-Boolean retrieval
“+canine feline –teeth”
Results must have “canine”, may have “feline”, must not have “teeth”
Need to expand accumulator buckets to keep track of number of required terms contributing to the weights and number of excluded termsSlide9
~Boolean QueriesAccumulator:
Total
Num
-Required
Num
-Excluded
For regular (no + or -)
Just add to Total (nothing new)
For required terms (+)
Add to total
Add to
Num
-RequiredSlide10Slide11
~Boolean Queries
For excluded terms (-)
Subtract from total
Add to
Num
-Excluded
Presenting results:
First (only) show results where
Num_required
in Accumulator ==
Num_required
in query &&
Num_excluded
== 0
Sort by weight
Can expand the results shown by later showing groups of results with
High weights, but missing 1 or more required terms
High weight, but including 1 or more excluded terms