/
Advanced  Query Processing Advanced  Query Processing

Advanced Query Processing - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
395 views
Uploaded On 2018-03-21

Advanced Query Processing - PPT Presentation

Dr Susan Gauch Query Term Weights The vector space model matches queries to documents with the inner productcosine similarity measure Query vector Document vector inner product Normalizedqvector ID: 659569

terms query weights term query terms term weights document documents vector required num weight results nwt excluded add dog

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Advanced Query Processing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Advanced Query Processing

Dr. Susan GauchSlide2

Query Term Weights

The vector space model matches queries to documents with the inner product/cosine similarity measure

Query vector * Document vector (inner product)

Normalized_q_vector

Normalized_doc_vector

Sum over all terms (

i

ε

t in vector space)

nwt_q

i

*

nwt_d

ij

We implement this with:

For all terms

i

with non-zero query weight

For all documents j that contain term

i

Sum (

nwt_d

ij

)Slide3

Query term weights

Where did the query term weights go?

Essentially, we assume that all query terms are weighted “1”

If a term occurs twice in a query

E.g., “dog cat dog”

Process “dog” twice, add the postings for “dog” twice, so we effectively have a

q_wt

of 2 for “dog”

Can do this more efficiently by preprocessing query using a…. HASHTABLE! To count the term frequencies in the query

Dog (2) cat (1)Slide4

Query Term Weights – Simple Implementation

Can do this more efficiently by preprocessing query using a…. HASHTABLE! To count the term frequencies in the query

Dog (2) cat (1)

For all terms

i

with non-zero query weight

For all documents j that contain term

i

Sum (

q_wt

i

*

nwt_d

ij

)Slide5

Query Term Weights – Proper Implementation

Can change query syntax to allow users to specify weights:

Dog (2) Cat (1)

Dog 0.7 Cat 0.3

Need better query parsing

Can tie to interfaces (sliders)

Users poor at selecting weights and often get worse retrieval not better, so infrequently implementedSlide6

Query Term Weights – Document Similarity

Where are query term weights actually used?

When trying to locate “similar” documents

Consider: how do you find the most similar documents to document

d

k

Applications: plagiarism detection, document clustering/classification (unsupervised/supervised learning)

Simple implementation:

Treat

d

k

as a query

Top results are most similar documentsSlide7

Document Similarity

For all terms

i

with non-zero

weight in

d

k

For all documents j that contain term

i

Sum

(

nwt_d

ik

*

nwt_d

ij

)

What is weight

d

ik

Tf

*

idf

of terms in

d

ik

We would need to store this

Or, start with document and calculate on the fly using stored

idf

in

dict

file

Efficiency

Linear in number of terms

Very slow for long documents

Calculate

tf

*

idf

for all terms in document k

Sort and use top n weighted terms (n ~ 10 .. 50)Slide8

~Boolean Queries

Vector space model merely sums the weights of the query terms in each document

Top document may not have all query terms in it

How implement quasi-Boolean retrieval

“+canine feline –teeth”

Results must have “canine”, may have “feline”, must not have “teeth”

Need to expand accumulator buckets to keep track of number of required terms contributing to the weights and number of excluded termsSlide9

~Boolean QueriesAccumulator:

Total

Num

-Required

Num

-Excluded

For regular (no + or -)

Just add to Total (nothing new)

For required terms (+)

Add to total

Add to

Num

-RequiredSlide10
Slide11

~Boolean Queries

For excluded terms (-)

Subtract from total

Add to

Num

-Excluded

Presenting results:

First (only) show results where

Num_required

in Accumulator ==

Num_required

in query &&

Num_excluded

== 0

Sort by weight

Can expand the results shown by later showing groups of results with

High weights, but missing 1 or more required terms

High weight, but including 1 or more excluded terms