Ranked Retrieval Part I March 16 2022 Mohammad Hammoud Carnegie Mellon University in Qatar Today Last Wednesdays Session SVMs Part III Todays Session Ranked Retrieval Part I ID: 932014
Download Presentation The PPT/PDF document "AI for Medicine Lecture 15:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
AI for Medicine
Lecture 15:
Ranked
Retrieval
– Part I
March 16, 2022
Mohammad Hammoud
Carnegie Mellon University in Qatar
Slide2Today…
Last Wednesday’s Session:
SVMs- Part III
Today’s Session:
Ranked Retrieval – Part I
Announcement:
Project: Each group will
Meet with me on the 20
th
or 21
st
of March
Present their idea to the class on the 23
rd
of March
Slide3Information Retrieval
Information Retrieval (IR) is concerned with
searching
over and
ranking
(large) data collections
Slide4Information Retrieval
More precisely, Information Retrieval (IR) is concerned with:
Finding material (usually
documents
– E.g., medical notes written by clinicians during patient-clinician encounters)
of an unstructured nature (usually
text)that satisfies an information need from within large collections (usually stored on computers).Example: Which notes include “thirst and fatigue but not dizziness”?
One way of doing this is to exhaustively search the whole collection, looking up each document and figuring out which document(s) contain thirst and fatigue but not dizziness
Slide5Exhaustively Searching the Text of Each Doc
thirst and fatigue but not dizziness
Query
…
…
…
Large Document Collection
Slide6Exhaustively Searching the Text of Each Doc
thirst and fatigue but not dizziness
Query
…
…
…
Large Document Collection
Lookup (e.g.,
grep
in Unix)
Very Slow!
Slide7Binary Term-Document Incidence Matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1
0
0
1
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1
0
00
10
Fever010
000
Cough0
0000
1
Headache0
1100
0
Terms
Documents
0
if the note (e.g., Note 4) does not contain the term (e.g., Headache)
1
if the note (e.g., Note 2) contains the term (e.g., Headache)
Alternatively, we can use a binary term-document incidence matrix
Slide8Binary Term-Document Incidence Matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1
0
0
1
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1
0
00
10
Fever010
000
Cough0
0000
1
Headache0
1100
0
Query
: Which notes include thirst and fatigue but not dizziness?
Answer
: Apply bitwise AND on the vectors of thirst, fatigue, and dizziness (complemented)
Alternatively, we can use a binary term-document incidence matrix
Slide9Binary Term-Document Incidence Matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1
0
0
1
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1
0
00
10
Fever010
000
Cough0
0000
1
Headache0
1100
0
Query
: Which notes include thirst and fatigue but not dizziness?
100100
Incidence
Vector
Alternatively, we can use a binary term-document incidence matrix
Slide10Binary Term-Document Incidence Matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1
0
0
1
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1
0
00
10
Fever010
000
Cough0
0000
1
Headache0
1100
0
Query
: Which notes include thirst and fatigue but not dizziness?
100100
AND
100100
Incidence
Vector
Alternatively, we can use a binary term-document incidence matrix
Slide11Binary Term-Document Incidence Matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1
0
0
1
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1
0
00
10
Fever010
000
Cough0
0000
1
Headache0
1100
0
Query
: Which notes include thirst and fatigue but not dizziness?
100100
AND
100100
AND
011101
Incidence
Vector
Alternatively, we can use a binary term-document incidence matrix
Slide12Binary Term-Document Incidence Matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1
0
0
1
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1
0
00
10
Fever010
000
Cough0
0000
1
Headache0
1100
0
Query
: Which notes include thirst and fatigue but not dizziness?
100100
AND
100100
AND
011101 =
100
1
00
100
1
00
011
1
01
000
1
00
000
1
00
Alternatively, we can use a binary term-document incidence matrix
Slide13Boolean Queries
The type of the “thirst
and
fatigue but
not
dizziness” query is Boolean
Documents either match or do not match
Boolean queries often produce too few or too many results (1000s)
AND
gives too few;
OR
gives too many
This is not good for the majority of users, who usually do not want to wade through 1000s of results
To address this issue,
ranked retrieval models
can be utilized
Slide14Ranked Retrieval Models
With ranked retrieval models, a
ranked list
of documents is returned
Only the
top most relevant
(e.g., 10) results can be shown so as to avoid overwhelming users
How can we rank documents in a collection with respect to a query?
We can assign a score (say, between 0 and 1) to each document-query pair
The score will measure how well a document and a query match
For a query with one term:
The score should be 0 if the term does not occur in the document
The more frequent the term appears in the document, the higher the score should be
Ranked Retrieval Models: Jaccard Coefficient
Jaccard coefficient measures the overlap between two sets, say, A & B
Jaccard(A, B) =
Jaccard(A, A) = 1
Jaccard(A, B) = 0 if
A and B do not have to be of the same size
Jaccard(query, document) always produces a score between 0 and 1
Example
:
Query
: Cough and fever
Document 1
: The patient has fever
Document 2
: The patient does not have fever
Jaccard(Query, Document 1) = 1/6
Jaccard(Query, Document 2) = 1/8
Slide16Ranked Retrieval Models: Jaccard Coefficient
Jaccard coefficient measures the overlap between two sets, say, A & B
Jaccard(A, B) =
Jaccard(A, A) = 1
Jaccard(A, B) = 0 if
A and B do not have to be of the same size
Jaccard(query, document) always produces a score between 0 and 1
Example
:
Query
: Cough and fever
Document 1
: The patient has fever
Document 2
: The patient does not have fever
Docume
nt
1
will appear before
Document 2
in the ranked list!
Slide17Ranked Retrieval Models: Jaccard Coefficient
Jaccard coefficient measures the overlap between two sets, say, A & B
Jaccard(A, B) =
Jaccard(A, A) = 1
Jaccard(A, B) = 0 if
A and B do not have to be of the same size
Jaccard(query, document) always produces a score between 0 and 1
Example
:
Query
: Cough and fever
Document 1
: The patient has fever
Document 2
: The patient does not have fever today
but had fever and cough in the past four days
Jaccard(Query, Document 1) = 1/6
Jaccard(Query, Document 2) = 2/16 = 1/8
Slide18Ranked Retrieval Models: Jaccard Coefficient
Jaccard coefficient measures the overlap between two sets, say, A & B
Jaccard(A, B) =
Jaccard(A, A) = 1
Jaccard(A, B) = 0 if
A and B do not have to be of the same size
Jaccard(query, document) always produces a score between 0 and 1
Example
:
Query
: Cough and fever
Document 1
: The patient has fever
Document 2
: The patient does not have fever today
but had fever and cough in the past four days
Docume
nt
1
will still appear before
Document 2
in the ranked list, but it is not clear whether
Document 1
is more relevant to the
Query
than
Document 2
!
Slide19Ranked Retrieval Models: Jaccard Coefficient
Jaccard coefficient measures the overlap between two sets, say, A & B
Jaccard(A, B) =
Jaccard(A, A) = 1
Jaccard(A, B) = 0 if
A and B do not have to be of the same size
Jaccard(query, document) always produces a score between 0 and 1
Jaccard(query, document) does not consider
term frequency
(i.e., how many times a term appears in a document)
A more sophisticated way is also needed to normalize the documents by length (different documents have different lengths)
Term-Document Count Matrix
To account for
term frequency
, we can use a term-document
count matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1
0
0
1
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1
0
0
0
1
0
Fever
0
1
0
0
0
0
Cough
0
0
0
0
0
1
Headache
0
1
1
0
0
0
Our previous binary term-document incidence matrix
Slide21Term-Document Count Matrix
To account for
term frequency
, we can use a term-document
count matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
4
0
0
3
0
0
Fatigue
1
0
0
1
0
0
Dizziness
2
0
0
0
4
0
Fever
0
5
0
0
0
0
Cough
0
0
0
0
0
5
Headache
0
2
3
0
0
0
A term-document count matrix
Slide22Term-Document Count Matrix
To account for
term frequency
, we can use a term-document
count matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
4
0
0
3
0
0
Fatigue
1
0
0
1
0
0
Dizziness
2
0
0
0
4
0
Fever
0
5
0
0
0
0
Cough
0
0
0
0
0
5
Headache
0
2
3
0
0
0
Headache appeared 3 times in N
o
te 3
Slide23Term Frequency
The
term frequency
,
tf
t,d
, of term
t
in document
d
is defined as the number of times that
t
occurs in
d
But how to use
tf
t,d
when computing query-document match scores?
A document with 10 occurrences of a term is more relevant than a document with 1 occurrence of the term, but not necessarily 10 times more relevant!
Relevance does not increase proportionally with term frequency
To this end, we can use
log-frequency weighting
Slide24Log-Frequency Weighting
The log-frequency weight,
, of term
t
in document
d
is:
Subsequently, the score
of a document-query pair becomes the sum of terms
t
in
both
query
q
and document
d
as follows:
Log-Frequency Weighting: Example
Here is our previous term-document
count matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
4
0
0
3
0
0
Fatigue
1
0
0
1
0
0
Dizziness
2
0
0
0
4
0
Fever
0
5
0
0
0
0
Cough
0
0
0
0
0
5
Headache
0
2
3
0
0
0
Log-Frequency Weighting: Example
Here is the corresponding term-document
log frequency matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1.60205999
0
0
1.47712125
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1.30103
0
0
0
1.60205999
0
Fever
0
0
0
0
1
0
Cough
0
1.69897
0
0
0
1.69897
Headache
0
1.30103
1.47712125
0
0
0
Log-Frequency Weighting: Example
Here is the corresponding term-document
log frequency matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1.60205999
0
0
1.47712125
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1.30103
0
0
0
1.60205999
0
Fever
0
0
0
0
1
0
Cough
0
1.69897
0
0
0
1.69897
Headache
0
1.30103
1.47712125
0
0
0
Query
: Cough and headache
Rank
Note 1
Slide28Log-Frequency Weighting: Example
Here is the corresponding term-document
log frequency matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1.60205999
0
0
1.47712125
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1.30103
0
0
0
1.60205999
0
Fever
0
0
0
0
1
0
Cough
0
1.69897
0
0
0
1.69897
Headache
0
1.30103
1.47712125
0
0
0
Query
: Cough and headache
Rank
Note 1
Slide29Log-Frequency Weighting: Example
Here is the corresponding term-document
log frequency matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1.60205999
0
0
1.47712125
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1.30103
0
0
0
1.60205999
0
Fever
0
0
0
0
1
0
Cough
0
1.69897
0
0
0
1.69897
Headache
0
1.30103
1.47712125
0
0
0
Query
: Cough and headache
Rank
Note 2
Note 1
Slide30Log-Frequency Weighting: Example
Here is the corresponding term-document
log frequency matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1.60205999
0
0
1.47712125
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1.30103
0
0
0
1.60205999
0
Fever
0
0
0
0
1
0
Cough
0
1.69897
0
0
0
1.69897
Headache
0
1.30103
1.47712125
0
0
0
Query
: Cough and headache
Rank
Note 2
Note 3
Note 1
Slide31Log-Frequency Weighting: Example
Here is the corresponding term-document
log frequency matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1.60205999
0
0
1.47712125
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1.30103
0
0
0
1.60205999
0
Fever
0
0
0
0
1
0
Cough
0
1.69897
0
0
0
1.69897
Headache
0
1.30103
1.47712125
0
0
0
Query
: Cough and headache
Rank
Note 2
Note 3
Note 1
Note 4
Slide32Log-Frequency Weighting: Example
Here is the corresponding term-document
log frequency matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1.60205999
0
0
1.47712125
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1.30103
0
0
0
1.60205999
0
Fever
0
0
0
0
1
0
Cough
0
1.69897
0
0
0
1.69897
Headache
0
1.30103
1.47712125
0
0
0
Query
: Cough and headache
Rank
Note 2
Note 3
Note 1
Note 4
Note 5
Slide33Log-Frequency Weighting: Example
Here is the corresponding term-document
log frequency matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1.60205999
0
0
1.47712125
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1.30103
0
0
0
1.60205999
0
Fever
0
0
0
0
1
0
Cough
0
1.69897
0
0
0
1.69897
Headache
0
1.30103
1.47712125
0
0
0
Query
: Cough and headache
Rank
Note 2
Note 6
Note 3
Note 1
Note 4
Note 5
Slide34Rare vs. Common Terms
But, rare terms are more informative than frequent terms
E.g., Stop words like “a”, “the”, “to”, “of”, etc., are frequent but not very informative
E.g., A document containing the word “arrhythmias” is very likely to be relevant to the query “arrhythmias”
We want higher weights for rare terms than for more frequent terms
This can be achieved using
document frequency
Slide35Document Frequency
The
document frequency
,
, of term
t
is defined as the number of documents in the given collection that contain
t
Thus,
, where
is the number of documents in the collection
is an inverse measure of the informativeness of
t
(
the smaller the number of documents that contain
t
the more informative
t
is
)
As such, we can define an inverse document frequency,
, as:
If
= 1,
I
f
= N,
Document Frequency
The
document frequency
,
, of term
t
is defined as the number of documents in the given collection that contain
t
Thus,
, where
is the number of documents in the collection
is an inverse measure of the informativeness of
t
(
the smaller the number of documents that contain
t
the more informative
t
is
)
As such, we can define an inverse document frequency,
, as:
T
o “dampen” the effect of
Document Frequency
The
document frequency
,
, of term
t
is defined as the number of documents in the given collection that contain
t
Thus,
, where
is the number of documents in the collection
is an inverse measure of the informativeness of
t
(
the smaller the number of documents that contain
t
the more informative
t
is
)
As such, we can define an inverse document frequency,
, as:
Note
: There is only one value of
for each
t
in the collection
TF.IDF (or TF-IDF) Weighting
To get the best weighting scheme, we can combine term frequency,
, with inverse document frequency,
, as follows:
Clearly,
increases with:
The number of occurrences of term
t
in document
d
And
the rarity of
t
in the collection
TF.IDF (or TF-IDF) Weighting
Subsequently, the score
of a document-query pair becomes the sum of terms
t
in both query
q
and document
d
as follows:
TF.IDF: Example
Here is the term-document
count matrix
of our earlier example
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
4
0
0
3
0
0
Fatigue
1
0
0
1
0
0
Dizziness
2
0
0
0
4
0
Fever
0
5
0
0
0
0
Cough
0
0
0
0
0
5
Headache
0
2
3
0
0
0
TF.IDF: Example
And, here is the corresponding term-document
log frequency matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
1.60205999
0
0
1.47712125
0
0
Fatigue
1
0
0
1
0
0
Dizziness
1.30103
0
0
0
1.60205999
0
Fever
0
0
0
0
1
0
Cough
0
1.69897
0
0
0
1.69897
Headache
0
1.30103
1.47712125
0
0
0
TF.IDF: Example
And, here is the respective term-document
TF.IDF matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
0.76437687
0
0
0.70476595
0
0
Fatigue
0.47712125
0
0
0.47712125
0
0
Dizziness
0.62074906
0
0
0
0.76437687
0
Fever
0
0
0
0
0.77815125
0
Cough
0
0.8106147
0
0
0
0.8106147
Headache
0
0.62074906
0.70476595
0
0
0
TF.IDF: Example
And, here is the respective term-document
TF.IDF matrix
Note 1
Note 2
Note 3
Note 4
Note 5
Note 6
T
hirst
0.76437687
0
0
0.70476595
0
0
Fatigue
0.47712125
0
0
0.47712125
0
0
Dizziness
0.62074906
0
0
0
0.76437687
0
Fever
0
0
0
0
0.77815125
0
Cough
0
0.8106147
0
0
0
0.8106147
Headache
0
0.62074906
0.70476595
0
0
0
Query
: Cough and headache
Next Wednesday’s Class…
Project Presentations
Ranked Retrieval – Part II
Slide45References
Schütze
, Hinrich, Christopher D. Manning, and Prabhakar Raghavan.
Introduction to information retrieval
. Vol. 39. Cambridge: Cambridge University Press, 2008.