/
AI for Medicine  Lecture 15: AI for Medicine  Lecture 15:

AI for Medicine Lecture 15: - PowerPoint Presentation

Hardrocker
Hardrocker . @Hardrocker
Follow
350 views
Uploaded On 2022-08-01

AI for Medicine Lecture 15: - PPT Presentation

Ranked Retrieval Part I March 16 2022 Mohammad Hammoud Carnegie Mellon University in Qatar Today Last Wednesdays Session SVMs Part III Todays Session Ranked Retrieval Part I ID: 932014

document note query term note document term query frequency jaccard matrix fatigue dizziness fever cough headache hirst 47712125 log

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "AI for Medicine Lecture 15:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

AI for Medicine

Lecture 15:

Ranked

Retrieval

– Part I

March 16, 2022

Mohammad Hammoud

Carnegie Mellon University in Qatar

Slide2

Today…

Last Wednesday’s Session:

SVMs- Part III

Today’s Session:

Ranked Retrieval – Part I

Announcement:

Project: Each group will

Meet with me on the 20

th

or 21

st

of March

Present their idea to the class on the 23

rd

of March

Slide3

Information Retrieval

Information Retrieval (IR) is concerned with

searching

over and

ranking

(large) data collections

Slide4

Information Retrieval

More precisely, Information Retrieval (IR) is concerned with:

Finding material (usually

documents

– E.g., medical notes written by clinicians during patient-clinician encounters)

of an unstructured nature (usually

text)that satisfies an information need from within large collections (usually stored on computers).Example: Which notes include “thirst and fatigue but not dizziness”?

One way of doing this is to exhaustively search the whole collection, looking up each document and figuring out which document(s) contain thirst and fatigue but not dizziness

Slide5

Exhaustively Searching the Text of Each Doc

thirst and fatigue but not dizziness

Query

Large Document Collection

Slide6

Exhaustively Searching the Text of Each Doc

thirst and fatigue but not dizziness

Query

Large Document Collection

Lookup (e.g.,

grep

in Unix)

Very Slow!

Slide7

Binary Term-Document Incidence Matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1

0

0

1

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1

0

00

10

Fever010

000

Cough0

0000

1

Headache0

1100

0

Terms

Documents

0

if the note (e.g., Note 4) does not contain the term (e.g., Headache)

1

if the note (e.g., Note 2) contains the term (e.g., Headache)

Alternatively, we can use a binary term-document incidence matrix

Slide8

Binary Term-Document Incidence Matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1

0

0

1

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1

0

00

10

Fever010

000

Cough0

0000

1

Headache0

1100

0

Query

: Which notes include thirst and fatigue but not dizziness?

Answer

: Apply bitwise AND on the vectors of thirst, fatigue, and dizziness (complemented)

Alternatively, we can use a binary term-document incidence matrix

Slide9

Binary Term-Document Incidence Matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1

0

0

1

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1

0

00

10

Fever010

000

Cough0

0000

1

Headache0

1100

0

Query

: Which notes include thirst and fatigue but not dizziness?

100100

Incidence

Vector

Alternatively, we can use a binary term-document incidence matrix

Slide10

Binary Term-Document Incidence Matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1

0

0

1

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1

0

00

10

Fever010

000

Cough0

0000

1

Headache0

1100

0

Query

: Which notes include thirst and fatigue but not dizziness?

100100

AND

100100

Incidence

Vector

Alternatively, we can use a binary term-document incidence matrix

Slide11

Binary Term-Document Incidence Matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1

0

0

1

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1

0

00

10

Fever010

000

Cough0

0000

1

Headache0

1100

0

Query

: Which notes include thirst and fatigue but not dizziness?

100100

AND

100100

AND

011101

Incidence

Vector

Alternatively, we can use a binary term-document incidence matrix

Slide12

Binary Term-Document Incidence Matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1

0

0

1

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1

0

00

10

Fever010

000

Cough0

0000

1

Headache0

1100

0

Query

: Which notes include thirst and fatigue but not dizziness?

100100

AND

100100

AND

011101 =

100

1

00

100

1

00

011

1

01

000

1

00

000

1

00

Alternatively, we can use a binary term-document incidence matrix

Slide13

Boolean Queries

The type of the “thirst

and

fatigue but

not

dizziness” query is Boolean

Documents either match or do not match

Boolean queries often produce too few or too many results (1000s)

AND

gives too few;

OR

gives too many

This is not good for the majority of users, who usually do not want to wade through 1000s of results

To address this issue,

ranked retrieval models

can be utilized

Slide14

Ranked Retrieval Models

With ranked retrieval models, a

ranked list

of documents is returned

Only the

top most relevant

(e.g., 10) results can be shown so as to avoid overwhelming users

How can we rank documents in a collection with respect to a query?

We can assign a score (say, between 0 and 1) to each document-query pair

The score will measure how well a document and a query match

For a query with one term:

The score should be 0 if the term does not occur in the document

The more frequent the term appears in the document, the higher the score should be

 

Slide15

Ranked Retrieval Models: Jaccard Coefficient

Jaccard coefficient measures the overlap between two sets, say, A & B

Jaccard(A, B) =

Jaccard(A, A) = 1

Jaccard(A, B) = 0 if

A and B do not have to be of the same size

Jaccard(query, document) always produces a score between 0 and 1

Example

:

Query

: Cough and fever

Document 1

: The patient has fever

Document 2

: The patient does not have fever

 

Jaccard(Query, Document 1) = 1/6

Jaccard(Query, Document 2) = 1/8

Slide16

Ranked Retrieval Models: Jaccard Coefficient

Jaccard coefficient measures the overlap between two sets, say, A & B

Jaccard(A, B) =

Jaccard(A, A) = 1

Jaccard(A, B) = 0 if

A and B do not have to be of the same size

Jaccard(query, document) always produces a score between 0 and 1

Example

:

Query

: Cough and fever

Document 1

: The patient has fever

Document 2

: The patient does not have fever

 

Docume

nt

1

will appear before

Document 2

in the ranked list!

Slide17

Ranked Retrieval Models: Jaccard Coefficient

Jaccard coefficient measures the overlap between two sets, say, A & B

Jaccard(A, B) =

Jaccard(A, A) = 1

Jaccard(A, B) = 0 if

A and B do not have to be of the same size

Jaccard(query, document) always produces a score between 0 and 1

Example

:

Query

: Cough and fever

Document 1

: The patient has fever

Document 2

: The patient does not have fever today

but had fever and cough in the past four days

 

Jaccard(Query, Document 1) = 1/6

Jaccard(Query, Document 2) = 2/16 = 1/8

Slide18

Ranked Retrieval Models: Jaccard Coefficient

Jaccard coefficient measures the overlap between two sets, say, A & B

Jaccard(A, B) =

Jaccard(A, A) = 1

Jaccard(A, B) = 0 if

A and B do not have to be of the same size

Jaccard(query, document) always produces a score between 0 and 1

Example

:

Query

: Cough and fever

Document 1

: The patient has fever

Document 2

: The patient does not have fever today

but had fever and cough in the past four days

 

Docume

nt

1

will still appear before

Document 2

in the ranked list, but it is not clear whether

Document 1

is more relevant to the

Query

than

Document 2

!

Slide19

Ranked Retrieval Models: Jaccard Coefficient

Jaccard coefficient measures the overlap between two sets, say, A & B

Jaccard(A, B) =

Jaccard(A, A) = 1

Jaccard(A, B) = 0 if

A and B do not have to be of the same size

Jaccard(query, document) always produces a score between 0 and 1

Jaccard(query, document) does not consider

term frequency

(i.e., how many times a term appears in a document)

A more sophisticated way is also needed to normalize the documents by length (different documents have different lengths)

 

Slide20

Term-Document Count Matrix

To account for

term frequency

, we can use a term-document

count matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1

0

0

1

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1

0

0

0

1

0

Fever

0

1

0

0

0

0

Cough

0

0

0

0

0

1

Headache

0

1

1

0

0

0

Our previous binary term-document incidence matrix

Slide21

Term-Document Count Matrix

To account for

term frequency

, we can use a term-document

count matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

4

0

0

3

0

0

Fatigue

1

0

0

1

0

0

Dizziness

2

0

0

0

4

0

Fever

0

5

0

0

0

0

Cough

0

0

0

0

0

5

Headache

0

2

3

0

0

0

A term-document count matrix

Slide22

Term-Document Count Matrix

To account for

term frequency

, we can use a term-document

count matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

4

0

0

3

0

0

Fatigue

1

0

0

1

0

0

Dizziness

2

0

0

0

4

0

Fever

0

5

0

0

0

0

Cough

0

0

0

0

0

5

Headache

0

2

3

0

0

0

Headache appeared 3 times in N

o

te 3

Slide23

Term Frequency

The

term frequency

,

tf

t,d

, of term

t

in document

d

is defined as the number of times that

t

occurs in

d

But how to use

tf

t,d

when computing query-document match scores?

A document with 10 occurrences of a term is more relevant than a document with 1 occurrence of the term, but not necessarily 10 times more relevant!

Relevance does not increase proportionally with term frequency

To this end, we can use

log-frequency weighting

Slide24

Log-Frequency Weighting

The log-frequency weight,

, of term

t

in document

d

is:

Subsequently, the score

of a document-query pair becomes the sum of terms

t

in

both

query

q

and document

d

as follows:

 

 

 

Slide25

Log-Frequency Weighting: Example

Here is our previous term-document

count matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

4

0

0

3

0

0

Fatigue

1

0

0

1

0

0

Dizziness

2

0

0

0

4

0

Fever

0

5

0

0

0

0

Cough

0

0

0

0

0

5

Headache

0

2

3

0

0

0

 

 

Slide26

Log-Frequency Weighting: Example

Here is the corresponding term-document

log frequency matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1.60205999

0

0

1.47712125

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1.30103

0

0

0

1.60205999

0

Fever

0

0

0

0

1

0

Cough

0

1.69897

0

0

0

1.69897

Headache

0

1.30103

1.47712125

0

0

0

 

 

Slide27

Log-Frequency Weighting: Example

Here is the corresponding term-document

log frequency matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1.60205999

0

0

1.47712125

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1.30103

0

0

0

1.60205999

0

Fever

0

0

0

0

1

0

Cough

0

1.69897

0

0

0

1.69897

Headache

0

1.30103

1.47712125

0

0

0

Query

: Cough and headache

Rank

Note 1

Slide28

Log-Frequency Weighting: Example

Here is the corresponding term-document

log frequency matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1.60205999

0

0

1.47712125

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1.30103

0

0

0

1.60205999

0

Fever

0

0

0

0

1

0

Cough

0

1.69897

0

0

0

1.69897

Headache

0

1.30103

1.47712125

0

0

0

Query

: Cough and headache

 

Rank

Note 1

Slide29

Log-Frequency Weighting: Example

Here is the corresponding term-document

log frequency matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1.60205999

0

0

1.47712125

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1.30103

0

0

0

1.60205999

0

Fever

0

0

0

0

1

0

Cough

0

1.69897

0

0

0

1.69897

Headache

0

1.30103

1.47712125

0

0

0

Query

: Cough and headache

 

Rank

Note 2

Note 1

Slide30

Log-Frequency Weighting: Example

Here is the corresponding term-document

log frequency matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1.60205999

0

0

1.47712125

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1.30103

0

0

0

1.60205999

0

Fever

0

0

0

0

1

0

Cough

0

1.69897

0

0

0

1.69897

Headache

0

1.30103

1.47712125

0

0

0

Query

: Cough and headache

 

Rank

Note 2

Note 3

Note 1

Slide31

Log-Frequency Weighting: Example

Here is the corresponding term-document

log frequency matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1.60205999

0

0

1.47712125

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1.30103

0

0

0

1.60205999

0

Fever

0

0

0

0

1

0

Cough

0

1.69897

0

0

0

1.69897

Headache

0

1.30103

1.47712125

0

0

0

Query

: Cough and headache

 

Rank

Note 2

Note 3

Note 1

Note 4

Slide32

Log-Frequency Weighting: Example

Here is the corresponding term-document

log frequency matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1.60205999

0

0

1.47712125

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1.30103

0

0

0

1.60205999

0

Fever

0

0

0

0

1

0

Cough

0

1.69897

0

0

0

1.69897

Headache

0

1.30103

1.47712125

0

0

0

Query

: Cough and headache

 

Rank

Note 2

Note 3

Note 1

Note 4

Note 5

Slide33

Log-Frequency Weighting: Example

Here is the corresponding term-document

log frequency matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1.60205999

0

0

1.47712125

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1.30103

0

0

0

1.60205999

0

Fever

0

0

0

0

1

0

Cough

0

1.69897

0

0

0

1.69897

Headache

0

1.30103

1.47712125

0

0

0

Query

: Cough and headache

 

Rank

Note 2

Note 6

Note 3

Note 1

Note 4

Note 5

Slide34

Rare vs. Common Terms

But, rare terms are more informative than frequent terms

E.g., Stop words like “a”, “the”, “to”, “of”, etc., are frequent but not very informative

E.g., A document containing the word “arrhythmias” is very likely to be relevant to the query “arrhythmias”

We want higher weights for rare terms than for more frequent terms

This can be achieved using

document frequency

Slide35

Document Frequency

The

document frequency

,

, of term

t

is defined as the number of documents in the given collection that contain

t

Thus,

, where

is the number of documents in the collection

is an inverse measure of the informativeness of

t

(

the smaller the number of documents that contain

t

the more informative

t

is

)

As such, we can define an inverse document frequency,

, as:

 

 

If

= 1,

I

f

= N,

 

Slide36

Document Frequency

The

document frequency

,

, of term

t

is defined as the number of documents in the given collection that contain

t

Thus,

, where

is the number of documents in the collection

is an inverse measure of the informativeness of

t

(

the smaller the number of documents that contain

t

the more informative

t

is

)

As such, we can define an inverse document frequency,

, as:

 

 

T

o “dampen” the effect of

 

Slide37

Document Frequency

The

document frequency

,

, of term

t

is defined as the number of documents in the given collection that contain

t

Thus,

, where

is the number of documents in the collection

is an inverse measure of the informativeness of

t

(

the smaller the number of documents that contain

t

the more informative

t

is

)

As such, we can define an inverse document frequency,

, as:

 

 

Note

: There is only one value of

for each

t

in the collection

 

Slide38

TF.IDF (or TF-IDF) Weighting

To get the best weighting scheme, we can combine term frequency,

, with inverse document frequency,

, as follows:

Clearly,

increases with:

The number of occurrences of term

t

in document

d

And

the rarity of

t

in the collection

 

 

Slide39

TF.IDF (or TF-IDF) Weighting

Subsequently, the score

of a document-query pair becomes the sum of terms

t

in both query

q

and document

d

as follows:

 

 

Slide40

TF.IDF: Example

Here is the term-document

count matrix

of our earlier example

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

4

0

0

3

0

0

Fatigue

1

0

0

1

0

0

Dizziness

2

0

0

0

4

0

Fever

0

5

0

0

0

0

Cough

0

0

0

0

0

5

Headache

0

2

3

0

0

0

 

 

Slide41

TF.IDF: Example

And, here is the corresponding term-document

log frequency matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

1.60205999

0

0

1.47712125

0

0

Fatigue

1

0

0

1

0

0

Dizziness

1.30103

0

0

0

1.60205999

0

Fever

0

0

0

0

1

0

Cough

0

1.69897

0

0

0

1.69897

Headache

0

1.30103

1.47712125

0

0

0

 

 

Slide42

TF.IDF: Example

And, here is the respective term-document

TF.IDF matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

0.76437687

0

0

0.70476595

0

0

Fatigue

0.47712125

0

0

0.47712125

0

0

Dizziness

0.62074906

0

0

0

0.76437687

0

Fever

0

0

0

0

0.77815125

0

Cough

0

0.8106147

0

0

0

0.8106147

Headache

0

0.62074906

0.70476595

0

0

0

 

Slide43

TF.IDF: Example

And, here is the respective term-document

TF.IDF matrix

Note 1

Note 2

Note 3

Note 4

Note 5

Note 6

T

hirst

0.76437687

0

0

0.70476595

0

0

Fatigue

0.47712125

0

0

0.47712125

0

0

Dizziness

0.62074906

0

0

0

0.76437687

0

Fever

0

0

0

0

0.77815125

0

Cough

0

0.8106147

0

0

0

0.8106147

Headache

0

0.62074906

0.70476595

0

0

0

Query

: Cough and headache

 

Slide44

Next Wednesday’s Class…

Project Presentations

Ranked Retrieval – Part II

Slide45

References

Schütze

, Hinrich, Christopher D. Manning, and Prabhakar Raghavan. 

Introduction to information retrieval

. Vol. 39. Cambridge: Cambridge University Press, 2008.