04 Ranked Retrieval Part I January 30 202 3 Mohammad Hammoud Carnegie Mellon University in Qatar Today Last Lecture AI Applications in Medicine Part II Todays Lecture ID: 1030478
Download Presentation The PPT/PDF document "AI for Medicine Lecture" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1. AI for Medicine Lecture 04: Ranked Retrieval– Part IJanuary 30, 2023Mohammad HammoudCarnegie Mellon University in Qatar
2. Today…Last Lecture:AI Applications in Medicine- Part IIToday’s Lecture:Ranked Retrieval – Part IAnnouncement:Assignment 1 is due today by midnight
3. Information RetrievalInformation Retrieval (IR) is concerned with searching over and ranking (large) data collections
4. Information RetrievalMore precisely, Information Retrieval (IR) is concerned with:Finding material (usually documents– E.g., medical notes written by clinicians during patient-clinician encounters) of an unstructured nature (usually text)that satisfies an information need from within large collections (usually stored on computers).Example: Which notes include “thirst and fatigue but not dizziness”?One way of doing this is to exhaustively search the whole collection, looking up each document and figuring out which documents contain thirst and fatigue but not dizziness
5. Exhaustively Searching the Text of Each Docthirst and fatigue but not dizzinessQuery………Large Document Collection
6. Exhaustively Searching the Text of Each Docthirst and fatigue but not dizzinessQuery………Large Document CollectionLookup (e.g., grep in Unix)Very Slow!
7. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000TermsDocuments0 if the note (e.g., Note 4) does not contain the term (e.g., Headache)1 if the note (e.g., Note 2) contains the term (e.g., Headache)Alternatively, we can use a binary term-document incidence matrix
8. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Query: Which notes include thirst and fatigue but not dizziness?Answer: Apply bitwise AND on the vectors of thirst, fatigue, and dizziness (complemented)Alternatively, we can use a binary term-document incidence matrix
9. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Query: Which notes include thirst and fatigue but not dizziness?100100Incidence Vector Alternatively, we can use a binary term-document incidence matrix
10. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Query: Which notes include thirst and fatigue but not dizziness?100100 AND 100100Incidence Vector Alternatively, we can use a binary term-document incidence matrix
11. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Query: Which notes include thirst and fatigue but not dizziness?100100 AND 100100 AND 011101 Incidence Vector Alternatively, we can use a binary term-document incidence matrix
12. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Query: Which notes include thirst and fatigue but not dizziness?100100 AND 100100 AND 011101 =100100100100011101000100000100Alternatively, we can use a binary term-document incidence matrix
13. Boolean QueriesThe type of the “thirst and fatigue but not dizziness” query is BooleanDocuments either match or do not matchBoolean queries often produce too few or too many results (1000s)AND gives too few; OR gives too manyThis is not good for the majority of users, who usually do not want to wade through 1000s of results To address this issue, ranked retrieval models can be utilized
14. Ranked Retrieval ModelsWith ranked retrieval models, a ranked list of documents is returnedOnly the top most relevant (e.g., 10) results can be shown so as to avoid overwhelming users How can we rank documents in a collection with respect to a query?We can assign a score (say, between 0 and 1) to each document-query pair The score will measure how well a document and a query matchFor a query with one term: The score should be 0 if the term does not occur in the document The more frequent the term appears in the document, the higher the score should be
15. Ranked Retrieval Models: Jaccard CoefficientJaccard coefficient measures the overlap between two sets, say, A & BJaccard(A, B) = Jaccard(A, A) = 1Jaccard(A, B) = 0 if A and B do not have to be of the same sizeJaccard(query, document) always produces a score between 0 and 1Example:Query: Cough and feverDocument 1: The patient has feverDocument 2: The patient does not have fever Jaccard(Query, Document 1) = 1/6Jaccard(Query, Document 2) = 1/8
16. Ranked Retrieval Models: Jaccard CoefficientJaccard coefficient measures the overlap between two sets, say, A & BJaccard(A, B) = Jaccard(A, A) = 1Jaccard(A, B) = 0 if A and B do not have to be of the same sizeJaccard(query, document) always produces a score between 0 and 1Example:Query: Cough and feverDocument 1: The patient has feverDocument 2: The patient does not have fever Document 1 will appear before Document 2 in the ranked list!
17. Ranked Retrieval Models: Jaccard CoefficientJaccard coefficient measures the overlap between two sets, say, A & BJaccard(A, B) = Jaccard(A, A) = 1Jaccard(A, B) = 0 if A and B do not have to be of the same sizeJaccard(query, document) always produces a score between 0 and 1Example:Query: Cough and feverDocument 1: The patient has feverDocument 2: The patient has fever and cough today, but yesterday and the day before she was totally okay, whereby she had no fever or cough or any symptom Jaccard(Query, Document 1) = 1/6Jaccard(Query, Document 2) = 3/20
18. Ranked Retrieval Models: Jaccard CoefficientJaccard coefficient measures the overlap between two sets, say, A & BJaccard(A, B) = Jaccard(A, A) = 1Jaccard(A, B) = 0 if A and B do not have to be of the same sizeJaccard(query, document) always produces a score between 0 and 1Example:Query: Cough and feverDocument 1: The patient has feverDocument 2: The patient has fever and cough today, but yesterday and the day before she was totally okay, whereby she had no fever or cough or any symptom Document 1 will still appear before Document 2 in the ranked list, but it is not clear whether Document 1 is more relevant to the Query than Document 2!
19. Ranked Retrieval Models: Jaccard CoefficientJaccard coefficient measures the overlap between two sets, say, A & BJaccard(A, B) = Jaccard(A, A) = 1Jaccard(A, B) = 0 if A and B do not have to be of the same sizeJaccard(query, document) always produces a score between 0 and 1Jaccard(query, document) does not consider term frequency (i.e., how many times a term appears in a document) A more sophisticated way is also needed to normalize the documents by length (different documents have different lengths)
20. Term-Document Count MatrixTo account for term frequency, we can use a term-document count matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Our previous binary term-document incidence matrix
21. Term-Document Count MatrixTo account for term frequency, we can use a term-document count matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst400300Fatigue100100Dizziness200040Fever050000Cough000005Headache023000A term-document count matrix
22. Term-Document Count MatrixTo account for term frequency, we can use a term-document count matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst400300Fatigue100100Dizziness200040Fever050000Cough000005Headache023000Headache appeared 3 times in Note 3
23. Term FrequencyThe term frequency, tft,d, of term t in document d is defined as the number of times that t occurs in d But how to use tft,d when computing query-document match scores?A document with 10 occurrences of a term is more relevant than a document with 1 occurrence of the term, but not necessarily 10 times more relevant!Relevance does not increase proportionally with term frequencyTo this end, we can use log-frequency weighting
24. Log-Frequency WeightingThe log-frequency weight, , of term t in document d is:Subsequently, the score of a document-query pair becomes the sum of terms t in both query q and document d as follows:
25. Log-Frequency Weighting: ExampleHere is our previous term-document count matrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst400300Fatigue100100Dizziness200040Fever050000Cough000005Headache023000
26. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000
27. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headacheRankNote 1
28. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 1
29. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 2Note 1
30. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 2Note 3Note 1
31. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 2Note 3Note 1Note 4
32. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 2Note 3Note 1Note 4Note 5
33. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 2Note 6Note 3Note 1Note 4Note 5
34. Rare vs. Common TermsBut, rare terms are more informative than frequent termsE.g., Stop words like “a”, “the”, “to”, “of”, etc., are frequent but not very informativeE.g., A document containing the word “arrhythmias” is very likely to be relevant to the query “arrhythmias”We want higher weights for rare terms than for more frequent termsThis can be achieved using document frequency
35. Document FrequencyThe document frequency, , of term t is defined as the number of documents in the given collection that contain tThus, , where is the number of documents in the collection is an inverse measure of the informativeness of t (the smaller the number of documents that contain t the more informative t is)As such, we can define an inverse document frequency, , as: If = 1, If = N,
36. Document FrequencyThe document frequency, , of term t is defined as the number of documents in the given collection that contain tThus, , where is the number of documents in the collection is an inverse measure of the informativeness of t (the smaller the number of documents that contain t the more informative t is)As such, we can define an inverse document frequency, , as: To “dampen” the effect of
37. Document FrequencyThe document frequency, , of term t is defined as the number of documents in the given collection that contain tThus, , where is the number of documents in the collection is an inverse measure of the informativeness of t (the smaller the number of documents that contain t the more informative t is)As such, we can define an inverse document frequency, , as: Note: There is only one value of for each t in the collection
38. TF.IDF (or TF-IDF) WeightingTo get the best weighting scheme, we can combine term frequency, , with inverse document frequency, , as follows:Clearly, increases with:The number of occurrences of term t in document dAnd the rarity of t in the collection
39. TF.IDF (or TF-IDF) WeightingSubsequently, the score of a document-query pair becomes the sum of terms t in both query q and document d as follows:
40. TF.IDF: ExampleHere is the term-document count matrix of our earlier example Note 1Note 2Note 3Note 4Note 5Note 6Thirst400300Fatigue100100Dizziness200040Fever050000Cough000005Headache023000
41. TF.IDF: ExampleAnd, here is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000
42. TF.IDF: ExampleAnd, here is the respective term-document TF.IDF matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst0.76437687000.7047659500Fatigue0.47712125000.4771212500Dizziness0.620749060000.764376870Fever00000.778151250Cough00.81061470000.8106147Headache00.620749060.70476595000
43. TF.IDF: ExampleAnd, here is the respective term-document TF.IDF matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst0.76437687000.7047659500Fatigue0.47712125000.4771212500Dizziness0.620749060000.764376870Fever00000.778151250Cough00.81061470000.8106147Headache00.620749060.70476595000Query: Cough and headache
44. Next Class…Ranked Retrieval – Part II
45. ReferencesSchütze, Hinrich, Christopher D. Manning, and Prabhakar Raghavan. Introduction to information retrieval. Vol. 39. Cambridge: Cambridge University Press, 2008.