/
AI for Medicine  Lecture AI for Medicine  Lecture

AI for Medicine Lecture - PowerPoint Presentation

heavin
heavin . @heavin
Follow
67 views
Uploaded On 2023-11-08

AI for Medicine Lecture - PPT Presentation

04 Ranked Retrieval Part I January 30 202 3 Mohammad Hammoud Carnegie Mellon University in Qatar Today Last Lecture AI Applications in Medicine Part II Todays Lecture ID: 1030478

term document 2note frequency document term frequency 2note 3note query 1note 4note matrix 5note log note cough incidence documents

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "AI for Medicine Lecture" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. AI for Medicine Lecture 04: Ranked Retrieval– Part IJanuary 30, 2023Mohammad HammoudCarnegie Mellon University in Qatar

2. Today…Last Lecture:AI Applications in Medicine- Part IIToday’s Lecture:Ranked Retrieval – Part IAnnouncement:Assignment 1 is due today by midnight

3. Information RetrievalInformation Retrieval (IR) is concerned with searching over and ranking (large) data collections

4. Information RetrievalMore precisely, Information Retrieval (IR) is concerned with:Finding material (usually documents– E.g., medical notes written by clinicians during patient-clinician encounters) of an unstructured nature (usually text)that satisfies an information need from within large collections (usually stored on computers).Example: Which notes include “thirst and fatigue but not dizziness”?One way of doing this is to exhaustively search the whole collection, looking up each document and figuring out which documents contain thirst and fatigue but not dizziness

5. Exhaustively Searching the Text of Each Docthirst and fatigue but not dizzinessQuery………Large Document Collection

6. Exhaustively Searching the Text of Each Docthirst and fatigue but not dizzinessQuery………Large Document CollectionLookup (e.g., grep in Unix)Very Slow!

7. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000TermsDocuments0 if the note (e.g., Note 4) does not contain the term (e.g., Headache)1 if the note (e.g., Note 2) contains the term (e.g., Headache)Alternatively, we can use a binary term-document incidence matrix

8. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Query: Which notes include thirst and fatigue but not dizziness?Answer: Apply bitwise AND on the vectors of thirst, fatigue, and dizziness (complemented)Alternatively, we can use a binary term-document incidence matrix

9. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Query: Which notes include thirst and fatigue but not dizziness?100100Incidence Vector Alternatively, we can use a binary term-document incidence matrix

10. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Query: Which notes include thirst and fatigue but not dizziness?100100 AND 100100Incidence Vector Alternatively, we can use a binary term-document incidence matrix

11. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Query: Which notes include thirst and fatigue but not dizziness?100100 AND 100100 AND 011101 Incidence Vector Alternatively, we can use a binary term-document incidence matrix

12. Binary Term-Document Incidence MatrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Query: Which notes include thirst and fatigue but not dizziness?100100 AND 100100 AND 011101 =100100100100011101000100000100Alternatively, we can use a binary term-document incidence matrix

13. Boolean QueriesThe type of the “thirst and fatigue but not dizziness” query is BooleanDocuments either match or do not matchBoolean queries often produce too few or too many results (1000s)AND gives too few; OR gives too manyThis is not good for the majority of users, who usually do not want to wade through 1000s of results To address this issue, ranked retrieval models can be utilized

14. Ranked Retrieval ModelsWith ranked retrieval models, a ranked list of documents is returnedOnly the top most relevant (e.g., 10) results can be shown so as to avoid overwhelming users How can we rank documents in a collection with respect to a query?We can assign a score (say, between 0 and 1) to each document-query pair The score will measure how well a document and a query matchFor a query with one term: The score should be 0 if the term does not occur in the document The more frequent the term appears in the document, the higher the score should be 

15. Ranked Retrieval Models: Jaccard CoefficientJaccard coefficient measures the overlap between two sets, say, A & BJaccard(A, B) = Jaccard(A, A) = 1Jaccard(A, B) = 0 if A and B do not have to be of the same sizeJaccard(query, document) always produces a score between 0 and 1Example:Query: Cough and feverDocument 1: The patient has feverDocument 2: The patient does not have fever Jaccard(Query, Document 1) = 1/6Jaccard(Query, Document 2) = 1/8

16. Ranked Retrieval Models: Jaccard CoefficientJaccard coefficient measures the overlap between two sets, say, A & BJaccard(A, B) = Jaccard(A, A) = 1Jaccard(A, B) = 0 if A and B do not have to be of the same sizeJaccard(query, document) always produces a score between 0 and 1Example:Query: Cough and feverDocument 1: The patient has feverDocument 2: The patient does not have fever Document 1 will appear before Document 2 in the ranked list!

17. Ranked Retrieval Models: Jaccard CoefficientJaccard coefficient measures the overlap between two sets, say, A & BJaccard(A, B) = Jaccard(A, A) = 1Jaccard(A, B) = 0 if A and B do not have to be of the same sizeJaccard(query, document) always produces a score between 0 and 1Example:Query: Cough and feverDocument 1: The patient has feverDocument 2: The patient has fever and cough today, but yesterday and the day before she was totally okay, whereby she had no fever or cough or any symptom Jaccard(Query, Document 1) = 1/6Jaccard(Query, Document 2) = 3/20

18. Ranked Retrieval Models: Jaccard CoefficientJaccard coefficient measures the overlap between two sets, say, A & BJaccard(A, B) = Jaccard(A, A) = 1Jaccard(A, B) = 0 if A and B do not have to be of the same sizeJaccard(query, document) always produces a score between 0 and 1Example:Query: Cough and feverDocument 1: The patient has feverDocument 2: The patient has fever and cough today, but yesterday and the day before she was totally okay, whereby she had no fever or cough or any symptom Document 1 will still appear before Document 2 in the ranked list, but it is not clear whether Document 1 is more relevant to the Query than Document 2!

19. Ranked Retrieval Models: Jaccard CoefficientJaccard coefficient measures the overlap between two sets, say, A & BJaccard(A, B) = Jaccard(A, A) = 1Jaccard(A, B) = 0 if A and B do not have to be of the same sizeJaccard(query, document) always produces a score between 0 and 1Jaccard(query, document) does not consider term frequency (i.e., how many times a term appears in a document) A more sophisticated way is also needed to normalize the documents by length (different documents have different lengths) 

20. Term-Document Count MatrixTo account for term frequency, we can use a term-document count matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst100100Fatigue100100Dizziness100010Fever010000Cough000001Headache011000Our previous binary term-document incidence matrix

21. Term-Document Count MatrixTo account for term frequency, we can use a term-document count matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst400300Fatigue100100Dizziness200040Fever050000Cough000005Headache023000A term-document count matrix

22. Term-Document Count MatrixTo account for term frequency, we can use a term-document count matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst400300Fatigue100100Dizziness200040Fever050000Cough000005Headache023000Headache appeared 3 times in Note 3

23. Term FrequencyThe term frequency, tft,d, of term t in document d is defined as the number of times that t occurs in d But how to use tft,d when computing query-document match scores?A document with 10 occurrences of a term is more relevant than a document with 1 occurrence of the term, but not necessarily 10 times more relevant!Relevance does not increase proportionally with term frequencyTo this end, we can use log-frequency weighting

24. Log-Frequency WeightingThe log-frequency weight, , of term t in document d is:Subsequently, the score of a document-query pair becomes the sum of terms t in both query q and document d as follows:   

25. Log-Frequency Weighting: ExampleHere is our previous term-document count matrixNote 1Note 2Note 3Note 4Note 5Note 6Thirst400300Fatigue100100Dizziness200040Fever050000Cough000005Headache023000  

26. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000  

27. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headacheRankNote 1

28. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 1

29. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 2Note 1

30. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 2Note 3Note 1

31. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 2Note 3Note 1Note 4

32. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 2Note 3Note 1Note 4Note 5

33. Log-Frequency Weighting: ExampleHere is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000Query: Cough and headache RankNote 2Note 6Note 3Note 1Note 4Note 5

34. Rare vs. Common TermsBut, rare terms are more informative than frequent termsE.g., Stop words like “a”, “the”, “to”, “of”, etc., are frequent but not very informativeE.g., A document containing the word “arrhythmias” is very likely to be relevant to the query “arrhythmias”We want higher weights for rare terms than for more frequent termsThis can be achieved using document frequency

35. Document FrequencyThe document frequency, , of term t is defined as the number of documents in the given collection that contain tThus, , where is the number of documents in the collection is an inverse measure of the informativeness of t (the smaller the number of documents that contain t the more informative t is)As such, we can define an inverse document frequency, , as:  If = 1, If = N,  

36. Document FrequencyThe document frequency, , of term t is defined as the number of documents in the given collection that contain tThus, , where is the number of documents in the collection is an inverse measure of the informativeness of t (the smaller the number of documents that contain t the more informative t is)As such, we can define an inverse document frequency, , as:  To “dampen” the effect of  

37. Document FrequencyThe document frequency, , of term t is defined as the number of documents in the given collection that contain tThus, , where is the number of documents in the collection is an inverse measure of the informativeness of t (the smaller the number of documents that contain t the more informative t is)As such, we can define an inverse document frequency, , as:  Note: There is only one value of for each t in the collection  

38. TF.IDF (or TF-IDF) WeightingTo get the best weighting scheme, we can combine term frequency, , with inverse document frequency, , as follows:Clearly, increases with:The number of occurrences of term t in document dAnd the rarity of t in the collection    

39. TF.IDF (or TF-IDF) WeightingSubsequently, the score of a document-query pair becomes the sum of terms t in both query q and document d as follows:  

40. TF.IDF: ExampleHere is the term-document count matrix of our earlier example Note 1Note 2Note 3Note 4Note 5Note 6Thirst400300Fatigue100100Dizziness200040Fever050000Cough000005Headache023000  

41. TF.IDF: ExampleAnd, here is the corresponding term-document log frequency matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst1.60205999001.4771212500Fatigue100100Dizziness1.301030001.602059990Fever000010Cough01.698970001.69897Headache01.301031.47712125000  

42. TF.IDF: ExampleAnd, here is the respective term-document TF.IDF matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst0.76437687000.7047659500Fatigue0.47712125000.4771212500Dizziness0.620749060000.764376870Fever00000.778151250Cough00.81061470000.8106147Headache00.620749060.70476595000 

43. TF.IDF: ExampleAnd, here is the respective term-document TF.IDF matrix Note 1Note 2Note 3Note 4Note 5Note 6Thirst0.76437687000.7047659500Fatigue0.47712125000.4771212500Dizziness0.620749060000.764376870Fever00000.778151250Cough00.81061470000.8106147Headache00.620749060.70476595000Query: Cough and headache 

44. Next Class…Ranked Retrieval – Part II

45. ReferencesSchütze, Hinrich, Christopher D. Manning, and Prabhakar Raghavan. Introduction to information retrieval. Vol. 39. Cambridge: Cambridge University Press, 2008.