Using Information Content Measures of Similarity Bridget McInnes Ted Pedersen Ying Liu Genevieve B Melton Serguei Pakhomov 1 Objective of this work Develop and evaluate a method than can disambiguate terms in biomedical text by ID: 784751
Download The PPT/PDF document "Knowledge-based Method for Determining t..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Knowledge-based Method for Determining the Meaning of Ambiguous Biomedical Terms Using Information Content Measures of Similarity
Bridget McInnesTed Pedersen Ying LiuGenevieve B. MeltonSerguei Pakhomov
1
Slide2Objective of this work
Develop and evaluate a method than can
disambiguate terms in biomedical text by
exploiting similarity information
extrapolated from the Unified Medical Language SystemEvaluate the efficacy of Information Content-based similarity measures over path-based similarity measures for Word Sense Disambiguation, WSD
2
Slide3Word Sense Disambiguation
Word sense disambiguation is
the task of determining the appropriate sense of a term given context in which it is used.
TERM:
toleranceDrugTolerance
Immune
Tolerance
3
Slide4Word Sense Disambiguation
Word sense disambiguation
is the task of determining the appropriate sense of a term given context in which it is used.
Busprione
attenuates tolerance to morphine in mice with skin cancer
Drug
Tolerance
Immune
Tolerance
4
Slide5Sense inventory
: Unified Medical Language System
Unified Medical Language Sources (UMLS)
Semantic Network
Metathesaurus~1.7 million biomedical and clinical concepts; integrated semi-automaticallyCUIs (Concept Unique Identifiers), linked:Hierarchical: PAR/CHD and RB/RNNon-hierarchical: SIB, RO
Sources viewed together or independently
Medical Subject Heading (MSH)
SPECIALIST Lexicon
Biomedical and clinical terms, including variants
5
Slide6Word Sense Disambiguation
Busprione attenuates tolerance
to morphine
in mice with skin cancer
DrugTolerance: C0013220ImmuneTolerance:C0020963Concept Unique Identifiers: CUIs
6
Slide7SenseRelate
algorithm
Each possible sense of a
target word
is assigned a score [sum similarity between it and its surrounding terms]Assign target word the sense with highest scoreProposed by Patwardhan and Pedersen 2003 using WordNet
UMLS::
SenseRelate
is a modification of this algorithm using
information from the UMLS
NEXT UP: an example
7
Slide8SenseRelate Example
Busprione
attenuates
tolerance
to morphine in mice with skin cancer8
Slide9SenseRelate Example
Busprione
attenuates
tolerance
to morphine in mice with skin cancerDrugTolerance: C0013220ImmuneTolerance:C00209639
Slide10SenseRelate Example
Busprione
attenuates
tolerance
to morphine in mice with skin cancerDrugTolerance: C0013220ImmuneTolerance:C0020963
Busprione
:
C0006462
Morphine:
C0026549
Mice:
C0026809
Skin cancer:
C0007114
10
Slide11SenseRelate Example
0.09
0.16
0.11
Busprione
attenuates
tolerance
to morphine
in mice with skin cancer
0.09
Drug
Tolerance:
C0013220
Immune
Tolerance:
C0020963
Busprione
:
C0006462
Morphine:
C0026549
Mice:
C0026809
Skin cancer:
C0007114
11
Slide12SenseRelate Example
0.09
0.16
0.11
Busprione
attenuates
tolerance
to morphine
in mice with skin cancer
0.09
Drug
Tolerance:
C0013220
Immune
Tolerance:
C0020963
Busprione
:
C0006462
Morphine:
C0026549
Mice:
C0026809
Skin cancer:
C0007114
Drug Tolerance
Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
12
Slide13SenseRelate Example
0.09
0.16
0.11
0.09
0.05
0.04
Busprione
attenuates
tolerance
to morphine
in mice with skin cancer
0.09
0.09
Drug
Tolerance:
C0013220
Immune
Tolerance:
C0020963
Busprione
:
C0006462
Morphine:
C0026549
Mice:
C0026809
Skin cancer:
C0007114
Drug Tolerance
Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
13
Slide14SenseRelate Example
0.09
0.16
0.11
0.09
0.05
0.04
Busprione
attenuates
tolerance
to morphine
in mice with skin cancer
0.09
0.09
Drug
Tolerance:
C0013220
Immune
Tolerance:
C0020963
Busprione
:
C0006462
Morphine:
C0026549
Mice:
C0026809
Skin cancer:
C0007114
Drug Tolerance
Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
Immune Tolerance
Score = 0.09 + 0.09 + 0.05 + 0.05 = 0.27
14
Slide15SenseRelate Example
0.09
0.16
0.11
0.09
0.05
0.04
Busprione
attenuates
tolerance
to morphine
in mice with skin cancer
0.09
0.09
Drug
Tolerance:
C0013220
Immune
Tolerance:
C0020963
Busprione
:
C0006462
Morphine:
C0026549
Mice:
C0026809
Skin cancer:
C0007114
Drug Tolerance
Score = 0.09 + 0.09 + 0.16 + 0.11 =
0.45
Immune Tolerance
Score = 0.09 + 0.09 + 0.05 + 0.05 = 0.27
15
Slide16Sense Relate Assumption
An ambiguous word is often used in the sense
that is most similar to the sense of the
terms that surround it
16
Slide17SenseRelate
Components
Identifying the concepts of surrounding terms
Calculating semantic similarity
17
Slide18Identifying the concepts of the surrounding terms
Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table
in the
UMLS
18
Slide19Identifying the concepts of the surrounding terms
Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS
Busprione
attenuates
tolerance
to morphine
in mice with
skin cancer
19
Slide20Identifying the concepts of the surrounding terms
Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS
...
skin cancer
skin grafting
skin
disease
...
SPECIALIST
LEXICON
Busprione
attenuates
tolerance
to morphine
in mice with
skin cancer
20
Slide21Identifying the concepts of the surrounding terms
Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS
...
skin cancer
skin grafting
skin
disease
...
...
skin cancer C0007114
skin grafting C0037297
skin disease
C0037274
...
SPECIALIST
LEXICON
MRCONSO
Busprione
attenuates
tolerance
to morphine
in mice with
skin cancer
21
Slide22Semantic Similarity Measures
Path-based measures
Path
Wu and Palmer
Leacock and ChodorowNgyuen and Al-MubaidInformation content (IC)-based measures
Resnik
Lin
Jiang and
Conrath
22
Slide23Path-based similarity measures
Use only the path information obtained from a taxonomy
23
Slide24Path-based similarity measures
Use only the path information obtained from a taxonomy
Path measure
sim
(c1,c2) = 1 / minpath(c2,c2)where minpath is the shortest path between the two concepts
24
Slide25Path-based similarity measures
Use only the path information obtained from a taxonomy
Path measure
sim
(c1,c2) = 1/minpath(c2,c2)where minpath is the shortest path between the two concepts
Wu and Palmer, 1994
sim
(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))
where LCS is the least common
subsumer
of the two concepts
25
Slide26Path-based similarity measures
Use only the path information obtained from a taxonomy
Path measure
sim
(c1,c2) = 1/ minpath(c2,c2)where minpath is the shortest path between the two concepts
Wu and Palmer, 1994
sim
(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))
where LCS is the least common
subsumer
of the two concepts
Leacock and
Chodorow
, 1998
sim
(c1,c2) = -log(
minpath
(c1,c2) / (2D) )
where D is the total depth of the taxonomy
26
Slide27Path-based similarity measures
Use only the path information obtained from a taxonomy
Path measure
sim
(c1,c2) = 1/ minpath(c2,c2)where minpath is the shortest path between the two concepts
Leacock and
Chodorow
, 1998
sim
(c1,c2) = -log(
minpath
(c1,c2) / (2D) )
where D is the total depth of the taxonomy
Wu and Palmer, 1994
sim
(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))
where LCS is the least common
subsumer
of the two concepts
Nyguen
and Al-
Mubaid
, 2006
sim
(c1,c2) = log ( (2 +
minpath
(c1,c2) - 1) *
(D - depth(LCS(c1,c2))) )
27
Slide28Path-based Similarity Measures
USE ONLY THE
PATH INFORMATION OBTAINED FROM A TAXONOMY
Disease:
C0012634
Drug Related Disorder: C0277579
Drug
Tolerance:
C0013220
Neoplasm:
C1302761
Neoplastic
Disease:
C1882062
Malignant Neoplasm:
C0006826
Skin cancer:
C0007114
28
Slide29Information content-based
MeasuresIncorporate the probability of the concepts
IC = -log(P(concept))
29
Slide30Information content-based
MeasuresIncorporate the probability of the concepts
IC = -log(P(concept))
P(concept)
Calculated by summing the probability of the concept and the probability of its descendants
Probabilities are obtained from an external corpus
30
Slide31Information content-based
MeasuresIncorporate the probability of the concepts
IC = -log(P(concept)
Resnik
, 1995sim(c1,c2) = IC(LCS(c1,c2))31
Slide32Information content-based
MeasuresIncorporate the probability of the concepts
IC = -log(P(concept)
Resnik
, 1995sim(c1,c2) = IC(LCS(c2,c2))
Jiang and
Conrath
, 1997
sim
(c1,c2) = 1 / (IC(c1)+IC(c2) – 2* IC(LCS(c1,c2))
32
Slide33Information content-based
MeasuresIncorporate the probability of the concepts
IC = -log(P(concept)
Resnik
, 1995sim(c1,c2) = IC(LCS(c2,c2))
Jiang and
Conrath
, 1997
sim
(c1,c2) = 1 ÷ (IC(c1)+IC(c2) – 2* IC(LCS(c1,c2))
Lin, 1998
sim
(c1,c2) = (2*IC(LCS(c2,c2))) / (IC(c1)+IC(c2))
33
Slide34IC-based similarity measures
Disease:
C0012634
Drug Related Disorder: C0277579
Drug
Tolerance:
C0013220
Neoplasm:
C1302761
Neoplastic
Disease:
C1882062
Malignant Neoplasm:
C0006826
Skin cancer:
C0007114
+
PATH INFORMATION
PROBABILITY OF CONCEPTS
EXTERNAL CORPUS
34
Slide35Experimental Framework
Use open-source UMLS
::Similarity package to obtain the
similarity between
the terms and possible senses in the SenseRelate algorithmPath information: parent/child relations in MSH source Information content: calculated using the UMLSonMedline
dataset created by NLM
Consists of concepts from 2009AB UMLS and the frequency they occurred in Medline using the Essie Search Engine (
Ide
et al 2007
)
Medline: database of citations of biomedical/clinical articles
35
Slide36Evaluation Data: MSH WSD
MSH-WSD dataset (
Jimeno-Yepes
, et al 2011)
203 target words (ambiguous word) from Medline106 terms e.g. tolerance 88 acronyms e.g. CA (calcium, california) 9 mixtures e.g. bat (brown adipose tissue)
Each target word contains ~187 instances
(Medline abstracts)
abstract = ~ 500 words
Each target word in the
instances assigned
a concept from MSH by exploiting the manually assigned MSH
concepts
assigned to the abstract
Average of 2.08
possible
senses
per target word
Majority sense over all the target words is 54.5%
36
Slide37Results
baselinepath
lch
wup
namresjcnaccurac
y
Path-based
IC-based
lin
37
Slide38Comparison across subsets of msh-wsd
accu
r
a
cy38
Slide39Comparison across subsets of msh-wsd
accu
r
a
cy39
Slide40Comparison across subsets of msh-wsd
accu
r
a
cy40
Slide41Comparison across subsets of msh-wsd
accu
r
a
cy41
Slide42Comparison across subsets of msh-wsd
accu
r
a
cy42
Slide43Window sizesUse the terms surrounding the target word within a specified window: 1, 2, 5, 10, 25, 50, 60, 70
Busprione
attenuates
tolerance
to morphine in mice with skin_cancerWINDOW SIZE = 243
Slide44Comparison of window sizes for lin
accu
r
a
cywindow size44
Slide45Surrounding terms Not all terms have a concept in the UMLS
thereforeNot all surrounding terms in the window mapped to CUIs
45
Slide46Window sizes versus mapped terms
numb
e
r
ofmappingswindow size46
Slide47Future work: mapping TermsCurrently looking at mapping the terms to CUIs using information from the concept mapping system
MetaMapObtain the terms from MetaMap and do a dictionary look up in MRCONSOHypothesis – the terms obtained by
MetaMap
are more accurate than using the SPECIALIST Lexicon
Obtain the CUIs from MetaMapHypothesis – the CUIs obtained by MetaMap will be more accurate than the dictionary look-up47
Slide48Objective #1
Develop and evaluate a method than can disambiguate terms in biomedical text by exploiting similarity information extrapolated from the UMLS
UMLS::
SenseRelate
statistically significantly higher disambiguation accuracy than the baselineOn par with previous unsupervised methods for terms48
Slide49Objective #2
Evaluate the efficacy of IC-based similarity measures over path-based measures on a secondary task
There is no statistically significant difference between the accuracies obtained by the IC-based measures
There is a statistically significant difference between the IC-based measures and the path-based measures
49
Slide50Take home message:
An ambiguous word is often used in the sense
that is most similar to the sense of the concepts
of the terms that surround it
50
Slide51Resources
Software:
UMLS::
SenseRelate
http://search.cpan.org/dist/UMLS-SenseRelate/UMLS::Similarityhttp://search.cpan.org/dist/UMLS-Similarity/DataMSH-WSD
http://
wsd.nlm.nih.gov/collaboration.shtml
51
Slide52Resources
Software:
UMLS::
SenseRelate
http://search.cpan.org/dist/UMLS-SenseRelate/UMLS::Similarityhttp://search.cpan.org/dist/UMLS-Similarity/DataMSH-WSD
http://
wsd.nlm.nih.gov/collaboration.shtml
THANK YOU
52
Slide53Resources
Software:
UMLS::
SenseRelate
http://search.cpan.org/dist/UMLS-SenseRelate/UMLS::Similarityhttp://search.cpan.org/dist/UMLS-Similarity/DataMSH-WSD
http://
wsd.nlm.nih.gov/collaboration.shtml
QUESTIONS?
53