Corpora and Statistical Methods Lecture 6 Word sense disambiguation Part 2 What are word senses Cognitive definition mental representation of meaning used in psychological experiments relies on introspection notoriously deceptive ID: 536343
Download Presentation The PPT/PDF document "Semantic similarity, vector space models..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Semantic similarity, vector space models and word-sense disambiguation
Corpora and Statistical Methods
Lecture 6Slide2
Word sense disambiguation
Part 2Slide3
What are word senses?
Cognitive definition:
mental representation of meaning
used in psychological experiments
relies on introspection (notoriously deceptive)
Dictionary-based
definition:
adopt sense definitions in a dictionary
most frequently used resource is
WordNetSlide4
WordNet
Taxonomic representation of words (“concepts”)
Each
word belongs to a
synset
, which contains near-synonyms
Each
synset
has a
gloss
Words
with multiple senses (
polysemy
) belong to multiple
synsets
Synsets
organised by hyponymy (IS-A) relationsSlide5
How many senses?
Example:
interest
pay 3% interest on a loan
showed interest in something
purchased an interest in a company.
the national interest…
have X’s best interest at heart
have an interest in word senses
The economy is run by business interestsSlide6
Wordnet entry for interest
(noun)
a sense of concern with and curiosity about someone or something … (Synonym: involvement)
the power of attracting or holding one’s interest… (Synonym: interestingness)
a reason for wanting something done ( Synonym: sake)
a fixed charge for borrowing money…
a diversion that occupies one’s time and thoughts … (Synonym: pastime)
a right or legal share of something; a financial involvement with something (Synonym: stake)
(usually plural) a social group whose members control some field of activity and who have common aims, (Synonym: interest group)Slide7
Some issues
Are all these really distinct senses? Is
WordNet
too fine-grained?
Would
native speakers distinguish all these as different?
Cf
. The distinction between sense ambiguity and
underspecification
(vagueness):
one could argue that there are fewer senses, but these are underspecified out of contextSlide8
Translation equivalence
Many WSD applications rely on translation equivalence
Given
: parallel corpus (e.g. English-German)
if word w in English has
n
translations in German, then each translation represents a sense
e.g
. German translations of
interest:
Zins
: financial charge (WN sense 4)
Anteil
: stake in a company (WN sense 6)
Interesse
: all other sensesSlide9
Some terminology
WSD Task: given an ambiguous word, find the intended sense in
context
Sense tagging
: task of labelling words as belonging to one sense or another.
needs some
a priori
characterisation of senses of each relevant word
Discrimination
:
distinguishes between occurrences of words based on senses
not necessarily explicit labellingSlide10
Some more terminology
Two types of WSD task:
Lexical sample task
: focuses on disambiguating a small set of target words, using an inventory of the senses of those words.
All-words task
: focuses on entire texts and a lexicon, where every word in the text has to be disambiguated
Serious data sparseness problems!Slide11
Approaches to WSD
All methods rely on training data. Basic idea:
Given
word w in context c
learn how to predict sense
s
of
w
based on various features of
w
Supervised learning: training data is labelled with correct senses
can do sense
tagging
Unsupervised learning: training data is unlabelled
but many other knowledge sources used
cannot do sense tagging, since this requires a priori sensesSlide12
Supervised learning
Words in training data labelled with their senses
She pays 3% interest/INTEREST-MONEY on the loan.
He showed a lot of interest/INTEREST-CURIOSITY in the painting.
Similar
to POS tagging
given a corpus tagged with senses
define features that indicate one sense over another
learn a model that predicts the correct sense given the featuresSlide13
Features (e.g. plant)
Neighbouring words:
plant life
manufacturing plant
assembly plant
plant closure
plant species
Content
words in a larger window
animal
equipment
employee
automaticSlide14
Other features
Syntactically related words
e.g. object, subject….
Topic of the text
is it about SPORT? POLITICS?
Part-of-speech tag, surrounding part-of-speech tagsSlide15
Some principles proposed (Yarowsky 1995)
One sense per discourse:
typically, all occurrences of a word will have the same sense in the same stretch of discourse (e.g. same document)
One sense per collocation:
nearby words provide clues as to the sense, depending on the distance and syntactic relationship
e.g.
plant
life
: all (?) occurrences of
plant+life
will indicate the botanic sense of
plantSlide16
Training data
SENSEVAL
Shared Task competition
datasets available for WSD, among other things
annotated corpora in many languages
Pseudo-words
create training corpus by artificially conflating words
e.g. all occurrences of
man
and
hammer
with
man-hammer
easy way to create training data
Multi-lingual
parallel corpora
translated texts aligned at the sentence level
translation indicates senseSlide17
Data representation
Example sentence:
An electric guitar and bass player stand off to one side...
Target word:
bass
Possible senses:
fish, musical instrument...
Relevant features are represented as vectors, e.g.:Slide18
Supervised methodsSlide19
Naïve Bayes Classifier
Identify the
features (F)
e.g. surrounding words
other cues apart from surrounding context
Combine
evidence from all features
Decision
rule: decide on sense
s’
iff
Example:
drug
. F = words in context
medication sense:
price, prescription, pharmaceutical
illegal substance sense:
alcohol, illicit, paraphernaliaSlide20
Using Bayes’ rule
We usually don’t know
P(
s
k
|f
)
but we can compute from training data: P(
s
k
) (the prior) and
P
(
f
|
s
k
)
P
(f)
can be eliminated because it is constant for all senses in the corpusSlide21
The independence assumption
It’s called “naïve” because:
i.e. all features are assumed to be independent
Obviously
, this is often not true.
e.g
. finding
illicit
in the context of
drug
may not be independent of finding
pusher.
cf. our discussion of collocations!
Also, topics
often constrain word choice.Slide22
Training the naive Bayes
classifier
We need to compute:
P(s)
for all senses
s
of
w
P(
f|s
)
for all features
fSlide23
Information-theoretic measures
Find the single, most informative feature to predict a sense.
E.g
. using a parallel corpus:
prendre
(FR) can translate as
take
or
make
prendre
une
décision
: make a decision
prendre
une
mesure
:
take a measure [to…]
Informative
feature in this case: direct object
mesure
indicates
take
décision
indicates
make
Problem
: need to identify the correct value of the feature that indicates a specific sense.Slide24
Brown et al’s algorithm
Given: translations T of word w
Given: values X of a useful feature (e.g.
mesure
,
décision
as values of DO)
Step 1: random partition P of T
While
improving
, do:
create partition Q of X that maximises I(P;Q)
find a partition P of T that maximises I(P;Q)
comment: relies on mutual info to find clusters of translations mapping to clusters of feature valuesSlide25
Using dictionaries and thesauri
Lesk
(1986): one of the first to exploit dictionary definitions
the definition corresponding to a sense can contain words which are good indicators for that sense
Method
:
Given: ambiguous word
w
with senses s
1
…
s
n
with glosses g
1
…
g
n
.
Given: the word
w
in context
c
compute overlap between
c
& each gloss
select the maximally matching senseSlide26
Expanding a dictionary
Problem
with
Lesk
:
often dictionary definitions don’t contain sufficient information
not all words in dictionary definitions are good informants
Solution
: use a thesaurus with subject/topic categories
e.g.
Roget’s thesaurus
http://www.gutenberg.org/cache/epub/22/pg22.html.utf8Slide27
Using topic categories
Suppose every sense
s
k
of word w has subject/topic
t
k
w
can be disambiguated by identifying the words related to
t
k
in the thesaurus
Problems
:
general-purpose thesauri don’t list domain-specific topics
several potentially useful words can be left out
e.g. …
Navratilova plays great tennis
…
proper name here useful as indicator of topic SPORTSlide28
Expanding a thesaurus: Yarowsky 1992
Given: context
c
and topic
t
For all contexts and topics, compute p(
c|t
) using Naïve
Bayes
by comparing words pertaining to
t
in the thesaurus with words in
c
if p(
c|t
) >
α
, then assign topic
t
to context
c
For all words in the vocabulary, update the list of contexts in which the word occurs.
Assign topic
t
to each word in
c
Finally, compute p(
w
|t
) for all
w
in the vocabulary
this gives the “strength of association” of
w
with
tSlide29
Yarowsky 1992: some results
SENSE ROGET TOPICS
ACCURACY
star
space object UNIVERSE 96%
celebrity
ENTERTAINER
95%
shape INSIGNIA 82.%
sentence
punishment LEGAL_ACTION 99%
set of words GRAMMAR 98% Slide30
Bootstrapping
Yarowsky
(1995) suggested the one sense per discourse/collocation constraints.
Yarowsky’s
method:
select the strongest
collocational
feature in a specific context
disambiguate based only on this feature
(similar to the information-theoretic method discussed earlier)Slide31
One sense per collocation
For each sense
s
of
w
, initialise F, the collocations found in
s
’s
dictionary
definition.
One sense per collocation:
identify the set of contexts containing collocates of
s
for each sense
s
of
w,
update F to contain those collocates such that
for all s’ ≠ s
(
where alpha is a threshold)Slide32
One sense per discourse
For each document:
find the majority sense of
w
out of those found in previous step
assign all occurrences of
w
the majority sense
This is implemented as a post-processing step. Reduces error rate by ca. 27%.Slide33
Unsupervised disambiguationSlide34
Preliminaries
Recall: unsupervised learning can do sense discrimination not tagging
akin
to clustering occurrences with the same sense
e.g
. Brown et al 1991: cluster translations of a word
this
is akin to clustering sensesSlide35
Brown et al’s method
Preliminary categorisation:
Set
P(
w
|
s
) randomly for all words
w
and senses
s
of
w
.
Compute
, for each context
c
of
w
the probability P(
c|s
) that the context was generated by sense
s
.
Use
(1) and (2) as a preliminary estimate. Re-estimate iteratively to find best fit to the corpus.Slide36
Characteristics of unsupervised disambiguation
Can adapt easily to new domains, not covered by a dictionary or pre-labelled corpus
Very useful for information retrieval
If
there are many senses (e.g. 20 senses for word
w
), the algorithm will split contexts into fine-grained sets
NB: can go awry with infrequent sensesSlide37
Some issues with WSDSlide38
The task definition
The WSD task traditionally assumes that a word has one and only one sense in a context.
Is this true?
Kilgarriff
(1993) argues that co-activation (one word displaying more than one sense) is frequent:
this would bring competition to the licensed trade
competition
= “act of competing”; “people/organisations who are competing”Slide39
Systematic polysemy
Not all senses are so easy to distinguish. E.g.
competition
in the “agent competing”
vs
“act of competing” sense.
The
polysemy
here is
systematic
Compare
bank/bank
where the senses are utterly distinct (and most linguists wouldn’t consider this a case of
polysemy
, but homonymy)
Can translation equivalence help here?
depends if
polysemy
is systematic in all languagesSlide40
Logical metonymy
Metonymy = usage of a word to stand for something else
e.g.
the pen is mightier than the sword
pen
=
the press
Logical
metonymy arises due to systematic
polysemy
good cook
vs.
good book
enjoy the paper
vs
enjoy the cake
Should
WSD distinguish these? How could they do this?Slide41
Which words/usages count?
Many proper names are identical to common nouns (cf.
Brown, Bush,…
)
This
presents a WSD algorithm with systematic ambiguity and reduces performance.
Also
, names are good indicators of senses of neighbouring words.
But this requires a priori categorisation of names.
Brown’s
green
stance
vs.
the cook’s
green
curry