Corpora and Statistical Methods Lecture 6 Semantic similarity Part 1 Synonymy Different phonological orthographic words highly related meanings sofa couch boy lad Traditional definition ID: 317867
Download Presentation The PPT/PDF document "Semantic similarity, vector space models..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Semantic similarity, vector space models and word-sense disambiguation
Corpora and Statistical Methods
Lecture 6Slide2
Semantic similarity
Part 1Slide3
Synonymy
Different phonological
/orthographic
words
highly related meanings
:
sofa / couch
boy / lad
Traditional definition:
w1 is synonymous with w2 if w1 can replace w2 in a sentence,
salva
veritate
Is this ever the case? Can we replace one word for another and keep our sentence identical?Slide4
The importance of
text genre &
register
With near-synonyms, there are often register-governed conditions of use.
E.g
.
naive
vs
gullible
vs
ingenuous
You're so bloody
gullible […]
[…] outside on the pavement trying to entice
gullible
idiots in […]
You're so
ingenuous
. You tackle things the wrong way.
The commentator's
ingenuous
query could just as well have been prompted […]
However, it is
ingenuous
to suppose that peace process […]
(source: BNC)Slide5
Synonymy vs. Similarity
The contextual theory of synonymy:
based on the work of Wittgenstein (1953), and Firth (1957)
You shall know a word by the company it keeps
(Firth 1957)
Under
this view, perfect synonyms might not exist.
But
words can be judged as highly similar if people put them into the same linguistic contexts, and judge the change to be slight.Slide6
Synonymy vs. similarity: example
Miller & Charles 1991:
Weak
contextual hypothesis
:
The similarity of the context
in which 2 words appear contributes
to the semantic similarity of those words
.
E.g.
snake
is similar to [resp. synonym of]
serpent
to the extent that we find
snake
and
serpent
in the same linguistic contexts.
It is much more likely that
snake/serpent
will occur in similar contexts than
snake/toad
NB
: this is not a discrete notion of synonymy, but a continuous definition of similaritySlide7
The Miller/Charles experiment
Subjects
were given sentences with missing words; asked to place words they felt were OK in each context.
Method to compare words
A
and
B
:
find sentences containing
A
find sentences containing
B
delete
A
and
B
from sentences and shuffle them
ask people to choose which sentences to place
A
and
B
in.
Results:
People
tend to
put similar words in the same context
, and this is highly correlated with occurrence in similar contexts in corpora.
Slide8
Issues with similarity
“Similar” is a much broader concept than “synonymous”:
“Contextually
related, though differing in meaning”:
man / woman
boy / girl
master / pupil
“
Contextually related, but with opposite meanings”:
big / small
clever / stupidSlide9
Uses of similarity
Assumption: semantically similar words behave in similar ways
Information
retrieval: query expansion with related terms
K
nearest neighbours
, e.g.:
given: a set of elements, each assigned to some topic
task: classify an unknown
w
by topic
method: find the topic that is most prevalent among
w’
s
semantic neighboursSlide10
Common approaches
Vector-space approaches:
represent word
w
as a vector containing the words (or other features) in the context of
w
compare the vectors of w1, w2
various vector-distance measures available
Information-theoretic measures:
w1 is similar to w2 to the extent that knowing about w1 increases my knowledge (decreases my uncertainty) about w2Slide11
Vector-space modelsSlide12
Basic data structure
Matrix M
M
ij
= no. of times
w
i
co-occurs with
w
j
(in some window).
Can also have Document * word
matrix
We can treat matrix cells as boolean: if
M
ij
> 0, then
w
i
co-occurs with
w
j
, else it does not.Slide13
Distance measures
Many measures take a set-theoretic perspective:
vectors
can be:
binary
(indicate co-occurrence or not
)
real-valued (indicate frequency
, or probability
)
similarity is a function of what two vectors have in commonSlide14
Classic similarity/distance measures
Boolean vector (sets)
Real-valued vector
Dice coefficient
Jaccard
Coefficient
Dice coefficient
Jaccard
CoefficientSlide15
Dice
(
car, truck
)
On the boolean matrix:
(2 * 2)/(4+2) = 0.66
Jaccard
On the boolean matrix: 2/4 = 0.5
Dice is more “generous”; Jaccard penalises lack of overlap more.
Dice vs. JaccardSlide16
Classic similarity/distance measures
Boolean vector (sets)
Real-valued vector
Cosine similarity
Cosine similarity
(= angle between 2 vectors)Slide17
probabilistic approachesSlide18
Turning counts to probabilities
P(
spacewalking|cosmonaut
) = ½ = 0.5
P(
red|car
) = ¼ = 0.25
NB
: this transforms each row into a probability distribution corresponding to a wordSlide19
Probabilistic measures of distance
KL-Divergence:
treat W1 as an approximation of W2
Problems
:
asymmetric: D(p||q) ≠ D(q||p)
not so useful for word-word similarity
if denominator
= 0, then
D(
v
||
w
)
is
undefinedSlide20
Probabilistic measures of distance
Information
radius
(aka Jenson-Shannon Divergence)
compares total divergence between
p
and
q
to the
average
of p and q
symmetric!
Dagan
et al (1997) showed this measure to be superior to KL-Divergence, when applied to a word sense disambiguation task.Slide21
Some characteristics of vector-space measures
Very simple conceptually;
Flexible
: can represent similarity based on document co-occurrence, word co-occurrence etc;
Vectors
can be arbitrarily large, representing wide context windows;
Can
be expanded to take into account grammatical relations (e.g. head-modifier, verb-argument, etc).Slide22
Grammar-informed methods: Lin (1998)
Intuition:
The similarity of any two things (words, documents, people, plants) is a function of the information gained by having:
a joint description of a and b
in terms of what they have in common
compared to
describing a and b separately
E.g
. do we gain more by a joint description of:
apple
and
chair
(both THINGS…)
apple
and
banana
(both FRUIT: more specific) Slide23
Lin’s definition cont/d
Essentially, we compare the info content of the “common” definition to the info content of the “separate” definition
NB: essentially mutual information!Slide24
An application to corpora
From a corpus-based point of view, what do words have in common?
context, obviously
How to define context?
just “bag-of-words” (typical of vector-space models)
more grammatically sophisticatedSlide25
Kilgarriff’s (2003) application
Definition of the notion of context, following Lin:
define
F(w) as the set of grammatical contexts in which w occurs
a
context is a triple
<
rel,w,w
’>
:
rel
is a grammatical relation
w is the word of interest
w’ is the other word in
rel
Grammatical relations can be obtained using a dependency parser.
Slide26
Grammatical co-occurrence matrix for
cell
Source: Jurafsky & Martin (2009), after Lin (1998)Slide27
Example with w =
cell
Example triples:
<subject-of,
cell
,
absorb
>
<object-of,
cell
,
attack
>
<
nmod-of
,
cell
,
architecture
>
Observe that each triple
f
consists of
the relation
r,
the second word in the relation
w’,
..and the word of interest
w
We can now compute the level of association between the word
w
and each of its triples
f
:
An information-theoretic measure that was proposed as a generalisation of the idea of
pointwise
mutual information.Slide28
Calculating similarity
Given that we have grammatical triples for our words of interest, similarity of w1 and w2 is a function of:
the triples they have in common
the triples that are unique to each
I.e
.: mutual info of what the two words have in common, divided by sum of mutual info of what each word hasSlide29
Sample results: master &
pupil
common:
Subject-of:
read, sit, know
Modifier:
good, form
Possession:
interest
master
only:
Subject-of:
ask
Modifier:
past
(cf.
past master)
pupil
only:
Subject-of:
make, find
PP_at-p:
schoolSlide30
Concrete implementation
The online
SketchEngine
gives grammatical relations of words, plus thesaurus which rates words by similarity to a head word.
This is based on the Lin 1998 model.Slide31
Limitations (or characteristics)
Only applicable as a measure of similarity between words of the same category
makes no sense to compare grammatical relations of different category words
Does
not distinguish between near-synonyms and “similar” words
student ~ pupil
master ~ pupil
MI
is sensitive to low-frequency: a relation which occurs only once in the corpus can come out as highly significant.