/
Semantic similarity, vector space models and word-sense dis Semantic similarity, vector space models and word-sense dis

Semantic similarity, vector space models and word-sense dis - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
447 views
Uploaded On 2016-05-13

Semantic similarity, vector space models and word-sense dis - PPT Presentation

Corpora and Statistical Methods Lecture 6 Semantic similarity Part 1 Synonymy Different phonological orthographic words highly related meanings sofa couch boy lad Traditional definition ID: 317867

words similarity vector word similarity words word vector context measures similar grammatical information common distance triples find cell matrix

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Semantic similarity, vector space models..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Semantic similarity, vector space models and word-sense disambiguation

Corpora and Statistical Methods

Lecture 6Slide2

Semantic similarity

Part 1Slide3

Synonymy

Different phonological

/orthographic

words

highly related meanings

:

sofa / couch

boy / lad

Traditional definition:

w1 is synonymous with w2 if w1 can replace w2 in a sentence,

salva

veritate

Is this ever the case? Can we replace one word for another and keep our sentence identical?Slide4

The importance of

text genre &

register

With near-synonyms, there are often register-governed conditions of use.

E.g

.

naive

vs

gullible

vs

ingenuous

You're so bloody

gullible […]

[…] outside on the pavement trying to entice

gullible

idiots in […]

You're so

ingenuous

. You tackle things the wrong way.

The commentator's

ingenuous

query could just as well have been prompted […]

However, it is

ingenuous

to suppose that peace process […]

(source: BNC)Slide5

Synonymy vs. Similarity

The contextual theory of synonymy:

based on the work of Wittgenstein (1953), and Firth (1957)

You shall know a word by the company it keeps

(Firth 1957)

Under

this view, perfect synonyms might not exist.

But

words can be judged as highly similar if people put them into the same linguistic contexts, and judge the change to be slight.Slide6

Synonymy vs. similarity: example

Miller & Charles 1991:

Weak

contextual hypothesis

:

The similarity of the context

in which 2 words appear contributes

to the semantic similarity of those words

.

E.g.

snake

is similar to [resp. synonym of]

serpent

to the extent that we find

snake

and

serpent

in the same linguistic contexts.

It is much more likely that

snake/serpent

will occur in similar contexts than

snake/toad

NB

: this is not a discrete notion of synonymy, but a continuous definition of similaritySlide7

The Miller/Charles experiment

Subjects

were given sentences with missing words; asked to place words they felt were OK in each context.

Method to compare words

A

and

B

:

find sentences containing

A

find sentences containing

B

delete

A

and

B

from sentences and shuffle them

ask people to choose which sentences to place

A

and

B

in.

Results:

People

tend to

put similar words in the same context

, and this is highly correlated with occurrence in similar contexts in corpora.

Slide8

Issues with similarity

“Similar” is a much broader concept than “synonymous”:

“Contextually

related, though differing in meaning”:

man / woman

boy / girl

master / pupil

Contextually related, but with opposite meanings”:

big / small

clever / stupidSlide9

Uses of similarity

Assumption: semantically similar words behave in similar ways

Information

retrieval: query expansion with related terms

K

nearest neighbours

, e.g.:

given: a set of elements, each assigned to some topic

task: classify an unknown

w

by topic

method: find the topic that is most prevalent among

w’

s

semantic neighboursSlide10

Common approaches

Vector-space approaches:

represent word

w

as a vector containing the words (or other features) in the context of

w

compare the vectors of w1, w2

various vector-distance measures available

Information-theoretic measures:

w1 is similar to w2 to the extent that knowing about w1 increases my knowledge (decreases my uncertainty) about w2Slide11

Vector-space modelsSlide12

Basic data structure

Matrix M

M

ij

= no. of times

w

i

co-occurs with

w

j

(in some window).

Can also have Document * word

matrix

We can treat matrix cells as boolean: if

M

ij

> 0, then

w

i

co-occurs with

w

j

, else it does not.Slide13

Distance measures

Many measures take a set-theoretic perspective:

vectors

can be:

binary

(indicate co-occurrence or not

)

real-valued (indicate frequency

, or probability

)

similarity is a function of what two vectors have in commonSlide14

Classic similarity/distance measures

Boolean vector (sets)

Real-valued vector

Dice coefficient

Jaccard

Coefficient

Dice coefficient

Jaccard

CoefficientSlide15

Dice

(

car, truck

)

On the boolean matrix:

(2 * 2)/(4+2) = 0.66

Jaccard

On the boolean matrix: 2/4 = 0.5

Dice is more “generous”; Jaccard penalises lack of overlap more.

Dice vs. JaccardSlide16

Classic similarity/distance measures

Boolean vector (sets)

Real-valued vector

Cosine similarity

Cosine similarity

(= angle between 2 vectors)Slide17

probabilistic approachesSlide18

Turning counts to probabilities

P(

spacewalking|cosmonaut

) = ½ = 0.5

P(

red|car

) = ¼ = 0.25

NB

: this transforms each row into a probability distribution corresponding to a wordSlide19

Probabilistic measures of distance

KL-Divergence:

treat W1 as an approximation of W2

Problems

:

asymmetric: D(p||q) ≠ D(q||p)

not so useful for word-word similarity

if denominator

= 0, then

D(

v

||

w

)

is

undefinedSlide20

Probabilistic measures of distance

Information

radius

(aka Jenson-Shannon Divergence)

compares total divergence between

p

and

q

to the

average

of p and q

symmetric!

Dagan

et al (1997) showed this measure to be superior to KL-Divergence, when applied to a word sense disambiguation task.Slide21

Some characteristics of vector-space measures

Very simple conceptually;

Flexible

: can represent similarity based on document co-occurrence, word co-occurrence etc;

Vectors

can be arbitrarily large, representing wide context windows;

Can

be expanded to take into account grammatical relations (e.g. head-modifier, verb-argument, etc).Slide22

Grammar-informed methods: Lin (1998)

Intuition:

The similarity of any two things (words, documents, people, plants) is a function of the information gained by having:

a joint description of a and b

in terms of what they have in common

compared to

describing a and b separately

E.g

. do we gain more by a joint description of:

apple

and

chair

(both THINGS…)

apple

and

banana

(both FRUIT: more specific) Slide23

Lin’s definition cont/d

Essentially, we compare the info content of the “common” definition to the info content of the “separate” definition

NB: essentially mutual information!Slide24

An application to corpora

From a corpus-based point of view, what do words have in common?

context, obviously

How to define context?

just “bag-of-words” (typical of vector-space models)

more grammatically sophisticatedSlide25

Kilgarriff’s (2003) application

Definition of the notion of context, following Lin:

define

F(w) as the set of grammatical contexts in which w occurs

a

context is a triple

<

rel,w,w

’>

:

rel

is a grammatical relation

w is the word of interest

w’ is the other word in

rel

Grammatical relations can be obtained using a dependency parser.

Slide26

Grammatical co-occurrence matrix for

cell

Source: Jurafsky & Martin (2009), after Lin (1998)Slide27

Example with w =

cell

Example triples:

<subject-of,

cell

,

absorb

>

<object-of,

cell

,

attack

>

<

nmod-of

,

cell

,

architecture

>

Observe that each triple

f

consists of

the relation

r,

the second word in the relation

w’,

..and the word of interest

w

We can now compute the level of association between the word

w

and each of its triples

f

:

An information-theoretic measure that was proposed as a generalisation of the idea of

pointwise

mutual information.Slide28

Calculating similarity

Given that we have grammatical triples for our words of interest, similarity of w1 and w2 is a function of:

the triples they have in common

the triples that are unique to each

I.e

.: mutual info of what the two words have in common, divided by sum of mutual info of what each word hasSlide29

Sample results: master &

pupil

common:

Subject-of:

read, sit, know

Modifier:

good, form

Possession:

interest

master

only:

Subject-of:

ask

Modifier:

past

(cf.

past master)

pupil

only:

Subject-of:

make, find

PP_at-p:

schoolSlide30

Concrete implementation

The online

SketchEngine

gives grammatical relations of words, plus thesaurus which rates words by similarity to a head word.

This is based on the Lin 1998 model.Slide31

Limitations (or characteristics)

Only applicable as a measure of similarity between words of the same category

makes no sense to compare grammatical relations of different category words

Does

not distinguish between near-synonyms and “similar” words

student ~ pupil

master ~ pupil

MI

is sensitive to low-frequency: a relation which occurs only once in the corpus can come out as highly significant.