/
Semantic similarity, vector space models and word-sense dis Semantic similarity, vector space models and word-sense dis

Semantic similarity, vector space models and word-sense dis - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
420 views
Uploaded On 2017-04-11

Semantic similarity, vector space models and word-sense dis - PPT Presentation

Corpora and Statistical Methods Lecture 6 Word sense disambiguation Part 2 What are word senses Cognitive definition mental representation of meaning used in psychological experiments relies on introspection notoriously deceptive ID: 536343

words sense word senses sense words senses word interest context plant data wsd features training topic task dictionary feature

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Semantic similarity, vector space models..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Semantic similarity, vector space models and word-sense disambiguation

Corpora and Statistical Methods

Lecture 6Slide2

Word sense disambiguation

Part 2Slide3

What are word senses?

Cognitive definition:

mental representation of meaning

used in psychological experiments

relies on introspection (notoriously deceptive)

Dictionary-based

definition:

adopt sense definitions in a dictionary

most frequently used resource is

WordNetSlide4

WordNet

Taxonomic representation of words (“concepts”)

Each

word belongs to a

synset

, which contains near-synonyms

Each

synset

has a

gloss

Words

with multiple senses (

polysemy

) belong to multiple

synsets

Synsets

organised by hyponymy (IS-A) relationsSlide5

How many senses?

Example:

interest

pay 3% interest on a loan

showed interest in something

purchased an interest in a company.

the national interest…

have X’s best interest at heart

have an interest in word senses

The economy is run by business interestsSlide6

Wordnet entry for interest

(noun)

a sense of concern with and curiosity about someone or something … (Synonym: involvement)

the power of attracting or holding one’s interest… (Synonym: interestingness)

a reason for wanting something done ( Synonym: sake)

a fixed charge for borrowing money…

a diversion that occupies one’s time and thoughts … (Synonym: pastime)

a right or legal share of something; a financial involvement with something (Synonym: stake)

(usually plural) a social group whose members control some field of activity and who have common aims, (Synonym: interest group)Slide7

Some issues

Are all these really distinct senses? Is

WordNet

too fine-grained?

Would

native speakers distinguish all these as different?

Cf

. The distinction between sense ambiguity and

underspecification

(vagueness):

one could argue that there are fewer senses, but these are underspecified out of contextSlide8

Translation equivalence

Many WSD applications rely on translation equivalence

Given

: parallel corpus (e.g. English-German)

if word w in English has

n

translations in German, then each translation represents a sense

e.g

. German translations of

interest:

Zins

: financial charge (WN sense 4)

Anteil

: stake in a company (WN sense 6)

Interesse

: all other sensesSlide9

Some terminology

WSD Task: given an ambiguous word, find the intended sense in

context

Sense tagging

: task of labelling words as belonging to one sense or another.

needs some

a priori

characterisation of senses of each relevant word

Discrimination

:

distinguishes between occurrences of words based on senses

not necessarily explicit labellingSlide10

Some more terminology

Two types of WSD task:

Lexical sample task

: focuses on disambiguating a small set of target words, using an inventory of the senses of those words.

All-words task

: focuses on entire texts and a lexicon, where every word in the text has to be disambiguated

Serious data sparseness problems!Slide11

Approaches to WSD

All methods rely on training data. Basic idea:

Given

word w in context c

learn how to predict sense

s

of

w

based on various features of

w

Supervised learning: training data is labelled with correct senses

can do sense

tagging

Unsupervised learning: training data is unlabelled

but many other knowledge sources used

cannot do sense tagging, since this requires a priori sensesSlide12

Supervised learning

Words in training data labelled with their senses

She pays 3% interest/INTEREST-MONEY on the loan.

He showed a lot of interest/INTEREST-CURIOSITY in the painting.

Similar

to POS tagging

given a corpus tagged with senses

define features that indicate one sense over another

learn a model that predicts the correct sense given the featuresSlide13

Features (e.g. plant)

Neighbouring words:

plant life

manufacturing plant

assembly plant

plant closure

plant species

Content

words in a larger window

animal

equipment

employee

automaticSlide14

Other features

Syntactically related words

e.g. object, subject….

Topic of the text

is it about SPORT? POLITICS?

Part-of-speech tag, surrounding part-of-speech tagsSlide15

Some principles proposed (Yarowsky 1995)

One sense per discourse:

typically, all occurrences of a word will have the same sense in the same stretch of discourse (e.g. same document)

One sense per collocation:

nearby words provide clues as to the sense, depending on the distance and syntactic relationship

e.g.

plant

life

: all (?) occurrences of

plant+life

will indicate the botanic sense of

plantSlide16

Training data

SENSEVAL

Shared Task competition

datasets available for WSD, among other things

annotated corpora in many languages

Pseudo-words

create training corpus by artificially conflating words

e.g. all occurrences of

man

and

hammer

with

man-hammer

easy way to create training data

Multi-lingual

parallel corpora

translated texts aligned at the sentence level

translation indicates senseSlide17

Data representation

Example sentence:

An electric guitar and bass player stand off to one side...

Target word:

bass

Possible senses:

fish, musical instrument...

Relevant features are represented as vectors, e.g.:Slide18

Supervised methodsSlide19

Naïve Bayes Classifier

Identify the

features (F)

e.g. surrounding words

other cues apart from surrounding context

Combine

evidence from all features

Decision

rule: decide on sense

s’

iff

Example:

drug

. F = words in context

medication sense:

price, prescription, pharmaceutical

illegal substance sense:

alcohol, illicit, paraphernaliaSlide20

Using Bayes’ rule

We usually don’t know

P(

s

k

|f

)

but we can compute from training data: P(

s

k

) (the prior) and

P

(

f

|

s

k

)

P

(f)

can be eliminated because it is constant for all senses in the corpusSlide21

The independence assumption

It’s called “naïve” because:

i.e. all features are assumed to be independent

Obviously

, this is often not true.

e.g

. finding

illicit

in the context of

drug

may not be independent of finding

pusher.

cf. our discussion of collocations!

Also, topics

often constrain word choice.Slide22

Training the naive Bayes

classifier

We need to compute:

P(s)

for all senses

s

of

w

P(

f|s

)

for all features

fSlide23

Information-theoretic measures

Find the single, most informative feature to predict a sense.

E.g

. using a parallel corpus:

prendre

(FR) can translate as

take

or

make

prendre

une

décision

: make a decision

prendre

une

mesure

:

take a measure [to…]

Informative

feature in this case: direct object

mesure

indicates

take

décision

indicates

make

Problem

: need to identify the correct value of the feature that indicates a specific sense.Slide24

Brown et al’s algorithm

Given: translations T of word w

Given: values X of a useful feature (e.g.

mesure

,

décision

as values of DO)

Step 1: random partition P of T

While

improving

, do:

create partition Q of X that maximises I(P;Q)

find a partition P of T that maximises I(P;Q)

comment: relies on mutual info to find clusters of translations mapping to clusters of feature valuesSlide25

Using dictionaries and thesauri

Lesk

(1986): one of the first to exploit dictionary definitions

the definition corresponding to a sense can contain words which are good indicators for that sense

Method

:

Given: ambiguous word

w

with senses s

1

s

n

with glosses g

1

g

n

.

Given: the word

w

in context

c

compute overlap between

c

& each gloss

select the maximally matching senseSlide26

Expanding a dictionary

Problem

with

Lesk

:

often dictionary definitions don’t contain sufficient information

not all words in dictionary definitions are good informants

Solution

: use a thesaurus with subject/topic categories

e.g.

Roget’s thesaurus

http://www.gutenberg.org/cache/epub/22/pg22.html.utf8Slide27

Using topic categories

Suppose every sense

s

k

of word w has subject/topic

t

k

w

can be disambiguated by identifying the words related to

t

k

in the thesaurus

Problems

:

general-purpose thesauri don’t list domain-specific topics

several potentially useful words can be left out

e.g. …

Navratilova plays great tennis

proper name here useful as indicator of topic SPORTSlide28

Expanding a thesaurus: Yarowsky 1992

Given: context

c

and topic

t

For all contexts and topics, compute p(

c|t

) using Naïve

Bayes

by comparing words pertaining to

t

in the thesaurus with words in

c

if p(

c|t

) >

α

, then assign topic

t

to context

c

For all words in the vocabulary, update the list of contexts in which the word occurs.

Assign topic

t

to each word in

c

Finally, compute p(

w

|t

) for all

w

in the vocabulary

this gives the “strength of association” of

w

with

tSlide29

Yarowsky 1992: some results

SENSE ROGET TOPICS

ACCURACY

star

space object UNIVERSE 96%

celebrity

ENTERTAINER

95%

shape INSIGNIA 82.%

sentence

punishment LEGAL_ACTION 99%

set of words GRAMMAR 98% Slide30

Bootstrapping

Yarowsky

(1995) suggested the one sense per discourse/collocation constraints.

Yarowsky’s

method:

select the strongest

collocational

feature in a specific context

disambiguate based only on this feature

(similar to the information-theoretic method discussed earlier)Slide31

One sense per collocation

For each sense

s

of

w

, initialise F, the collocations found in

s

’s

dictionary

definition.

One sense per collocation:

identify the set of contexts containing collocates of

s

for each sense

s

of

w,

update F to contain those collocates such that

for all s’ ≠ s

(

where alpha is a threshold)Slide32

One sense per discourse

For each document:

find the majority sense of

w

out of those found in previous step

assign all occurrences of

w

the majority sense

This is implemented as a post-processing step. Reduces error rate by ca. 27%.Slide33

Unsupervised disambiguationSlide34

Preliminaries

Recall: unsupervised learning can do sense discrimination not tagging

akin

to clustering occurrences with the same sense

e.g

. Brown et al 1991: cluster translations of a word

this

is akin to clustering sensesSlide35

Brown et al’s method

Preliminary categorisation:

Set

P(

w

|

s

) randomly for all words

w

and senses

s

of

w

.

Compute

, for each context

c

of

w

the probability P(

c|s

) that the context was generated by sense

s

.

Use

(1) and (2) as a preliminary estimate. Re-estimate iteratively to find best fit to the corpus.Slide36

Characteristics of unsupervised disambiguation

Can adapt easily to new domains, not covered by a dictionary or pre-labelled corpus

Very useful for information retrieval

If

there are many senses (e.g. 20 senses for word

w

), the algorithm will split contexts into fine-grained sets

NB: can go awry with infrequent sensesSlide37

Some issues with WSDSlide38

The task definition

The WSD task traditionally assumes that a word has one and only one sense in a context.

Is this true?

Kilgarriff

(1993) argues that co-activation (one word displaying more than one sense) is frequent:

this would bring competition to the licensed trade

competition

= “act of competing”; “people/organisations who are competing”Slide39

Systematic polysemy

Not all senses are so easy to distinguish. E.g.

competition

in the “agent competing”

vs

“act of competing” sense.

The

polysemy

here is

systematic

Compare

bank/bank

where the senses are utterly distinct (and most linguists wouldn’t consider this a case of

polysemy

, but homonymy)

Can translation equivalence help here?

depends if

polysemy

is systematic in all languagesSlide40

Logical metonymy

Metonymy = usage of a word to stand for something else

e.g.

the pen is mightier than the sword

pen

=

the press

Logical

metonymy arises due to systematic

polysemy

good cook

vs.

good book

enjoy the paper

vs

enjoy the cake

Should

WSD distinguish these? How could they do this?Slide41

Which words/usages count?

Many proper names are identical to common nouns (cf.

Brown, Bush,…

)

This

presents a WSD algorithm with systematic ambiguity and reduces performance.

Also

, names are good indicators of senses of neighbouring words.

But this requires a priori categorisation of names.

Brown’s

green

stance

vs.

the cook’s

green

curry