/
Albert Gatt Albert Gatt

Albert Gatt - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
392 views
Uploaded On 2016-05-12

Albert Gatt - PPT Presentation

Corpora and Statistical Methods Lecture 3 Zipfs law and the Zipfian distribution Part 1 Identifying words Words Levels of identification Graphical word a token Dependent on surface properties of text ID: 316303

words frequency rank corpus frequency words corpus rank word law frequent ranks types data size 000 type maltese corpora

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Albert Gatt" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Albert Gatt

Corpora and Statistical Methods – Lecture 3Slide2

Zipf’s law and the Zipfian distribution

Part 1Slide3

Identifying wordsSlide4

Words

Levels of identification:

Graphical word (a token)

Dependent on surface properties of text

Underlying word (stem, root…)

Dependent on some form of morphological analysis

Practical definition: A word…

is an indivisible (!) sequence of characters

carries elementary meaning

is reusable in different contextsSlide5

Indivisibility

Words can have compositional meaning from parts that are either words themselves, or

prefixes

and

suffixes

colour

+

-less

=

colourless

(derivation)

data

+

base

= database (compounding)

The

notion of “atomicity” or “indivisibility” is a matter of

degree

.Slide6

Problems with indivisibility

Definite article in Maltese

il-kelb

DEF-dog

“the dog”

phonologically dependent on word

German

componding

Lebensversicherunggesellschaftsangestellter

“life insurance company employee”

Arabic conjunctions:

w

aliy

One possible gloss:

and I follow

(w- is “and”)Slide7

Resuability

Words become part of the

lexicon

of a language, and can be reused.

But some words can be formed on the fly using productive morphological processes

.

Many words are used very rarely

A large majority of the lexicon is inaccessible to native speakers

Approximately

50% of the words in a novel will be used only once within that novel (

hapax

legomena

)Slide8

The graphic definition

Many

corpora, starting with Brown, use a definition of a graphic word:

sequence of letters/numbers

possibly some other symbols

separated by whitespace or punctuation

But

even here, there are

exceptions

.

Not much use for tokenisation of languages like Arabic.Slide9

Non-alphanumeric characters

Numbers such as 22.5

in word frequency counts, typically mapped to a single type ##

Other

characters:

Abbreviations: U.S.A.

Apostrophes: O’Hara vs. John’s

Whitespace: New Delhi

A problem for

tokenisation

Hyphenated

compounds:

so-called, A-1-plus vs. aluminum-export industry

How many words do we have here?Slide10

Tokenisation

Task of breaking up running text into component words.

Crucial for most NLP tasks, as parameters typically estimated based on words.

Can

be statistical or rule-based. Often, simple regular expressions will go a long way.

Some

practical problems:

Whitespace: very useful in Indo-European languages. In others (e.g. East Asian languages, ancient Greek) no space is used.

Non-alphanumeric symbols: need to decide if these are part of a word or not.Slide11

Types and tokensSlide12

Running example

Throughout this lecture, data is taken from a corpus of Maltese texts:

ca. 5

1

,000 words

all from Maltese-language newspapers

various topics and article types

Compared

to data from English corpora taken from

Baroni

2007Slide13

Definitions (I)

token

=

any

word in the corpus

(

also

counting words that occur more than once)

type

= all the individual, different words in the corpus

(grouping words together as representatives of a single type)

Example:

I spoke to the chap who spoke to the child

10 tokens

7 types (

I, spoke, to, the, chap, who, child)Slide14

Definitions (II)

The number of tokens in the corpus is an estimate of overall

corpus size

Maltese corpus: 51,000 tokens

The number of types is an estimate of

vocabulary size

gives an idea of the lexical richness of the corpus

Maltese corpus: 8193 typesSlide15

Relative measures of frequency

Type-token ratio:

no. occurrences of a type / corpus size

essentially relative frequency

In very large corpora, this is typically multiplied by a constant

e.g. multiplying by 1 million gives frequency per millionSlide16

Type/token ratio

Ratio varies enormously depending on corpus size!

If the corpus is 1000 words, it’s easy to see a TTR of, say, 40%.

With 4 million words, it’s more likely to be in the region of 2%.

Reasons:

vocab size grows with corpus size but

large corpora will contain a lot of types that occur many timesSlide17

Frequency lists (BNC)

type

frequency

the

6054231

in

1931797

time

149487

year

73167

man

57699

monarch

744

cumin

51

prestidigitation

3

A simple list, pairing each word with its frequencySlide18

Frequency lists (

MT

)

type

frequency

a

ħħar (“last”)

97

j

kun (“be.IMPERF.3SG”)

96

u

koll (“also”)

93

b

ħala (“as”)

91

d

ak (“that.SGM”)

86

t

at- (“of.DEF”)

86Slide19

Frequency ranks

Word counts can get very big.

most frequent word in the Maltese corpus occurs

2195

times

(and the corpus is small)

Raw frequency lists can be hard to process.

Useful to represent words in terms of

rank

:

count the words

sort by frequency (most frequent first)

assign a rank to the words:

rank 1 = most frequent

rank 2 = next most frequent

…Slide20

Rank/frequency profile (BNC)

rank 1 goes to the most frequent type

all ranks are unique

ties in frequency are given arbitrary rank

rank (r)

freq (f)

1

6054231

2

1931797

3

149487

Note the large differences in frequency from one rank to anotherSlide21

Rank-frequency profile

(

MT)

Rank

(r)

Frequency

(f)

1

2195

2

2080

3

1277

4

1264

Differences in frequency from one rank to another are

smaller

than in BNC.Slide22

Frequency spectrum (MT

)

A representation that shows, for each frequency value, the number of different

types

that occur with that frequency.

frequency

types

1

4382

2

1253

3

661

4

356Slide23

Word distributions (few giants, many midgets)Slide24

Non-linguistic case study

Suppose we are interested in measuring people’s height.

population = adult, male/female, European

sample: N people from the relevant population

measure height of each person in the

sample

Results:

person 1: 1.6 m

person 2: 1.5 m

…Slide25

Measures of central tendency

Given the height of individuals in our sample, we can calculate some summary statistics:

mean (“average”): sum of all heights in sample, divided by N

mode: most frequent value

What

are your expectations?

will most people be extremely tall?

extremely short?

more or less average?Slide26

Plotting height/frequency

Observations:

Extreme values are less frequent.

2. Most people fall on the mean

3. Mode is approximately same as mean

4. Bell-shaped curve (“normal” distribution)Slide27

Distributions of words

Out of 51,000 tokens in the Maltese corpus:

8016 tokens belong to just the 5 most frequent types (the types at ranks 1 -- 5)

ca. 15% of our corpus size is made up of only 5 different words!

Out

of 8193 types:

4382 are

hapax

legomena

, occurring only once (bottom ranks)

1253 occur only twice

In

this data, the mean won’t tell us very much.

it hides huge variations!Slide28

Ranks and frequencies (MT)

2195

2080

1277

1

1

Among top ranks, frequency drops

very dramatically (but depends on corpus size)

Among bottom ranks, frequency drops very

graduallySlide29

General observations

There

are always a few very high-frequency words, and many low-frequency words.

Among

the top ranks, frequency differences are big.

Among

bottom ranks, frequency differences are very small.Slide30

So what are the high-frequency words?

Top 5 ranked words in the Maltese data:

li

(“that”)

, l-

(DEF)

, il-

(DEF)

, u

(“and”)

, ta’

(“of”)

, tal-

(“of the”)

Bottom ranked words:

żona (“zone”)

f = 1

yankee

f = 1

żwieten (“Zejtun residents”)

f = 1

xortih (“luck.3SGM”)

f = 1

widnejhom (“ear.POSS.3PL”)

f = 1Slide31

Frequency distributions in corpora

The top few frequency ranks are taken up by function words.

In the Brown corpus, the 10 top-ranked words make up 23% of total corpus size (

Baroni

, 2007)

Bottom-ranked

words display lots of ties in frequency.

Lots of words occurring only once (

hapax

legomena

)

In Brown, ca. ½ of vocabulary size is made up of words that occur only once.Slide32

Implications

The mean or average frequency hides huge deviations.

In Brown, average frequency of a type is 19 tokens. But:

the mean is inflated by a few very frequent types

most words will have frequency well below the mean

Mean will therefore be higher than median (the middle value)

not a very meaningful indicator of central tendency

Mode (most frequent frequency value) is usually 1.

This is typical of most large corpora. Same happens if we look at n-grams rather than words.Slide33

Typical shape of a rank/frequency curveSlide34

Actual example

(MT)

A few high frequency, low-rank words

Hundreds of low-frequency, high-rank wordsSlide35

Zipf’s law

Observation: Frequency decreases non-linearly with rank.

Suppose

a = 1, and C = 60,000.

The model predicts:

2

nd

most frequent word will be C/2 = 30,000

3

rd

most frequent: C/3 = 20,000

20

th

most frequent = C/20 = 3000

So frequency decreases very rapidly (exponentially) as rank increases.

a constant, determined from data, roughly the frequency of the most frequent word

a constant, determined from dataSlide36

Things to note

The law doesn’t predict frequency ties

there are no ties among ranks

The

law is a

power law

: frequency is a function of negative power of rank

Taking

the log of both sides gives us a linear function:

Basically a straight line plot.

Slide37

Log-log plot for MT data (a=1)

Deviation from prediction for high frequencies

Deviation from prediction for low frequenciesSlide38

Log-log plot for data from Baroni 2007Slide39

Some observations

Empirical work has shown that the law doesn’t perfectly predict frequencies:

at the bottom ranks (low frequencies), actual frequency drops more rapidly than predicted

at the top ranks (high frequencies), the model predicts higher frequencies than actually attestedSlide40

Mandelbrot’s law

Mandelbrot proposed a version of

Zipf’s

law as follows:

(Note:

Zipf’s

original law is Mandelbrot’s law with b = 0)

If

b

is a small value, it will make frequency of items ranked at the top (rank 1, 2, etc) significantly smaller, but won’t affect the lower ranks.Slide41

Comparison

Let C = 60,000, a = 1 and b = 1

Then

, for a word of rank 1:

Zipf’s

law predicts f(w) = 60,000/1 = 60,000

Mandelbrot’s law predicts f(w) = 60,000/(1+1) = 30,000

For

a word of rank 1000:

Zipf

predicts: f(w) = 60,000/1000 = 60

Mandelbrot: f(w) = 60,000/1001 = 59.94

So

differences are bigger at the top than at the bottom.Slide42

Linear version of Mandelbrot

Note: this is no longer a linear curve, so should fit our data better.Slide43

Consequences of the law

Data sparseness: no matter how big your corpus, most of the words in it will be of very low frequency.

You

can’t exhaust the vocabulary of a language: new words will crop up as corpus size increases.

implication: you can’t compare

vocab

ulary

richness of corpora of different sizesSlide44

Explanation for Zipfian distributions

Zipf’s own explanation (“least effort” principle):

Speaker’s goal is to minimise effort by using a few distinct words as frequently as possible

Hearer’s goal is to maximise clarity by having as large a vocabulary as possibleSlide45

Other Zipfian distributions

Zipf’s

law crops up in other domains (e.g. distribution of incomes)

Even

randomly generated character strings show the same pattern!

short strings will be few, but likely to crop up by chance

more long strings, but each one less likely individually