Corpora and Statistical Methods Lecture 3 Zipfs law and the Zipfian distribution Part 1 Identifying words Words Levels of identification Graphical word a token Dependent on surface properties of text ID: 316303
Download Presentation The PPT/PDF document "Albert Gatt" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Albert Gatt
Corpora and Statistical Methods – Lecture 3Slide2
Zipf’s law and the Zipfian distribution
Part 1Slide3
Identifying wordsSlide4
Words
Levels of identification:
Graphical word (a token)
Dependent on surface properties of text
Underlying word (stem, root…)
Dependent on some form of morphological analysis
Practical definition: A word…
is an indivisible (!) sequence of characters
carries elementary meaning
is reusable in different contextsSlide5
Indivisibility
Words can have compositional meaning from parts that are either words themselves, or
prefixes
and
suffixes
colour
+
-less
=
colourless
(derivation)
data
+
base
= database (compounding)
The
notion of “atomicity” or “indivisibility” is a matter of
degree
.Slide6
Problems with indivisibility
Definite article in Maltese
il-kelb
DEF-dog
“the dog”
phonologically dependent on word
German
componding
Lebensversicherunggesellschaftsangestellter
“life insurance company employee”
Arabic conjunctions:
w
aliy
One possible gloss:
and I follow
(w- is “and”)Slide7
Resuability
Words become part of the
lexicon
of a language, and can be reused.
But some words can be formed on the fly using productive morphological processes
.
Many words are used very rarely
A large majority of the lexicon is inaccessible to native speakers
Approximately
50% of the words in a novel will be used only once within that novel (
hapax
legomena
)Slide8
The graphic definition
Many
corpora, starting with Brown, use a definition of a graphic word:
sequence of letters/numbers
possibly some other symbols
separated by whitespace or punctuation
But
even here, there are
exceptions
.
Not much use for tokenisation of languages like Arabic.Slide9
Non-alphanumeric characters
Numbers such as 22.5
in word frequency counts, typically mapped to a single type ##
Other
characters:
Abbreviations: U.S.A.
Apostrophes: O’Hara vs. John’s
Whitespace: New Delhi
A problem for
tokenisation
Hyphenated
compounds:
so-called, A-1-plus vs. aluminum-export industry
How many words do we have here?Slide10
Tokenisation
Task of breaking up running text into component words.
Crucial for most NLP tasks, as parameters typically estimated based on words.
Can
be statistical or rule-based. Often, simple regular expressions will go a long way.
Some
practical problems:
Whitespace: very useful in Indo-European languages. In others (e.g. East Asian languages, ancient Greek) no space is used.
Non-alphanumeric symbols: need to decide if these are part of a word or not.Slide11
Types and tokensSlide12
Running example
Throughout this lecture, data is taken from a corpus of Maltese texts:
ca. 5
1
,000 words
all from Maltese-language newspapers
various topics and article types
Compared
to data from English corpora taken from
Baroni
2007Slide13
Definitions (I)
token
=
any
word in the corpus
(
also
counting words that occur more than once)
type
= all the individual, different words in the corpus
(grouping words together as representatives of a single type)
Example:
I spoke to the chap who spoke to the child
10 tokens
7 types (
I, spoke, to, the, chap, who, child)Slide14
Definitions (II)
The number of tokens in the corpus is an estimate of overall
corpus size
Maltese corpus: 51,000 tokens
The number of types is an estimate of
vocabulary size
gives an idea of the lexical richness of the corpus
Maltese corpus: 8193 typesSlide15
Relative measures of frequency
Type-token ratio:
no. occurrences of a type / corpus size
essentially relative frequency
In very large corpora, this is typically multiplied by a constant
e.g. multiplying by 1 million gives frequency per millionSlide16
Type/token ratio
Ratio varies enormously depending on corpus size!
If the corpus is 1000 words, it’s easy to see a TTR of, say, 40%.
With 4 million words, it’s more likely to be in the region of 2%.
Reasons:
vocab size grows with corpus size but
large corpora will contain a lot of types that occur many timesSlide17
Frequency lists (BNC)
type
frequency
the
6054231
in
1931797
time
149487
year
73167
man
57699
…
monarch
744
cumin
51
prestidigitation
3
A simple list, pairing each word with its frequencySlide18
Frequency lists (
MT
)
type
frequency
a
ħħar (“last”)
97
j
kun (“be.IMPERF.3SG”)
96
u
koll (“also”)
93
b
ħala (“as”)
91
d
ak (“that.SGM”)
86
t
at- (“of.DEF”)
86Slide19
Frequency ranks
Word counts can get very big.
most frequent word in the Maltese corpus occurs
2195
times
(and the corpus is small)
Raw frequency lists can be hard to process.
Useful to represent words in terms of
rank
:
count the words
sort by frequency (most frequent first)
assign a rank to the words:
rank 1 = most frequent
rank 2 = next most frequent
…Slide20
Rank/frequency profile (BNC)
rank 1 goes to the most frequent type
all ranks are unique
ties in frequency are given arbitrary rank
rank (r)
freq (f)
1
6054231
2
1931797
3
149487
…
Note the large differences in frequency from one rank to anotherSlide21
Rank-frequency profile
(
MT)
Rank
(r)
Frequency
(f)
1
2195
2
2080
3
1277
4
1264
Differences in frequency from one rank to another are
smaller
than in BNC.Slide22
Frequency spectrum (MT
)
A representation that shows, for each frequency value, the number of different
types
that occur with that frequency.
frequency
types
1
4382
2
1253
3
661
4
356Slide23
Word distributions (few giants, many midgets)Slide24
Non-linguistic case study
Suppose we are interested in measuring people’s height.
population = adult, male/female, European
sample: N people from the relevant population
measure height of each person in the
sample
Results:
person 1: 1.6 m
person 2: 1.5 m
…Slide25
Measures of central tendency
Given the height of individuals in our sample, we can calculate some summary statistics:
mean (“average”): sum of all heights in sample, divided by N
mode: most frequent value
What
are your expectations?
will most people be extremely tall?
extremely short?
more or less average?Slide26
Plotting height/frequency
Observations:
Extreme values are less frequent.
2. Most people fall on the mean
3. Mode is approximately same as mean
4. Bell-shaped curve (“normal” distribution)Slide27
Distributions of words
Out of 51,000 tokens in the Maltese corpus:
8016 tokens belong to just the 5 most frequent types (the types at ranks 1 -- 5)
ca. 15% of our corpus size is made up of only 5 different words!
Out
of 8193 types:
4382 are
hapax
legomena
, occurring only once (bottom ranks)
1253 occur only twice
…
In
this data, the mean won’t tell us very much.
it hides huge variations!Slide28
Ranks and frequencies (MT)
2195
2080
1277
…
1
1
…
Among top ranks, frequency drops
very dramatically (but depends on corpus size)
Among bottom ranks, frequency drops very
graduallySlide29
General observations
There
are always a few very high-frequency words, and many low-frequency words.
Among
the top ranks, frequency differences are big.
Among
bottom ranks, frequency differences are very small.Slide30
So what are the high-frequency words?
Top 5 ranked words in the Maltese data:
li
(“that”)
, l-
(DEF)
, il-
(DEF)
, u
(“and”)
, ta’
(“of”)
, tal-
(“of the”)
Bottom ranked words:
żona (“zone”)
f = 1
yankee
f = 1
żwieten (“Zejtun residents”)
f = 1
xortih (“luck.3SGM”)
f = 1
widnejhom (“ear.POSS.3PL”)
f = 1Slide31
Frequency distributions in corpora
The top few frequency ranks are taken up by function words.
In the Brown corpus, the 10 top-ranked words make up 23% of total corpus size (
Baroni
, 2007)
Bottom-ranked
words display lots of ties in frequency.
Lots of words occurring only once (
hapax
legomena
)
In Brown, ca. ½ of vocabulary size is made up of words that occur only once.Slide32
Implications
The mean or average frequency hides huge deviations.
In Brown, average frequency of a type is 19 tokens. But:
the mean is inflated by a few very frequent types
most words will have frequency well below the mean
Mean will therefore be higher than median (the middle value)
not a very meaningful indicator of central tendency
Mode (most frequent frequency value) is usually 1.
This is typical of most large corpora. Same happens if we look at n-grams rather than words.Slide33
Typical shape of a rank/frequency curveSlide34
Actual example
(MT)
A few high frequency, low-rank words
Hundreds of low-frequency, high-rank wordsSlide35
Zipf’s law
Observation: Frequency decreases non-linearly with rank.
Suppose
a = 1, and C = 60,000.
The model predicts:
2
nd
most frequent word will be C/2 = 30,000
3
rd
most frequent: C/3 = 20,000
20
th
most frequent = C/20 = 3000
So frequency decreases very rapidly (exponentially) as rank increases.
a constant, determined from data, roughly the frequency of the most frequent word
a constant, determined from dataSlide36
Things to note
The law doesn’t predict frequency ties
there are no ties among ranks
The
law is a
power law
: frequency is a function of negative power of rank
Taking
the log of both sides gives us a linear function:
Basically a straight line plot.
Slide37
Log-log plot for MT data (a=1)
Deviation from prediction for high frequencies
Deviation from prediction for low frequenciesSlide38
Log-log plot for data from Baroni 2007Slide39
Some observations
Empirical work has shown that the law doesn’t perfectly predict frequencies:
at the bottom ranks (low frequencies), actual frequency drops more rapidly than predicted
at the top ranks (high frequencies), the model predicts higher frequencies than actually attestedSlide40
Mandelbrot’s law
Mandelbrot proposed a version of
Zipf’s
law as follows:
(Note:
Zipf’s
original law is Mandelbrot’s law with b = 0)
If
b
is a small value, it will make frequency of items ranked at the top (rank 1, 2, etc) significantly smaller, but won’t affect the lower ranks.Slide41
Comparison
Let C = 60,000, a = 1 and b = 1
Then
, for a word of rank 1:
Zipf’s
law predicts f(w) = 60,000/1 = 60,000
Mandelbrot’s law predicts f(w) = 60,000/(1+1) = 30,000
For
a word of rank 1000:
Zipf
predicts: f(w) = 60,000/1000 = 60
Mandelbrot: f(w) = 60,000/1001 = 59.94
So
differences are bigger at the top than at the bottom.Slide42
Linear version of Mandelbrot
Note: this is no longer a linear curve, so should fit our data better.Slide43
Consequences of the law
Data sparseness: no matter how big your corpus, most of the words in it will be of very low frequency.
You
can’t exhaust the vocabulary of a language: new words will crop up as corpus size increases.
implication: you can’t compare
vocab
ulary
richness of corpora of different sizesSlide44
Explanation for Zipfian distributions
Zipf’s own explanation (“least effort” principle):
Speaker’s goal is to minimise effort by using a few distinct words as frequently as possible
Hearer’s goal is to maximise clarity by having as large a vocabulary as possibleSlide45
Other Zipfian distributions
Zipf’s
law crops up in other domains (e.g. distribution of incomes)
Even
randomly generated character strings show the same pattern!
short strings will be few, but likely to crop up by chance
more long strings, but each one less likely individually