Frequency Dictionary Adam Kilgarriff amp Carole Tiberius Outline Three problems Our solutions Routledge Frequency Dictionaries Ten languagesvolumes so far Series editors Mark Davies Paul ID: 623465
Download Presentation The PPT/PDF document "Genre in a" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Genre in a Frequency Dictionary
Adam Kilgarriff & Carole TiberiusSlide2
Outline
Three problemsOur solutionsSlide3
Routledge Frequency Dictionaries
Ten languages/volumes so farSeries editors: Mark Davies, Paul
Rayson“5000 most frequently used words”Genre/text type?Some marking Like traditional dictionariesSlide4
The corpus linguist’s dilemma
We know thatEverything depends on text typeUsually
IgnorePretend our corpus is representative(or we even knew what it meant)Frequency dictionarySpecially painfulSlide5
Poetic interlude
As many texts
as stars in the skyAs many domains as constellationsAs many genres
as stories to tell about them
Represent them?
ShucksSlide6
A tiny step in the direction of respecting the importance of genre
Instead of just one listOne list per genreSlide7
The Whelks Problem
Rare word but
a book about whelks uses it hundreds of timesSolution document frequencySlide8
The genius of Brown
Fixed sample size
500 x 2000-word samplesMakes the maths easyFrequencies directly comparableDocument frequency worksNo need to compensate for different sample lengthContra Sinclair, HanksDifferent goals
Brown: very widely used, replicatedSlide9
A Frequency Dictionary of Dutch
In
the Routledge
series
Publication later this yearSlide10
Dutch
W
ritten and Spoken
the
Netherlands and Flanders
pinpas
betaalkaart
g
ij, ge
j
ij, jeSlide11
The Corpus
Fiction
25 books per year, 1970-2009NewspapersFrom SONAR corpus, 1993-2005. SpokenFrom
Corpus Gesproken Mederlands
Web
From SONAR corpus, includes
blogs, discussion lists,
e-magazines
, press releases, websites and
wikipediaSlide12
Corpus preparation
TaggingLemmatisation
Slice corpora into 2000-word samples
http://
ilk.uvt.nl
/
frog
/Slide13
How many lists?
One list per genre But – overlap?Core:
4 genres:General:Slide14
Which words to include
Which list(s) to put them inThroughout
document frequency, implemented as percentage of samples that the word occurs inSlide15
Inclusion
Include if average across four genres > 1.125
5000 wordsSlide16
Core Vocabulary
words that are used across all
kinds of languageimplemented as Words with frequency > x in all genres4.5 mark gives 943 core-vocab
words
i
n core-
vocab
only; not in
other
listsSlide17
Word
Fiction
NewsSpokenWebHam20
5
4
3
Egg
20
18
4
3
Cheese
20
18
19
3
Which list(s)? T
he problemSlide18
Our solution
Word
FictionNewsSpokenWeb
Ham
20
5
4
3
Egg
20
18
4
3
Cheese
20
18
19
3
Which list(s)? T
he problem
Word
Fiction
News
Spoken
Web
Lists
Ham
20
5
4
3
Fiction
Egg
20
18
4
3
Fiction, News
Cheese
20
18
19
3
GeneralSlide19
Algorithm
Minimum > 45.5
The complication is that some words will occur in two, three or four of the lists generated in this way, and for such cases we have to decide whether they go in:just one listmore
than
one
list
the
general
list.
Our strategy
is to say there should be some cases of each, as follows:
if highest frequency is at least double the next highest, list in that genre only
if two are high and two are low, that is, the first- and second-highest, and both more than double the other two, list in both the top two
else
list in
general
.
Slide20
Algorithm
Min > 4.5?Core-vocab
ElseIf highest-score > 2 x second-highest-scoreHighest-score-genreElse if second-highest-score > 2 x third-highest-scoreHighest-score-genre and
second-highest-score-genre
Else
GeneralSlide21
The ‘genre’ listsSlide22
Observations
FictionBroadest vocabulary, longest listSpoken
Smallest, shortestSpoken and web: much overlapFiction and news: some overlapSlide23
In sum
Everything depends on genreNot easy to handle well in any dictionarySpecially hard in a frequency dictionary
It helps to useFixed sample sizeDocument frequencies (as percentages)A modest attempt to pay genre due respectRoutledge Frequency Dictionary of Dutch, 2013Slide24
Poetic interlude
As many texts
as stars in the skyAs many domains as constellationsAs many genres
as stories to tell about them
Represent them?
Shucks