/
Genre in a Genre in a

Genre in a - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
371 views
Uploaded On 2018-01-15

Genre in a - PPT Presentation

Frequency Dictionary Adam Kilgarriff amp Carole Tiberius Outline Three problems Our solutions Routledge Frequency Dictionaries Ten languagesvolumes so far Series editors Mark Davies Paul ID: 623465

genre list highest frequency list genre frequency highest corpus score fiction dictionary lists word general core words spoken document

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Genre in a" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Genre in a Frequency Dictionary

Adam Kilgarriff & Carole TiberiusSlide2

Outline

Three problemsOur solutionsSlide3

Routledge Frequency Dictionaries

Ten languages/volumes so farSeries editors: Mark Davies, Paul

Rayson“5000 most frequently used words”Genre/text type?Some marking Like traditional dictionariesSlide4

The corpus linguist’s dilemma

We know thatEverything depends on text typeUsually

IgnorePretend our corpus is representative(or we even knew what it meant)Frequency dictionarySpecially painfulSlide5

Poetic interlude

As many texts

as stars in the skyAs many domains as constellationsAs many genres

as stories to tell about them

Represent them?

ShucksSlide6

A tiny step in the direction of respecting the importance of genre

Instead of just one listOne list per genreSlide7

The Whelks Problem

Rare word but

a book about whelks uses it hundreds of timesSolution document frequencySlide8

The genius of Brown

Fixed sample size

500 x 2000-word samplesMakes the maths easyFrequencies directly comparableDocument frequency worksNo need to compensate for different sample lengthContra Sinclair, HanksDifferent goals

Brown: very widely used, replicatedSlide9

A Frequency Dictionary of Dutch

In

the Routledge

series

Publication later this yearSlide10

Dutch

W

ritten and Spoken

the

Netherlands and Flanders

pinpas

betaalkaart

g

ij, ge

j

ij, jeSlide11

The Corpus

Fiction

25 books per year, 1970-2009NewspapersFrom SONAR corpus, 1993-2005. SpokenFrom

Corpus Gesproken Mederlands

Web

From SONAR corpus, includes

blogs, discussion lists,

e-magazines

, press releases, websites and

wikipediaSlide12

Corpus preparation

TaggingLemmatisation

Slice corpora into 2000-word samples

http://

ilk.uvt.nl

/

frog

/Slide13

How many lists?

One list per genre But – overlap?Core:

4 genres:General:Slide14

Which words to include

Which list(s) to put them inThroughout

document frequency, implemented as percentage of samples that the word occurs inSlide15

Inclusion

Include if average across four genres > 1.125

5000 wordsSlide16

Core Vocabulary

words that are used across all

kinds of languageimplemented as Words with frequency > x in all genres4.5 mark gives 943 core-vocab

words

i

n core-

vocab

only; not in

other

listsSlide17

Word

Fiction

NewsSpokenWebHam20

5

4

3

Egg

20

18

4

3

Cheese

20

18

19

3

Which list(s)? T

he problemSlide18

Our solution

Word

FictionNewsSpokenWeb

Ham

20

5

4

3

Egg

20

18

4

3

Cheese

20

18

19

3

Which list(s)? T

he problem

Word

Fiction

News

Spoken

Web

Lists

Ham

20

5

4

3

Fiction

Egg

20

18

4

3

Fiction, News

Cheese

20

18

19

3

GeneralSlide19

Algorithm

Minimum > 45.5

The complication is that some words will occur in two, three or four of the lists generated in this way, and for such cases we have to decide whether they go in:just one listmore

than

one

list

the

general

list.

 

Our strategy

is to say there should be some cases of each, as follows:

if highest frequency is at least double the next highest, list in that genre only

if two are high and two are low, that is, the first- and second-highest, and both more than double the other two, list in both the top two

else

list in

general

.

 Slide20

Algorithm

Min > 4.5?Core-vocab

ElseIf highest-score > 2 x second-highest-scoreHighest-score-genreElse if second-highest-score > 2 x third-highest-scoreHighest-score-genre and

second-highest-score-genre

Else

GeneralSlide21

The ‘genre’ listsSlide22

Observations

FictionBroadest vocabulary, longest listSpoken

Smallest, shortestSpoken and web: much overlapFiction and news: some overlapSlide23

In sum

Everything depends on genreNot easy to handle well in any dictionarySpecially hard in a frequency dictionary

It helps to useFixed sample sizeDocument frequencies (as percentages)A modest attempt to pay genre due respectRoutledge Frequency Dictionary of Dutch, 2013Slide24

Poetic interlude

As many texts

as stars in the skyAs many domains as constellationsAs many genres

as stories to tell about them

Represent them?

Shucks