/
Text Corpora and Lexical Resources Text Corpora and Lexical Resources

Text Corpora and Lexical Resources - PowerPoint Presentation

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
444 views
Uploaded On 2016-03-10

Text Corpora and Lexical Resources - PPT Presentation

Chapter 2 of Natural Language Processing with Python So far We have learned the basics of Python Reading and writing interactive and files Control structures if while for function and class definitions ID: 249577

words corpus list nltk corpus words nltk list word genre gutenberg categories text frequency brown str fileids fileid cfd

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Text Corpora and Lexical Resources" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Text Corpora and Lexical Resources

Chapter 2 of Natural Language Processing with PythonSlide2

So far --

We have learned the basics of Python

Reading and writing – interactive and files

Control structures

if, while, for, function and class definitions

Important data structures:

lists,

tuples

, numeric (

int

and float)

Basic natural language processing techniquesSlide3

Tonight

Expanding the scope of textual information we can access

Additional language constructions for working with text

Reintroduce some Python structures for organizing programsSlide4

Text corpora

A collection of text entities

Usually there is some unifying characteristic, but not always

Typical examples

All issues of a newspaper for a period of time

A collection of reports from a particular industry or standards body

More recent

The whole collection of posts to twitter

All the entries in a blog or set of blogsSlide5

Check it out

Go to http://

www.gutenberg.org

/

Take a few minutes to explore the site.

Look at the top 100

downloads

of yesterday

Can you characterize them? What do you think of this list? Slide6

Corpora in nltk

The

nltk

includes part of the Gutenberg collection

Find out which ones by

>>>

nltk.corpus.gutenberg.fileids

()

These are the texts of the Gutenberg collection that are downloaded with the

nltk

package.Slide7

Accessing other texts

We will explore the files loaded with

nltk

You may want to explore other texts also.

From the

help(nltk.corpus

):

If

C{item

} is one of the unique identifiers listed in the corpus module's

C{items

} variable, then the corresponding document will be loaded from the NLTK corpus package.

If

C{item} is a filename, then that file will be read.

For now – just a note that we can use these tools on other texts that we download or acquire from any source.Slide8

Using the tools we saw before

The particular texts we saw in chapter 1 were accessed through aliases that simplified the interaction.

Now, more general case, we have to do more.

To get the list of words in a text:

>>>

emma

=

nltk.corpus.gutenberg.words('austen-emma.txt

'

)

Now we have the form we had for the texts of Chapter 1 and can use the tools found there. Try:

>>>

len(emma

)

Note the frequency of use of Jane Austen books ???Slide9

Shortened reference

Global context

Instead of citing the

gutenberg

corpus for each resource,

>>> from

nltk.corpus

import

gutenberg

>>>

gutenberg.fileids

()

['

austen-emma.txt

', '

austen-persuasion.txt

', '

austen-sense.txt

', ...]

>>>

emma

=

gutenberg.words('austen-emma.txt

'

)

So,

nltk.corpus.gutenberg.words

('austen-emma.txt

'

)

becomes just

gutenberg.words('austen-emma.txt

')Slide10

Other access options

gutenberg.words('austen-emma.txt

'

)

the words of the text

gutenberg.raw(

'austen-emma.txt

'

)

the original text, no separation into tokens (words). One long string.

gutenberg.sents(

'austen-emma.txt

'

)the text divided into sentencesSlide11

Some code to run

Enter and run the code for counting characters, words, sentences and finding the lexical diversity score of each text in the corpus.

import

nltk

from

nltk.corpus

import

gutenberg

for

fileid

in

gutenberg.fileids

():

num_chars

=

len(gutenberg.raw(fileid

))

num_words

=

len(gutenberg.words(fileid

))

num_sents = len(gutenberg.sents(fileid)) num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])) print int(num_chars/num_words), int(num_words/num_sents), \int(num_words/num_vocab), fileid

Short, simple code. Already seeing some noticeable time to executeSlide12

Modify the code

Simple change – print out the total number of characters, words, sentences for each text.Slide13

The text corpus

Take a look at your directory of

nltk_data

to see the variety of text materials accessible to you.

Some are not plain text and we cannot use them yet – but will

Of the plain text, note the diversity

Classic published materials

News feeds, movie reviews

Overheard conversations, internet chat

All categories of language are needed to understand the language as it is defined and as it is used.Slide14

The Brown Corpus

First 1 million word corpus

Explore –

what are the categories?

Access words or sentences from one or more categories or

fileids

>

>> from

nltk.corpus

import brown

>>>

brown.categories

(

)

>

>>

brown.fileids(categories

=”<choose>"

)Slide15

Sylistics

Enter that code and run it.

What does it give you?

What does it mean?

>>> from

nltk.corpus

import brown

>>>

news_text

=

brown.words(categories

='news')

>>>

fdist

=

nltk.FreqDist([w.lower

() for

w

in

news_text

])

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> for

m

in modals:... print m + ':', fdist[m],Slide16

Spot check

Repeat the previous code, but look for the use of those same words in the categories for religion, government

Now analyze the use of the “

wh

” words in the news category and one other of your choice. (Who, What, Where, When, Why)Slide17

One step comparison

Consider the following code:

import

nltk

from

nltk.corpus

import brown

cfd

=

nltk.ConditionalFreqDist

(

(genre, word)

for genre in

brown.categories

()

for word in

brown.words(categories

=genre))

genres =

['news', 'religion', 'hobbies', '

science_fiction

', 'romance', 'humor']

modals = ['can', 'could', 'may', 'might', 'must', 'will']

cfd.tabulate(conditions

=genres, samples=modals)Enter and run it. What does it do? Slide18

Other corpora

There is some information about the Reuters and Inaugural address corpora also. Take a look at them with the online site. (5 minutes or so)Slide19

Spot Check

Take a look at Table 2-2 for a list of some of the material available from the

nltk

project. (I cannot fit it on a slide in any meaningful way)

Confirm that you have downloaded all of these (when you did the

nltk.download

, if you selected all)

Find them in your directory and explore.

How many languages are represented?

How would you describe the variety of content? What do you find most interesting/unusual/strange/fun?Slide20

Languages

The Universal Declaration of Human Rights is available in 300 languages.

>>>

udhr.fileids

()Slide21

Organization of Corpora

The organization will vary according to the type of corpus. Knowing the organization may be important for using the corpus.Slide22

Example

Description

fileids

()

the

files of the corpus

fileids([categories

]

)

the files of the corpus corresponding to these categories

categories()

the

categories of the corpuscategories([fileids]) the categories of the corpus corresponding to these filesraw() the

raw content of the corpus

raw(fileids

=[f1,f2,f3])

the

raw content of the specified files

raw(categories

=[c1,c2])

the

raw content of the specified categories

words()

the words of the whole corpuswords(fileids=[f1,f2,f3]) the words of the specified fileidswords(categories=[c1,c2]) the words of the specified categoriessents() the sentences of the whole corpussents(fileids=[f1,f2,f3]) the sentences of the specified fileidssents(categories=[c1,c2]) the sentences of the specified categoriesabspath(fileid

)

the location of the given file on disk

encoding(fileid

)

the

encoding of the file (if known)

open(fileid

)

open

a stream for reading the given corpus file

root()

the

path to the root of locally installed corpus

readme()

the

contents of the README file of the corpus

Table 2.3 – Basic Corpus Functionality in NLTKSlide23

from help(nltk.corpus.reader

)

Corpus reader functions are named based on the type of information

they return. Some common examples, and their return types, are:

-

I{corpus}.words

(): list of

str

-

I{corpus}.sents

(): list of (list of

str

) - I{corpus}.paras

(): list of (list of (list of

str

))

-

I{corpus}.tagged_words

(): list of (

str,str

)

tuple

-

I{corpus}.tagged_sents(): list of (list of (str,str)) - I{corpus}.tagged_paras(): list of (list of (list of (str,str))) - I{corpus}.chunked_sents(): list of (Tree w/ (str,str) leaves) - I{corpus}.parsed_sents(): list of (Tree with str leaves) - I{corpus}.parsed_paras

(): list of (list of (Tree with

str

leaves))

-

I{corpus}.xml

(): A single xml

ElementTree

-

I{corpus}.raw

(): unprocessed corpus contents

For example, to read a list of the words in the Brown Corpus, use

C{nltk.corpus.brown.words

()}:

>>> from

nltk.corpus

import brown

>>> print

brown.words

()

Types of information returned from typical functionsSlide24

Spot check

Choose a corpus and exercise some of the functions

Look at raw, words,

sents

, categories,

fileids

, encoding

Repeat for a source in a different language.

Work in pairs and talk about what you find, what you might want to look for.

Report out brieflySlide25

Working with your own sources

NLTK provides a great bunch of resources, but you will certainly want to access your own collections – other books you download, or files you create, etc.

from

nltk.corpus

import

PlaintextCorpusReader

>>>

corpus_root

= '/

usr/share/dict

'

>>> wordlists =

PlaintextCorpusReader(corpus_root

, '.*')

>>>

wordlists.fileids

()

['README', 'connectives', '

propernames

', 'web2', 'web2a', 'words']

>>>

wordlists.words('connectives

')

['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

You could get the list of files in any directorySlide26

Other Corpus readers

There are a number of different readers for different types of corpora.

Many files in corpora are “marked up” in various ways and the reader needs to understand the markings to return meaningful results.

We will stick to the

PlaintextCorpusReader

for nowSlide27

Conditional Frequency Distribution

When texts in a corpus are divided into categories, we may want to look at the characteristics by category – word use by author or over time, for example

Figure 2.4: Counting Words Appearing in a Text Collection (a conditional frequency distribution)Slide28

Frequency Distributions

A frequency distribution counts some occurrence, such as the use of a word or phrase.

A conditional frequency distribution, counts some occurrence separately for each of some number of conditions (Author, date, genre, etc.)

For example:

>>>

genre_word

= [(genre, word)

... for genre in ['news', 'romance']

... for word in

brown.words(categories

=genre)]

>>>

len(genre_word

)

170576

Think about this. What exactly is happening?

What are those 170,576 things?, Run the code, then enter just >>>

genre_wordSlide29

For each genre (‘news’, ‘romance’)

loop over every word in that genre

produce the pairs showing the genre and the word

What type of data is

genre_word

?

>>>

genre_word

= [(genre, word)

... for genre in ['news', 'romance']

... for word in

brown.words(categories

=genre)]

>>>

len(genre_word

)

170576Slide30

Spot check

Refining the

result

When you displayed

genre_word

, you may have noticed that some of the words are not words at all. They are punctuation marks.

Refine this code to eliminate the entries in

genre_word

in which the word is not all alphabetic.

Remove duplicate words that differ only in capitalization.

Work together. Talk about what you are doing. Share your ideas and insights Slide31

Conditional Frequency Distribution

From the list of pairs we created, we can generate a conditional frequency distribution of words by genre

>>>

cfd

=

nltk.ConditionalFreqDist(genre_word

)

>>>

cfd

>

>>

cfd.conditions

(

)

Run these. Look at the resultsSlide32

Look at the conditional distributions

>>>

cfd['news

']

<

FreqDist

with 100554 outcomes>

>>>

cfd['romance

']

<

FreqDist

with 70022 outcomes>>>> list(cfd['romance

'])

[',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'I', 'in', 'he', 'had',

'?', 'her', 'that', 'it', 'his', 'she', 'with', 'you', 'for', 'at', 'He', 'on', 'him',

'said', '!', '--', 'be', 'as', ';', 'have', 'but', 'not', 'would', 'She', 'The', ...]

>>>

cfd['romance']['could

']

193Slide33

Presenting the results

Plotting and tabulating

concise representations of the frequency distributions

Tabulate

With no parameters, simply tabulates all the conditions against all the values

cfd.tabulate

()Slide34

Look closely

>>> from

nltk.corpus

import inaugural

>>>

cfd

=

nltk.ConditionalFreqDist

(

... (target, fileid[:4])

... for

fileid in inaugural.fileids

()

... for

w

in

inaugural.words(fileid

)

... for target in ['

america

', 'citizen']

... if

w.lower().startswith(target))Get the textThe two axes

Narrow the word choice

All the words in each file

Remember List Comprehension?Slide35

Three elements

For a conditional frequency distribution:

Two axes

condition or event, something of interest

some connected characteristic – a year, a place, an author, anything that is related in some way to the event

Something to count

For the condition and the characteristic, what are we counting? Words? actions? what?

From the previous example

inaugural addresses

specific words

count the number of times that a form of either of those words occurred in that addressSlide36

Spot check

Run the code on the previous example.

How many times was some version of “citizen” used in the 1909 inaugural address?

How many times was “

america

” mentioned in 2009?

Play with the code. What can you leave off and still get some meaningful output?Slide37

Another case

Somewhat simpler specification

Distribution of length of word in languages, with restriction on languages

>

>> from

nltk.corpus

import

udhr

>>> languages = ['Chickasaw', 'English', '

German_Deutsch

',

... '

Greenlandic_Inuktikut

', '

Hungarian_Magyar

', '

Ibibio_Efik

']

>>>

cfd

=

nltk.ConditionalFreqDist

(

... (

lang, len(word)) ... for lang in languages... for word in udhr.words(lang + '-Latin1'))Slide38

Now tabulate

Only choose to tabulate some of the results.

>>>

cfd.tabulate(conditions

=['English', '

German_Deutsch

'],

... samples=range(10), cumulative=True)

0 1

2

3

4

5

6

7

8 9 English 0 185 525 883 997 1166 1283 1440 1558 1638German_Deutsch 0 171 263 614 717 894 1013 1110 1213 1275Note – so far, I cannot do plots. I hope to get that fixed. If you can do plots, do try some of the examples.Slide39

Common methods for Conditional Frequency Distributions

cfdist

=

ConditionalFreqDist(pairs

)

create a conditional frequency distribution from a list of pairs

cfdist.conditions

()

alphabetically sorted list of conditions

cfdist[condition

]

the frequency distribution for this condition

cfdist[condition][sample

]

frequency for the given sample for this condition

cfdist.tabulate

()

tabulate the conditional frequency distribution

cfdist.tabulate(samples

, conditions)

tabulation limited to the specified samples and conditions

cfdist.plot

()

graphical plot of the conditional frequency distributioncfdist.plot(samples, conditions) graphical plot limited to the specified samples and conditionscfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in cfdist2Slide40

References

This set of slides comes very directly from the book, Natural Language Processing with Python.

www.nltk.org