/
Accessing files with NLTK Accessing files with NLTK

Accessing files with NLTK - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
420 views
Uploaded On 2016-07-03

Accessing files with NLTK - PPT Presentation

Regular Expressions Accessing additional files Python has tools for accessing files from the local directories and also for obtaining files from the web We have seen the tools for reading any file from a local directory ID: 388329

nltk words print word words nltk word print url string character matches findall file match http page read pattern

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Accessing files with NLTK" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Accessing files with NLTKRegular ExpressionsSlide2

Accessing additional files

Python has tools for accessing files from the local directories and also for obtaining files from the web

.

We have seen the tools for reading any file from a local directory

Now, let’s see how to obtain files from the web.Slide3

Reminder, file access

file(filename[, mode])

filename.close

()

File no longer available

filename.fileno() returns the file descriptor, not usually needed.filename.read([size])read at most size bytes. If size not specified, read to end of file.filename.readline([size])read one line. If size provided, read that many bytes. Empty string returned if EOF encountered immediatelyfilename.readlines([sizehint]) return a list of lines. If sizehint present, return approximately that number of lines, possibly rounding to fill a buffer. filename.write(string)

Where filename is the internal name of the file object

Mode is ‘r’ for read only, ‘w’ for write only, ‘r+’ for read or write, ‘a’ for append.Slide4

Python module for web access

urllib2

Note – this is for Python 2.x, not Python 3

Python 3 splits the urllib2 materials over several modules

import urllib2

urllib2.urlopen(url [,data][, timeout])Establish a link with the server identified in the url and send either a GET or POST request to retrieve the page.The optional data field provides data to send to the server as part of the request. If the data field is present, the HTTP request used is POST instead of GETUse to fetch content that is behind a form, perhaps a login pageIf used, the data must be encoded properly for including in an HTTP request. See http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1timeout defines time in seconds to be used for blocking operations such as the connection attempt. If it is not provided, the system wide default value is used.4

http://docs.python.org/library/urllib2.htmlSlide5

URL fetch and use

urlopen

returns a file-like object with methods:

Same as for files: read(),

readline

(), readlines(), fileno(), close()New for this class: info() – returns meta information about the document at the URLgetcode() – returns the HTTP status code sent with the response (ex: 200, 404)geturl() – returns the URL of the page, which may be different from the URL requested if the server redirected the request5Slide6

Short example file read

filename=

raw_input

('File to read: ')

source = file(filename) #Access is read-only

for line in source: print lineRecall what this does:Open the file for read access (default when no option specified)Step through the file, one line at a time (“for line in source”)Print each lineSlide7

URL fetch

import urllib2

url

=

raw_input

("Enter the URL of the page to fetch: ")if "http://" not in url[0:6]: url = "http://"+urlprint "Attempting to open ", urltry: linecount=0

page=urllib2.urlopen(url) result = page.getcode()

if result == 200:

for line in page:

print line

linecount

+=1

print "Page Information \n ",

page.info

()

print "Result code = ",

page.getcode

()

print "Page contains ",

linecount

," lines."

except:

print "\

nCould

not open URL: ",

urlSlide8

URL info

info() provides the header information that http returns when the HEAD request is used.

ex:

>>> print

mypage.info

()Date: Mon, 12 Sep 2011 14:23:44 GMTServer: Apache/1.3.27 (Unix)Last-Modified: Tue, 02 Sep 2008 21:12:03 GMTETag: "2f0d4-215f-48bdac23"Accept-Ranges: bytesContent-Length: 8543Connection: closeContent-Type: text/html8Slide9

URL status and code

>>> print

mypage.getcode

()

200

>>> print mypage.geturl()http://www.csc.villanova.edu/~cassel/9Slide10

Messy HTML

HTML is not always perfect.

Browsers may be forgiving.

Human and computerized html generators make mistakes.

Tools for dealing with imperfect html include Beautiful Soup.

http://www.crummy.com/software/BeautifulSoup/Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."10Slide11

The NLP pipelineSlide12

import

nltk

import urllib2

fail = False

url

= raw_input("Enter the URL of the page to fetch: ")if "http://" not in url[0:7]: url = "http://"+urlprint "Attempting to open ", urltry:

linecount=0 page=urllib2.urlopen(url)except: print "\

nCould

not open URL: ",

url

fail = True

if not fail:

for line in page:

raw =

nltk.clean_html

(line)

print raw

File: /Users/

lcassel

/

pythonwork

/

classexample

/

url

-fetch-

clean.pySlide13

Tokenizing

import re,

nltk

, urllib2,

pprint

filename=raw_input('File to read: ')infile = file(filename) #Access is read-onlyprint "File chosen:", filenamesource = infile.read(1000)tokens = nltk.wordpunct_tokenize(source)

tokens = tokens[20:200]text = nltk.Text(tokens)words = [

w.lower

() for w in text]

vocab = sorted(set(words)

)

print vocab

File: /

Users/

lcassel

/

pythonwork

/

classexamples

/

openfile.pySlide14

Output from previous code:

['#', "'", '***', ',', '-', '.', '15', '2006', '2011', '2554', '28', ':', ';', '[', ']', 'a', 'about', 'almost', 'and', 'anywhere', 'at', 'author', 'away', 'bickers', 'but', 'by', 'children', '

constance

', 'copy', 'cost', 'crime', '

dagny

', 'date', 'deeply', 'doctor', 'dostoevsky', 'ebook', 'english', 'evenings', 'father', 'few', 'five', 'fyodor', 'garnett', 'give', 'gutenberg', 'hard', 'help', 'himself', 'his', 'in', 'included', 'it', 'john', 'language', 'last', 'license', 'lived', 'march', 'may', 'mother', 'no', 'november', 'of', 'online', 'only', 'or', 'org', 'parents', 'people', 'poor', 'preface', 'produced', 'project', 'punishment', 're', 'reader', 'release', 'religious', 'restrictions', 'rooms', 's', 'so', 'son', 'spent', 'start', 'terms', 'that', 'the', 'their', 'they', 'this', 'title', 'to', 'translated', 'translator', 'two', 'under', 'understand', 'updated', 'use', 'very', 'was', 'were', 'whatsoever', 'with', 'words', 'work', 'working', 'www', 'you']

Input file was “Crime and Punishment” as a local txt file, since crawling Gutenberg does not seem to work.Slide15

Spot checkFetch a web page

Print out the lines of the page as they are, and also as cleaned by

nltk

.

Compare the two versions. What is removed and what is retained? Is all html removed? If anything is left, what is it and why do you think it is retained.

Tokenize the text of the pagePrint the vocabulary Slide16

Character encoding

ASCII, Unicode

American Standard Code for Information Interchange

Everything stored in the computer must be expressed as a bit pattern.

For numbers, easy – convert to binary

For integers, direct conversionFor real numbers, floating pointsomewhat arbitrary choice of how to represent where the decimal point is, how much precision for the whole number part, how much for the exponent.For non-numeric characters, some arbitrary choice of what bit pattern to assign to each characterSlide17

Coding considerations

If the numeric interpretation of the bit string assigned to one character is less than that for another character, the first will

sort

to an earlier position.

Thus, assign the codes in the sort order desired.

Clearly, A before BA before or after a?8 before or after A?* before or after A, 8?Once the choices are made and the code is constructed, sort order is determined. Any need to change will have to be dealt with in individual applicationsSlide18

Representing the bit patterns

All the encodings can be represented as numeric values. Example ASCII code for “K” – two bytes: 0100 1011

Decimal 75

familiar, but not really convenient for representing bits.

Hexadecimal 4B

one character for each four bits. Octal 113 (_01 001 011)one character for each three bits, from the rightSlide19

The ASCII codeSlide20

Limitations of ASCIIOriginal ASCII used only 7 of the available 8 bits

last bit kept for parity checking

Limited

the

number of characters that can be represented.

Extended – use the 8th bitThere are several variationsSee http://www.ascii-code.com/Slide21

Source:

http://www.cdrummond.qc.ca/cegep/informat/Professeurs/Alain/files/ascii.htm

Extended ASCII Hex 80 to FF

Some additional language characters, such as

é

and à and æ and the Greek alphabet. Many more missing.Slide22

Unicode

ASCII is just one encoding example

ASCII, even extended, does not have enough space for all needed encodings.

Different schemes in use present potential conflict – different codes for the same symbol, different symbols with the same code if you deal with more than one scheme.

Enter

unicode. See unicode.orgSlide23

From unicode.org

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM,

JustSystems

, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java,

ECMAScript

(JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.Slide24

UnicodeThere are three encoding forms:

8, 16, 32 bits

UTF-8 includes the ASCII codes

UTF-16 all commonly used symbols, other symbols available in pairs of 16-bit units

UTF-32 when size is not an issue. All symbols in 32 bit string of bitsSlide25

Using unicodeSlide26

Regular ExpressionsProcessing text often involves selecting for specific characteristics

Regular expressions

powerful tool for describing the characteristics of interest

Access in python:

import re

Raw string notation: precede a string with rr’\n’means backslash then n, not new lineSlide27

Regular Expression special characters –pt

1

‘^’ (Caret) Matches the start of the string

$’

matches the end of the string, or just before newline at the end of a string‘.’ matches any single character‘*’ match 0 or more repetitions of the preceding re. 0*1 matches any number of 0s followed by 1: 1, 01, 001, 0001, etc.‘+’ matches 1 or more repetition. 0+1 matches 01, 001, 0001, etc., but not 1‘?’ matches 0 or 1 repetitions. 0?1 matches 1 and 01 only{m,n} matches between m and n repetitions. If no n specified, matches only exactly m repetitions. 0{2,4}1 matches 001, 0001, 000010{3}1 matches only 0001Slide28

Regular Expression special characters –pt

2

{

m,n

}? match as few as possible of these.

0{2,4}1 will match 001 if it is available, or 0001 if no 001 is available, or 00001 if no shorter string is available.\ escape special character, so you can search for * or ? etc[ ] used to indicate a set of characters. [abc] will match a or b or crange: [0-9A-Za-z] will match any digit or letter, upper or lower caseSpecial characters lose meaning in set: [\*] matches \ or *^ = negate the set [^0-9] will match anything except a digit| means “or” A|B means the character A or the character B. Options are tested left to right and the search quits when a match is found. This gives priority to the symbol listed first.Slide29

Python re

import

nltk

import re

wordlist = [

w for w in nltk.corpus.words.words('en') if w.islower()]print [w for w in wordlist if re.search('ed$', w)]matches all words in the list that end in ed

Take it step by step:(Get all the English words in the wordlist -- )

wordlist = [w for w in

nltk.corpus.words.words

('en')]

print wordlist[0:200]

['A', 'a', '

aa

', '

aal

', '

aalii

', '

aam

', '

Aani

', 'aardvark', 'aardwolf', 'Aaron', 'Aaronic', '

Aaronical

', '

Aaronite

', '

Aaronitic

', '

Aaru

', '

Ab

', 'aba', '

Ababdeh

', '

Ababua

', '

abac

', 'abaca', '

abacate

', '

abacay

', '

abacinate

', '

abacination

', '

abaciscus

', '

abacist

', 'aback', '

abactinal

', '

abactinally

', '

abaction

', '

abactor

', '

abaculus

', 'abacus', '

Abadite

', '

abaff

', 'abaft', '

abaisance

',Slide30

from __future__ import division

import

nltk

, re,

pprint

wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]print wordlist[0:200]Restrict to lower case words['a', 'aa

', 'aal', 'aalii', '

aam

', 'aardvark', 'aardwolf', 'aba', '

abac

', 'abaca', '

abacate

', '

abacay

', '

abacinate

', '

abacination

', '

abaciscus

', '

abacist

', 'aback', '

abactinal

', '

abactinally

', '

abaction

', '

abactor

',

['

abaissed

', 'abandoned', 'abased', 'abashed', '

abatised

', 'abed']

from __future__ import division

import

nltk

, re,

pprint

wordlist = [w for w in

nltk.corpus.words.words

('en') if

w.islower

()]

wordlist = wordlist[0:200]

print [w for w in wordlist if

re.search

('

ed

$', w)]Slide31

Wildcard . matches any single character

Crossword match example

:

[

w

for w in wordlist if re.search('^..j..t..$', w)]Word beginning

Single character

Specific letter

Word end

Crossword match example:

['abjectly', 'adjuster', 'dejected', '

dejectly

', 'injector', 'majestic', '

objectee

', 'objector', 'rejecter', '

rejector

', '

unjilted

', '

unjolted

', 'unjustly’]Slide32

Spot check

Your Turn: The caret symbol ^ matches the start of a string, just like the $ matches the end. What results do we get with the above example if we leave out both of these, and search for «..

j..t

..»?

Think about it first. What do you expect?

Then run it.Crossword match example: ['abjectedness', 'abjection', 'abjective', 'abjectly', 'abjectness', 'adjection', 'adjectional', 'adjectival', 'adjectivally', 'adjective', 'adjectively', 'adjectivism', 'adjectivitis

', 'adjustable', 'adjustably', 'adjustage', 'adjustation', 'adjuster', 'adjustive', 'adjustment', '

antejentacular

', '

antiprojectivity

', 'bijouterie', '

coadjustment

', '

cojusticiar

', '

conjective

', 'conjecturable', 'conjecturably', 'conjectural', '

conjecturalist

', '

conjecturality

', 'conjecturally', 'conjecture', 'conjecturer', '

coprojector

', '

counterobjection

', 'dejected', 'dejectedly', 'dejectedness', '

dejectile

', 'dejection', …

There will always be two letters before j and two letters between j and t and two letters after t. Nothing else specified.Slide33

? as optional character? indicates 0 or 1 occurrences

^

e

-?mail$

matches either email or e-mail

^[Ee]-?mail$allows either upper or lower case ENote that [^Ee] matches anything that is not E,ethe negation is inside the [ ]Slide34

Texting example

First letter from

ghi

, second from

mno

, then jlk, then defTake away the ^ and $[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)['gold', 'golf', 'hold', 'hole']

'tinkerlike', 'tinkerly', 'tinkershire', 'tinkershue', 'tinkerwise', 'tinlet

', 'titleholder', '

toolholder

', '

toolholding

', 'touchhole', '

trainless

', '

traphole

', '

trinkerman

', 'trinket', '

trinketer

', '

trinketry

', '

trinkety

', '

triole

', '

trioleate

', '

triolefin

', '

trioleic

’, …Slide35

Python use of re

re.search(pattern

,

string[,flags

])

scan through string looking for pattern. Return None if not found.re.match(pattern, string) if zero or more characters at the beginning of string match the re pattern, return a corresponding MatchObject instance. Return None if string does not match the pattern.re.split(pattern,string)Split string by occurrences of pattern.

from: http://docs.python.org/library/re.html some options not includedSlide36

Some shortened forms

>>>

re.split

('\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

>>> re.split('(\W+)', 'Words, words, words.')['Words', ', ', 'words', ', ', 'words', '.', '']>>> re.split('\W+', 'Words, words, words.', 1)['Words', 'words, words.']>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)['0', '3', '9']\w = word class: equivalent to [a-zA-Z0-9_]

\W = complement of \w – all characters other than letters and digits

“If

capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list

.” – thus, the split is on the non alpha-numeric characters, but those characters are included in the resulting list.

Ref: http

://

docs.python.org

/library/

re.htmlSlide37

re.findall(pattern,

string[,flags

])

return all non-overlapping matches of

pattern

in string, as a list of strings. String scanned left-to-right. Matches returned in order found.Slide38

Applications of reExtract word pieces

another

> word = 'supercalifragilisticexpialidocious'

>>>

re.findall(r'[aeiou

]', word)['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', '

o', 'u']>>> len(re.findall(r'[aeiou]', word))16

>>>

wsj

=

sorted(set(nltk.corpus.treebank.words

()))

>>>

fd

=

nltk.FreqDist(vs

for word in

wsj

... for

vs

in re.findall(r'[aeiou]{2,}', word))

>>>

fd.items

()

vu50390:ch3

lcassel

$ python re2.py

[('

io

', 549), ('ea', 476), ('

ie

', 331), ('

ou

', 329), ('

ai

', 261), ('

ia

', 253), ('

ee

', 217), ('

oo

', 174), ('

ua

', 109), ('au', 106), ('

ue

', 105), ('

ui

', 95), ('

ei

', 86), ('

oi

', 65), ('

oa

', 59), ('

eo

', 39), ('

iou

', 27), ('

eu

', 18), ('

oe

', 15), ('

iu

', 14), ('

ae

', 11), ('eau', 10), ('

uo

', 8), ('

ao

', 6), ('

oui

', 6), ('

eou

', 5), ('

uou

', 5), ('

uee

', 4), ('

aa

', 3), ('

ieu

', 3), ('

uie

', 3), ('

eei

', 2), ('

aia

', 1), ('

aii

', 1), ('

aiia

', 1), ('

eea

', 1), ('

iai

', 1), ('

iao

', 1), ('

ioa

', 1), ('

oei

', 1), ('

ooi

', 1), ('

ueui

', 1), ('

uu

', 1)]Slide39

Spot check

Your Turn: In the W3C Date Time Format, dates are represented like this: 2009-12-31. Replace the ? in the following Python code with a regular expression, in order to convert the string '2009-12-31' to a list of integers [2009, 12, 31]:

[

int(n

) for

n in re.findall(?, '2009-12-31')]Slide40

Processing some text

>>>

regexp

=

r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou

]'>>> def compress(word):... pieces = re.findall(regexp, word)... return ''.join(pieces)...>>> english_udhr = nltk.corpus.udhr.words('English-Latin1')>>> print nltk.tokenwrap(compress(w) for w in english_udhr[:75])Unvrsl Dclrtn of Hmn Rghts Prmble Whrs

rcgntn of the inhrnt dgnty andof the eql and inlnble

rghts

of all

mmbrs

of the

hmn

fmly

is the

fndtn

of

frdm

,

jstce

and

pce

in the

wrld

,

Whrs

dsrgrd

and

cntmpt

fr

hmn

rghts

hve

rsltd

in

brbrs

acts

whch

hve

outrgd

the

cnscnce

of

mnknd

,

and the

advnt

of a

wrld

in

whch

hmn

bngs

shll

enjy

frdm

of

spch

and

Noting redundancy in English and eliminating internal word vowels:Slide41

Tabulating combinations

>>>

rotokas_words

=

nltk.corpus.toolbox.words('rotokas.dic

')>>> cvs = [cv for w in rotokas_words for cv in re.findall\(r'[ptksvr][aeiou]', w)]>>>

cfd = nltk.ConditionalFreqDist(cvs)>>> cfd.tabulate

()

a

e

i

o

u

k

418 148 94 420 173

p

83 31 105 34 51

r

187 63 84 89 79

s

0 0 100 2 1

t

47 8 0 148 37

v

93 27 105 48 49

Rotokas

is an East Papuan languageSlide42

Inspecting the words behind the numbers

>>>

cv_word_pairs

= [(

cv, w) for w in rotokas_words... for cv in re.findall(r'[ptksvr][aeiou]', w)]>>> cv_index = nltk.Index(cv_word_pairs)>>> cv_index['su']['kasuari']>>> cv_index['po

']['kaapo', 'kaapopato', 'kaipori

', '

kaiporipie

', '

kaiporivira

', '

kapo

', '

kapoa

', '

kapokao

', '

kapokapo

', '

kapokapo

', '

kapokapoa

', '

kapokapoa

', '

kapokapora

', '

kapokapora

', '

kapokaporo

', '

kapokaporo

', '

kapokari

', '

kapokarito

', '

kapokoa

', '

kapoo

', '

kapooto

', '

kapoovira

', '

kapopaa

', '

kaporo

', '

kaporo

', '

kaporopa

', '

kaporoto

', '

kapoto

', '

karokaropo

', '

karopo

', '

kepo

', '

kepoi

', '

keposi

', '

kepoto

']Slide43

Stemming

Simple approach:

>>> def

stem(word

):

... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es

',\ 's', 'ment']:... if

word.endswith(suffix

):

... return word[:-

len(suffix

)]

... return wordSlide44

Building a stemmer

Build a disjunction of all suffixes

Take a look. What do we have here?

r

– raw string. Interpret everything just as what you see.

^ from the beginning . match anything* repeat the match anything 0 or more times(ing|ly|ed|ious|ies|ive|es|s|ment) – look for one of these$ at the end of the string‘processing’ -- the stringresult = re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')['ing']Slide45

To get the whole word

Need to add ?:

>>>

re.findall(r

'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')['processing']Slide46

Split the word into stem and suffix

Some subtleties involved

>>>

re.findall(r

'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')[('process', 'ing')]Looks ok, but >>> re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')[('processe', 's')]

The * is a greedy operator. It takes as much as it can get.>>>

re.findall(r

'^(.*?)(

ing|ly|ed|ious|ies|ive|es|s|ment

)$', 'processes')

[('process', '

es

')]

*? is non greedy version.

>>>

re.findall(r

'^(.*?)(

ing|ly|ed|ious|ies|ive|es|s|ment

)?$', 'language')

[('language', '')]

? makes the suffix list optional, matches when none presentSlide47

A stemming function

>>> def

stem(word

):

...

regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'... stem, suffix = re.findall(regexp, word)[0]... return stem...>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords... is no basis for a system of government. Supreme executive power derives from... a mandate from the masses, not from some farcical aquatic ceremony.""">>> tokens = nltk.word_tokenize(raw)>>> [stem(t) for t in tokens]['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond','distribut', 'sword', '

i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern','.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from','the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

Note some strange “words” returned as the stem:

basi

from basis and

deriv

and

execut

etc.Slide48

The Porter StemmerOfficial home:

http://tartarus.org/martin/PorterStemmer/index-old.html

The python version

http://

tartarus.org/martin/PorterStemmer/python.txtSlide49

>>> from

nltk.corpus

import

gutenberg

,

nps_chat>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))>>> moby.findall(r"<a> (<.*>) <man>") monied; nervous; dangerous; white; white; white; pious; queer; good;mature; white; Cape; great; wise; wise; butterless; white; fiendish;pale; furious; better; certain; complete; dismasted; younger; brave;brave; brave; brave>>> chat = nltk.Text(nps_chat.words())>>> chat.findall(r"<.*> <.*> <bro>") you rule bro; telling you bro; u twizted bro

>>> chat.findall(r"<l.*>{3,}") lol lol lol;

lmao

lol

lol

;

lol

lol

lol

; la la la la la; la la la; la

la la; lovely

lol

lol

love;

lol

lol

lol

.; la la la; la la la

( ) means only that part is returnedSlide50

re.show

Co{l}or{l}ess

green ideas

s{l}eep

furious{l}yColorless {gree}n ideas sleep furiouslyimport nltk, resent = "Colorless green ideas sleep furiously"nltk.re_show('l',sent)nltk.re_show('gree',sent)Slide51

Word patterns

>>> from

nltk.corpus

import brown

>>> hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))>>> hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")speed and other activities; water and other liquids; tomb and otherlandmarks; Statues and other monuments; pearls and other jewels;charts and other items; roads and other features; figures and otherobjects; military and other areas; demands and other factors;abstracts and other compilations; iron and other metalsSlide52

Spot CheckHow would you find all instances of the pattern

as

x

as

y

example: as easy as pieCan you handle this: as pretty as a pictureSlide53

More on Stemming

>>> porter =

nltk.PorterStemmer

()

>>>

lancaster = nltk.LancasterStemmer()>>> [porter.stem(t) for t in tokens]['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond','distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern','.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from','the', 'mass', ',', 'not', 'from', 'some', 'farcic', '

aquat', 'ceremoni', '.']>>> [lancaster.stem(t) for t in tokens]['den', ':', 'list', ',', 'strange', 'wom

', 'lying', 'in', 'pond', '

distribut

',

'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', '

suprem

',

'

execut

', '

pow

', '

der

', 'from', 'a', '

mand

', 'from', 'the', 'mass', ',', 'not',

'from', '

som

', '

farc

', '

aqu

', 'ceremony', '.']

>>>

wnl

=

nltk.WordNetLemmatizer

()

>>> [

wnl.lemmatize(t

) for

t

in tokens]

['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond',

'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of',

'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a',

'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical',

'aquatic', 'ceremony', '.']

Only keeps stems if in dictionarySlide54

Tokenizing

We have done split, but it was not very complete.

Built in re abbreviation for any kind of white space: \

s

>>>

re.split(r'\s+', raw)['Dennis:', 'Listen,', 'strange', 'women', 'lying', 'in', 'ponds', 'distributing', 'swords', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses,', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony.']>>>Slide55

Tokenizing

Split on anything other than a word character (A-Za-z0-9)

>>>

re.split(r

'\W+', raw)['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in','a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper','in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without','Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered','']Note: I’M became I Mre.findall(r'\w+', raw)

Splits on the words, instead of the separators«\w+|\S\w*»

will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \

s

) followed by further word characters. This means that punctuation is grouped with any following letters (e.g. '

s

) but that sequences of two or more punctuation characters are separated.Slide56

Getting there

>>>

re.findall(r'\w+|\S\w

*', raw)

["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',','(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t",'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does','very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that','makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']Now get internal marks – ‘M and ‘t Slide57

Regular expression symbols

Summary

Symbol Function

\

b

Word boundary (zero width)\d Any decimal digit (equivalent to [0-9])\D Any non-digit character (equivalent to [^0-9])\s Any whitespace character (equivalent to [ \t\n\r\f\v]\S Any non-whitespace character (equivalent to [^ \t\n\r\f\v])\w Any alphanumeric character (equivalent to [a-zA-Z0-9_])\W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])\t The tab character\n The newline characterSlide58

Tokenizer in Python

>>> text = 'That U.S.A. poster-print costs $12.40...'

>>> pattern =

r'''(?x

) # set flag to allow verbose regexps... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.... | \w+(-\w+)* # words with optional internal hyphens... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%... | \.\.\. # ellipsis... | [][.,;"'?():-_`] # these are separate tokens... '''>>> nltk.regexp_tokenize(text, pattern)['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']Slide59

Spot Check

☼ Describe the class of strings matched by the following regular expressions.

[a-

zA

-Z]+

[A-Z][a-z]*p[aeiou]{,2}t\d+(\.\d+)?([^aeiou][aeiou][^aeiou])*\w+|[^\w\s]+Test your answers using nltk.re_show().Slide60

Exercises

For next week:

Read in some text from a corpus, tokenize it, and print the list of all

wh

-word types that occur. (wh-words in English are used in questions, relative clauses and exclamations: who, which, what, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?For two weeks from now:★ Obtain raw texts from two or more genres and compute their respective reading difficulty scores as in the earlier exercise on reading difficulty. E.g. compare ABC Rural News and ABC Science News (nltk.corpus.abc). Use Punkt to perform sentence segmentation.