Download Presentation - The PPT/PDF document "Introduction to Natural Language Process..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Presentation on theme: "Introduction to Natural Language Processing"— Presentation transcript:
Introduction to Natural Language Processing
Source: Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit
We have had three weeks of Object-Oriented Programming in Python
Simple I/O, File I/O
, and their methods
Numeric types and operations
Control structures: if, for, while
Function definition and use
Parameters for defining the function, arguments for calling the function
Applying what we have
The first chapter of the NLTK book repeats much of what we have seen
Now in the context of an application domain: Natural Language Processing
Note: there are similar packages for other domains
Book examples in chapter 1 are all done with the interactive python shell
What can we achieve by combining simple programming techniques with large quantities of text?How can we automatically extract key words and phrases that sum up the style and content of a text?What tools and techniques does the Python programming language provide for such work?What are some of the interesting challenges of natural language processing?
Quote from nltk book
Since text can cover any subject area, it is a general interest area to explore in some depth.
We will not have time to explore all of them, but this gives a full list for further exploration.
) is True for all values
) is True for any
Using the NLTK
opens a window showing this:
Do it now
Getting data from the downloaded files
Previously, we used
from math import pi
to get something specific from a module
Now, from the
, we will get the text files we will use
Import the data files
>>> import nltk>>> from nltk.book import **** Introductory Examples for the NLTK Book ***Loading text1, ..., text9 and sent1, ..., sent9Type the name of the text or sentence to view it.Type: 'texts()' or 'sents()' to list the materials.text1: Moby Dick by Herman Melville 1851text2: Sense and Sensibility by Jane Austen 1811text3: The Book of Genesistext4: Inaugural Address Corpustext5: Chat Corpustext6: Monty Python and the Holy Grailtext7: Wall Street Journaltext8: Personals Corpustext9: The Man Who Was Thursday by G . K . Chesterton 1908
Do it now.Then type sent1 at a python prompt to see the fist sentence of Moby DickRepeat for sent2 .. sent9 to see the first sentence of each text.Take note of the collection of texts. Great variety. Different ones will be useful for different types of exploration
What type of data is each first sentence?
Searching the texts
>>> text9.concordance("sunset")Building index...Displaying 14 of 14 matches:E suburb of Saffron Park lay on the sunset side of London , as red and ragged n , as red and ragged as a cloud of sunset . It was built of a bright brick thbered in that place for its strange sunset . It looked like the end of the worival ; it was upon the night of the sunset that his solitude suddenly ended . he Embankment once under a dark red sunset . The red river reflected the red sst seemed of fiercer flame than the sunset it mirrored . It looked like a strehe passionate plumage of the cloudy sunset had been swept away , and a naked mder the sea . The sealed and sullen sunset behind the dark dome of St . Paul 'ming with the colour and quality of sunset . The Colonel suggested that , befogold . Up this side street the last sunset light shone as sharp and narrow as of gas , which in the full flush of sunset seemed coloured like a sunset cloudsh of sunset seemed coloured like a sunset cloud . " After all ," he said , " y and quietly , like a long , low , sunset cloud , a long , low house , mellowhouse , mellow in the mild light of sunset . All the six friends compared note
A concordance shows a word in context
Same word in different texts
>>> text1.concordance("monstrous")Building index...Displaying 11 of 11 matches:ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have rll over with a heathenish array of monstrous clubs and spears . Some were thickd as you gazed , and wondered what monstrous cannibal and savage could ever havthat has survived the flood ; most monstrous and most mountainous ! That Himmalthey might scout at Moby Dick as a monstrous fable , or still worse and more deth of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere ling Scenes . In connexion with the monstrous pictures of whales , I am stronglyere to enter upon those still more monstrous stories of them which are to be foght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u>>> text2.concordance("monstrous")Building index...Displaying 11 of 11 matches:. " Now , Palmer , you shall see a monstrous pretty girl ." He immediately wentyour sister is to marry him . I am monstrous glad of it , for then I shall haveou may tell your sister . She is a monstrous lucky girl to get him , upon my hok how you will like them . Lucy is monstrous pretty , and so good humoured and Jennings , " I am sure I shall be monstrous glad of Miss Marianne ' s company usual noisy cheerfulness , " I am monstrous glad to see you -- sorry I could nt however , as it turns out , I am monstrous glad there was never any thing in so scornfully ! for they say he is monstrous fond of her , as well he may . I spossible that she should ." " I am monstrous glad of it . Good gracious ! I havthing of the kind . So then he was monstrous happy , and talked on some time abe very genteel people . He makes a monstrous deal of money , and they keep thei>>>
Sense and Sensibility
abundant candid careful christian contemptible curious delightfullydetermined doleful domineering exasperate fearless few gamesomehorrible impalpable imperial lamentable lazy loving>>>
>>> text2.similar("monstrous")Building word-context index...very exceedingly heartily so a amazingly as extremely good greatremarkably sweet vast>>>
Note different sense of the word in the two texts.
Choose a word and generate a concordance for it in two or three texts.Do you see any difference in meaning?Look for similar terms in the texts.
Not sure what words are in what texts?
will return true or false
Look at the first sentence to get some words that are in the text.
Guess. ex: “money” appears in all but text6 and text8
Looking at vocabulary
>>> len(set(text3))2789>>> len(set(text2))6833>>>
Total number of tokens, includes non words and repeated words
What do these numbers mean?
What does this tell us?
On average, a word is used > 20 times
A rough measure of lexical richness
>>> from __future__ import division>>> 100*text2.count("money")/len(text2)0.018364694581002431>>>
Note two ways to get floating point results when dividing integers
These are the 50 most common tokens in the text of Moby Dick. Many of these are not useful in characterizing the text. We call them “stop words” and will see how to eliminate them from consideration later.
More precise specification
Consider the mathematical expressionPython implementation is[w for w in V if p(w)]
>>> AustenVoc=set(text2)>>> long_words_2=[w for w in AustenVoc if len(w) >15]>>> long_words_2['incomprehensible', 'disqualifications', 'disinterestedness', 'companionableness']>>>
List comprehension – we saw it first last week
Add to the condition
fdist2=FreqDist(text2)>>> long_words_2=sorted([w for w in AustenVoc if len(w) >12 and fdist2[w]>5])>>> long_words_2['Somersetshire', 'accommodation', 'circumstances', 'communication', 'consciousness', 'consideration', 'disappointment', 'distinguished', 'embarrassment', 'encouragement', 'establishment', 'extraordinary', 'inconvenience', 'indisposition', 'neighbourhood', 'unaccountable', 'uncomfortable', 'understanding', 'unfortunately']
) can be as complex as we need
Find all the words longer than 12 characters, which occur at least 5 times, in each of the texts.
How well do they give you a sense of the texts?
Collocations and Bigrams
Sometimes a word by itself is not representative of its role in a text. It is only with a companion word that we get the intended sense.
sign of hope
Bigrams are two word combinations
not all bigrams are useful, of course
len(bigrams(text2)) == 141575
including “and among”, “they could” , …
Collocations provides bigrams that include uncommon words – words that might be significant in the text.
There are 28,839 3-letter words and 334 13-letter words in Sense and Sensibility
Table 1.2 – FreqDist functions
create a frequency distribution containing the given samples
increment the count for this sample
count of the number of times a given sample occurred
frequency of a given sample
total number of samples
The samples sorted in order of decreasing frequency
for sample in
iterate over the samples, in order of decreasing frequency
with the greatest count
tabulate the frequency distribution
graphical plot of the frequency
cumulative plot of the frequency distribution
test if samples in fdist1
occur less frequently than in fdist2
Function Meanings.startswith(t) test if s starts with ts.endswith(t) test if s ends with tt in s test if t is contained inside ss.islower() test if all cased characters in s are lowercases.isupper() test if all cased characters in s are uppercases.isalpha() test if all characters in s are alphabetics.isalnum() test if all characters in s are alphanumerics.isdigit() test if all characters in s are digitss.istitle() test if s is titlecased (all words in s have have initial capitals)
We have seen conditionals and loop statements. These are some special functions for work on text
>>> sorted([w for w in set(text7) if '-' in w and 'index' in w])>>> sorted([wd for wd in set(text3) if wd.istitle() and len(wd) > 10])>>> sorted([w for w in set(sent7) if not w.islower()])>>> sorted([t for t in set(text2) if 'cie' in t or 'cei' in t])
From the NLTK book: Run the following examples and explain what is happening. Then make up some tests of your own.
Ending the double count of words
The count of words from the various texts was flawed. How?We hadWhat’s the problem? How do we fix it?
>>> len(text1)260819>>> len(set(text1))19317>>> len(set([word.lower() for word in text1]))17231>>>
() for word in text1 if
Nested loops and loops with conditions
Follow what happens.
>>> for token in sent1:
... print token, 'is a lowercase word'
... print token, 'is a
... print token, 'is punctuation'
Call is a
me is a lowercase word
Ishmael is a
. is punctuation
>>> tricky =
in set(text2) if '
>>> for word in tricky:
... print word,
ancient ceiling conceit conceited conceive conscience
a. The lost children were found by the searchers (agentive)b. The lost children were found by the mountain (locative)c. The lost children were found by the afternoon (temporal)
a. The thieves stole the paintings. They were subsequently sold.
. The thieves stole the paintings. They were subsequently caught.
. The thieves stole the paintings. They were subsequently found.
>>> text4.generate()Building ngram index...Fellow - Citizens : Under Providence I have given freedom new reach ,and maintain lasting peace -- based on righteousness and justice .There was this reason only why the cotton - producing States should bepromoted by just and abundant society , on just principles . Theselater years have elapsed , and civil war . More than this , we affirma new beginning is a destiny . May Congress prohibit slavery in theworkshop , in translating humanity ' s strongest , but we have adopted, and fear of God . And , in each>>>
An inaugural address??
-- MIT hoax – conference submission
Babel> How long before the next flight to Alice Springs?
0> How long before the next flight to Alice Springs?
2> How long before the following flight to Alice jump?
4> How long before the following flight to Alice do you jump?
6> How long, before the following flight to Alice does, do you jump?
8> How long before the following flight to Alice does, do you jump?
10> How long, before the following flight does to Alice, do do you jump?
12> How long before the following flight does leap to Alice, does you?