Lecture 51272015 Susan W Brown Today Big picture What do you need to know What are finite state methods good for Review morphology Review finite state methods How this fits with morphology ID: 756507
Download Presentation The PPT/PDF document "Natural Language Processing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Natural Language Processing
Lecture 5—1/27/2015
Susan W. BrownSlide2
Today
Big picture
What do you need to know?
What are finite state methods good for?
Review morphology
Review finite state methods
How this fits with morphology
Epsilon transitions
Begin N-gramsSlide3
1/28/2015
Speech and Language Processing - Jurafsky and Martin
3
Words
Finite-state methods are particularly useful in dealing with large lexicons
That is, big
bunches of words
Often infinite sized bunches
Many devices, some with limited memory resources, need access to large lists of words
And they need to perform fairly sophisticated tasks with those listsSlide4
Word recognition/generation
Recognize surface forms
Spell checking
Speech recognition
Transform surface forms to more abstract reprParsing (morphological, syntactic) input to IR, MT or reasoning systemGenerate surface forms from abstract reprSummarizationQuestion answeringSlide5
FSA and FSTs formal definitions
States
Alphabets
Transitions
ClosureUnder union, inversion, compositionUnder intersection, complementation?Slide6
1/28/2015
Speech and Language Processing - Jurafsky and Martin
6
English Morphology
Morphology is the study of the ways that words are built up from smaller units called morphemes
The
minimal meaning-bearing
units in a language
We can usefully divide morphemes into two classes
Stems
: The core meaning-bearing units
Affixes
: Bits and pieces that adhere to stems to change their meanings and grammatical functionsSlide7
1/28/2015
Speech and Language Processing - Jurafsky and Martin
7
English Morphology
We
can further divide morphology up into two broad classes
Inflectional
DerivationalSlide8
1/28/2015
Speech and Language Processing - Jurafsky and Martin
8
Word Classes
By word class, we have in mind familiar notions like noun and verb
Also referred to as parts of speech and lexical categories
We’ll go into the gory details in Chapter 5
Right now we’re concerned with word classes because the way that stems and affixes combine is based to a large degree on the word class of the stemSlide9
1/28/2015
Speech and Language Processing - Jurafsky and Martin
9
Inflectional Morphology
Inflectional morphology concerns the combination of stems and affixes where the resulting word....
Has
the same word class
as the original
And serves a grammatical/semantic purpose that is
Different
from the original
But is nevertheless
transparently
related to the original
“walk” + “s” = “walks”Slide10
1/28/2015
Speech and Language Processing - Jurafsky and Martin
10
Inflection in English
Nouns are simple
Markers for plural and possessive
Verbs are only slightly more complex
Markers appropriate to the tense of the verb
That’
s pretty much it
Other languages can be quite a bit more complex
An implication of this is that hacks (approaches) that work in English will not work for many other languagesSlide11
1/28/2015
Speech and Language Processing - Jurafsky and Martin
11
Regulars and Irregulars
Things are complicated by the fact that some words misbehave (refuse to follow the rules)
Mouse/mice, goose/geese, ox/oxen
Go/went, fly/flew, catch/caught
The terms
regular
and
irregular
are used to refer to words that follow the rules and those that don
’
tSlide12
1/28/2015
Speech and Language Processing - Jurafsky and Martin
12
Regular and Irregular Verbs
Regulars…
Walk, walks, walking, walked, walked
Irregulars
Eat, eats, eating,
ate, eaten
Catch, catches, catching,
caught, caught
Cut, cuts, cutting,
cut, cutSlide13
1/28/2015
Speech and Language Processing - Jurafsky and Martin
13
Inflectional Morphology
So inflectional morphology in English is fairly straightforward
But is
somewhat complicated
by the fact
that there
are irregularitiesSlide14
1/28/2015
Speech and Language Processing - Jurafsky and Martin
14
Derivational Morphology
Derivational morphology is the messy stuff that no one ever taught you
In English it is characterized by
Quasi-systematicity
Irregular meaning change
Changes of word classSlide15
1/28/2015
Speech and Language Processing - Jurafsky and Martin
15
Derivational Examples
Verbs and Adjectives to Nouns
-ation
computerize
computerization
-ee
appoint
appointee
-er
kill
killer
-ness
fuzzy
fuzzinessSlide16
1/28/2015
Speech and Language Processing - Jurafsky and Martin
16
Derivational Examples
Nouns and Verbs to Adjectives
-al
computation
computational
-able
embrace
embraceable
-less
clue
cluelessSlide17
1/28/2015
Speech and Language Processing - Jurafsky and Martin
17
Example:
Compute
Many paths are possible…
Start with
compute
Computer -> computerize -> computerization
Computer -> computerize -> computerizable
But not all paths/operations are equally good (allowable?)
Clue
Clue
clueless
Clue ?clueful
Clue
*clueableSlide18
1/28/2015
Speech and Language Processing - Jurafsky and Martin
18
Morphology and FSAs
We would like to use the machinery provided by FSAs to capture these facts about morphology
Accept strings that are in the language
Reject strings that are not
And do so in a way that doesn’t require us to in effect list all the forms of all the words in the language
Even in English this is inefficient
And in other languages it is impossibleSlide19
1/28/2015
Speech and Language Processing - Jurafsky and Martin
19
Start Simple
Regular singular nouns are ok as is
They are in the language
Regular plural nouns have an -s on the end
So they’re also in the language
Irregulars are ok as isSlide20
1/28/2015
Speech and Language Processing - Jurafsky and Martin
20
Simple RulesSlide21
1/28/2015
Speech and Language Processing - Jurafsky and Martin
21
Now Plug in the Words Spelled Out
Replace the class names like
“
reg-noun
”
with FSAs that recognize all the words in that class. Slide22
An epsilon digression
Can always create an equivalent machine with no epsilon transitions
Intuitive and convenient notationSlide23
1/28/2015
Speech and Language Processing - Jurafsky and Martin
23
Another epsilon example: Union (Or)Slide24
1/28/2015
Speech and Language Processing - Jurafsky and Martin
24
Derivational Rules
If everything is an accept state how do things ever get rejected?Slide25
1/28/2015
Speech and Language Processing - Jurafsky and Martin
25
Lexicons
So the big picture is to store a lexicon (list of words you care about) as an FSA. The base lexicon is embedded in larger automata that captures the inflectional and derivational morphology of the language.
So what? Well, the simplest thing you can do with such an FSA is
spell checking
If the machine rejects, the word isn’t in the language
Without listing every form of every wordSlide26
1/28/2015
Speech and Language Processing - Jurafsky and Martin
26
Parsing/Generation
vs. Recognition
We can now run strings through these machines to recognize strings in the language
But recognition is usually not quite what we need
Often if we find some string in the language we might like to assign a structure to it (
parsing
)
Or we might start with some structure and want to produce a surface form for it (
production/generation
)
For
that we’ll move to finite state transducers
Add a second tape that can be written to Slide27
1/28/2015
Speech and Language Processing - Jurafsky and Martin
27
Finite State Transducers
The simple story
Add another tape
Add extra symbols to the transitions
On one tape we read
“
cats
”
, on the other we write
“
cat +N +PL
”
+N and +PL are elements in the alphabet for one tape that represent underlying linguistic featuresSlide28
1/28/2015
Speech and Language Processing - Jurafsky and Martin
28
FSTsSlide29
1/28/2015
Speech and Language Processing - Jurafsky and Martin
29
The Gory Details
Of course, its not as easy as
“
cat +N +PL
”
<->
“
cats
”
As we saw earlier there are
geese
,
mice
and
oxen
But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes
Cats
vs
Dogs (‘s’ sound vs. ‘z’ sound)
Fox
and
Foxes (that ‘e’ got inserted)
And doubling consonants (swim, swimming)
adding k’s (picnic, picnicked)
deleting e’s,...Slide30
1/28/2015
Speech and Language Processing - Jurafsky and Martin
30
Multi-Tape Machines
To deal with these complications, we will add even more tapes and use the output of one tape machine as the input to the next
So, to handle irregular spelling changes we will add intermediate tapes with intermediate symbolsSlide31
1/28/2015
Speech and Language Processing - Jurafsky and Martin
31
Multi-Level Tape Machines
We use one machine to transduce between the lexical and the intermediate level (M1), and another (M2) to handle the spelling changes to the surface tape
M1 knows about the particulars of the lexicon
M2 knows about weird English spelling rules
M1
M2Slide32
1/28/2015
Speech and Language Processing - Jurafsky and Martin
32
Lexical to Intermediate LevelSlide33
1/28/2015
Speech and Language Processing - Jurafsky and Martin
33
Intermediate to Surface
The add an
“
e
”
English spelling rule as in
fox^s# <-> foxes#Slide34
1/28/2015
Speech and Language Processing - Jurafsky and Martin
34
FoxesSlide35
1/28/2015
Speech and Language Processing - Jurafsky and Martin
35
FoxesSlide36
1/28/2015
Speech and Language Processing - Jurafsky and Martin
36
FoxesSlide37
1/28/2015
Speech and Language Processing - Jurafsky and Martin
37
Note
A key feature of this lower machine is that it has to do the right thing for inputs to which it doesn’t apply. So...
fox^s#
foxes
but
bird^s#
birds
and
cat#
cat
Slide38
Cascading FSTs
E-insertion rule
Possessive rule: add ‘s or s’
We want to send all our words through all the FSTs
cat+N +SG catcat+N +PL cat^sCat+N +SG +Poss cat^’sCat+N +PL +Poss cat^s^’sFSTs closed under compositionSlide39
FST determinism
FSTs are not necessarily
deterministic
the
search algorithms for NFTs are inefficientSequential FSTsDeterministicMore efficientNo ambiguityP-subsequentialAllows some ambiguitySlide40
HW 2 questions?
(Homework 1 feedback on Thursday)Slide41
New Topic
Statistical language modeling
Chapter 4
1/28/2015
Speech and Language Processing - Jurafsky and Martin
41Slide42
1/28/2015
Speech and Language Processing - Jurafsky and Martin
42
Word Prediction
Guess the next word...
So I notice three guys standing on the ???
What are some of the knowledge sources you used to come up with those predictions?Slide43
1/28/2015
Speech and Language Processing - Jurafsky and Martin
43
Word Prediction
We can formalize this task using what are called
N-
gram
models
N
-grams are token sequences of length
N
-gram means
“
written
”
Our earlier example contains the following 2-grams (aka bigrams)
(So I), (I notice), (notice three), (three guys), (guys standing), (standing on), (on the)
Given knowledge of counts of N-grams such as these, we can guess likely next words in a sequence.Slide44
1/28/2015
Speech and Language Processing - Jurafsky and Martin
44
N
-Gram Models
More formally, we can use knowledge of the counts of
N
-grams to assess the conditional probability of candidate words as the next word in a sequence.
Or, we can use them to assess the probability of an entire sequence of words.
Pretty much the same thing as we’
ll see...Slide45
1/28/2015
Speech and Language Processing - Jurafsky and Martin
45
Applications
It turns out that being able to predict the next word (or any linguistic unit) in a sequence is an extremely useful thing to be able to do.
As we
’
ll see, it lies at the core of the following applications
Automatic speech recognition
Handwriting and character recognition
Spelling correction
Machine translation
And many moreSlide46
1/28/2015
Speech and Language Processing - Jurafsky and Martin
46
Counting
Simple counting lies at the core of any probabilistic approach. So let’s first take a look at what we’
re counting.
He stepped out into the hall, was delighted to encounter a water brother.
13 tokens, 15 if we include
“
,
”
and
“
.
”
as separate tokens.
Assuming we include the comma and period as tokens, how many bigrams are there?Slide47
1/28/2015
Speech and Language Processing - Jurafsky and Martin
47
Counting
Not always that simple
I do uh main- mainly business data processing
Spoken language poses various challenges.
Should we count
“
uh
”
and other fillers as tokens?
What about the repetition of
“
mainly
”
? Should such do-overs count twice or just once?
The answers depend on the application.
If we
’
re focusing on something like ASR to support indexing for search, then
“
uh
”
isn
’
t helpful (it
’
s not likely to occur as a query).
But filled pauses are very useful in dialog management, so we might want them there
Tokenization of text raises the same kinds of issuesSlide48
1/28/2015
Speech and Language Processing - Jurafsky and Martin
48
Counting: Types and Tokens
How about
They picnicked by the pool, then lay back on the grass and looked at the stars.
18 tokens (again counting punctuation)
But we might also note that
“
the
”
is used 3 times, so there are only 16 unique types (as opposed to tokens).
In going forward, we’ll have occasion to focus on counting both types and tokens of both words and
N
-grams.Slide49
1/28/2015
Speech and Language Processing - Jurafsky and Martin
49
Counting: Corpora
What happens when we look at large bodies of text instead of single utterances
Google Web Crawl
Crawl of 1,024,908,267,229 English tokens in Web text
13,588,391 wordform types
That seems like a lot of types... After all, even large dictionaries of English have only around 500k types. Why so many here?
Numbers
Misspellings
Names
Acronyms
etcSlide50
1/28/2015
Speech and Language Processing - Jurafsky and Martin
50
Language Modeling
Now that we know how to count, back to word prediction
We can model the word prediction task as the ability to assess the conditional probability of a word given the previous words in the sequence
P(w
n
|w
1
,w
2
…w
n-1
)
We’ll call a statistical model that can assess this a
Language ModelSlide51
1/28/2015
Speech and Language Processing - Jurafsky and Martin
51
Language Modeling
How might we go about calculating such a conditional probability?
One way is to use the definition of conditional probabilities and look for counts. So to get
P(
the
|
its water is so transparent that
)
By definition that
’
s
P(its water is so transparent that the)
P(its water is so transparent that)
We can get each of those from counts in a large corpus.Slide52
1/28/2015
Speech and Language Processing - Jurafsky and Martin
52
Very Easy Estimate
How to estimate?
P(the | its water is so transparent that)
P(the | its water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)Slide53
1/28/2015
Speech and Language Processing - Jurafsky and Martin
53
Very Easy Estimate
According to Google those counts are
12000 and 19000 so the conditional probability of interest is...
P(the | its water is so transparent that) =
0.63Slide54
1/28/2015
Speech and Language Processing - Jurafsky and Martin
54
Language Modeling
Unfortunately, for most sequences and for most text collections we won’t get good estimates from this method.
What we’re likely to get is 0. Or worse 0/0.
Clearly, we’ll have to be a little more clever.
Let’s first use the chain rule of probability
And then apply a particularly useful independence assumptionSlide55
1/28/2015
Speech and Language Processing - Jurafsky and Martin
55
The Chain Rule
Recall the definition of conditional probabilities
Rewriting:
For sequences...
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
In general
P(x
1
,x
2
,x
3
,…x
n
) = P(x
1
)P(x
2
|x
1
)P(x
3
|x
1
,x
2
)…P(x
n
|x
1
…x
n-1
)Slide56
1/28/2015
Speech and Language Processing - Jurafsky and Martin
56
The Chain Rule
P(its water was so transparent)=
P(its)*
P(water|its)*
P(was|its water)*
P(so|its water was)*
P(transparent|its water was so)Slide57
1/28/2015
Speech and Language Processing - Jurafsky and Martin
57
Unfortunately
There are still a lot of possible sequences in there
In general, we
’
ll never be able to get enough data to compute the statistics for those longer prefixes
Same problem we had for the strings themselvesSlide58
1/28/2015
Speech and Language Processing - Jurafsky and Martin
58
Independence Assumption
Make the simplifying assumption
P(lizard|the,other,day,I,was,walking,along,and,saw,a)
= P(lizard|a)
Or maybe
P(lizard|the,other,day,I,was,walking,along,and,saw,a)
= P(lizard|saw,a)
That is, the probability in question is to some degree
independent
of its earlier history.Slide59
1/28/2015
Speech and Language Processing - Jurafsky and Martin
59
Independence Assumption
This particular kind of independence assumption is called a
Markov assumption
after the Russian mathematician Andrei Markov.Slide60
1/28/2015
Speech and Language Processing - Jurafsky and Martin
60
So for each component in the product replace with the approximation (assuming a prefix of N - 1)
Bigram version
Markov AssumptionSlide61
1/28/2015
Speech and Language Processing - Jurafsky and Martin
61
Estimating Bigram Probabilities
The Maximum Likelihood Estimate (MLE)Slide62
1/28/2015
Speech and Language Processing - Jurafsky and Martin
62
An Example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>