/
Natural Language Processing Natural Language Processing

Natural Language Processing - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
344 views
Uploaded On 2019-03-15

Natural Language Processing - PPT Presentation

Lecture 51272015 Susan W Brown Today Big picture What do you need to know What are finite state methods good for Review morphology Review finite state methods How this fits with morphology ID: 756507

speech language 2015 processing language speech processing 2015 martin jurafsky word words morphology water english transparent

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Natural Language Processing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Natural Language Processing

Lecture 5—1/27/2015

Susan W. BrownSlide2

Today

Big picture

What do you need to know?

What are finite state methods good for?

Review morphology

Review finite state methods

How this fits with morphology

Epsilon transitions

Begin N-gramsSlide3

1/28/2015

Speech and Language Processing - Jurafsky and Martin

3

Words

Finite-state methods are particularly useful in dealing with large lexicons

That is, big

bunches of words

Often infinite sized bunches

Many devices, some with limited memory resources, need access to large lists of words

And they need to perform fairly sophisticated tasks with those listsSlide4

Word recognition/generation

Recognize surface forms

Spell checking

Speech recognition

Transform surface forms to more abstract reprParsing (morphological, syntactic)  input to IR, MT or reasoning systemGenerate surface forms from abstract reprSummarizationQuestion answeringSlide5

FSA and FSTs formal definitions

States

Alphabets

Transitions

ClosureUnder union, inversion, compositionUnder intersection, complementation?Slide6

1/28/2015

Speech and Language Processing - Jurafsky and Martin

6

English Morphology

Morphology is the study of the ways that words are built up from smaller units called morphemes

The

minimal meaning-bearing

units in a language

We can usefully divide morphemes into two classes

Stems

: The core meaning-bearing units

Affixes

: Bits and pieces that adhere to stems to change their meanings and grammatical functionsSlide7

1/28/2015

Speech and Language Processing - Jurafsky and Martin

7

English Morphology

We

can further divide morphology up into two broad classes

Inflectional

DerivationalSlide8

1/28/2015

Speech and Language Processing - Jurafsky and Martin

8

Word Classes

By word class, we have in mind familiar notions like noun and verb

Also referred to as parts of speech and lexical categories

We’ll go into the gory details in Chapter 5

Right now we’re concerned with word classes because the way that stems and affixes combine is based to a large degree on the word class of the stemSlide9

1/28/2015

Speech and Language Processing - Jurafsky and Martin

9

Inflectional Morphology

Inflectional morphology concerns the combination of stems and affixes where the resulting word....

Has

the same word class

as the original

And serves a grammatical/semantic purpose that is

Different

from the original

But is nevertheless

transparently

related to the original

“walk” + “s” = “walks”Slide10

1/28/2015

Speech and Language Processing - Jurafsky and Martin

10

Inflection in English

Nouns are simple

Markers for plural and possessive

Verbs are only slightly more complex

Markers appropriate to the tense of the verb

That’

s pretty much it

Other languages can be quite a bit more complex

An implication of this is that hacks (approaches) that work in English will not work for many other languagesSlide11

1/28/2015

Speech and Language Processing - Jurafsky and Martin

11

Regulars and Irregulars

Things are complicated by the fact that some words misbehave (refuse to follow the rules)

Mouse/mice, goose/geese, ox/oxen

Go/went, fly/flew, catch/caught

The terms

regular

and

irregular

are used to refer to words that follow the rules and those that don

tSlide12

1/28/2015

Speech and Language Processing - Jurafsky and Martin

12

Regular and Irregular Verbs

Regulars…

Walk, walks, walking, walked, walked

Irregulars

Eat, eats, eating,

ate, eaten

Catch, catches, catching,

caught, caught

Cut, cuts, cutting,

cut, cutSlide13

1/28/2015

Speech and Language Processing - Jurafsky and Martin

13

Inflectional Morphology

So inflectional morphology in English is fairly straightforward

But is

somewhat complicated

by the fact

that there

are irregularitiesSlide14

1/28/2015

Speech and Language Processing - Jurafsky and Martin

14

Derivational Morphology

Derivational morphology is the messy stuff that no one ever taught you

In English it is characterized by

Quasi-systematicity

Irregular meaning change

Changes of word classSlide15

1/28/2015

Speech and Language Processing - Jurafsky and Martin

15

Derivational Examples

Verbs and Adjectives to Nouns

-ation

computerize

computerization

-ee

appoint

appointee

-er

kill

killer

-ness

fuzzy

fuzzinessSlide16

1/28/2015

Speech and Language Processing - Jurafsky and Martin

16

Derivational Examples

Nouns and Verbs to Adjectives

-al

computation

computational

-able

embrace

embraceable

-less

clue

cluelessSlide17

1/28/2015

Speech and Language Processing - Jurafsky and Martin

17

Example:

Compute

Many paths are possible…

Start with

compute

Computer -> computerize -> computerization

Computer -> computerize -> computerizable

But not all paths/operations are equally good (allowable?)

Clue

Clue

 clueless

Clue  ?clueful

Clue

*clueableSlide18

1/28/2015

Speech and Language Processing - Jurafsky and Martin

18

Morphology and FSAs

We would like to use the machinery provided by FSAs to capture these facts about morphology

Accept strings that are in the language

Reject strings that are not

And do so in a way that doesn’t require us to in effect list all the forms of all the words in the language

Even in English this is inefficient

And in other languages it is impossibleSlide19

1/28/2015

Speech and Language Processing - Jurafsky and Martin

19

Start Simple

Regular singular nouns are ok as is

They are in the language

Regular plural nouns have an -s on the end

So they’re also in the language

Irregulars are ok as isSlide20

1/28/2015

Speech and Language Processing - Jurafsky and Martin

20

Simple RulesSlide21

1/28/2015

Speech and Language Processing - Jurafsky and Martin

21

Now Plug in the Words Spelled Out

Replace the class names like

reg-noun

with FSAs that recognize all the words in that class. Slide22

An epsilon digression

Can always create an equivalent machine with no epsilon transitions

Intuitive and convenient notationSlide23

1/28/2015

Speech and Language Processing - Jurafsky and Martin

23

Another epsilon example: Union (Or)Slide24

1/28/2015

Speech and Language Processing - Jurafsky and Martin

24

Derivational Rules

If everything is an accept state how do things ever get rejected?Slide25

1/28/2015

Speech and Language Processing - Jurafsky and Martin

25

Lexicons

So the big picture is to store a lexicon (list of words you care about) as an FSA. The base lexicon is embedded in larger automata that captures the inflectional and derivational morphology of the language.

So what? Well, the simplest thing you can do with such an FSA is

spell checking

If the machine rejects, the word isn’t in the language

Without listing every form of every wordSlide26

1/28/2015

Speech and Language Processing - Jurafsky and Martin

26

Parsing/Generation

vs. Recognition

We can now run strings through these machines to recognize strings in the language

But recognition is usually not quite what we need

Often if we find some string in the language we might like to assign a structure to it (

parsing

)

Or we might start with some structure and want to produce a surface form for it (

production/generation

)

For

that we’ll move to finite state transducers

Add a second tape that can be written to Slide27

1/28/2015

Speech and Language Processing - Jurafsky and Martin

27

Finite State Transducers

The simple story

Add another tape

Add extra symbols to the transitions

On one tape we read

cats

, on the other we write

cat +N +PL

+N and +PL are elements in the alphabet for one tape that represent underlying linguistic featuresSlide28

1/28/2015

Speech and Language Processing - Jurafsky and Martin

28

FSTsSlide29

1/28/2015

Speech and Language Processing - Jurafsky and Martin

29

The Gory Details

Of course, its not as easy as

cat +N +PL

<->

cats

As we saw earlier there are

geese

,

mice

and

oxen

But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes

Cats

vs

Dogs (‘s’ sound vs. ‘z’ sound)

Fox

and

Foxes (that ‘e’ got inserted)

And doubling consonants (swim, swimming)

adding k’s (picnic, picnicked)

deleting e’s,...Slide30

1/28/2015

Speech and Language Processing - Jurafsky and Martin

30

Multi-Tape Machines

To deal with these complications, we will add even more tapes and use the output of one tape machine as the input to the next

So, to handle irregular spelling changes we will add intermediate tapes with intermediate symbolsSlide31

1/28/2015

Speech and Language Processing - Jurafsky and Martin

31

Multi-Level Tape Machines

We use one machine to transduce between the lexical and the intermediate level (M1), and another (M2) to handle the spelling changes to the surface tape

M1 knows about the particulars of the lexicon

M2 knows about weird English spelling rules

M1

M2Slide32

1/28/2015

Speech and Language Processing - Jurafsky and Martin

32

Lexical to Intermediate LevelSlide33

1/28/2015

Speech and Language Processing - Jurafsky and Martin

33

Intermediate to Surface

The add an

e

English spelling rule as in

fox^s# <-> foxes#Slide34

1/28/2015

Speech and Language Processing - Jurafsky and Martin

34

FoxesSlide35

1/28/2015

Speech and Language Processing - Jurafsky and Martin

35

FoxesSlide36

1/28/2015

Speech and Language Processing - Jurafsky and Martin

36

FoxesSlide37

1/28/2015

Speech and Language Processing - Jurafsky and Martin

37

Note

A key feature of this lower machine is that it has to do the right thing for inputs to which it doesn’t apply. So...

fox^s#

foxes

but

bird^s#

birds

and

cat#

cat

Slide38

Cascading FSTs

E-insertion rule

Possessive rule: add ‘s or s’

We want to send all our words through all the FSTs

cat+N +SG  catcat+N +PL  cat^sCat+N +SG +Poss  cat^’sCat+N +PL +Poss  cat^s^’sFSTs closed under compositionSlide39

FST determinism

FSTs are not necessarily

deterministic

the

search algorithms for NFTs are inefficientSequential FSTsDeterministicMore efficientNo ambiguityP-subsequentialAllows some ambiguitySlide40

HW 2 questions?

(Homework 1 feedback on Thursday)Slide41

New Topic

Statistical language modeling

Chapter 4

1/28/2015

Speech and Language Processing - Jurafsky and Martin

41Slide42

1/28/2015

Speech and Language Processing - Jurafsky and Martin

42

Word Prediction

Guess the next word...

So I notice three guys standing on the ???

What are some of the knowledge sources you used to come up with those predictions?Slide43

1/28/2015

Speech and Language Processing - Jurafsky and Martin

43

Word Prediction

We can formalize this task using what are called

N-

gram

models

N

-grams are token sequences of length

N

-gram means

written

Our earlier example contains the following 2-grams (aka bigrams)

(So I), (I notice), (notice three), (three guys), (guys standing), (standing on), (on the)

Given knowledge of counts of N-grams such as these, we can guess likely next words in a sequence.Slide44

1/28/2015

Speech and Language Processing - Jurafsky and Martin

44

N

-Gram Models

More formally, we can use knowledge of the counts of

N

-grams to assess the conditional probability of candidate words as the next word in a sequence.

Or, we can use them to assess the probability of an entire sequence of words.

Pretty much the same thing as we’

ll see...Slide45

1/28/2015

Speech and Language Processing - Jurafsky and Martin

45

Applications

It turns out that being able to predict the next word (or any linguistic unit) in a sequence is an extremely useful thing to be able to do.

As we

ll see, it lies at the core of the following applications

Automatic speech recognition

Handwriting and character recognition

Spelling correction

Machine translation

And many moreSlide46

1/28/2015

Speech and Language Processing - Jurafsky and Martin

46

Counting

Simple counting lies at the core of any probabilistic approach. So let’s first take a look at what we’

re counting.

He stepped out into the hall, was delighted to encounter a water brother.

13 tokens, 15 if we include

,

and

.

as separate tokens.

Assuming we include the comma and period as tokens, how many bigrams are there?Slide47

1/28/2015

Speech and Language Processing - Jurafsky and Martin

47

Counting

Not always that simple

I do uh main- mainly business data processing

Spoken language poses various challenges.

Should we count

uh

and other fillers as tokens?

What about the repetition of

mainly

? Should such do-overs count twice or just once?

The answers depend on the application.

If we

re focusing on something like ASR to support indexing for search, then

uh

isn

t helpful (it

s not likely to occur as a query).

But filled pauses are very useful in dialog management, so we might want them there

Tokenization of text raises the same kinds of issuesSlide48

1/28/2015

Speech and Language Processing - Jurafsky and Martin

48

Counting: Types and Tokens

How about

They picnicked by the pool, then lay back on the grass and looked at the stars.

18 tokens (again counting punctuation)

But we might also note that

the

is used 3 times, so there are only 16 unique types (as opposed to tokens).

In going forward, we’ll have occasion to focus on counting both types and tokens of both words and

N

-grams.Slide49

1/28/2015

Speech and Language Processing - Jurafsky and Martin

49

Counting: Corpora

What happens when we look at large bodies of text instead of single utterances

Google Web Crawl

Crawl of 1,024,908,267,229 English tokens in Web text

13,588,391 wordform types

That seems like a lot of types... After all, even large dictionaries of English have only around 500k types. Why so many here?

Numbers

Misspellings

Names

Acronyms

etcSlide50

1/28/2015

Speech and Language Processing - Jurafsky and Martin

50

Language Modeling

Now that we know how to count, back to word prediction

We can model the word prediction task as the ability to assess the conditional probability of a word given the previous words in the sequence

P(w

n

|w

1

,w

2

…w

n-1

)

We’ll call a statistical model that can assess this a

Language ModelSlide51

1/28/2015

Speech and Language Processing - Jurafsky and Martin

51

Language Modeling

How might we go about calculating such a conditional probability?

One way is to use the definition of conditional probabilities and look for counts. So to get

P(

the

|

its water is so transparent that

)

By definition that

s

P(its water is so transparent that the)

P(its water is so transparent that)

We can get each of those from counts in a large corpus.Slide52

1/28/2015

Speech and Language Processing - Jurafsky and Martin

52

Very Easy Estimate

How to estimate?

P(the | its water is so transparent that)

P(the | its water is so transparent that) =

Count(its water is so transparent that the)

Count(its water is so transparent that)Slide53

1/28/2015

Speech and Language Processing - Jurafsky and Martin

53

Very Easy Estimate

According to Google those counts are

12000 and 19000 so the conditional probability of interest is...

P(the | its water is so transparent that) =

0.63Slide54

1/28/2015

Speech and Language Processing - Jurafsky and Martin

54

Language Modeling

Unfortunately, for most sequences and for most text collections we won’t get good estimates from this method.

What we’re likely to get is 0. Or worse 0/0.

Clearly, we’ll have to be a little more clever.

Let’s first use the chain rule of probability

And then apply a particularly useful independence assumptionSlide55

1/28/2015

Speech and Language Processing - Jurafsky and Martin

55

The Chain Rule

Recall the definition of conditional probabilities

Rewriting:

For sequences...

P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

In general

P(x

1

,x

2

,x

3

,…x

n

) = P(x

1

)P(x

2

|x

1

)P(x

3

|x

1

,x

2

)…P(x

n

|x

1

…x

n-1

)Slide56

1/28/2015

Speech and Language Processing - Jurafsky and Martin

56

The Chain Rule

P(its water was so transparent)=

P(its)*

P(water|its)*

P(was|its water)*

P(so|its water was)*

P(transparent|its water was so)Slide57

1/28/2015

Speech and Language Processing - Jurafsky and Martin

57

Unfortunately

There are still a lot of possible sequences in there

In general, we

ll never be able to get enough data to compute the statistics for those longer prefixes

Same problem we had for the strings themselvesSlide58

1/28/2015

Speech and Language Processing - Jurafsky and Martin

58

Independence Assumption

Make the simplifying assumption

P(lizard|the,other,day,I,was,walking,along,and,saw,a)

= P(lizard|a)

Or maybe

P(lizard|the,other,day,I,was,walking,along,and,saw,a)

= P(lizard|saw,a)

That is, the probability in question is to some degree

independent

of its earlier history.Slide59

1/28/2015

Speech and Language Processing - Jurafsky and Martin

59

Independence Assumption

This particular kind of independence assumption is called a

Markov assumption

after the Russian mathematician Andrei Markov.Slide60

1/28/2015

Speech and Language Processing - Jurafsky and Martin

60

So for each component in the product replace with the approximation (assuming a prefix of N - 1)

Bigram version

Markov AssumptionSlide61

1/28/2015

Speech and Language Processing - Jurafsky and Martin

61

Estimating Bigram Probabilities

The Maximum Likelihood Estimate (MLE)Slide62

1/28/2015

Speech and Language Processing - Jurafsky and Martin

62

An Example

<s> I am Sam </s>

<s> Sam I am </s>

<s> I do not like green eggs and ham </s>