/
Language modelling using N-Grams Language modelling using N-Grams

Language modelling using N-Grams - PowerPoint Presentation

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
412 views
Uploaded On 2016-07-31

Language modelling using N-Grams - PPT Presentation

Corpora and Statistical Methods Lecture 7 In this lecture We consider one of the basic tasks in Statistical NLP language models are probabilistic representations of allowable sequences This part ID: 426496

data model models probability model data probability models test gram words training language estimate entropy set trigram perplexity grams

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Language modelling using N-Grams" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Language modelling using N-Grams

Corpora and Statistical Methods

Lecture 7Slide2

In this lecture

We consider one of the basic tasks in Statistical NLP:

language models

are probabilistic representations of allowable sequences

This part:

methodological overview

fundamental statistical estimation models

Next part:

smoothing techniquesSlide3

Assumptions, definitions, methodology, algorithms

Part 1 Slide4

Example task

The

word-prediction

task (Shannon game)

Given

:

a sequence of words (the

history

)

a choice of next word

Predict

:

the most likely next word

Generalises

easily to other problems, such as predicting the POS of unknowns based on history.Slide5

Applications of the Shannon game

Automatic speech recognition (cf. tutorial 1):

given a sequence of possible words, estimate its probability

Context-sensitive

spelling correction:

Many spelling errors are real words

He walked for miles in the dessert.

(resp.

desert

)

Identifying

such errors requires a global estimate of the probability of a sentence.Slide6

Applications of N-gram models generally

POS Tagging (cf. lecture 3):

predict the POS of an unknown word by looking at its history

Statistical

parsing:

e.g. predict the group of words that together form a phrase

Statistical

NL Generation:

given a semantic form to be realised as text

, and several possible realisations,

select the most

probable one.Slide7

A real-world example: Google’s

did you mean

Google uses an n-gram model (based on sequences of characters, not words).

In this case, the sequence

apple desserts

is much more probable than

apple desertsSlide8

How it works

Documents provided by the search engine are added to:

An index (for fast retrieval)

A language model (based on probability of a sequence of characters)

A submitted query (“apple deserts”) can be modified (using character insertions, deletions, substitutions and transpositions) to yield a query that fits the language model better (“apple desserts”).

Outcome is a context-sensitive spelling correction:

“apple deserts”

 “apple desserts”

frod

baggins

”  “

frodo

baggins

frod

”  “ford”Slide9

The noisy channel model

After

Jurafsky

and Martin (2009),

Speech and Language Processing

(2

nd

Ed). Prentice Hall p. 198Slide10

The assumptions behind n-gram modelsSlide11

The Markov Assumption

Markov models:

probabilistic models which predict the likelihood of a future unit based on

limited history

in

language modelling, this pans out as the

local history assumption

:

the probability of

w

n

depends on a limited number of prior words

utility

of the assumption:

we can rely on a small

n

for our n-gram models (bigram, trigram)long n-grams become exceedingly sparse

Probabilities become very small with long sequencesSlide12

The structure of an n-gram model

The task can be re-stated in conditional probabilistic terms:

Limiting

n

under the Markov Assumption means:

greater chance of finding more than one occurrence of the sequence

w

1

…w

n-1

more robust statistical estimations

N-grams are essentially

equivalence classes

or

bins

every unique n-gram is a type or

bin

Slide13

Structure of n-gram models (II)

If we construct a model where all histories with the same n-1 words are considered one class or bin, we have an

(n-1)

th

order Markov Model

Note terminology:

n-gram model

=

(n-1)

th

order Markov ModelSlide14

Methodological considerations

We are often concerned with:

building an

n-gram

model

evaluating it

We therefore make a distinction between

training

and

test

data

You never test on your training data

If you do, you’re bound to get good results.

N-gram models tend to be

overtrained

, i.e.: if you train on a corpus C, your model will be biased towards expecting the kinds of events in C

.

Another term for this:

overfittingSlide15

Dividing the data

Given: a corpus of

n

units (words, sentences, … depending on the task)

A large proportion of the corpus is reserved for training.

A smaller proportion for testing/evaluation (normally 5-10%)Slide16

Held-out (validation) data

Held-out estimation:

during training, we sometimes estimate parameters for our model empirically

commonly

used in smoothing (how much probability space do we want to set aside for unseen data)?

therefore

, the training set is often split further into

training data

and

validation data

normally, held-out data is 10% of the size of the training dataSlide17

Development data

A common approach:

train an algorithm on training data

a.

(estimate further parameters on held-out data if required)

evaluate it

re-tune it

go back to Step 1 until no further

finetuning

necessary

Carry out final evaluation

For

this purpose, it’s useful to have:

training data for step 1

development

set for steps 2-4

final test set for step 5Slide18

Significance testing

Often, we compare the performance of our algorithm against some baseline.

A

single, raw performance score won’t tell us much. We need to test for

significance

(e.g. using

t-

test).

Typical

method:

Split test set into several small test sets, e.g. 20 samples

evaluation carried out separately on each

mean and variance estimated based on 20 different samples

test for significant difference between algorithm and a predefined baselineSlide19

Size of n-gram models

In a corpus of vocabulary size N, the assumption is that any combination of

n

words is a potential

n-gram

.

For a bigram model: N

2

possible

n-grams

in principle

For a trigram model: N

3

possible n-grams.

…Slide20

Size (continued)

Each n-gram in our model is a parameter used to estimate probability of the next possible word.

too many parameters make the model unwieldy

too many parameters lead to data sparseness: most of them will have f = 0 or 1

Most

models stick to unigrams, bigrams or trigrams.

estimation can also combine different order modelsSlide21

Further considerations

When building a model, we tend to take into account the start-of-sentence symbol:

the girl swallowed a large green caterpillar

<s> the

the girl

Also

typical to map all

tokens

w

such that count(w)

<

k

to <UNK>:

usually, tokens with frequency 1 or 2 are just considered “unknown” or “unseen”

this reduces the parameter spaceSlide22

Building models using Maximum Likelihood EstimationSlide23

Maximum Likelihood Estimation Approach

Basic equation:

In a unigram model, this reduces to simple probability.

MLE models estimate probability using relative frequency.Slide24

Limitations of MLE

MLE builds the model that maximises the probability of the training data.

Unseen

events in the training data are assigned zero probability.

Since

n-gram

models tend to be sparse, this is a real problem.

Consequences

:

seen events are given more probability mass than they have

unseen events are given zero massSlide25

Seen/unseen

A

A’

Probability mass of events in

training data

Probability mass

of events not in

training data

The problem with MLE is that it distributes A’ among members of A.Slide26

The solution

Solution is to correct MLE estimation using a

smoothing

technique.

More on this in the next part

But cf. Tutorial 1, which introduced the simplest method of smoothing known.Slide27

Adequacy of different order models

Manning/

Schutze

`99 report results for n-gram models of a corpus of the novels of Austen.

Task: use n-gram model to predict the probability of a sentence in the test data.

Models:

unigram: essentially zero-context

markov

model, uses only the probability of individual words

bigram

trigram

4-gramSlide28

Example test case

Training Corpus: five Jane Austen novels

Corpus size = 617,091 words

Vocabulary size = 14,585 unique types

Task: predict the next word of the trigram

“inferior to ________”

from test data,

Persuasion

:

“[

In person, she was]

inferior to

both

[sisters.]”Slide29

Selecting an

n

Vocabulary (V) = 20,000 words

n

Number of bins

(i.e. no. of

possible

unique n-grams)

2 (bigrams)

400,000,000

3 (trigrams)

8,000,000,000,000

4 (4-grams)

1.6 x 1017Slide30

Adequacy of unigrams

Problems with unigram models:

not entirely hopeless because most sentences contain a majority of highly common words

ignores syntax completely:

P(In person she was inferior) = P(inferior was she person in)Slide31

Adequacy of bigrams

Bigrams:

improve situation dramatically

some unexpected results:

p(she|person) decreases compared to the unigram model. Though

she

is very common, it is uncommon after

personSlide32

Adequacy of trigrams

Trigram models will do brilliantly when they’re useful.

They capture a surprising amount of contextual variation in text.

Biggest limitation:

most new trigrams in test data will not have been seen in training data.

Problem

carries over to 4-grams, and is much worse!Slide33

Reliability vs. Discrimination

larger n: more information about the context of the specific instance (greater

discrimination

)

smaller n: more instances in training data, better statistical estimates (more

reliability

)Slide34

Backing off

Possible way of striking a balance between reliability and discrimination:

backoff model

:

where possible, use a trigram

if trigram is unseen, try and “back off” to a bigram model

if bigrams are unseen, try and “back off” to a unigramSlide35

Evaluating language modelsSlide36

Perplexity

Recall: Entropy is a measure of

uncertainty

:

high

entropy = high uncertainty

perplexity:

if I’ve trained on a sample, how surprised am I when exposed to a new sample?

a measure of uncertainty of a model on new dataSlide37

Entropy as “expected value”

One way to think of the summation part is as a weighted average of the information content.

We can view this average value as an “expectation”: the expected surprise/uncertainty of our model.Slide38

Comparing distributions

We have a language model built from a sample. The sample is a probability distribution

q

over n-grams.

q(x) = the probability of some n-gram x in our model.

The

sample is generated from a true population (“the language”) with probability distribution

p

.

p(x) = the probability of x in the true distributionSlide39

Evaluating a language model

We’d like an estimate of how good our model is as a model of the language

i.e. we’d like to compare

q

to

p

We don’t have access to p. (Hence, can’t use KL-Divergence)

Instead, we use our test data as an estimate of p.Slide40

Cross-entropy: basic intuition

Measure the number of bits needed to identify an event coming from

p

, if we code it according to q:

We draw sequences according to

p;

but we sum the log of their probability according to q.

This

estimate is called cross-entropy

H(

p,q

)Slide41

Cross-entropy: p vs. q

Cross-entropy is an

upper bound

on the entropy of the true distribution p:

H(p) ≤ H(

p,q

)

if our model distribution (q) is good, H(

p,q

) ≈ H(p)

We

estimate cross-entropy based on our test data.

Gives an estimate of the distance of our language model from the distribution in the test sample.Slide42

Estimating cross-entropy

Probability according

to p (test set)

Entropy according

to q (language model)Slide43

Perplexity

The perplexity of a language model with probability distribution

q

, relative to a test set with probability distribution

p

is:

A perplexity value of

k

(obtained on a test set) tells us:

our model is as surprised on average as it would be if it had to make

k

guesses for every sequence (n-gram) in the test data.

The lower the perplexity, the better the language

model

(the lower the surprise on our test data).

Slide44

Perplexity example (Jurafsky & Martin, 2000, p. 228)

Trained

unigram, bigram and trigram models from a corpus of news text (Wall Street Journal)

applied smoothing

38 million words

Vocab

of 19,979 (low-frequency words mapped to UNK).

Computed perplexity on a test set of 1.5 million words.Slide45

J&M’s results

Trigrams do best of all.

Value suggests the extent to which the model can fit the data in the test set.

Note: with unigrams, the model has to make lots of guesses!

N-Gram model

Perplexity

Unigram

962

Bigram

170

Trigram

109Slide46

Summary

Main point about Markov-based language models:

data sparseness is always a problem

smoothing techniques are required to estimate probability of unseen events

Next part discusses more refined smoothing techniques than those seen so far.