Corpora and Statistical Methods Lecture 7 In this lecture We consider one of the basic tasks in Statistical NLP language models are probabilistic representations of allowable sequences This part ID: 426496
Download Presentation The PPT/PDF document "Language modelling using N-Grams" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Language modelling using N-Grams
Corpora and Statistical Methods
Lecture 7Slide2
In this lecture
We consider one of the basic tasks in Statistical NLP:
language models
are probabilistic representations of allowable sequences
This part:
methodological overview
fundamental statistical estimation models
Next part:
smoothing techniquesSlide3
Assumptions, definitions, methodology, algorithms
Part 1 Slide4
Example task
The
word-prediction
task (Shannon game)
Given
:
a sequence of words (the
history
)
a choice of next word
Predict
:
the most likely next word
Generalises
easily to other problems, such as predicting the POS of unknowns based on history.Slide5
Applications of the Shannon game
Automatic speech recognition (cf. tutorial 1):
given a sequence of possible words, estimate its probability
Context-sensitive
spelling correction:
Many spelling errors are real words
He walked for miles in the dessert.
(resp.
desert
)
Identifying
such errors requires a global estimate of the probability of a sentence.Slide6
Applications of N-gram models generally
POS Tagging (cf. lecture 3):
predict the POS of an unknown word by looking at its history
Statistical
parsing:
e.g. predict the group of words that together form a phrase
Statistical
NL Generation:
given a semantic form to be realised as text
, and several possible realisations,
select the most
probable one.Slide7
A real-world example: Google’s
did you mean
Google uses an n-gram model (based on sequences of characters, not words).
In this case, the sequence
apple desserts
is much more probable than
apple desertsSlide8
How it works
Documents provided by the search engine are added to:
An index (for fast retrieval)
A language model (based on probability of a sequence of characters)
A submitted query (“apple deserts”) can be modified (using character insertions, deletions, substitutions and transpositions) to yield a query that fits the language model better (“apple desserts”).
Outcome is a context-sensitive spelling correction:
“apple deserts”
“apple desserts”
“
frod
baggins
” “
frodo
baggins
”
“
frod
” “ford”Slide9
The noisy channel model
After
Jurafsky
and Martin (2009),
Speech and Language Processing
(2
nd
Ed). Prentice Hall p. 198Slide10
The assumptions behind n-gram modelsSlide11
The Markov Assumption
Markov models:
probabilistic models which predict the likelihood of a future unit based on
limited history
in
language modelling, this pans out as the
local history assumption
:
the probability of
w
n
depends on a limited number of prior words
utility
of the assumption:
we can rely on a small
n
for our n-gram models (bigram, trigram)long n-grams become exceedingly sparse
Probabilities become very small with long sequencesSlide12
The structure of an n-gram model
The task can be re-stated in conditional probabilistic terms:
Limiting
n
under the Markov Assumption means:
greater chance of finding more than one occurrence of the sequence
w
1
…w
n-1
more robust statistical estimations
N-grams are essentially
equivalence classes
or
bins
every unique n-gram is a type or
bin
Slide13
Structure of n-gram models (II)
If we construct a model where all histories with the same n-1 words are considered one class or bin, we have an
(n-1)
th
order Markov Model
Note terminology:
n-gram model
=
(n-1)
th
order Markov ModelSlide14
Methodological considerations
We are often concerned with:
building an
n-gram
model
evaluating it
We therefore make a distinction between
training
and
test
data
You never test on your training data
If you do, you’re bound to get good results.
N-gram models tend to be
overtrained
, i.e.: if you train on a corpus C, your model will be biased towards expecting the kinds of events in C
.
Another term for this:
overfittingSlide15
Dividing the data
Given: a corpus of
n
units (words, sentences, … depending on the task)
A large proportion of the corpus is reserved for training.
A smaller proportion for testing/evaluation (normally 5-10%)Slide16
Held-out (validation) data
Held-out estimation:
during training, we sometimes estimate parameters for our model empirically
commonly
used in smoothing (how much probability space do we want to set aside for unseen data)?
therefore
, the training set is often split further into
training data
and
validation data
normally, held-out data is 10% of the size of the training dataSlide17
Development data
A common approach:
train an algorithm on training data
a.
(estimate further parameters on held-out data if required)
evaluate it
re-tune it
go back to Step 1 until no further
finetuning
necessary
Carry out final evaluation
For
this purpose, it’s useful to have:
training data for step 1
development
set for steps 2-4
final test set for step 5Slide18
Significance testing
Often, we compare the performance of our algorithm against some baseline.
A
single, raw performance score won’t tell us much. We need to test for
significance
(e.g. using
t-
test).
Typical
method:
Split test set into several small test sets, e.g. 20 samples
evaluation carried out separately on each
mean and variance estimated based on 20 different samples
test for significant difference between algorithm and a predefined baselineSlide19
Size of n-gram models
In a corpus of vocabulary size N, the assumption is that any combination of
n
words is a potential
n-gram
.
For a bigram model: N
2
possible
n-grams
in principle
For a trigram model: N
3
possible n-grams.
…Slide20
Size (continued)
Each n-gram in our model is a parameter used to estimate probability of the next possible word.
too many parameters make the model unwieldy
too many parameters lead to data sparseness: most of them will have f = 0 or 1
Most
models stick to unigrams, bigrams or trigrams.
estimation can also combine different order modelsSlide21
Further considerations
When building a model, we tend to take into account the start-of-sentence symbol:
the girl swallowed a large green caterpillar
<s> the
the girl
…
Also
typical to map all
tokens
w
such that count(w)
<
k
to <UNK>:
usually, tokens with frequency 1 or 2 are just considered “unknown” or “unseen”
this reduces the parameter spaceSlide22
Building models using Maximum Likelihood EstimationSlide23
Maximum Likelihood Estimation Approach
Basic equation:
In a unigram model, this reduces to simple probability.
MLE models estimate probability using relative frequency.Slide24
Limitations of MLE
MLE builds the model that maximises the probability of the training data.
Unseen
events in the training data are assigned zero probability.
Since
n-gram
models tend to be sparse, this is a real problem.
Consequences
:
seen events are given more probability mass than they have
unseen events are given zero massSlide25
Seen/unseen
A
A’
Probability mass of events in
training data
Probability mass
of events not in
training data
The problem with MLE is that it distributes A’ among members of A.Slide26
The solution
Solution is to correct MLE estimation using a
smoothing
technique.
More on this in the next part
But cf. Tutorial 1, which introduced the simplest method of smoothing known.Slide27
Adequacy of different order models
Manning/
Schutze
`99 report results for n-gram models of a corpus of the novels of Austen.
Task: use n-gram model to predict the probability of a sentence in the test data.
Models:
unigram: essentially zero-context
markov
model, uses only the probability of individual words
bigram
trigram
4-gramSlide28
Example test case
Training Corpus: five Jane Austen novels
Corpus size = 617,091 words
Vocabulary size = 14,585 unique types
Task: predict the next word of the trigram
“inferior to ________”
from test data,
Persuasion
:
“[
In person, she was]
inferior to
both
[sisters.]”Slide29
Selecting an
n
Vocabulary (V) = 20,000 words
n
Number of bins
(i.e. no. of
possible
unique n-grams)
2 (bigrams)
400,000,000
3 (trigrams)
8,000,000,000,000
4 (4-grams)
1.6 x 1017Slide30
Adequacy of unigrams
Problems with unigram models:
not entirely hopeless because most sentences contain a majority of highly common words
ignores syntax completely:
P(In person she was inferior) = P(inferior was she person in)Slide31
Adequacy of bigrams
Bigrams:
improve situation dramatically
some unexpected results:
p(she|person) decreases compared to the unigram model. Though
she
is very common, it is uncommon after
personSlide32
Adequacy of trigrams
Trigram models will do brilliantly when they’re useful.
They capture a surprising amount of contextual variation in text.
Biggest limitation:
most new trigrams in test data will not have been seen in training data.
Problem
carries over to 4-grams, and is much worse!Slide33
Reliability vs. Discrimination
larger n: more information about the context of the specific instance (greater
discrimination
)
smaller n: more instances in training data, better statistical estimates (more
reliability
)Slide34
Backing off
Possible way of striking a balance between reliability and discrimination:
backoff model
:
where possible, use a trigram
if trigram is unseen, try and “back off” to a bigram model
if bigrams are unseen, try and “back off” to a unigramSlide35
Evaluating language modelsSlide36
Perplexity
Recall: Entropy is a measure of
uncertainty
:
high
entropy = high uncertainty
perplexity:
if I’ve trained on a sample, how surprised am I when exposed to a new sample?
a measure of uncertainty of a model on new dataSlide37
Entropy as “expected value”
One way to think of the summation part is as a weighted average of the information content.
We can view this average value as an “expectation”: the expected surprise/uncertainty of our model.Slide38
Comparing distributions
We have a language model built from a sample. The sample is a probability distribution
q
over n-grams.
q(x) = the probability of some n-gram x in our model.
The
sample is generated from a true population (“the language”) with probability distribution
p
.
p(x) = the probability of x in the true distributionSlide39
Evaluating a language model
We’d like an estimate of how good our model is as a model of the language
i.e. we’d like to compare
q
to
p
We don’t have access to p. (Hence, can’t use KL-Divergence)
Instead, we use our test data as an estimate of p.Slide40
Cross-entropy: basic intuition
Measure the number of bits needed to identify an event coming from
p
, if we code it according to q:
We draw sequences according to
p;
but we sum the log of their probability according to q.
This
estimate is called cross-entropy
H(
p,q
)Slide41
Cross-entropy: p vs. q
Cross-entropy is an
upper bound
on the entropy of the true distribution p:
H(p) ≤ H(
p,q
)
if our model distribution (q) is good, H(
p,q
) ≈ H(p)
We
estimate cross-entropy based on our test data.
Gives an estimate of the distance of our language model from the distribution in the test sample.Slide42
Estimating cross-entropy
Probability according
to p (test set)
Entropy according
to q (language model)Slide43
Perplexity
The perplexity of a language model with probability distribution
q
, relative to a test set with probability distribution
p
is:
A perplexity value of
k
(obtained on a test set) tells us:
our model is as surprised on average as it would be if it had to make
k
guesses for every sequence (n-gram) in the test data.
The lower the perplexity, the better the language
model
(the lower the surprise on our test data).
Slide44
Perplexity example (Jurafsky & Martin, 2000, p. 228)
Trained
unigram, bigram and trigram models from a corpus of news text (Wall Street Journal)
applied smoothing
38 million words
Vocab
of 19,979 (low-frequency words mapped to UNK).
Computed perplexity on a test set of 1.5 million words.Slide45
J&M’s results
Trigrams do best of all.
Value suggests the extent to which the model can fit the data in the test set.
Note: with unigrams, the model has to make lots of guesses!
N-Gram model
Perplexity
Unigram
962
Bigram
170
Trigram
109Slide46
Summary
Main point about Markov-based language models:
data sparseness is always a problem
smoothing techniques are required to estimate probability of unseen events
Next part discusses more refined smoothing techniques than those seen so far.