/
N-Gram Language Models  ChengXiang  Zhai Department  of Computer Science N-Gram Language Models  ChengXiang  Zhai Department  of Computer Science

N-Gram Language Models ChengXiang Zhai Department of Computer Science - PowerPoint Presentation

briana-ranney
briana-ranney . @briana-ranney
Follow
349 views
Uploaded On 2019-11-01

N-Gram Language Models ChengXiang Zhai Department of Computer Science - PPT Presentation

NGram Language Models ChengXiang Zhai Department of Computer Science University of Illinois UrbanaChampaign 1 Outline General questions to ask about a language model Ngram language models ID: 761679

words smoothing prior 100 smoothing words 100 prior counts model text ref gram estimate w1w2

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "N-Gram Language Models ChengXiang Zhai..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

N-Gram Language Models ChengXiang ZhaiDepartment of Computer ScienceUniversity of Illinois, Urbana-Champaign 1

OutlineGeneral questions to ask about a language model N-gram language models Special case: Unigram language modelsSmoothing Methods 2

3 Central Questions to Ask about a LM: “ADMI”Application: Why do you need a LM? For what purpose? Data: What kind of data do you want to model?Model: How do you define the model? Inference: How do you infer/estimate the parameters? Evaluation metric for a LM Data set for estimation & evaluation Assumptions to be made Inference/Estimation algorithm Speech recognition Perplexity Speech text data Limited memory N-gram LM Smoothing methods

Central Question in LM: p(w 1w2…wm|C)=? What is C? We usually ignore C (=“context”) since it depends on the application, but it’s important to consider it when applying a LMRefinement 1: p(w1w2…wm|C)  p(w1w2…wm) What random variables are involved? What is the event space?What event does “w1w2…wm” represent? What is the sample space? p(w1w2…wm) =p(X=w1w2…wm) vs. p(X1=w1,X2=w2,…Xm=wm)? 4 X 1 =? X 2 =? X m =? X=?

Central Question in LM: p(w1w2…w m|C)=? Refinement 2: p(w1w2…wm) = p(X1=w1,X2=w2,…Xm=wm)What assumption have we made here? Chaining Rule: p(w1w2…wm)= p(w1)p(w2|w1)...p(wm|w1w2…wm-1)What about p(w1w2…wm)= p(wm)p(wm-1|wm)...p(w1|w2…wm)? Refinement 3: Assume limited dependence (only depends on the n previous words)  N-gram LM 5 p(X 1 =w 1 ,X 2 =w 2,…X m=w m)p(X1=w1 ) p(X2=w2|X1=w1)…p( Xn=w n|X1=w1,…,Xn-1=w n-1) … p(w1 w2…wm)  p(w1) p(w2|w 1)…p(w n|w1,…,wn-1)… p(wm|wm-n+1,…,wm-1 )

Key Assumption in N-gram LM: 6p(wm|w1,…, wm-n+1…,wm-1)=p(wm|wm-n+1,…,wm-1) IgnoredDoes this assumption hold?

Estimation of N-Gram LMs Text Data: DQuestion: p(wm|wm-n+1,…,wm-1)=?Boils down to estimate p(w1,w2,…,wm), ML estimate is: 7p(wm|wm-n+1,…,wm-1) = p(wm-n+1,…,wm-1,wm) p(wm-n+1,…,w m-1 ) P(X|Y)=p(X,Y)/p(Y) = c(w 1 w 2 … w m , D ) p(w 1 ,w 2 ,…,wm)   Count of word sequence “ w 1 w 2 … w m ” Total counts of all word sequences of length m

ML Estimate of N-Gram LM 8p(wm|wm-n+1,…,wm-1) = c(wm-n+1…wm-1wm, D)   Count of long word sequences may be zero! Not accurate Cause problems when computing the conditional probability p(w m |w m-n+1 ,…,w m-1 ) Solution: smoothing Key idea: backoff to shorter N-grams, eventually to unigrams Treat shorter N-gram models as prior in Bayesian estimation

Special Case of N-Gram LM: Unigram LM Generate text by generating each word INDEPENDENTLYp(wm|w1,…, wm-n+1…,wm-1)=p(wm): History didn’t matter!How to estimate a unigram LM? Text data: d Maximum Likelihood estimator: 9pML(w|d)   Count of word w in d Total Counts of all words in d

Unigram Language Model Smoothing (Illustration) 10 p( w|d ) w Word Max. Likelihood Estimate Smoothed LM

How to Smooth?All smoothing methods try to discount the probability of words seen in a text data set re-allocate the extra counts so that unseen words will have a non-zero countMethod 1: Additive smoothing: Add a constant  to the counts of each word, e.g., “add 1”11 “Add one”, Laplace Vocabulary size Count of w in d Length of d (total counts)

Improve Additive Smoothing Should all unseen words get equal probabilities?We can use a reference model to discriminate unseen words12 Discounted ML estimate Reference language model Normalizer Prob. Mass for unseen words

However, how do we define p(w|REF)? p(w|REF): Reference Language Model What do we know about those unseen words? Why are there unseen words? Zipf’s law: most words occur infrequently in text (e.g., just once)Unseen words are non-relevant to a topicUnseen words are relevant, but the text data sample isn’t large enough to include them The context variable C in p(w1w2…wm|C) can provide a basis for defining p(w|REF) E.g., in retrieval, p(w|Collection) can serve as p(w|REF) for estimating a language model for an individual document p(w|d) 13

Interpolation vs. Backoff Interpolation: view p(w|REF) as a prior and the actual counts as observed evidence Backoff (Katz-Backoff): if the count is sufficiently high (sufficient evidence), we’d trust the ML estimate, otherwise, we simply ignore the ML estimate and go for p(w|REF) 14   otherwise  

Smoothing Methods based on Interpolation Method 2: Absolute discounting (Kneser-Ney Smoothing): Subtract a constant  from the count of each wordMethod 3: Linear interpolation (Jelinek-Mercer smoothing): “Shrink” uniformly toward p(w|REF)15 # unique words parameter ML estimate

Method 4 Dirichlet Prior/Bayesian (McKay): Assume pseudo counts p(w|REF)16 parameter Smoothing Methods based on Interpolation (cont.) What would happen if we increase/decrease ? What if |d|  + 

Linear Interpolation (Jelinek-Mercer) Smoothing 17 the 0.1 a 0.08 .. computer 0.02 database 0.01 …… text 0.001 network 0.001 mining 0.0009 … Collection LM P( w|C ) Unigram LM p(w| )=? Document d text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? association ? database ? … query ? network? Total #words= 100 10/100 5/100 3/100 3/100 1/100 0/100  

Dirichlet Prior (Bayesian) Smoothing 18 the 0.1 a 0.08 .. computer 0.02 database 0.01 …… text 0.001 network 0.001 mining 0.0009 … Collection LM P( w|C ) Unigram LM p(w| )=? Document d text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? association ? database ? … query ? network? Total #words= 100 10/100 5/100 3/100 3/100 1/100 0/100   

Dirichlet Prior Smoothing (Bayesian Smoothing) Bayesian estimator of multinomial distribution (unigram LM)First consider posterior of parameters: p(|d) p(d| )p()Then, consider the mean or mode of the posterior distributionSampling distribution(of data): p(d|)Prior (on model parameters): P()=p(1 ,…, N), where 1 is probability of the i-th word in the vocabularyConjugate Prior: intuitive & mathematically convenient“encode” the prior as “extra pseudo counts,” which can be conveniently combined with the observed actual counts p(d| ) and p(  ) have the same functional form 19

Dirichlet Prior Smoothing Dirichlet distribution is a conjugate prior for multinomial sampling distribution 20 “pseudo ” word counts  i = p( w i |REF ) X

Posterior distribution of parameters: The predictive distribution is the same as the mean: Dirichlet prior smoothing Dirichlet Prior Smoothing (cont.)

Good Turing Smoothing Key Idea: Assume total # unseen events to be n1 (# of singletons), and adjust all the seen events in the same way22 Adjusted count Sum of counts of all terms that occurred c( w,d )+1 times Share the counts among all the words that occurred c( w,d ) times Heuristics are needed

Smoothing of p(wm|w1,…, w m-n+1…,wm-1)How should we define p(w|REF)? In general, p(w|REF) can be defined based on any “clues” from the history h=(wm-n+1,…,wm-1)Most natural: p(w|REF)=p(wm|wm-n+2,…,wm-1), ignore wm-n+1 ; can be done recursively to rely on shorter and shorter historyIn general, relax the condition to make it less specific so as to increase the counts we can collect (e.g., shorten the history, cluster the history)23p(wm|wm-n+1,…,wm-1) = c (w m ,w m-n+1 ,…, w m-1 ; D)   What if this is zero?

What You Should Know What is an N-gram language model? What assumptions are made in an N-gram language model? What are the events involved? How to compute ML estimate of an N-gram language model? Why do we need to do smoothing in general? Know the major smoothing methods and how they work: additive smoothing, absolute discount, linear interpolation (fixed coefficient), Dirichlet prior, Good TuringKnow the basic idea of deriving Dirichlet prior smoothing 24