Lecture 2 Ngram KaiWei Chang CS University of Virginia kwkwchangnet Couse webpage httpkwchangnetteachingNLP16 1 CS 6501 Natural Language Processing This lecture Language Models What are Ngram models ID: 762196
Download Presentation The PPT/PDF document "Lecture 2: N-gram Kai-Wei Chang CS @ Un..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Lecture 2: N-gram Kai-Wei ChangCS @ University of Virginiakw@kwchang.netCouse webpage: http://kwchang.net/teaching/NLP16 1 CS 6501: Natural Language Processing
This lecture Language ModelsWhat are N-gram models?How to use probabilitiesWhat does P(Y|X) mean?How can I manipulate it?How can I estimate its value in practice? CS 6501: Natural Language Processing 2
What is a language model ?Probability distributions over sentences (i.e., word sequences ) P(W) = P( ) Can use them to generate strings P( ) Rank possible sentences P(“Today is Tuesday”) > P(“Tuesday Today is”) P(“Today is Tuesday”) > P(“Today is Virginia”) CS 6501: Natural Language Processing 3
Language model applications Context-sensitive spelling correctionCS 6501: Natural Language Processing4
Language model applications AutocompleteCS 6501: Natural Language Processing5
Language model applications Smart Reply CS 6501: Natural Language Processing 6
Language model applications Language generationhttps://pdos.csail.mit.edu/archive/scigen/ CS 6501: Natural Language Processing 7
Bag-of-Words with N-grams N-grams: a contiguous sequence of n tokens from a given piece of textCS 6501: Natural Language Processing8 http://recognize-speech.com/language-model/n-gram-model/comparison
N-Gram Models Unigram model: Bigram model: Trigram model: N-gram model: … CS 6501: Natural Language Processing 9
Random language via n-gram http://www.cs.jhu.edu/~jason/465/PowerPoint/lect01,3tr-ngram-gen.pdfBehind the scenes – probability theoryCS 6501: Natural Language Processing 10
Sampling with replacement 1. P( ) = ? 2. P() = ? 3. P( red, ) = ? 4. P( blue ) = ? 5. P( red | ) = ? 6. P( | red) = ? 7. P( ) = ? 8. P() = ? 9. P(2 x , 3 x , 4 x ) = ?CS 6501: Natural Language Processing11
Sampling words with replacement CS 6501: Natural Language Processing12 Example from Julia hockenmaier , Intro to NLP
Implementation: how to sample? Sample from a discrete distribution Assume outcomes in the event space Divide the interval [0,1] into intervals according to the probabilities of the outcomes Generate a random number between 0 and 1 Return where falls into CS 6501: Natural Language Processing13
Conditional on the previous word CS 6501: Natural Language Processing14 Example from Julia hockenmaier , Intro to NLP
Conditional on the previous word CS 6501: Natural Language Processing15 Example from Julia hockenmaier , Intro to NLP
Recap: probability Theory Conditional probability P(blue | ) = ? Bayes’ rule: Verify: P( red | ) , P( | red ), P(), P(red)Independent Prove: CS 6501: Natural Language Processing 16
The Chain Rule The joint probability can be expressed in terms of the conditional probability: P(X,Y) = P(X | Y) P(Y)More variables: P(X, Y, Z) = P(X | Y, Z) P(Y, Z) = P(X | Y, Z) P(Y | Z) P(Z) CS 6501: Natural Language Processing 17
Language model for text Probability distribution over sentences Complexity - - maximum sentence length CS 6501: Natural Language Processing 18 Chain rule: from conditional probability to joint probability A rough estimate: Average English sentence length is 14.3 words 475,000 main headwords in Webster's Third New International Dictionary How large is this? We need independence assumptions!
Probability models Building a probability model:defining the model(making independent assumption)estimating the model’s parametersuse the model (making inference) CS 6501: Natural Language Processing 19 Trigram Model (defined in terms of parameters like P(“ is”|”today ”) ) param Values definition of P
Independent assumption Independent assumptioneven though X and Y are not actually independent, we treat them as independentMake the model compact (e.g., from to ) CS 6501: Natural Language Processing 20
Language model with N-gram The chain rule: N-gram language model assumes each word depends only on the last n-1 words (Markov assumption) CS 6501: Natural Language Processing 21
Language model with N-gram Example: trigram (3-gram) = = P P(“Today”)P(“ is”|”Today ”)P(“ a”|”is ”, “Today”)… P(“ day”|”sunny ”, “a”) CS 6501: Natural Language Processing 22
Unigram model CS 6501: Natural Language Processing23
Bigram model Condition on the previous wordCS 6501: Natural Language Processing24
Ngram model CS 6501: Natural Language Processing25
More examples Yoav’s blog post: http://nbviewer.jupyter.org/gist/yoavg/d76121dfde261842213910-gram character-level LM:CS 6501: Natural Language Processing 26 First Citizen: Nay, then, that was hers, It speaks against your other service: But since the youth of the circumstance be spoken: Your uncle and one Baptista's daughter. SEBASTIAN: Do I stand till the break off. BIRON: Hide thy head.
More examples Yoav’s blog post: http://nbviewer.jupyter.org/gist/yoavg/d76121dfde261842213910-gram character-level LM:CS 6501: Natural Language Processing 27 ~~/* * linux /kernel/ time.c * Please report this on hardware. */void irq_mark_irq(unsigned long old_entries, eval); /* * Divide only 1000 for ns^2 -> us^2 conversion values don't overflow: seq_puts(m, "\ttramp: %pS", (void *)class->contending_point]++; if (likely(t->flags & WQ_UNBOUND)) { /* * Update inode information. If the * slowpath and sleep time (abs or rel ) * @ rmtp : remaining (either due * to consume the state of ring buffer size. */ header_size - size, in bytes, of the chain. */ BUG_ON(!error); } while ( cgrp ) { if (old) { if ( kdb_continue_catastrophic ; # endif
Questions? CS 6501: Natural Language Processing28
Maximum likelihood Estimation “Best” means “data likelihood reaches maximum” Unigram Language Model p(w| )=? Document text 10 mining 5association 3database 3algorithm 2…query 1efficient 1…text ?mining ?assocation ?database ?…query ?… Estimation A paper (total #words=100) 10/100 5/100 3/100 3/100 1/100 CS 6501: Natural Language Processing 29
Which bag of words more likely generate: aaaDaaaKoaaaaCS 6501: Natural Language Processing 30 a E K a a a DaaaoabK a D E P F n o
Parameter estimation General setting:Given a (hypothesized & probabilistic) model that governs the random experimentThe model gives a probability of any data that depends on the parameter Now, given actual sample data X={x 1 ,…, x n }, what can we say about the value of ?Intuitively, take our best guess of -- “best” means “best explaining/fitting the data”Generally an optimization problem CS 6501: Natural Language Processing31
Maximum likelihood estimation Data: a collection of words, Model: multinomial distribution with parameters Maximum likelihood estimator: 32 CS 6501: Natural Language Processing
Maximum likelihood estimation 33 Lagrange multiplier Set partial derivatives to zero ML estimate =1 Since we have Requirement from probability CS 6501: Natural Language Processing
Maximum likelihood estimation For N-gram language models CS 6501: Natural Language Processing 34 Length of document or total number of words in a corpus
A bi-gram example <S> I am Sam </S><S> I am legend </S><S> Sam I am </S>P( I | <S>) = ? P(am | I) = ?P( Sam | am) = ? P( </S> | Sam) = ? P( <S>I am Sam</S> | bigram model) = ? CS 6501: Natural Language Processing 35
Practical Issues We do everything in the log spaceAvoid underflowAdding is faster than multiplying Toolkits KenLM : https://kheafield.com/code/kenlm/ SRILM: http://www.speech.sri.com/projects/srilm CS 6501: Natural Language Processing36
More resources Google n-gram:https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.htmlCS 6501: Natural Language Processing 37 File sizes: approx. 24 GB compressed ( gzip'ed ) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401Number of trigrams: 977,069,902 Number of fourgrams : 1,313,818,354 Number of fivegrams : 1,176,470,663
More resources Google n-gram viewerhttps://books.google.com/ngrams/Data: http://storage.googleapis.com/books/ngrams/books/datasetsv2.htmlCS 6501: Natural Language Processing 38 circumvallate 1978 335 91 circumvallate 1979 261 91
CS 6501: Natural Language Processing 39
CS 6501: Natural Language Processing 40
CS 6501: Natural Language Processing 41
CS 6501: Natural Language Processing 42
How about unseen words/phrases Example: Shakespeare corpus consists of N=884,647 word tokens and a vocabulary of V=29,066 word typesOnly 30,000 word types occurredWords not in the training data 0 probability Only 0.04% of all possible bigrams occurred CS 6501: Natural Language Processing 43
Next Lecture Dealing with unseen n-gramsKey idea: reserve some probability mass to events that don’t occur in the training dataHow much probability mass should we reserve?CS 6501: Natural Language Processing 44
Recap N-gram language modelsHow to generate text from a language modelHow to estimate a language modelReading: Speech and Language Processing Chapter 4: N-GramsCS 6501: Natural Language Processing 45