/
Lecture 2: N-gram  Kai-Wei Chang CS @ University of Virginia Lecture 2: N-gram  Kai-Wei Chang CS @ University of Virginia

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
343 views
Uploaded On 2019-11-02

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia - PPT Presentation

Lecture 2 Ngram KaiWei Chang CS University of Virginia kwkwchangnet Couse webpage httpkwchangnetteachingNLP16 1 CS 6501 Natural Language Processing This lecture Language Models What are Ngram models ID: 762196

natural language processing 6501 language natural 6501 processing model gram probability words number word http likelihood maximum today data

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Lecture 2: N-gram Kai-Wei Chang CS @ Un..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Lecture 2: N-gram Kai-Wei ChangCS @ University of Virginiakw@kwchang.netCouse webpage: http://kwchang.net/teaching/NLP16 1 CS 6501: Natural Language Processing

This lecture Language ModelsWhat are N-gram models?How to use probabilitiesWhat does P(Y|X) mean?How can I manipulate it?How can I estimate its value in practice? CS 6501: Natural Language Processing 2

What is a language model ?Probability distributions over sentences (i.e., word sequences ) P(W) = P( ) Can use them to generate strings P( ) Rank possible sentences P(“Today is Tuesday”) > P(“Tuesday Today is”) P(“Today is Tuesday”) > P(“Today is Virginia”)   CS 6501: Natural Language Processing 3

Language model applications Context-sensitive spelling correctionCS 6501: Natural Language Processing4

Language model applications AutocompleteCS 6501: Natural Language Processing5

Language model applications Smart Reply CS 6501: Natural Language Processing 6

Language model applications Language generationhttps://pdos.csail.mit.edu/archive/scigen/ CS 6501: Natural Language Processing 7

Bag-of-Words with N-grams N-grams: a contiguous sequence of n tokens from a given piece of textCS 6501: Natural Language Processing8 http://recognize-speech.com/language-model/n-gram-model/comparison

N-Gram Models Unigram model: Bigram model: Trigram model: N-gram model: …   CS 6501: Natural Language Processing 9

Random language via n-gram http://www.cs.jhu.edu/~jason/465/PowerPoint/lect01,3tr-ngram-gen.pdfBehind the scenes – probability theoryCS 6501: Natural Language Processing 10

Sampling with replacement 1. P( ) = ? 2. P() = ? 3. P( red, ) = ? 4. P( blue ) = ? 5. P( red | ) = ? 6. P( | red) = ? 7. P( ) = ? 8. P() = ? 9. P(2 x , 3 x , 4 x ) = ?CS 6501: Natural Language Processing11

Sampling words with replacement CS 6501: Natural Language Processing12 Example from Julia hockenmaier , Intro to NLP

Implementation: how to sample? Sample from a discrete distribution Assume outcomes in the event space Divide the interval [0,1] into intervals according to the probabilities of the outcomes Generate a random number between 0 and 1 Return where falls into CS 6501: Natural Language Processing13

Conditional on the previous word CS 6501: Natural Language Processing14 Example from Julia hockenmaier , Intro to NLP

Conditional on the previous word CS 6501: Natural Language Processing15 Example from Julia hockenmaier , Intro to NLP

Recap: probability Theory Conditional probability P(blue | ) = ? Bayes’ rule: Verify: P( red | ) , P(  | red ), P(), P(red)Independent Prove:   CS 6501: Natural Language Processing 16

The Chain Rule The joint probability can be expressed in terms of the conditional probability: P(X,Y) = P(X | Y) P(Y)More variables: P(X, Y, Z) = P(X | Y, Z) P(Y, Z) = P(X | Y, Z) P(Y | Z) P(Z)   CS 6501: Natural Language Processing 17

Language model for text Probability distribution over sentences Complexity - - maximum sentence length   CS 6501: Natural Language Processing 18 Chain rule: from conditional probability to joint probability A rough estimate:   Average English sentence length is 14.3 words 475,000 main headwords in Webster's Third New International Dictionary   How large is this? We need independence assumptions!

Probability models Building a probability model:defining the model(making independent assumption)estimating the model’s parametersuse the model (making inference) CS 6501: Natural Language Processing 19 Trigram Model (defined in terms of parameters like P(“ is”|”today ”) ) param Values   definition of P

Independent assumption Independent assumptioneven though X and Y are not actually independent, we treat them as independentMake the model compact (e.g., from to )   CS 6501: Natural Language Processing 20

Language model with N-gram The chain rule: N-gram language model assumes each word depends only on the last n-1 words (Markov assumption)   CS 6501: Natural Language Processing 21

Language model with N-gram Example: trigram (3-gram) = = P P(“Today”)P(“ is”|”Today ”)P(“ a”|”is ”, “Today”)… P(“ day”|”sunny ”, “a”)   CS 6501: Natural Language Processing 22

Unigram model CS 6501: Natural Language Processing23

Bigram model Condition on the previous wordCS 6501: Natural Language Processing24

Ngram model CS 6501: Natural Language Processing25

More examples Yoav’s blog post: http://nbviewer.jupyter.org/gist/yoavg/d76121dfde261842213910-gram character-level LM:CS 6501: Natural Language Processing 26 First Citizen: Nay, then, that was hers, It speaks against your other service: But since the youth of the circumstance be spoken: Your uncle and one Baptista's daughter. SEBASTIAN: Do I stand till the break off. BIRON: Hide thy head.

More examples Yoav’s blog post: http://nbviewer.jupyter.org/gist/yoavg/d76121dfde261842213910-gram character-level LM:CS 6501: Natural Language Processing 27 ~~/* * linux /kernel/ time.c * Please report this on hardware. */void irq_mark_irq(unsigned long old_entries, eval); /* * Divide only 1000 for ns^2 -> us^2 conversion values don't overflow: seq_puts(m, "\ttramp: %pS", (void *)class->contending_point]++; if (likely(t->flags & WQ_UNBOUND)) { /* * Update inode information. If the * slowpath and sleep time (abs or rel ) * @ rmtp : remaining (either due * to consume the state of ring buffer size. */ header_size - size, in bytes, of the chain. */ BUG_ON(!error); } while ( cgrp ) { if (old) { if ( kdb_continue_catastrophic ; # endif

Questions? CS 6501: Natural Language Processing28

Maximum likelihood Estimation “Best” means “data likelihood reaches maximum” Unigram Language Model  p(w| )=? Document text 10 mining 5association 3database 3algorithm 2…query 1efficient 1…text ?mining ?assocation ?database ?…query ?… Estimation A paper (total #words=100) 10/100 5/100 3/100 3/100 1/100 CS 6501: Natural Language Processing 29  

Which bag of words more likely generate: aaaDaaaKoaaaaCS 6501: Natural Language Processing 30 a E K a a a DaaaoabK a D E P F n o

Parameter estimation General setting:Given a (hypothesized & probabilistic) model that governs the random experimentThe model gives a probability of any data that depends on the parameter Now, given actual sample data X={x 1 ,…, x n }, what can we say about the value of ?Intuitively, take our best guess of -- “best” means “best explaining/fitting the data”Generally an optimization problem CS 6501: Natural Language Processing31

Maximum likelihood estimation Data: a collection of words, Model: multinomial distribution with parameters Maximum likelihood estimator:   32     CS 6501: Natural Language Processing  

Maximum likelihood estimation 33 Lagrange multiplier Set partial derivatives to zero ML estimate     =1     Since we have   Requirement from probability CS 6501: Natural Language Processing  

Maximum likelihood estimation For N-gram language models   CS 6501: Natural Language Processing 34 Length of document or total number of words in a corpus

A bi-gram example <S> I am Sam </S><S> I am legend </S><S> Sam I am </S>P( I | <S>) = ? P(am | I) = ?P( Sam | am) = ? P( </S> | Sam) = ? P( <S>I am Sam</S> | bigram model) = ? CS 6501: Natural Language Processing 35

Practical Issues We do everything in the log spaceAvoid underflowAdding is faster than multiplying Toolkits KenLM : https://kheafield.com/code/kenlm/ SRILM: http://www.speech.sri.com/projects/srilm   CS 6501: Natural Language Processing36

More resources Google n-gram:https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.htmlCS 6501: Natural Language Processing 37 File sizes: approx. 24 GB compressed ( gzip'ed ) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401Number of trigrams: 977,069,902 Number of fourgrams : 1,313,818,354 Number of fivegrams : 1,176,470,663

More resources Google n-gram viewerhttps://books.google.com/ngrams/Data: http://storage.googleapis.com/books/ngrams/books/datasetsv2.htmlCS 6501: Natural Language Processing 38 circumvallate 1978 335 91 circumvallate 1979 261 91

CS 6501: Natural Language Processing 39

CS 6501: Natural Language Processing 40

CS 6501: Natural Language Processing 41

CS 6501: Natural Language Processing 42

How about unseen words/phrases Example: Shakespeare corpus consists of N=884,647 word tokens and a vocabulary of V=29,066 word typesOnly 30,000 word types occurredWords not in the training data 0 probability Only 0.04% of all possible bigrams occurred   CS 6501: Natural Language Processing 43

Next Lecture Dealing with unseen n-gramsKey idea: reserve some probability mass to events that don’t occur in the training dataHow much probability mass should we reserve?CS 6501: Natural Language Processing 44

Recap N-gram language modelsHow to generate text from a language modelHow to estimate a language modelReading: Speech and Language Processing Chapter 4: N-GramsCS 6501: Natural Language Processing 45