Hidden Markov Models Teaching Demo The University of Arizona Tatjana Scheffler tatjanaschefflerunipotsdamde WarmUp Parts of Speech Part of Speech Tagging Grouping words into morphosyntactic types like noun verb etc ID: 761255
Download Presentation The PPT/PDF document "Hidden Markov Models Teaching Demo The U..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Hidden Markov Models Teaching DemoThe University of ArizonaTatjana Schefflertatjana.scheffler@uni-potsdam.de
Warm-Up: Parts of Speech Part of Speech Tagging = Grouping words into morphosyntactic types like noun, verb, etc.: 2 She went for a walk
Warm-Up: Parts of Speech Grouping words into morphosyntactic types like noun, verb, etc.: 3 PRON VERB ADP DET NOUN She went for a walk
POS tags (Universal Dependencies) 4 Open class words Closed class words Other ADJ ADPPUNCTADVAUXSYMINTJCCONJXNOUNDET PROPNNUM VERBPART PRON SCONJ https:// universaldependencies.org /u/ pos /
Let’s try! Look at your your word(s) and find an appropriate (or likely) part of speech tag. Write it on the paper.Now find the other words from your sentence (sentences are numbered an color-coded in the upper left corner). Re-tag the sentence. Did any of your group have to change POS tags? How did you know? 5
Hidden Markov Models Generative language modelCompare, e.g. n-gram model: P(wn|w1,…,wn-1 ) Now, two step process: Generate sequence of hidden states (POS tags) t 1, …, t T from a bigram model P(t i |t i-1 ) Independently, generate an observable word wi from each state ti, from a bigram model P(wi|ti)6
Question 1: Language Modelling Given an HMM and a string w1, …, wT, what is its likelihood P(w1, …, w T )? Compute it efficiently with the Forward algorithm 7
Question 2: POS Tagging Given an HMM and a string w1, …, wT, what is the most likely sequence of hidden tags t 1 , …, t T that generated it? Compute it efficiently with the Viterbi algorithm 8
Question 3: Training (not today) Train HMM parameters from a set of POS tags andAnnotated training data: Maximum likelihood training with smoothing Unannotated training data: Forward-backward algorithm (an instance of EM) 9
Hidden Markov Model (formally) A Hidden Markov Model is a 5-tuple consisting of:a finite set of states Q={q1,…,qN} (= POS tags) a finite set of possible observations O (= words) initial probabilities a 0i = P(X1 = qi) transition probabilities a ij = P(X t+1 = q j | Xt = qi)emission probabilities bi(o) = P(Yt = o | Xt = qi)The HMM describes two coupled random processes:Xt = qi : At time t, HMM is in state qiYt = o : At time t, HMM emits observation o.10
Example: Eisner’s ice cream diary States: weather in Baltimore on a given dayObservations: How many ice creams Eisner ate that day 11 initial p. a 0H transition p. a CH emission p. b C (3)
HMM models x,y jointlyThe coupled random processes of HMM give us the joint probability P( x,y ) where x = x1, …, xT is the sequence of hidden states and y = y 1 , …, y T is the sequence of observations 12
Question 1: Likelihood P(y) How likely is it that Eisner ate 3 ice creams on day 1, 1 ice cream on day 2, and 3 ice creams on day 3?Want to compute P(3,1,3)Definitions let us compute joint probabilities like P(H,3,C,1,H,3) easily.But there could be different state sequences that lead to (3,1,3)We must sum up over all of them 13
Naïve approach Sum over all possible state sequences to compute P(3,1,3): Technical term: Marginalization 14
Naïve approach is too expensive Naïve approach sums up over exponential number of terms. This is too slow for practical use.Visualize the path through the hidden states in a trellis (unfolding of the HMM)one column for each time point, represents X t each column contains a copy of all the states in the HMM edges from states in t to t+1 (transitions in the HMM) Each path in the trellis is one state sequenceSo P(y) is the sum of all paths that emit y 15
Ice cream trellis 16
Ice cream trellis 17
Sentence likelihood 18
Sentence likelihood 19
Sentence likelihood 20
Sentence likelihood 21
Sentence likelihood 22
The Forward algorithm In the naïve solution, we compute many intermediate results several timesCentral idea: Define the forward probability that the HMM outputs and ends in So: 23
The Forward algorithm Base case, t = 1: Inductive step for t = 2 … T: 24
P(3,1,3) with Forward 25
P(3,1,3) with Forward 26
P(3,1,3) with Forward 27
P(3,1,3) with Forward 28
P(3,1,3) with Forward 29
P(3,1,3) with Forward 30
P(3,1,3) with Forward 31
P(3,1,3) with Forward 32
Question 1: Likelihood P(y) How likely is it that Eisner ate 3 ice creams on day 1, 1 ice cream on day 2, and 3 ice creams on day 3? Use Forward algorithm to sum up over different paths (= weather patterns) efficiently. 33
Question 2: Tagging Given observations y1, …, yT, what is the most likely sequence of hidden states x 1 , …, x T? We are only interested in the most likely sequence of states (not really their probability). 34
Naïve solution 35
Parallelism Likelihood (Question 1) Tagging (Question 2) P(y) argmax P( x,y ) Forward algorithm Viterbi algorithm Likelihood (Question 1) Tagging (Question 2) P(y) argmax P( x,y ) Forward algorithm Viterbi algorithm 36
The Viterbi algorithm Base case, t = 1: Inductive step, t = 2 … T: For each state and time step (j, t), remember the i for which maximum was achieved as a backpointer bp t (j) Retrieve optimal tag sequence by following bps T → 1 37
P(x,3,1,3) with Viterbi 38
P(x,3,1,3) with Viterbi 39
Runtime Forward and Viterbi have the same runtime, dominated by the inductive step: Compute N T times. Each computation iterates over N predecessor states i Total runtime is O(N 2 T) Linear in sentence length Quadratic in number of states (tags) 40
Summary Hidden Markov Models are a popular model for POS tagging and other tasks (e.g., dialog act tagging, see research talk tomorrow!)HMM = two coupled random processes:Bigram hidden state modelModel generating observable outputs from statesEfficient algorithms for two common problems: Forward algorithm for likelihood computation Viterbi algorithm for tagging (= best state sequence) 41
Eisner’s ice cream HMM 42