Ph D Candidate Department of Linguistics University of New Mexico Slides available at wwwcsgonzagaedudepalma 1 Syllables and Concepts in Large Vocabulary Continuous Speech Recognition An Engineered Artifact ID: 562584
Download Presentation The PPT/PDF document "Paul De Palma" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Paul De PalmaPh. D. CandidateDepartment of LinguisticsUniversity of New MexicoSlides available at: www.cs.gonzaga.edu/depalma
1
Syllables and Concepts in Large Vocabulary Continuous Speech RecognitionSlide2
An Engineered Artifact2
SyllablesPrincipled word segmentation scheme No claim about human syllabification
Concepts
Words and phrases with similar meanings
No claim about cognitionSlide3
Reducing the Search Space3ASR answers the question:
What is the most likely sequence of words given an acoustic signal?Considers many candidate word sequences
To Reduce the Search Space
Reduce number of candidates
Using Syllables in the Language Model
Using Concepts in a Concept ComponentSlide4
Syllables in LM: Why?4
Switchboard
(Greenberg, 1999, p. 167)
Cumulative Frequency as a Function of Frequency RankSlide5
Most Frequent Words are Monosyllabic5
Number
Syllablesper
Word
% of Corpus by
Token
% of Corpus
by Type
1
81.04
22.39
2
14.30
39.76
3
3.50
24.264.969.915.183.216.02.40
Polysyllabic words are easier to recognize (Hamalainen, et al. , 2007)And (of course) fewer syllables than words
(Greenberg, 1999, p. 167)Slide6
Reduce the Search Space 2:Concept Component6
Word
Map:GO
Syllable
Map:GO
A flight
ax f_l_ay_td
A ticket
ax t_ih k_ax_td
book airline travel
b_uh_kd eh_r l_ay_n t_r_ae v_ax_l
book reservations
b_uh_kd r_eh s_axr v_ey sh_ax_n_z
Create a reservation
k_r_iy
ey_td
ax r_eh z_axr v_ey
sh_ax_n
Departing
d_ax
p_aa_r
dx_ix_ng
Fly
f_l_ay
Flying
f_l_ay
ix_ng
get
g_eh_td
I am leaving
ay
ae_m
l_iy
v_ix_ngSlide7
The (Simplified) Architecture of an LVCSR System7
Feature ExtractorTransforms an acoustic signal into a collection of 39 feature vectors
The province of digital signal processing
Acoustic Model
Collection of probabilities of acoustic observations given word sequences
Language Model
Collection of probabilities of word sequences
Decoder
Guesses a probable sequence of words given an acoustic signal by searching the product of the probabilities found in the acoustic and language modelsSlide8
Simplified Schematic8
Feature
Extractor
signal
Words
Decoder
Acoustic Model
Language ModelSlide9
Enhanced Recognizer9
Feature
Extractor
signal
Syllables
Decoder
Acoustic Model
Syllable Language Model
assumed
assumed
assumed
My Work
Concept Component
Syllables,
Concepts
My Work
P(O|S)
P(S)Slide10
How ASR Works 10
Input is a sequence of acoustic observations:O = o1
, o
2
,…,
o
t
Output is a string of words:
W = w
1
, w
2
,…,
w
n
Then
“The hypothesized word sequence is that string W in the target language with the greatest probability given a sequence of acoustic observations.” (1)Slide11
Operationalizing Equation 111
(1)
(2)
Using
Bayes
’ Rule:
(3)
Since the acoustic signal is the same for each candidate, (3) can be rewritten
(4)
Acoustic Model (likelihood O|W)
Language Model
(prior probability)
DecoderSlide12
LM: A Closer Look12A collection of probabilities of word sequences
p(W) = p(w1…w
n
)
(5)
Can be written by the probability chain rule:
(6)Slide13
Markov Assumption13Approximate the full decomposition of (6) by looking only a specified number of words into the past
Bigram1 word into the past
Trigram 2 words into the past
…
N-gram n words into the pastSlide14
Bigram Language Model 14
Def. Bigram Probability: p(w
n
| w
n-1
) = count(w
n-1
w
n
)/count(w
n-1
) (7)
Minicorpus
<s>paul wrote his thesis</s>
<s>
james
wrote a different thesis</s><s>paul wrote a thesis suggested by george</s><s>the thesis</s><s>jane wrote the poem</s>(e.g., ) p(paul|<s>) = count(<s>paul)/count(<s>) = 2/5P(paul wrote a thesis) = p(paul|<s>) * p(wrote|paul) * p(a|wrote) * p(thesis|a) * p(</s>|thesis) = .075P(paul wrote the thesis) = p(paul|<s>) * p(wrote|paul) * p(the|wrote
) * p(thesis|the) * p(</s>|thesis) = .0375Slide15
Experiment 1: Perplexity15Perplexity: PP(X)Functionally related to entropy: H(X)
Entropy is a measure of informationHypothesis
PPX(of syllable LM) < PPX (of word LM)
Syllable LM contains more informationSlide16
Definitions16
Let X be a random variable p(x) be its probability function
Defs
:
H(X) = -∑
x∈X
p(x) * lg(p(x)) (1)
PP(X) = 2
H(X)
(2)
Given certain assumptions
1
and def. of H(X),
PP(X) can be transformed to: p(w
1
…wn)-1/nPerplexity is the nth inverse root of the probability of a word sequence1. X is an ergodic and stationary process, n is arbitrarily largeSlide17
Entropy As Information17Suppose the letters of a polynesian
alphabet are distributed as follows:1
p
t
k
a
i
u
1/8
¼
1/8
¼
1/8
1/8
Calculate the per letter entropy
H(P) = -∑
i∈{p,t,k,a,i,u} p(i) * lg(p(i))
= = 2 ½ bits2.5 bits on average required to encode a letter (p: 100, t: 00, etc)1. Manning, C., Schutze
, H. (1999).
Foundations of
Statistiical
Natural Language Processing
. Cambridge: MIT Press.Slide18
Reducing the Entropy18
SupposeThis language consists of entirely of CV syllables
We know their distribution
We can compute the conditional entropy of syllables in the language
H(V|C), where V ∈ {
a,i,u
} and C ∈ {
p,t,k
}
H(V|C) = 2.44 bits
Entropy for two letters, letter model: 5 bits
Conclusion: The syllable model contains more information than the letter modelSlide19
Perplexity As Weighted Average Branching Factor19
Suppose:letters in alphabet occur with equal frequency
At every fork we have 26 choicesSlide20
Reducing the Branching Factor20
Suppose ‘E’ occurs 75 times more frequently than any other letter
p(any other letter) = x
75 * x + 25*x = 1, since there are 25 such letters
x
= .01.
Since any letter,
w
i
, is either E or one of the other 25 letters
p(
w
i
) = .75 + .01 = .76 and
Still 26 choices at each fork
‘E’ is 75 times more likely than any other choice Perplexity is reducedModel contains more informationSlide21
Perplexity Experiment21Reduced perplexity in a language model is used as an indicator that an experiment with real data might be fruitful
Technique (for both syllable and word corpora)Randomly choose 10% of the utterances from a corpus as a test set
Generate a language model from the remaining 90%
Compute the perplexity of the test set given the language model
Compute the mean over twenty runs of step 3Slide22
The Corpora22Air Travel Information System (Hemphill, et al., 2009)
Word types: 1604 Word tokens: 219,009 Syllable types: 1314
Syllable Tokens:
317,578
Transcript of simulated human-computer speech (NextIt, 2008)
Word types:
482
Word tokens:
5,782
Syllable types:
537 (This will have repercussions in Exp. 2)
Syllable tokens:
8,587 Slide23
Results23
Bigrams
Trigrams
Mean Words
NextIt
39.44
31.44
Mean Syllables
NextIt
35.96
22.26
Mean Words
ATIS
41.35
31.43
Mean Syllables
ATIS21.1514.74Notice drop in perplexity from words to syllables.Perplexity of 14.74 for trigram syllable ATIS At every turn, less than ½ as many choices as for trigram word ATISSlide24
Experiment 2: Syllables in the language Model24Hypothesis:
A syllable language model will perform better than a word-based language modelBy what Measure?Slide25
Symbol Error Rate25SER = (100 * (I + S + D))/T
Where:I is the number of insertionsS is the number of substitutions]
D is the number of deletions
T is the total number of symbols
SER = 100(2+1+1)/5 = .8
1
1. Alignment performed by a dynamic programming
algorithm
Minimum
Edit DistanceSlide26
Technique26Phonetically transcribe corpus and reference files
Syllabify corpus and references filesBuild language models Run a recognizer on 18 short human-computer telephone monologues
Compute mean, median, std of SER for 1-gram, 2-gram, 3-gram, 4-gram over all monologuesSlide27
Results27
Means
Substitution
Insertion
Deletion
SER
N=2,3,4
73.14%
103.00 %
81.74%
85.3%
Syllables
Normed
by Words
Syllables Compared to Words
Mean SER
Words
Mean SER
Syllables
N=2
46.4
41.0
N=3
46.8
39.4
N=4
46.7
39.0Slide28
Experiment 3: A Concept Component28Hypothesis:
A recognizer equipped with a post-processor that transforms syllable output to syllable/concept output will perform better than one not equipped with such a processorSlide29
Technique29Develop equivalence classes from the training transcript : BE, WANT, GO, RETURN
Map the equivalence classes onto the reference files used to score the output of the recognizer.
Map the equivalence classes onto the output of the recognizer
Determine the SER of the modified output in step 3 with respect to the reference files in step 2.Slide30
Results30
Mean SER
Syllables
Mean
SER
Concepts
N=2
41.0
41.3
N=3
39.4
40.6
N=4
39.0
40.3
Means
SubstitutionInsertionDeletion SERN=2,3,4
1.06%0.95%
1.09%
1.02%
Concepts
Normed
by Syllables
2% decline overall. Why?
Concepts Compared to SyllablesSlide31
Mapping Was Intended to Produce an Upper Bound On SER31
For each distinct syllable string that appears in the hypothesis or reference files, search each of the concepts for a matchIf match, substitute concept for the syllable string: ay
w_uh_dd
l_ay_kd
WANT
Misrecognition of a single syllable no insertionSlide32
Misalignment Between Training and Reference Files32
Equivalence classes constructed using only the LM training model transcriptMore frequent in reference files:
1
st
person singular (
I want
)
Imperatives (
List all flights
)
Less frequent in reference files:
1
st
person plural (
My husband and me want
)
Polite forms (I would like)BE does not appear (There should be, There’s going to be, etc.)Slide33
Summary331. Perplexity: syllable language model contains more information than a word language model (and probably will perform better)
2. Syllable language model results in a 14.7% mean improvement in SER3. The very slight increase in mean SER for a concept language model justifies further research Slide34
Further Research34Test the given system over a large production corpus
Develop of a probabilistic concept language model
Develop necessary software to pass the output of the concept language model on to an expert systemSlide35
The (Almost, Almost) Last Word
“But it must be recognized that the notion ‘probability of a sentence’ is an entirely useless one under any known interpretation of the term.”
Cited in
Jurafsky
and Martin (2009) from a 1969 essay on
Quine
.Slide36
36The (Almost) Last WordHe just never thought to count.Slide37
37The Last Word
ThanksTo my generous committee:Bill Croft, Department of Linguistics
George Luger, Department of Computer Science
Caroline Smith, Department of Linguistics
Chuck Wooters, U.S. Department of DefenseSlide38
References38Cover, T., Thomas, J. (1991).
Elements of Information Theory. Hoboken, NJ: John Wiley & Sons. Greenberg, S. (1999) Speaking in shorthand—A syllable-centric perspective for understanding pronunciation variation.
Speech Communication
, 29, 159-176.
Hemphill, C., Godfrey, J.,
Doddington
, G. (2009). The ATIS Spoken Language Systems Pilot Corpus. Retrieved 6/17/09 from:
http://www.ldc.upenn.edu/Catalog/readme_files/atis/sspcrd/corpus.html
Hamalainen
, A.,
Boves
, L., de
Veth
, J., ten Bosch, L. (2007) On the utility of syllable-based acoustic models for pronunciation variation modeling.
EURASIP Journal on Audio, Speech, and Music Processing.
46460, 1-11.
Jurafsky, D., Martin, J. (2000) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.Jurafsky, D., Martin, J. (2009) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.Manning, C., Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.NextIt. (2008). Retrieved 4/5/08 from: http:/www.nextit.com.NIST. (2007) Syllabification software. National Institute of Standards: NIST Spoken Language Technology Evaluation and Utility. Retrieved 11/30/07 from:
http://www.nist.gov/speech/tools/. Slide39
Additional Slides39Slide40
Transcription of a recording40
REF: (3.203,5.553) GIVE ME A FLIGHT BETWEEN SPOKANE AND SEATTLE
REF: (15.633,18.307) UM OCTOBER SEVENTEENTH
REF: (26.827,29.606) OH I NEED A PLANE FROM SPOKANE TO SEATTLE
REF: (43.337,46.682) I WANT A ROUNDTRIP FROM MINNEAPOLIS TO
REF: (58.050,61.762) I WANT TO BOOK A TRIP FROM MISSOULA TO
PORTLAND
REF: (73.397,77.215) I NEED A TICKET FROM ALBUQUERQUE TO NEW
YORK
REF: (87.370,94.098) YEAH RIGHT UM I NEED A TICKET FROM SPOKANE
SEPTEMBER
THIRTIETH TO SEATTLE RETURNING OCTOBER
THIRD
REF: (107.381,113.593) I WANT TO GET FROM ALBUQUERQUE TO NEW
ORLEANS
ON OCTOBER THIRD TWO THOUSAND SEVENSlide41
Transcribed and Segmented141
REF: (3.203,5.553) GIHV MIY AX FLAYTD BAX TWIYN SPOW KAEN
AENDD
SIY AE DXAXL
REF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTH
REF: (26.827,29.606) OW AY NIYDD AX PLEYN FRAHM SPOW KAEN
TUW SIY
AE DXAXL
REF: (43.337,46.682) AY WAANTD AX RAWNDD TRIHPD FRAHM
MIH NIY
AE PAX LAXS TUW
REF: (58.050,61.762) AY WAANTD TUW BUHKD AX TRIHPD FRAHM
MIH ZUW
LAX TUW PAORTD LAXNDD
REF: (73.397,77.215) AY NIYDD AX TIH KAXTD FRAHM AEL BAX
KAXR KIY
TUW NUW YAORKDREF: (87.370,94.098) YAE RAYTD AHM AY NIYDD AX TIH KAXTD FRAHM SPOW KAEN SEHPD TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RAX TER NIXNG AAKD TOW BAXR THERDDREF: (107.381,113.593) AY WAANTD TUW GEHTD FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN1. Produced by University of Colorado transcription software (to a version of ARPAbet) , National Institute of Standards (NIST) syllabifier, and my own Python classes that coordinate the two. Slide42
With Inserted Equivalence Classes1 42
REF: (3.203,5.553) GIHV MIY GO BAX TWIYN SPOW KAEN AENDD SIY AE DXAXL
REF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTH
REF: (26.827,29.606) OW WANT AX PLEYN FRAHM SPOW KAEN
TUW SIY
AE DXAXL
REF: (43.337,46.682) WANT AX RAWNDD TRIHPD FRAHM MIH NIY
AE PAX
LAXS TUW
REF: (58.050,61.762) WANT TUW BUHKD AX TRIHPD FRAHM MIH
ZUW LAX
TUW PAORTD LAXNDD
REF: (73.397,77.215) WANT GO FRAHM AEL BAX KAXR KIY TUW
NUW YAORKD
REF: (87.370,94.098) YAE RAYTD AHM AY WANT GO FRAHM SPOW
KAEN SEHPD
TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RETURN AAKD TOW BAXR THERDDREF: (107.381,113.593) WANT GO FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN1. A subset of a set in which all members share an equivalence relation. WANT is an equivalence class with members, I need, I would like, and so on.Slide43
Including Flat Language ModelWord Perplexity43
Flat LM
N = 1
N = 2
N = 3
N = 4
Mean Appendix C
482
112.56
39.44
31.44
30.42
Median, Appendix C
482
110.82
38.74
28.96
29.31
Std Dev Appendix C
0.0
13.61
7.21
5.82
4.05
Mean ATIS
1604
191.84
41.35
31.51
31.43
Median ATIS
1604
191.74
38.93
28.2
30.19
Std Dev ATIS
0.0
.57
8.64
7.6
4.32Slide44
Including Flat LMSyllable Perplexity44
Flat LM
N = 1
N = 2
N = 3
N = 4
Mean Appendix C
537
177.99
35.96
22.26
21.04
Median, Appendix C
537
177.44
35.81
22.71
20.75
Std Dev Appendix C
0.0
1.77
8.49
3.73
3.53
Mean ATIS
1314
231.13
22.42
14.91
14.11
Median ATIS
1314
231.26
21.15
14.74
13.2
Std Dev ATIS
0.0
.097
4.36
1.83
1.79Slide45
Words and Syllables Normed by Flat LM45
N = 1
N = 2
N = 3
N = 4
Mean Appendix C
33.14%
6.69%
4.15%
3.92%
Mean ATIS
17.58%
1.71%
1.13%
1.07%
N = 1
N = 2
N = 3
N = 4
Mean Appendix C
23.35%
8.18%
6.52%
6.31%
Mean ATIS
11.95%
11.96%
1.96%
1.95%
Words Data
Normed
by Flat LM
Syllable Data
Normed
by Flat LMSlide46
Syllabifier from National Institute of Standards and Technology (NIST, 2007)Based on Daniel Kahn’s 1976 dissertation from MIT (Kahn, 1976)Generative in nature and English-biasedSyllabifiers
46Slide47
Estimates of the number of English syllables range from 1000 to 30,000Suggests that there is some difficulty in pinning down what a syllable is.Usual hierarchical approachSyllables
syllable
Coda (C)
onset (C)
Nucleus (V)
rhymeSlide48
Sonority rises to the nucleus and falls to the codaSpeech sounds appear to form a sonority hierarchy (from highest to lowest) vowels, glides, liquids, nasals, obstruentsUseful but not absolute: e.g, both depth and spit
seem to violate the sonority hierarchySonority