Julia Hirschberg CS 4706 Thanks to Roberto Pieraccini and Francis Ganong for some slides 2 Recreating the Speech Chain DIALOG SEMANTICS SYNTAX LEXICON MORPHOLOGY PHONETICS VOCALTRACT ID: 930128
Download Presentation The PPT/PDF document "1 Automatic Speech Recognition: An Ove..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
Automatic Speech Recognition: An Overview
Julia Hirschberg
CS 4706
(Thanks
to Roberto
Pieraccini
and
Francis
Ganong
for some slides)
Slide22
Recreating the Speech Chain
DIALOG
SEMANTICS
SYNTAX
LEXICON
MORPHOLOGY
PHONETICS
VOCAL-TRACT
ARTICULATORS
INNER EAR
ACOUSTIC NERVE
SPEECH
RECOGNITION
DIALOG
MANAGEMENT
SPOKEN
LANGUAGE
UNDERSTANDING
SPEECHSYNTHESIS
Slide33
The Problem of Segmentation... or...
Why Speech Recognition is so Difficult
m
I
n
&
m
b
&
r
i
s
e
v
&
n
th
r
E
n
I
n
z
E
r
o
t
ü
s
e
v
&
n
f
O
r
MY
NUMBER
IS
SEVEN
THREE
NINE
ZERO
TWO
SEVEN
FOUR
NP
NP
VP
(user:Roberto (attribute:telephone-num value:7360474))
Slide44
The Illusion of Segmentation... or...
Why Speech Recognition is so Difficult
m
I
n
&
m
b
&
r
i
s
e
v
&
n
th
r
E
n
I
n
z
E
r
o
t
ü
s
e
v
&
n
f
O
r
MY
NUMBER
IS
SEVEN
THREE
NINE
ZERO
TWO
SEVEN
FOUR
NP
NP
VP
(user:Roberto (attribute:telephone-num value:7360474))
Intra-speaker variability
Noise/reverberation
Coarticulation
Context-dependency
Word confusability
Word variations
Speaker Dependency
Multiple Interpretations
Limited vocabulary
Ellipses and Anaphors
Slide55
Speech Recognition: the Early Years
1952 – Automatic Digit Recognition (AUDREY)
Davis, Biddulph, Balashek (Bell Laboratories)
Slide66
1960’s – Speech Processing and Digital Computers
AD/DA converters and digital computers start appearing in the labs
James Flanagan
Bell Laboratories
Slide77
1969 – Whither Speech Recognition?
General purpose speech recognition seems far away. Social-purpose speech recognition is severely limited.
It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish…
It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. That is a necessary but not sufficient condition.
We are safe in asserting that speech recognition is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon.
One doesn’t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamour…Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers.
The typical recognizer gets it into his head that he can solve “the problem.” The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach).The Journal of the Acoustical Society of America, June 1969
J. R. Pierce
Executive Director,
Bell Laboratories
Slide88
1971-1976: The ARPA SUR project
Despite anti-speech recognition campaign led by
Pierce Commission
ARPA launches 5 year Spoken Understanding Research program
Goal: 1000-word vocabulary, 90% understanding rate, near real time on 100 mips machine4 Systems built by the end of the programSDC (24%)BBN’s HWIM (44%)
CMU’
s Hearsay II (74%)CMU’s HARPY (95% -- but 80 times real time!)Rule-based systems except for HarpyEngineering approach: search network of all the possible utterances
Raj Reddy -- CMU
LESSON LEARNED:
Hand-built knowledge does not scale up
Need of a global “optimization
” criterion
Slide99
Lack of clear evaluation criteria
ARPA felt systems had failed
Project not extended
Speech Understanding: too early for its time
Need a standard evaluation method so that progress within and across systems could be measured
Slide1010
1970’s – Dynamic Time Warping
The Brute Force of the Engineering Approach
TEMPLATE (WORD 7)
UNKNOWN WORD
T.K. Vyntsyuk (1968)
H. Sakoe,
S. Chiba (1970)
Isolated Words
Speaker Dependent
Connected Words
Speaker Independent
Sub-Word Units
Slide1111
1980s -- The Statistical Approach
Fred Jelinek
S
1
S
2
S
3
a
11
a
12
a
22
a
23
a
33
Acoustic HMMs
Word Tri-grams
No Data Like More Data
“
Whenever I fire a linguist, our system performance improves
”
(1988)
Some of my best friends are linguists (2004)
Jim Baker
Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s
Purely statistical approach pursued by Fred Jelinek and Jim Baker, IBM T.J.Watson Research
Slide1212
1980-1990 – Statistical approach becomes ubiquitous
Lawrence Rabiner,
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,
Proceeding of the IEEE, Vol. 77, No. 2, February 1989.
13
1980s-1990s – The Power of Evaluation
Pros and Cons of DARPA programs
+ Continuous incremental improvement
- Loss of
“
bio-diversity
”
SPOKEN
DIALOG
INDUSTRY
SPEECHWORKS
NUANCE
MIT
SRI
TECHNOLOGY
VENDORS
PLATFORM
INTEGRATORS
APPLICATION
DEVELOPERSHOSTING
TOOLS
STANDARDS
STANDARDS
STANDARDS
1997
1995
1996
1998
1999
2000
2001
2002
2003
2004
…
Slide1414
NUANCE Today
Slide1515
Large Vocabulary Continuous Speech Recognition (LVCSR) Today
~
20,000-64,000 words
Speaker independent (vs. speaker-dependent)
Continuous speech (vs isolated-word)Conversational speech, dialogue (vs. dictation)Many languages (vs. English, Japanese, Swedish, French)
3/21/2012
15
Speech and Language Processing Jurafsky and Martin
Slide1616
Current error rates
Task
Vocabulary
Error (%)
Digits
11
0.5
WSJ read speech
5K
3
WSJ read speech
20K
3
Broadcast news
64,000+
10
Conversational Telephone
64,000+
20
3/21/2012
16
Speech and Language Processing Jurafsky and Martin
Slide1717
Human vs. Machine Error Rates
Conclusions:
Machines about 5 times worse than humans
Gap increases with noisy speech
These numbers are rough…
Task
Vocab
ASR
Hum SR
Continuous digits
11
.5
.009
WSJ 1995 clean
5K
3
0.9
WSJ 1995
w/noise
5K
9
1.1
SWBD 2004
65K
20
4
3/21/2012
17
Speech and Language Processing Jurafsky and Martin
Slide1818
How to Build an ASR System
Build a statistical model of the speech-to-text process
Collect lots of speech and transcribe all the words
Train the
acoustic model on the labeled speechTrain a language model on the transcription + lots more textParadigms: Supervised Machine Learning + SearchThe Noisy Channel Model
Slide19ASR Paradigm
Given an acoustic observation: What is the most likely sequence of words to explain the input?
Using an
Acoustic Model and a
Language Model
Two problems:How to score hypotheses (Modeling)How to pick hypotheses to score (Search)
Slide2020
The Noisy Channel Model
Search through space of all possible sentences.
Pick the one that is most probable given the
waveform
Decoding
Slide2121
The Noisy Channel Model: Assumptions
What is the
most likely
sentence out of all sentences in the language L, given some acoustic input O?
Treat acoustic input O as sequence of individual acoustic observations or framesO = o1,o2,o3
,…,ot
Define a sentence W as a sequence of words:W = w1,w2,w3,…,wn
Slide2222
Noisy Channel Model
Probabilistic implication: Pick the highest probable sequence:
We can use
Bayes rule
to rewrite this:
Since denominator is the same for each candidate sentence W, we can ignore it for the argmax
:
Slide2323
Speech Recognition Meets Noisy Channel: Acoustic Likelihoods and LM Priors
Slide2424
Components of an ASR System
Corpora for training and testing of components
Representation for input and method of
extracting features
Pronunciation Model (a lexicon)Acoustic Model (acoustic characteristics of subword units)Language Model (word sequence probabilities)Algorithms
to search hypothesis space efficiently
Slide2525
Training and Test Corpora
Collect corpora appropriate for recognition task at hand
Small speech + phonetic transcription to associate sounds with symbols (
Acoustic Model
)Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model+)Very large text corpus to identify ngram probabilities or build a grammar (Language Model)
Slide2626
Building the Acoustic Model
Goal: Model likelihood of sounds given spectral features, pronunciation models, and prior context
Usually represented as Hidden Markov Model
States represent phones or other
subword units for each word in the lexiconTransition probabilities (a) on states: how likely to transition from one unit to itself? To the next?Observation
likelihoods (b): how likely is
spectral feature vector (the acoustic information) to be observed at state i?
Slide2727
Training a Word HMM
Slide2828
Initial estimates of acoustic models from phonetically transcribed corpus or
flat start
Transition probabilities
between phone states
Observation probabilities associating phone states with acoustic features of windows of waveformEmbedded training: Re-estimate probabilities using initial phone HMMs + orthographically transcribed corpus
+ pronunciation lexicon to create
whole sentence HMMs for each sentence in training corpusIteratively retrain transition and observation probabilities by running the training data through the model until convergence
Slide2929
Training the Acoustic Model
Iteratively sum over all possible segmentations of words and phones – given the transcript -- re-estimating HMM parameters accordingly until convergence
Slide3030
Building the Pronunciation Model
Models likelihood of word given network of candidate phone hypotheses
Multiple pronunciations for each word
May be weighted automaton or simple dictionary
Words come from all corpora (including text)Pronunciations come from pronouncing dictionary or TTS system
Slide3131
ASR Lexicon: Markov Models for Pronunciation
Slide3232
Building the Language Model
Models likelihood of word given previous word(s)
Ngram
models:
Build the LM by calculating bigram or trigram probabilities from text training corpus: how likely is one word to follow another? To follow the two previous words?Smoothing issues: sparse data GrammarsFinite state grammar or Context Free Grammar (CFG) or semantic grammar
Out of Vocabulary (OOV) problem
Slide3333
Search/Decoding
Find the best hypothesis
Ŵ
(P
(O|W) P(W)) givenA sequence of acoustic feature vectors (O)A trained HMM (Acoustic Model)Lexicon (Pronunciation Model)Probabilities of word sequences (
Language Model)For OCalculate most likely state
sequences in HMM given transition and observation probabilities lattice w/AM probabilitiesTrace backward thru lattice to assign words to states using AM and LM probabilities (decoding)N best vs. 1-best vs. lattice output
Limiting searchLattice minimization and determinizationPruning: beam search
Slide3434
Evaluating Success
Transcription
Goal: Low
Word Error Rate
(Subst+Ins+Del)/N * 100This is a testThesis test. (1subst+2del)/4*100=75% WERThat was the dentist calling. (4 subst+1ins)/4words * 100=125% WER
UnderstandingGoal: High Concept A
ccuracyHow many domain concepts were correctly recognized?I want to go from Boston to Washington on September 29
Slide3535
Domain concepts Valuessource city
Boston
target city
Washington
travel month September travel day 29
Ref: Go
from Boston to Washington on September 29 vs. ASR: Go to Boston from
Washington on December 293concepts/4concepts * 100 = 25% Concept Error Rate or 75% Concept Accuracy3subst/8words * 100 = 37.5% WER or 62.5% Word Accuracy
Which is better?
Slide36ASR Today
Combines many probabilistic phenomena: varying acoustic features of phones, likely pronunciations of words, likely sequences of wordsRelies upon many approximate techniques to
‘
translate
’
a signalFinite State Transducers now a standard representation for all components36
Slide37ASR Future
3->5Triphones -> QuinphonesTrigrams -> Pentagrams
Bigger acoustic models
More parameters
More mixtures
Bigger lexicons65k -> 256k
Slide38Larger language models
More data, more parametersBetter back-offsMore kinds of adaptation
Feature space adaptation
Discriminative training instead of MLE
Rover
: combinations of recognizersFSM flattening of knowledge into uniform structure
Slide39Search
Acoustic
Models
Front End
Input
Wave
Acoustic
Features
LanguageModels
Lexicon
But What is Human about State-of-the-Art ASR?
Slide40Input
Wave
Sampling, Windowing
Stacking, computation of deltas:Normalizations: filtering, etc
Linear Transformations:dimensionality reduction
Acoustic
Features
Front End: MFCC
Search
Acoustic
Models
Front End
InputWave
AcousticFeatures
LanguageModels
Lexicon
Postprocessing
FastFourierTransform
cosine transform
first 8-12 coefficientsMel Filter Bank:
Slide41Basic Lexicon
A list of spellings and pronunciationsCanonical pronunciations
And a few others
Limited to 64k entries
Support simple stems and suffixes
Linguistically naïveNo phonological rewritesDoesn’t support all languages
Search
AcousticModelsFront End
InputWave
AcousticFeatures
LanguageModels
Lexicon
Slide42Language Model
Frequency sensitive, like humansContext sensitive, like humans
Slide4343
Next Class
Building an ASR system: the
Pocket Sphinx
Toolkit