References 1 33 34 of Becchetti 3 93 of Huang Waveform plots of typical vowel sounds Voiced 濁音 tone 1 tone 2 tone 4 t 音高 Speech Production and Source Model ID: 656948
Download Presentation The PPT/PDF document "7.0 Speech Signals and Front-end Process..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
7.0 Speech Signals and Front-end Processing
References
: 1. 3.3, 3.4 of Becchetti
3. 9.3 of HuangSlide2
Waveform plots of typical vowel sounds - Voiced(濁音)
tone
1
tone 2
tone
4
t
(
音高
)Slide3
Speech Production and Source Model
Human vocal mechanism
Speech Source Model
Vocal
tract
u(t)
x(t)
excitationSlide4
Voiced and Unvoiced Speech
u(t)
x(t)
pitch
pitch
voiced
unvoicedSlide5
Waveform plots of typical consonant sounds
Unvoiced
(清音)
Voiced
(濁音)Slide6
Waveform plot of a sentenceSlide7
Time and Frequency
Domains
(P.12
of
2.0)
X[k]
time domain
1-1 mappingFourier TransformFast Fourier Transform (FFT) frequency domain
Slide8
Frequency domain spectra of speech signals
Voiced
UnvoicedSlide9
Frequency Domain
formant frequencies
formant frequencies
excitation
excitation
Formant
Structure
Formant
Structure
Voiced
UnvoicedSlide10
Input/Output Relationship for Time/Frequency Domains
F
F
F
time domain: convolution
frequency domain: product
:
Formant structure: differences
between
phonemes
:
excitation
excitation
formant
structureSlide11
SpectrogramSlide12
SpectrogramSlide13
Formant FrequenciesSlide14
Formant frequency contours
He will allow a rare lie.
Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and JuangSlide15
Speech Signals
Voiced/unvoiced
濁音、清音
Pitch/tone
音高、聲調
Vocal tract
聲道
Frequency domain/formant frequency
Spectrogram representation
Speech Source Model
digitization and transmission of the parameters will be adequate
at receiver the parameters can produce x[n] with the model
much less parameters with much slower variation in time lead to much less bits required
the key for low bit rate speech coding
x[n]
u[n]
parameters
parameters
Excitation Generator
Vocal Tract Model
Ex
G(
),G(z), g[n]
x[n]=u[n]
g[n]
X(
)
=U(
)
G(
)
X(z)=U(z)G(z)
U (
)
U (z)Slide16
Speech Source Model
x(t)
a[n]
t
nSlide17
Speech Source Model
Sophisticated model for speech production
Simplified model for speech production
Unvoiced
Periodic Impulse
Train Generator
Uncorrelated
Noise
Generator
G(z)
Glottal
filter
H(z)
Vocal Tract
Filter
R(z)
Lip
Radiation
Filter
V
U
G
G
x(n)
Speech
Signal
Voiced
Pitch period,
N
Unvoiced
Periodic Impulse
Train Generator
Random
Sequence
Generator
Combined
Filter
V
U
G
x(n)
Speech
Signal
Voiced/unvoiced
switch
Pitch period,
N
VoicedSlide18
Simplified Speech Source Model
Excitation parameters
v/u : voiced/ unvoiced
N : pitch for voiced
G : signal gain
excitation signal u[n]
unvoiced
voiced
random
sequence
generator
periodic pulse train
generator
x[n]
G(z) =
1
1
a
k
z
-k
P
k = 1
Excitation
G(z), G(
), g[n]
Vocal Tract Model
u[n]
G
v/u
N
Vocal Tract parameters
{a
k
} : LPC coefficients
formant structure of speech signals
A good approximation, though not
precise enough
Reference: 3.3.1-3.3.6 of Rabiner and Juang, or 6.3 of HuangSlide19
Speech Source Model
u[n]
x[n]
1
Slide20
Feature Extraction - MFCC
Mel-Frequency Cepstral Coefficients (MFCC)
Most widely used in the speech recognition
Has generally obtained a better accuracy at relatively low computational complexity
The process of MFCC extraction :
Speech signal
Pre-emphasis
Window
DFT
Mel
filter-bank
Log(| |
2
)
IDFT
MFCC
energy
derivatives
x
(
n
)
x’
(
n
)
x
t
(
n
)
X
t
(
k
)
Y
t
(
m
)
Y
t
’
(
m
)
y
t
(
j
)
e
tSlide21
Pre-emphasis
The process of Pre-emphasis :
a high-pass filter
H
(
z
)=
1-a
•
z-1 0<a≤1
Speech signal
x
(
n
)
x’
(
n
)=
x
(
n
)-a
x
(
n-1
)Slide22
Why pre-emphasis?
Reason :
Voiced sections of the speech signal naturally have a negative spectral slope (attenuation) of approximately 20 dB per decade due to the physiological characteristics of the speech production system
High frequency formants have small amplitude with respect to low frequency formants. A pre-emphasis of high frequencies is therefore helpful to obtain similar amplitude for all formantsSlide23
Why Windowing?
Why dividing the speech signal into successive and overlapping frames?
Voice signals change their characteristics from time to time. The characteristics remain unchanged only in short
time intervals (short-time stationary, short-time Fourier transform)
Frames
Frame Length : the length of time over which a set of parameters can be obtained and is valid. Frame length ranges between
20 ~ 10 msFrame Shift: the length of time between successive parameter calculationsFrame Rate: number of frames per secondSlide24
Waveform plot of a sentenceSlide25
Hamming
Rectangular
x[n]
x[n]w[n]
F
FSlide26
Effect of Windowing (1)
Windowing :
x
t
(
n)=w(n
)•x’(n), w(n): the shape of the window (product in time domain)Xt()=W
()*X’(), *: convolution (convolution in frequency domain) Rectangular window (w(n)=1 for 0
≤ n ≤ L-1): simply extract a segment of the signalwhose frequency response has high side lobesMain lobe : spreads out the narrow band power of the signal (that around the formant
frequency) in a wider frequency range, and thus reduces the local frequency resolution in formant allocationSide lobe : swap energy from different and distant frequencies
(dB)
Rectangular
HammingSlide27
Input/Output
Relationship for Time/Frequency
Domains
(P.10 of 7.0)
F
F
F
time domain: convolution
frequency domain: product
:
Formant structure: differences
between
phonemes
:
excitation
excitation
formant
structureSlide28
Windowing
main lobe
side lobes
Main lobe
: spreads out the narrow band power of the signal (that around the formant frequency) in a wider frequency range, and thus reduces the local frequency resolution in formant allocation
Side lobe
: swap energy from different
and distant frequenciesSlide29
Effect of Windowing (2)
Windowing (Cont.):
For a designed window, we wish that
the main lobe is as narrow as possible
the side lobe is as low as possibleHowever, it is impossible to achieve both simultaneously. Some trade-off is neededThe most widely used window shape is the Hamming window Slide30
DFT and Mel-filter-bank Processing
For each frame of signal (
L
points, e.g., L=512),
the Discrete Fourier Transform (DFT) is first performed to obtain its spectrum (L points, for example L=512)
The bank of filters based on Mel scale is then applied, and each filter output is the sum of its filtered spectral components (M filters, and thus M outputs, for example M=24)
DFT
t
Time domain signal
spectrum
sum
sum
sum
Y
t
(
0
)
Y
t
(
1
)
Y
t
(
M-1
)
x
t
(
n
)
X
t
(
k
)
n = 0,1,....L-1
k = 0,1,.... -1
2
LSlide31
Peripheral Processing for Human PerceptionSlide32
Mel-scale Filter Bank
Slide33
Why Filter-bank Processing?
The filter-bank processing simulates human ear perception
Frequencies of a complex sound within a certain frequency band cannot be individually identified.
When one of the components of this sound falls outside this
frequency
band, it can be individually distinguished.This
frequency band is referred to as the critical band.These critical bands somehow overlap with each other.The critical bands are roughly distributed linearly in the logarithm frequency scale (including the center frequencies and the bandwidths), specially at higher frequencies.Human perception for pitch of signals is proportional to the logarithm of the frequencies (relative ratios between the frequencies)
Slide34
Feature Extraction - MFCC
Mel-Frequency Cepstral Coefficients (MFCC)
Most widely used in the speech recognition
Has generally obtained a better accuracy at relatively low computational complexity
The process of MFCC extraction :
Speech signal
Pre-emphasis
Window
DFT
Mel
filter-bank
Log(| |
2
)
IDFT
MFCC
energy
derivatives
x
(
n
)
x’
(
n
)
x
t
(
n
)
X
t
(
k
)
Y
t
(
m
)
Y
t
’
(
m
)
y
t
(
j
)
e
tSlide35
Logarithmic Operation and IDFT
The final process of MFCC evaluation : logarithm operation and IDFT
Mel-filter output
Y
t
(m
)
Filter index
(m)
Filter index
(m)
Log(| |
2
)
Y’
t
(m
)
IDFT
quefrency
( j)
MFCC vector
y
t
( j)
y
t
=
C
Y
’
t
0
0
M-1
0
M-1
J-1Slide36
Why Log Energy Computation?
Using the magnitude (or energy) only
Phase information is not very helpful in speech recognition
Replacing the phase part of the original speech signal with continuous random phase usually won’t be perceived by human ears
Using the Logarithmic operationHuman perception sensitivity is proportional to signal energy in logarithmic
scale (relative ratios between signal energy values)The logarithm compresses larger values while expands smaller values, which is a characteristic of the human hearing systemThe dynamic compression also makes feature extraction less sensitive to variations in signal dynamicsTo make a convolved noisy process additiveSpeech signal x(n), excitation u(n) and the impulse response of vocal tract g(n)
x(n)=u(n)*g(n) X()=U()G()
|X()|=|U()||G()| log|X()|=log|U()|+log|G(
)| Slide37
Why Inverse DFT?
Final procedure for MFCC : performing the inverse DFT on the log-spectral power
Advantages :
Since the log-power spectrum is real and symmetric, the inverse DFT reduces to a Discrete Cosine Transform (DCT). The DCT has the property to produce highly uncorrelated features
y
t
diagonal rather than full covariance matrices can be used in the Gaussian distributions in many casesEasier to remove the interference of excitation on formant structuresthe phoneme for a segment of speech signal is primarily based on the formant structure (or vocal tract shape)on the frequency scale the formant structure changes slowly over frequency, while the excitation changes much fasterSlide38
Speech Production and Source Model (P.3
of
7.0)
Human vocal mechanism
Speech Source Model
Vocal
tract
u(t)
x(t)
excitationSlide39
Voiced and Unvoiced Speech
(P.4
of
7.0)
u(t)
x(t)
pitch
pitch
voiced
unvoicedSlide40
Frequency domain spectra of speech signals (P.8
of
7.0)
Voiced
UnvoicedSlide41
Frequency
Domain
(P.9
of
7.0)
formant frequencies
formant frequencies
excitation
excitation
Formant
Structure
Formant
Structure
Voiced
UnvoicedSlide42
Input/Output
Relationship for Time/Frequency
Domains
(P.10 of 7.0)
F
F
F
time domain: convolution
frequency domain: product
:
Formant structure: differences
between
phonemes
:
excitation
excitation
formant
structureSlide43
Logarithmic Operation
u[n]
g[n]
x[n]= u[n]
*
g[n]
Slide44
Derivatives
Derivative operation : to obtain the change of the feature vectors with time
MFCC stream
y
t
(j)
quefrency(j)
Frame index
t-1 t t+1 t+2
quefrency(j)
Frame index
MFCC stream
y
t
(j)
quefrency(j)
Frame index
2
MFCC stream
2
y
t
(j)Slide45
Linear Regression
(x
i
,
y
i
)y=ax+b
find a, bSlide46
Why Delta Coefficients?
To capture the dynamic characters of the speech signal
Such information carries relevant information for speech recognition
The value of
p
should be properly chosenThe dynamic characters may not be properly extracted if p
is too small Too large P may imply frames too far awayTo cancel the DC part (channel distortion or convolutional noise) of the MFCC featuresAssume, for clean speech, an MFCC parameter stream for an utterance is{y(t-N), y(t-N+1),…....,y(t),
y(t+1), y(t+2), ……}, y(t) is an MFCC parameter at time t,while after channel distortion, the MFCC stream becomes{
y(t-N)+h, y(t-N+1)+h,…....,y(t)+h, y(t+1)+h,
y(t+2)+h, ……}the channel effect h is eliminated in the delta (difference) coefficientsSlide47
Convolutional Noise
y[n]= x[n]
*
h[n]
x[n]
h[n]
MFCC
y = x + hSlide48
End-point Detection
Push (and Hold) to Talk/Continuously Listening
Adaptive Energy Threshold
Low Rejection Rate
false acceptance may be rescued
Vocabulary Words Preceded and Followed by a Silence/Noise ModelTwo-class Pattern Classifier
Gaussian density functions used to model the two classeslog-energy, delta log-energy as the feature parametersdynamically adapted parameters
Speech
Silence/NoiseSlide49
End-point Detection
silence
(noise)
false rejection
false acceptance
detected speech
segments
speechSlide50
與語音學、訊號波型、頻譜特性有關的網址
17. Three Tutorials on Voicing and Plosives
http://homepage.ntu.edu.tw/~karchung/intro%20page%2017.htm
8. Fundamental frequency and harmonics
http://homepage.ntu.edu.tw/~karchung/phonetics%20II%20page%20eight.htm
9. Vowels and Formants I: Resonance (with soda bottle demonstration)
http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20nine.htm10. Vowels and Formants II (with duck call demonstration)http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20ten.htm12. Understanding Decibels (A PowerPoint slide show)http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20twelve.htm13. The Case of the Missing Fundamentalhttp://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20thirteen.htm14. Forry, wrong number! I The frequency ranges of speech and hearing
http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20fourteen.htm19. Vowels and Formants III: Formants for fun and profit (with samplesof exotic music)http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20nineteen.htm20. Getting into spectrograms: Some useful linkshttp://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20twenty.htm21. Two other ways to visualize sound signalshttp://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20twentyone.htm23. Advanced speech analysis tools II: Praat and morehttp://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20twentythree.htm
25. Synthesizing vowels onlinehttp://www.asel.udel.edu/speech/tutorials/synthesis/vowels.html