/
7.0 Speech Signals and Front-end Processing 7.0 Speech Signals and Front-end Processing

7.0 Speech Signals and Front-end Processing - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
365 views
Uploaded On 2018-03-19

7.0 Speech Signals and Front-end Processing - PPT Presentation

References 1 33 34 of Becchetti 3 93 of Huang Waveform plots of typical vowel sounds Voiced 濁音 tone 1 tone 2 tone 4 t   音高 Speech Production and Source Model ID: 656948

frequency speech signal formant speech frequency formant signal time excitation mfcc voiced domain unvoiced model filter frequencies structure energy

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "7.0 Speech Signals and Front-end Process..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

7.0 Speech Signals and Front-end Processing

References

: 1. 3.3, 3.4 of Becchetti

3. 9.3 of HuangSlide2

Waveform plots of typical vowel sounds - Voiced(濁音)

tone

1

tone 2

tone

4

t

 

(

音高

)Slide3

Speech Production and Source Model

Human vocal mechanism

Speech Source Model

Vocal

tract

u(t)

x(t)

excitationSlide4

Voiced and Unvoiced Speech

u(t)

x(t)

pitch

pitch

voiced

unvoicedSlide5

Waveform plots of typical consonant sounds

Unvoiced

(清音)

Voiced

(濁音)Slide6

Waveform plot of a sentenceSlide7

Time and Frequency

Domains

(P.12

of

2.0)

X[k]

time domain

1-1 mappingFourier TransformFast Fourier Transform (FFT) frequency domain

 

 Slide8

Frequency domain spectra of speech signals

Voiced

UnvoicedSlide9

Frequency Domain

formant frequencies

formant frequencies

excitation

excitation

Formant

Structure

Formant

Structure

Voiced

UnvoicedSlide10

Input/Output Relationship for Time/Frequency Domains

F

F

F

time domain: convolution

frequency domain: product

:

Formant structure: differences

between

phonemes

:

excitation

 

excitation

formant

structureSlide11

SpectrogramSlide12

SpectrogramSlide13

Formant FrequenciesSlide14

Formant frequency contours

He will allow a rare lie.

Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and JuangSlide15

Speech Signals

Voiced/unvoiced

濁音、清音

Pitch/tone

音高、聲調

Vocal tract

聲道

Frequency domain/formant frequency

Spectrogram representation

Speech Source Model

digitization and transmission of the parameters will be adequate

at receiver the parameters can produce x[n] with the model

much less parameters with much slower variation in time lead to much less bits required

the key for low bit rate speech coding

x[n]

u[n]

parameters

parameters

Excitation Generator

Vocal Tract Model

Ex

G(

),G(z), g[n]

x[n]=u[n]

g[n]

X(

)

=U(

)

G(

)

X(z)=U(z)G(z)

U (

)

U (z)Slide16

Speech Source Model

x(t)

a[n]

t

nSlide17

Speech Source Model

Sophisticated model for speech production

Simplified model for speech production

Unvoiced

Periodic Impulse

Train Generator

Uncorrelated

Noise

Generator

G(z)

Glottal

filter

H(z)

Vocal Tract

Filter

R(z)

Lip

Radiation

Filter

V

U

G

G

x(n)

Speech

Signal

Voiced

Pitch period,

N

Unvoiced

Periodic Impulse

Train Generator

Random

Sequence

Generator

Combined

Filter

V

U

G

x(n)

Speech

Signal

Voiced/unvoiced

switch

Pitch period,

N

VoicedSlide18

Simplified Speech Source Model

Excitation parameters

v/u : voiced/ unvoiced

N : pitch for voiced

G : signal gain

excitation signal u[n]

unvoiced

voiced

random

sequence

generator

periodic pulse train

generator

x[n]

G(z) =

1

1

a

k

z

-k

P

k = 1

Excitation

G(z), G(

), g[n]

Vocal Tract Model

u[n]

G

v/u

N

Vocal Tract parameters

{a

k

} : LPC coefficients

formant structure of speech signals

A good approximation, though not

precise enough

Reference: 3.3.1-3.3.6 of Rabiner and Juang, or 6.3 of HuangSlide19

Speech Source Model

u[n]

x[n]

 

1

 

 Slide20

Feature Extraction - MFCC

Mel-Frequency Cepstral Coefficients (MFCC)

Most widely used in the speech recognition

Has generally obtained a better accuracy at relatively low computational complexity

The process of MFCC extraction :

Speech signal

Pre-emphasis

Window

DFT

Mel

filter-bank

Log(| |

2

)

IDFT

MFCC

energy

derivatives

x

(

n

)

x’

(

n

)

x

t

(

n

)

X

t

(

k

)

Y

t

(

m

)

Y

t

(

m

)

y

t

(

j

)

e

tSlide21

Pre-emphasis

The process of Pre-emphasis :

a high-pass filter

H

(

z

)=

1-a

z-1 0<a≤1

Speech signal

x

(

n

)

x’

(

n

)=

x

(

n

)-a

x

(

n-1

)Slide22

Why pre-emphasis?

Reason :

Voiced sections of the speech signal naturally have a negative spectral slope (attenuation) of approximately 20 dB per decade due to the physiological characteristics of the speech production system

High frequency formants have small amplitude with respect to low frequency formants. A pre-emphasis of high frequencies is therefore helpful to obtain similar amplitude for all formantsSlide23

Why Windowing?

Why dividing the speech signal into successive and overlapping frames?

Voice signals change their characteristics from time to time. The characteristics remain unchanged only in short

time intervals (short-time stationary, short-time Fourier transform)

Frames

Frame Length : the length of time over which a set of parameters can be obtained and is valid. Frame length ranges between

20 ~ 10 msFrame Shift: the length of time between successive parameter calculationsFrame Rate: number of frames per secondSlide24

Waveform plot of a sentenceSlide25

Hamming

Rectangular

x[n]

x[n]w[n]

F

FSlide26

Effect of Windowing (1)

Windowing :

x

t

(

n)=w(n

)•x’(n), w(n): the shape of the window (product in time domain)Xt()=W

()*X’(), *: convolution (convolution in frequency domain) Rectangular window (w(n)=1 for 0

≤ n ≤ L-1): simply extract a segment of the signalwhose frequency response has high side lobesMain lobe : spreads out the narrow band power of the signal (that around the formant

frequency) in a wider frequency range, and thus reduces the local frequency resolution in formant allocationSide lobe : swap energy from different and distant frequencies

(dB)

Rectangular

HammingSlide27

Input/Output

Relationship for Time/Frequency

Domains

(P.10 of 7.0)

F

F

F

time domain: convolution

frequency domain: product

:

Formant structure: differences

between

phonemes

:

excitation

 

excitation

formant

structureSlide28

Windowing

main lobe

side lobes

Main lobe

: spreads out the narrow band power of the signal (that around the formant frequency) in a wider frequency range, and thus reduces the local frequency resolution in formant allocation

Side lobe

: swap energy from different

and distant frequenciesSlide29

Effect of Windowing (2)

Windowing (Cont.):

For a designed window, we wish that

the main lobe is as narrow as possible

the side lobe is as low as possibleHowever, it is impossible to achieve both simultaneously. Some trade-off is neededThe most widely used window shape is the Hamming window Slide30

DFT and Mel-filter-bank Processing

For each frame of signal (

L

points, e.g., L=512),

the Discrete Fourier Transform (DFT) is first performed to obtain its spectrum (L points, for example L=512)

The bank of filters based on Mel scale is then applied, and each filter output is the sum of its filtered spectral components (M filters, and thus M outputs, for example M=24)

DFT

t

Time domain signal

spectrum

sum

sum

sum

Y

t

(

0

)

Y

t

(

1

)

Y

t

(

M-1

)

x

t

(

n

)

X

t

(

k

)

n = 0,1,....L-1

k = 0,1,.... -1

2

LSlide31

Peripheral Processing for Human PerceptionSlide32

Mel-scale Filter Bank

 

 

 

 

 Slide33

Why Filter-bank Processing?

The filter-bank processing simulates human ear perception

Frequencies of a complex sound within a certain frequency band cannot be individually identified.

When one of the components of this sound falls outside this

frequency

band, it can be individually distinguished.This

frequency band is referred to as the critical band.These critical bands somehow overlap with each other.The critical bands are roughly distributed linearly in the logarithm frequency scale (including the center frequencies and the bandwidths), specially at higher frequencies.Human perception for pitch of signals is proportional to the logarithm of the frequencies (relative ratios between the frequencies)

  Slide34

Feature Extraction - MFCC

Mel-Frequency Cepstral Coefficients (MFCC)

Most widely used in the speech recognition

Has generally obtained a better accuracy at relatively low computational complexity

The process of MFCC extraction :

Speech signal

Pre-emphasis

Window

DFT

Mel

filter-bank

Log(| |

2

)

IDFT

MFCC

energy

derivatives

x

(

n

)

x’

(

n

)

x

t

(

n

)

X

t

(

k

)

Y

t

(

m

)

Y

t

(

m

)

y

t

(

j

)

e

tSlide35

Logarithmic Operation and IDFT

The final process of MFCC evaluation : logarithm operation and IDFT

Mel-filter output

Y

t

(m

)

Filter index

(m)

Filter index

(m)

Log(| |

2

)

Y’

t

(m

)

IDFT

quefrency

( j)

MFCC vector

y

t

( j)

y

t

=

C

Y

t

0

0

M-1

0

M-1

J-1Slide36

Why Log Energy Computation?

Using the magnitude (or energy) only

Phase information is not very helpful in speech recognition

Replacing the phase part of the original speech signal with continuous random phase usually won’t be perceived by human ears

Using the Logarithmic operationHuman perception sensitivity is proportional to signal energy in logarithmic

scale (relative ratios between signal energy values)The logarithm compresses larger values while expands smaller values, which is a characteristic of the human hearing systemThe dynamic compression also makes feature extraction less sensitive to variations in signal dynamicsTo make a convolved noisy process additiveSpeech signal x(n), excitation u(n) and the impulse response of vocal tract g(n)

x(n)=u(n)*g(n) X()=U()G()

 |X()|=|U()||G()| log|X()|=log|U()|+log|G(

)| Slide37

Why Inverse DFT?

Final procedure for MFCC : performing the inverse DFT on the log-spectral power

Advantages :

Since the log-power spectrum is real and symmetric, the inverse DFT reduces to a Discrete Cosine Transform (DCT). The DCT has the property to produce highly uncorrelated features

y

t

diagonal rather than full covariance matrices can be used in the Gaussian distributions in many casesEasier to remove the interference of excitation on formant structuresthe phoneme for a segment of speech signal is primarily based on the formant structure (or vocal tract shape)on the frequency scale the formant structure changes slowly over frequency, while the excitation changes much fasterSlide38

Speech Production and Source Model (P.3

of

7.0)

Human vocal mechanism

Speech Source Model

Vocal

tract

u(t)

x(t)

excitationSlide39

Voiced and Unvoiced Speech

(P.4

of

7.0)

u(t)

x(t)

pitch

pitch

voiced

unvoicedSlide40

Frequency domain spectra of speech signals (P.8

of

7.0)

Voiced

UnvoicedSlide41

Frequency

Domain

(P.9

of

7.0)

formant frequencies

formant frequencies

excitation

excitation

Formant

Structure

Formant

Structure

Voiced

UnvoicedSlide42

Input/Output

Relationship for Time/Frequency

Domains

(P.10 of 7.0)

F

F

F

time domain: convolution

frequency domain: product

:

Formant structure: differences

between

phonemes

:

excitation

 

excitation

formant

structureSlide43

Logarithmic Operation

 

 

u[n]

g[n]

x[n]= u[n]

*

g[n]

 Slide44

Derivatives

Derivative operation : to obtain the change of the feature vectors with time

MFCC stream

y

t

(j)

quefrency(j)

Frame index

t-1 t t+1 t+2

quefrency(j)

Frame index

MFCC stream

y

t

(j)

quefrency(j)

Frame index

2

MFCC stream

2

y

t

(j)Slide45

Linear Regression

(x

i

,

y

i

)y=ax+b

find a, bSlide46

Why Delta Coefficients?

To capture the dynamic characters of the speech signal

Such information carries relevant information for speech recognition

The value of

p

should be properly chosenThe dynamic characters may not be properly extracted if p

is too small Too large P may imply frames too far awayTo cancel the DC part (channel distortion or convolutional noise) of the MFCC featuresAssume, for clean speech, an MFCC parameter stream for an utterance is{y(t-N), y(t-N+1),…....,y(t),

y(t+1), y(t+2), ……}, y(t) is an MFCC parameter at time t,while after channel distortion, the MFCC stream becomes{

y(t-N)+h, y(t-N+1)+h,…....,y(t)+h, y(t+1)+h,

y(t+2)+h, ……}the channel effect h is eliminated in the delta (difference) coefficientsSlide47

Convolutional Noise

y[n]= x[n]

*

h[n]

x[n]

h[n]

MFCC

y = x + hSlide48

End-point Detection

Push (and Hold) to Talk/Continuously Listening

Adaptive Energy Threshold

Low Rejection Rate

false acceptance may be rescued

Vocabulary Words Preceded and Followed by a Silence/Noise ModelTwo-class Pattern Classifier

Gaussian density functions used to model the two classeslog-energy, delta log-energy as the feature parametersdynamically adapted parameters

Speech

Silence/NoiseSlide49

End-point Detection

silence

(noise)

false rejection

false acceptance

detected speech

segments

speechSlide50

與語音學、訊號波型、頻譜特性有關的網址

17. Three Tutorials on Voicing and Plosives

http://homepage.ntu.edu.tw/~karchung/intro%20page%2017.htm

8. Fundamental frequency and harmonics

http://homepage.ntu.edu.tw/~karchung/phonetics%20II%20page%20eight.htm

9. Vowels and Formants I: Resonance (with soda bottle demonstration)

http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20nine.htm10. Vowels and Formants II (with duck call demonstration)http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20ten.htm12. Understanding Decibels (A PowerPoint slide show)http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20twelve.htm13. The Case of the Missing Fundamentalhttp://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20thirteen.htm14. Forry, wrong number! I  The frequency ranges of speech and hearing

http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20fourteen.htm19. Vowels and Formants III: Formants for fun and profit (with samplesof exotic music)http://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20nineteen.htm20. Getting into spectrograms: Some useful linkshttp://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20twenty.htm21. Two other ways to visualize sound signalshttp://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20twentyone.htm23. Advanced speech analysis tools II: Praat and morehttp://homepage.ntu.edu.tw/~karchung/Phonetics%20II%20page%20twentythree.htm

25. Synthesizing vowels onlinehttp://www.asel.udel.edu/speech/tutorials/synthesis/vowels.html