/
Estimating Tract Variables from acoustics via Estimating Tract Variables from acoustics via

Estimating Tract Variables from acoustics via - PowerPoint Presentation

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
417 views
Uploaded On 2016-04-08

Estimating Tract Variables from acoustics via - PPT Presentation

machine learning Christiana Sabett Applied math applied statistics and scientific computing amsc October 7 2014 Advisor dr carol espyWilson Electrical and computer engineering Introduction ID: 276476

data ann networks neural ann data neural networks tract 2014 kalman 2010 oct web learning speech espy mitra wilson test dynamic vocal

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Estimating Tract Variables from acoustic..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Estimating Tract Variables from acoustics via machine learning

Christiana Sabett

Applied math, applied statistics, and scientific computing (

amsc

)

October 7, 2014

Advisor: dr. carol espy-Wilson

Electrical and computer engineeringSlide2

Introduction

Automatic speech recognition (ASR) systems are inadequate/incomplete in their current forms.

Coarticulation

– overlap of actions in vocal tractSlide3

Tract Variablesa

Articulatory information: information from the organs along the vocal tract

Tract variables (TVs): vocal tract constriction variables relaying information of a physical trajectory in time

Lip Aperture (LA)

Lip Protrusion (LP)

Tongue tip constriction degree (TTCD)

Tongue tip constriction location (TTCL)Tongue body constriction degree (TBCD)

Tongue body constriction location (TBCL)Velum (VEL)

Glottis (GLO)a. Mitra

et al, 2010.Slide4

Tract variables

TVs are consistent in the presence of

coarticulation

TVs can improve the robustness of automatic speech recognition

Time

Frequency

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

2000

4000

6000

8000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-2000

0

2000

Perfect-memory: Clearly articulated

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-10

0

10

TB

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-15

-10

-5

0

TT

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-25

-20

-15

LA

K

M

P

ER

EH

T

EH

ER

IY

M

FSlide5

Project Goal

Effectively estimate TV trajectories using artificial neural networks, implementing

Kalman

smoothing when necessary. Slide6

Approach

Artificial neural networks (ANNs)

b

Feed-forward ANN (FF-ANN)Recurrent ANN (RANN)

Motivation:Speech inversion is a many-to-one mappingc

ANNs can map m inputs to n outputs

Retroflex /r/

Bunched /r/

b.

Papcun

,

1992.c. Atal et al., 1978.Slide7

structured

3 hidden layers

Each node has f =

tanh

(x) sigmoidal activation function

Weights

wBiases bInput: acoustic features vector (9x20 or 9x13)Output: g

k , an estimate of the TV trajectories at time k (dimension 8x1)

gk is a nonlinear composition of the activation functions

d. Mitra, 2010. Slide8

Cost function

Networks trained by minimizing the sum-of-squares error

Training data [

x, t

] (N = 315 words)e

Output of the network gk is predicted TV trajectory estimated by position at each time step

kWeights and biases updated using scaled conjugate gradient algorithm and dynamic backpropagation to reduce ESE

e. Mitra, 2010. Slide9

Dynamic backpropagationf

Where:

f.

Jin, Liang, and M.m. Gupta, 1999.Slide10

Scaled conjugate gradient (scg)g

Choose weight vector and scalars. Let p

1

= r

1 = -E’SE (w1)

While steepest descent direction rk ≠ 0If success = true, calculate second order information.

Scale sk : the finite difference approximation to the second derivative.If δk ≤ 0, make the Hessian positive definite.Calculate step size

αk = (pk r

k )/δkCalculate comparison parameter Δk

If Δk ≥ 0 : wk+1 = wk + α

k pk , rk+1 =

-E’SE (wk+1)if k mod M = 0 (M is number of weights), restart algorithm: let p

k+1

= r

k+1

else create new conjugate direction

p

k+1

=

r

k+1

+

β

k

p

k

If Δk < 0.25, increase scale parameter: λk = 4

λkg. Moller, 1993.Slide11

Kalman smoothing

Kalman

filtering

is used to

smooth the noisy trajectory estimates from the ANNsTV trajectories modeled as output of a dynamic system

State space representation:Parameters:

Γ : time difference (

ms) between two consecutive measurementsωk : process noiseν

k : measurement noiseSlide12

Kalman smoothingh

Recursive estimator

Predict phase:

Predicted state estimate

Predicted estimate covarianceUpdate phase:S

k = Residual covarianceKk = Optimal

Kalman gainUpdate the state estimate Update estimate covariance

h.

Kalman, 1960. Slide13

Implementation

Python

Scientific libraries

FANN (Fast Artificial Neural Network)Neurolab

PyBrainDeepthought/Deepthought2 high performance computing clustersSlide14

Test Problem

Synthetic data set (420 words) as model input [

x,t

]Data sampled over nine 10-ms windows

Generated from a speech production model at Haskins Laboratory (Yale Univ.)TV trajectories generated by TAsk Dynamic and Applications (TADA) modelReproduce estimates of root mean square error (RMSE) and Pearson product-moment correlation coefficient (PPMC) Slide15

Validation Methods

New real data set:

47 American-English speakers

56 tasks per speaker

Obtained from Univ. of Wisconsin’s X-Ray MicroBeam Speech Production databaseFeed data through model Compare error estimates Obtain visual trajectoriesSlide16

Milestones

Build a FF-ANN

Implement

Kalman smoothingUse synthetic data to test FF-ANN

Build a recurrent ANNImplement smoothing (if necessary)Test AR-ANN using real dataSlide17

Timeline

This semester: Build and test an FF-ANN

October: Research and start implementation.

November: Finish implementation and incorporate Kalman

smoothing.December: Test and compile results using synthetic data.Next semester: Build and test a recurrent ANNJanuary-February: Research and begin implementation (modifying FF-ANN).

March: Finish implementation. Begin testing.April: Modifications (as necessary) and further testing. May: Finalize and collect results.Slide18

Deliverables

Proposal presentation and report

Mid-year presentation/report

Final presentation/reportFF-ANN code

Recurrent ANN codeSynthetic data setReal acoustic data setSlide19

Bibliography

Atal, B. S., J. J. Chang, M. V. Matthews, and J. W.

Tukey

. "Inversion of Articulatory-to-acoustic Transformation in the Vocal Tract by a Computer-sorting Technique." 

The Journal of the Acoustical Society of America 63.5 (1978): 1535-1553. Bengio,

Yoshua. "Introduction to Multi-Layer Perceptrons (Feedforward Neural Networks)¶." Introduction to Multi-Layer

Perceptrons (Feedforward Neural Networks) — Notes De Cours IFT6266 Hiver 2010. 2 Apr. 2010. Web. 4 Oct. 2014.Jin, Liang, and

M.m. Gupta. "Stable Dynamic Backpropagation Learning in Recurrent Neural Networks." IEEE Transactions on Neural Networks 10.6 (1999): 1321-1334. Web. 4 Oct. 2014. <http://www.maths.tcd.ie/~mnl/store/JinGupta1999a.pdf>.

Jordan, Michael I., and David E. Rumelhart. "Forward Models: Supervised Learning with a Distal Teacher." Cognitive Science

 16 (1992): 307-354. Web. 4 Oct. 2014. ]Kalman, R. E. "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering 82 (1960): 35-45. Web. 4 Oct. 2014

.Slide20

bibliography

6.

Mitra

, Vikramjit.

Improving Robustness of Speech Recognition Systems. Dissertation, University of Maryland, College Park. 2010.7. Mitra, V., I. Y. Ozbek

, Hosung Nam, Xinhui Zhou, and C. Y. Espy-Wilson. "From Acoustics for Vocal Tract Time Functions." 

Acoustics, Speech, and Signal Processing, 2009. ICASSP 2009.(2009): 4497-4500. Print. 8. Moller, M. "A Scaled Conjugate Gradient Algorithm For Fast Supervised Learning

." Neural Networks 6 (1993): 525-533. Web. 4 Oct. 2014. 9. Nielsen, Michael. "Neural Networks and Deep Learning." 

Neural Networks and Deep Learning. Determination Press, 1 Sept. 2014. Web. 4 Oct. 2014. 10. Papcun, George. "Inferring Articulation and Recognizing Gestures from Acoustics with

a Neural Network Trained on X-ray Microbeam Data." The Journal of the Acoustical Society of America (1992): 688. Web. 4 Oct. 2014.Slide21

bibliography

All images taken from

Mitra

, Vikramjit

, Hosung Nam, Carol Y. Espy-Wilson, Elliot Saltzman, and Louis Goldstein. "Retrieving Tract Variables From Acoustics: A Comparison

of Different Machine Learning Strategies." IEEE Journal of Selected Topics in Signal Processing

 4.6 (2010): 1027-1045. Print.Espy-Wilson, Carol. Presentation at Interspeech 2013. Espy-Wilson, Carol. Unpublished results.

Sound clips courtesy of I Know That Voice. 2013. Film.Carol Espy-Wilson, Interspeech 2013.Slide22

Thanks!

QUESTIONS?