machine learning Christiana Sabett Applied math applied statistics and scientific computing amsc October 7 2014 Advisor dr carol espyWilson Electrical and computer engineering Introduction ID: 645134
Download Presentation The PPT/PDF document "Estimating Tract Variables from acoustic..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Estimating Tract Variables from acoustics via machine learning
Christiana Sabett
Applied math, applied statistics, and scientific computing (
amsc
)
October 7, 2014
Advisor: dr. carol espy-Wilson
Electrical and computer engineeringSlide2
Introduction
Automatic speech recognition (ASR) systems are inadequate/incomplete in their current forms.
Coarticulation
– overlap of actions in vocal tractSlide3
Tract Variablesa
Articulatory information: information from the organs along the vocal tract
Tract variables (TVs): vocal tract constriction variables relaying information of a physical trajectory in time
Lip Aperture (LA)
Lip Protrusion (LP)
Tongue tip constriction degree (TTCD)
Tongue tip constriction location (TTCL)Tongue body constriction degree (TBCD)
Tongue body constriction location (TBCL)Velum (VEL)
Glottis (GLO)a. Mitra
et al, 2010.Slide4
Tract variables
TVs are consistent in the presence of
coarticulation
TVs can improve the robustness of automatic speech recognition
Time
Frequency
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0
2000
4000
6000
8000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-2000
0
2000
Perfect-memory: Clearly articulated
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10
0
10
TB
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-15
-10
-5
0
TT
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-25
-20
-15
LA
K
M
P
ER
EH
T
EH
ER
IY
M
FSlide5
Project Goal
Effectively estimate TV trajectories using artificial neural networks, implementing
Kalman
smoothing when necessary. Slide6
Approach
Artificial neural networks (ANNs)
b
Feed-forward ANN (FF-ANN)Recurrent ANN (RANN)
Motivation:Speech inversion is a many-to-one mappingc
ANNs can map m inputs to n outputs
Retroflex /r/
Bunched /r/
b.
Papcun
,
1992.c. Atal et al., 1978.Slide7
structured
3 hidden layers
Each node has f =
tanh
(x) sigmoidal activation function
Weights
wBiases bInput: acoustic features vector (9x20 or 9x13)Output: g
k , an estimate of the TV trajectories at time k (dimension 8x1)
gk is a nonlinear composition of the activation functions
d. Mitra, 2010. Slide8
Cost function
Networks trained by minimizing the sum-of-squares error
Training data [
x, t
] (N = 315 words)e
Output of the network gk is predicted TV trajectory estimated by position at each time step
kWeights and biases updated using scaled conjugate gradient algorithm and dynamic backpropagation to reduce ESE
e. Mitra, 2010. Slide9
Dynamic backpropagationf
Where:
f.
Jin, Liang, and M.m. Gupta, 1999.Slide10
Scaled conjugate gradient (scg)g
Choose weight vector and scalars. Let p
1
= r
1 = -E’SE (w1)
While steepest descent direction rk ≠ 0If success = true, calculate second order information.
Scale sk : the finite difference approximation to the second derivative.If δk ≤ 0, make the Hessian positive definite.Calculate step size
αk = (pk r
k )/δkCalculate comparison parameter Δk
If Δk ≥ 0 : wk+1 = wk + α
k pk , rk+1 =
-E’SE (wk+1)if k mod M = 0 (M is number of weights), restart algorithm: let p
k+1
= r
k+1
else create new conjugate direction
p
k+1
=
r
k+1
+
β
k
p
k
If Δk < 0.25, increase scale parameter: λk = 4
λkg. Moller, 1993.Slide11
Kalman smoothing
Kalman
filtering
is used to
smooth the noisy trajectory estimates from the ANNsTV trajectories modeled as output of a dynamic system
State space representation:Parameters:
Γ : time difference (
ms) between two consecutive measurementsωk : process noiseν
k : measurement noiseSlide12
Kalman smoothingh
Recursive estimator
Predict phase:
Predicted state estimate
Predicted estimate covarianceUpdate phase:S
k = Residual covarianceKk = Optimal
Kalman gainUpdate the state estimate Update estimate covariance
h.
Kalman, 1960. Slide13
Implementation
Python
Scientific libraries
FANN (Fast Artificial Neural Network)Neurolab
PyBrainDeepthought/Deepthought2 high performance computing clustersSlide14
Test Problem
Synthetic data set (420 words) as model input [
x,t
]Data sampled over nine 10-ms windows
Generated from a speech production model at Haskins Laboratory (Yale Univ.)TV trajectories generated by TAsk Dynamic and Applications (TADA) modelReproduce estimates of root mean square error (RMSE) and Pearson product-moment correlation coefficient (PPMC) Slide15
Validation Methods
New real data set:
47 American-English speakers
56 tasks per speaker
Obtained from Univ. of Wisconsin’s X-Ray MicroBeam Speech Production databaseFeed data through model Compare error estimates Obtain visual trajectoriesSlide16
Milestones
Build a FF-ANN
Implement
Kalman smoothingUse synthetic data to test FF-ANN
Build a recurrent ANNImplement smoothing (if necessary)Test AR-ANN using real dataSlide17
Timeline
This semester: Build and test an FF-ANN
October: Research and start implementation.
November: Finish implementation and incorporate Kalman
smoothing.December: Test and compile results using synthetic data.Next semester: Build and test a recurrent ANNJanuary-February: Research and begin implementation (modifying FF-ANN).
March: Finish implementation. Begin testing.April: Modifications (as necessary) and further testing. May: Finalize and collect results.Slide18
Deliverables
Proposal presentation and report
Mid-year presentation/report
Final presentation/reportFF-ANN code
Recurrent ANN codeSynthetic data setReal acoustic data setSlide19
Bibliography
Atal, B. S., J. J. Chang, M. V. Matthews, and J. W.
Tukey
. "Inversion of Articulatory-to-acoustic Transformation in the Vocal Tract by a Computer-sorting Technique."
The Journal of the Acoustical Society of America 63.5 (1978): 1535-1553. Bengio,
Yoshua. "Introduction to Multi-Layer Perceptrons (Feedforward Neural Networks)¶." Introduction to Multi-Layer
Perceptrons (Feedforward Neural Networks) — Notes De Cours IFT6266 Hiver 2010. 2 Apr. 2010. Web. 4 Oct. 2014.Jin, Liang, and
M.m. Gupta. "Stable Dynamic Backpropagation Learning in Recurrent Neural Networks." IEEE Transactions on Neural Networks 10.6 (1999): 1321-1334. Web. 4 Oct. 2014. <http://www.maths.tcd.ie/~mnl/store/JinGupta1999a.pdf>.
Jordan, Michael I., and David E. Rumelhart. "Forward Models: Supervised Learning with a Distal Teacher." Cognitive Science
16 (1992): 307-354. Web. 4 Oct. 2014. ]Kalman, R. E. "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering 82 (1960): 35-45. Web. 4 Oct. 2014
.Slide20
bibliography
6.
Mitra
, Vikramjit.
Improving Robustness of Speech Recognition Systems. Dissertation, University of Maryland, College Park. 2010.7. Mitra, V., I. Y. Ozbek
, Hosung Nam, Xinhui Zhou, and C. Y. Espy-Wilson. "From Acoustics for Vocal Tract Time Functions."
Acoustics, Speech, and Signal Processing, 2009. ICASSP 2009.(2009): 4497-4500. Print. 8. Moller, M. "A Scaled Conjugate Gradient Algorithm For Fast Supervised Learning
." Neural Networks 6 (1993): 525-533. Web. 4 Oct. 2014. 9. Nielsen, Michael. "Neural Networks and Deep Learning."
Neural Networks and Deep Learning. Determination Press, 1 Sept. 2014. Web. 4 Oct. 2014. 10. Papcun, George. "Inferring Articulation and Recognizing Gestures from Acoustics with
a Neural Network Trained on X-ray Microbeam Data." The Journal of the Acoustical Society of America (1992): 688. Web. 4 Oct. 2014.Slide21
bibliography
All images taken from
Mitra
, Vikramjit
, Hosung Nam, Carol Y. Espy-Wilson, Elliot Saltzman, and Louis Goldstein. "Retrieving Tract Variables From Acoustics: A Comparison
of Different Machine Learning Strategies." IEEE Journal of Selected Topics in Signal Processing
4.6 (2010): 1027-1045. Print.Espy-Wilson, Carol. Presentation at Interspeech 2013. Espy-Wilson, Carol. Unpublished results.
Sound clips courtesy of I Know That Voice. 2013. Film.Carol Espy-Wilson, Interspeech 2013.Slide22
Thanks!
QUESTIONS?