Hossein Hamooni Abdullah Mueen University of New Mexico Department of Computer Science What is Phoneme Phonemes are very small units of intelligible sound usually less than 200 ID: 599524
Download Presentation The PPT/PDF document "Dual-domain Hierarchical Classification ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Dual-domain Hierarchical Classification of Phonetic Time Series
Hossein Hamooni, Abdullah MueenUniversity of New MexicoDepartment of Computer ScienceSlide2
What is Phoneme?
Phonemes are very small units of intelligible sound (usually less than 200 ms).
Phonetic spelling is the sequence of phonemes that a
word comprises.
Example:
Coat ([kōt] /K OW T/)From ([frəm] /F R AH M/)impressive ([imˈpresiv] /IH M P R EH S IH V/)
2Slide3
Phoneme Classification
What is phoneme classification?Input: A short segment of audio signal.
Output
: What phoneme it is.
Phoneme classification is a complex task:
More than 100 classes (based on International Phonetic Alphabet)Variation in speakers, dialects, accents, noise in the environment, etc.Phoneme classification can be used in:Robust speech recognitionAccent/dialect detection
Speech quality scoring
3Slide4
Related Work
Different methods for phoneme classification have been used in the literature:Hidden Markov model [Lee, 1989]
Neural network [Schwarz, 2009]
Deep belief network
[Mohamed, 2012]
Support vector machine [Salomon, 2001]Hierarchical methods [Dekel, 2005] Boltzmann machine [Mohamed, 2010]
Although data mining society has shown that k-NN classifiers can work well on time series data, it hasn’t been tried on phoneme yet.
4
[C. Lopes, F. Perdigao, 2011]Slide5
Our Dual-domain Approach
5Time Domain:
Using k-NN
Dynamic
Time Warping (DTW
) ExpensiveSpeed up by lower bounding techniquesFrequency Domain:Using k-NN
Euclidean distance between Mel-frequency
cepstrum
coefficients
(MFCC)
FastSlide6
Real Example
6Slide7
Challenge
7DTW is expensive (quadratic in time and space complexity)We need to apply a speed up techniqueSolution: Lower bounding techniques
w
wSlide8
DTW Lower bounding
8Resampling to equal length doesn’t always work !!!Slide9
DTW Lower bounding
9We use the prefix of the longer signal (Prefixed LB_Keogh)
We show that Prefixed
LB_Keogh
is a lower bound if:
w > difference between lengths of two signalsWe set w = c * length of the longer signalWe ignore all pairs of signals that don’t satisfy the above condition.
2
4
6
8
10
12
14
16
18
x
10
4
0
0.5
1
1.5
2
2.5
3
3.5
Speedup
Training Set Size
10
20
30
40
50
60
70
80
90
100
80.2
80.4
80.6
80.8
81
81.2
81.4
81.6
81.8
Window Size (c%)
Accuracy(%)
c = 30%Slide10
Data Collection
10370,000 phonemes are segmented from:
Data is publicly available.
Slide11
Phoneme Segmentation
11The Penn Phonetics Lab Forced Aligner (p2fa
)
Takes a signal and a transcript
Produces timing segmentations (word level and phoneme level)Slide12
Accuracy (All layers)
12
10-fold cross validation
100 random phonemes in each foldSlide13
Accented Phoneme Classification
13
0
0.5
1
1.5
2
2.5
3
3.5
x 10
4
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Training Set Size
Accuracy
MFCC
DTW
British vs. American
accent
Using Oxford test set
2-class classification problem
No hierarchySlide14
Conclusion
We present a dual-domain hierarchical method for phoneme classification.
We generate a novel dataset of 370,000 phonemes
.
We achieve up to
73% accuracy rate for 39 classes.Our lower bounding technique gives us up
to
3X
speedup.
14Slide15
15
Thank You
Data and code available at:
http://
cs.unm.edu
/~hamooni/papers/Dual_2014