/
Voiced-Unvoiced-Silence Classifications Using Hybrid Features and Ying Voiced-Unvoiced-Silence Classifications Using Hybrid Features and Ying

Voiced-Unvoiced-Silence Classifications Using Hybrid Features and Ying - PDF document

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
414 views
Uploaded On 2015-11-29

Voiced-Unvoiced-Silence Classifications Using Hybrid Features and Ying - PPT Presentation

TRANSACTIONS ON AUDIO PROCESSING VOL classification processes principle function more similarly the non method than to the parametric method because feature space using hyperplanes and perform pa ID: 209279

TRANSACTIONS AUDIO PROCESSING VOL.

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Voiced-Unvoiced-Silence Classifications ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Voiced-Unvoiced-Silence Classifications Using Hybrid Features and Yingyong Qi and Bobby The network evaluated and indicated that significantly affected TRANSACTIONS ON AUDIO PROCESSING, VOL. classification processes. principle, function more similarly the non- method than to the parametric method because feature space using hyperplanes and perform pattern classification accordingly accordingly The unique advantage that the decision rule much more easily synthesized than both parametric and non-parametric methods. network implementation the classifier also promotes the perspec- tive of building classification hardware with adaptive training note that a recent discovery has illuminated relationship between the distribution and the parametric distribution assumptions an optimum Bayesian classifier. A recent article article has shown that the approximate the probability density functions of the class being trained. this behavior independent of any particular network architecture, any number layers, processing nodes and connection geometry. we are justified using the training algorithm an MFN, convenient and routine in application, without sacrificing the desirable properties associated with a Bayesian decision process. block diagram the network training and classification process Speech signals were low-pass filtered at kHz, sampled at and quantized with 16-bit accuracy. digitized signals were further high-pass filtered at a fourth-order Butterworth digital filter eliminate low-frequency feature vector was obtained ms segment feature vector was a combination of coefficients and two waveform parameters, the zero-crossing rate and a nonlinear function cepstral coefficients were derived from coefficients and and The autocorrelation method, Hamming window, and were used in calculating the inverse squareroot function was applied to the rms energy numerical range. example set training samples samples (each sample Example average and the average feature vectors classification was made each input feature vector after training was classification output was further decoded and passed through a three-point median filter eliminate isolated “impulse” The network was trained using generalized delta rule propagation of error with a learning rate term was added in updating the weights weights 121. The training would not terminate until was less than and the error difference between consecutive training iterations was less than a total of training iterations input and ouput layers the network fixed number There were PE’s in the input that matched the the feature vector cepstral coefficients and 2 waveform parameters). were 3 in the output layer. output vector was coded coded 1001 for voiced sound, sound, for unvoiced sound, and [001] coding was selected differences between categories. Because and maximum of the activation function could only were replaced practical calculations. overall architecture the number of hidden layer and nodes per hidden layer, was a parameter be determined in the experimental evaluation the network. network performance as a function the size of training set and signal to noise ratio was also evaluated and compared a Bayesian, maximum-likelihood AUDIO PROCESSING, men and women) provided speech samples evaluating the performance speech samples three-digit numbers and rainbow paragraph which begins with a rainbow. . . Recordings were made office environment. speech recordings were pre-processed (see Fig. 1) and were interactively labeled the three sound categories using waveform and spectrographic displays and audio output membership assignment was made largely based on the acoustic features of content of a sound taken only a reference. example, when a voiced fricative such devoiced based on features (reduced periodicity and increased high frequency noisy), the network classification rate was obtained using a three-step procedure: (1) a set of training a given size was randomly selected from the (2) all data samples (excluding the training samples) were classified an error was counted network classification differed from manual classification. Network Architecture Because a method optimal selection network architecture has not been well established, objective here to empirically select a network had a architecture and reasonably high classification performance. An extensive search for an network was not undertaken. Based previous works, for the hidden node was set to 15 and was increased 15 to 40 with an incremental Classification rates were these single hidden as for double hidden layer network. Each network was trained a set randomly selected training samples from each sound classification rate a function network architecture Fig. 3(a). the figure, the network with architecture of single hidden nodes) was a preferable choice terms of the network simplicity and classification rate. In fact, the rate was not significantly altered when the number hidden node the number hidden layer was increased. Because classification rates were relatively high all the networks, a substantial increase of classification rate due network architecture was not expected. 15-20-3 network comparing the performance network classifier and a primary objectives this comparison was determine how the performance classifier would the size training set and noise corruption. stated earlier, a large training typically required reliable Bayesian classifier. Such a requirement, however, not a mandate training a network. Thus it was hypothesized that the performance the network would not critically depend on the of training set whereas a classifier here was a strict software implementation of additional decision logic was added. of the classifier involved the computation the mean and inverse covariance matrix the training vectors. Classifications were made based on likelihood ratios. network training and classification was the same described above except that network architecture of the training set was manipulated. same set a given size were used train both the network and the classifier. The classification rate a function of training size is in Fig. The results indicated that the performance of both classifiers a function training size were similar when the number training samples was relatively large. When the size each category was dimension of vector, however, the inverse covariance matrix became ill-conditioned and, thus, subsequent classifications not be computed classifier. In contrast, a reasonably high classification rate was achieved even when the training set was less than the the feature vector. The insensitivity the size training set was apparently a significant advantage of the network classifier [13], [14]. known, the susceptible to noise corrup- tion because the unvoiced speech itself a noise and the corruptive noise will significantly obscure distinction between silence and unvoiced speech. interesting, however, to compare how the network and the classifier would stand noise corruption. added was a Gaussian random noise whose variance control the signal-to-noise ratio. training samples (30 from each sound category) were randomly selected after noise of an appropriate level (depended on signal level of each speaker) was added. network training and classification processes remained the same above. The classification results a function of signal-to-noise ratio shown in Fig. 4(a). As be seen, both classifiers were degraded a comparable rate when noise ratio reduced. Both classifiers practically failed when AUDIO PROCESSING, F1 F2 F3 0.8632 0.7457 0.871 1.00000 0.8415 0.8381 0.7212 F1 1 0.1166 -0.0348 0.1435 0.1698 8218 5701 2348 3524 Noise Ratio used as the signal to noise ratio was reduced to the advantages using hybrid features, the classification rate signal-to-noise ratio was computed when only the cepstral coefficients were used as the feature vector. presented in Fig. 4(b). Speaker-Dependent and Speaker-Independent Class@cation Finally, the performance the network was evaluated both speaker-dependent and speaker-independent classifications. For speaker-dependent classification, the network was trained speaker and subsequent classification was made 0 0 SL 18 1227 (3.4%) 35852 speaker. Training samples were speech from each sound and were excluded from the classification. was noted that the duration network training was a function both training sample and speaker. more typical (far away from class boundaries) the samples were, the shorter the training time needed. The number iterations also differed significantly from another. But, the final network weights between the input layer and the hidden layer were surprisingly similar among speakers although the similarity was not found between the hidden layer and the output layer. The correlation weights in each layer shown in Table training iterations needed meet the each speaker. The classification matrix and rate speaker-dependent classification tabulated in Table overall classification rate was achieved the speaker- dependent classification. classification results are shown speaker-independent classification, the network was trained samples from two speakers and subsequent classification was made all speakers. One male one female speaker were randomly selected provide the training samples. Training samples were again segments of speech from each sound category. The classification was made speech recordings except training samples. overall classification rate TRANSACTIONS ON AUDIO PROCESSING. VOL. Speech waveform classification (middle window), spoken by male speaker The results the study clearly demonstrate that the voiced, unvoiced, and silence classification can be effectively accomplished using a multilayer feedforward network and hybrid features. Unlike the methods previously reported, the network can be effectively trained for the task. Reasonably high classification rates have been The most significant advantage the network classifier training samples and yet achieve reasonably high classification rate. contrast, a large training set a prerequisite building a workable classifier. When the size applications, the network classifier apparently a preferable alternative. network classification also computationally much simpler classifier. Only one-pass computation network classification whereas a classification can not made until cross-comparisons with all templates have been completed network training, however, take much longer than the calculation means and covariance matrices for the classifier. Such a tradeoff should be recognized. classifier in this study a straightforward implementation of the algorithm. The classification rate could be higher than demonstrated additional decision logic were introduced. Such a work not intended because the primarily used as a comparative baseline and a more complicated classifier can be found in the literature literature It is also worth to mention that the network classifier could be easily making the voiced and unvoiced only. Informal results indicate that the much more noise corruption than the observations indicate that the network training time the selection training samples. much longer training time was noted when the training set includes samples that were the boundary between categories than when the training set only consisted obvious samples from each sound category. the network circumstances, however, were found to be comparable. using “typical” samples training and letting the network make the decision ambiguous cases more efficient than trying let the network accept ambiguous cases as a prototypes for classification. use of “typical” samples training, however, not a common approach building a statistical discrimination function. including ambiguous samples network training currently under investigation investigation In conclusion, a procedure was developed making voiced, unvoiced, and silence classifications of speech using provide a useful tool speech analysis and may have applications speech-data mixed communication systems. systems. B. Atal and and synthesis prediction of speech wave,” Acoust. Soc. Soc. B. Atal and recognition approach with applications Trans. Acousr.. Speech, Signal Signal L. Siegel, “A procedure for using pattem classification techniques to obtain a Voicednlnvoiced classifier,” Trans. Acousr., Speech, Speech, R. Duda and P. Hart, Pattern Classification and Scene Analysis. Wiley, 1973. 1973. L. Siegel and A. Bessey, “Voicednlnvoicedhlixed excitation classifi- cation of speech,” Speech, Signal Signal L. Rabiner and an LPC the Voiced-Unvoiced-Silence detection problem,” Trans. Acoust., Speech, Signal pp. 338-343, Aug. 1977. 1977. D. Rumelhart, G. Hinton, and R. Williams, “Learning Parallel Distributed Processing: the Microstructures Eds., vol. vol. G. Fant, “The source filter concept in voice production,” production,” R. Lippman, to computing with neural neural D. Ruck, S. Rogers, M. Kabrisky, M. Oxley, and B. Suter, “The mul- tilayer perceptron as an approximation to a Bayes optimal discriminant Trans. Neural Dec. 1990. 1990. 111 A. Gray and speech processing,” Speech, Signal pp. 381-391, Oct. 1976. [12] Chang and Fallside, algorithm for Speech and Language, pp. 205-218, 1987. 1987. L. Niles, H. Silverman, G. Tajchman, and M. Bush, “How limited training data can allow a neural network outperform an statistical classifier,” in in L. Niles, H. Silverman, size on relative neural network and other pattem classifiers,” Tech. Rep. University, Providence, Providence, B. Hunt, Dekruger, “Fuzzy back propagation pp. 62-74, 1992. Bernard Merialdo present a theorem which that the found by the Forward-Backward in the mean that to lie the same connected component of the the initial polynomial being maximized). theoretical result that, in the choice Hidden Markov models increasingly being used in various domains and, particular, in speech recognition recognition [7]-[9]. Their popularity comes from the existence of efficient training procedure, which, given an observed output string, allows the values parameters (transition and emission probabilities) be estimated. as the or the iterative algorithm which starts initial point (a set parameter values) and builds a sequence of reestimates which improve the likelihood of training data. sequence converges a local maximum the likelihood function. detailed presentation theory and practice of hidden Markov found in in Nadas [lo] discusses the use of the Baum-Welch algorithm and makes some remarks on the choice initial point. THE BAUM-WELCH when the output symbols belong finite alphabet), the convergence this algorithm from the following theorem: theorem: [#I: Let p (X) = p ({XtJ}) be a polynomial with positive coefficients, homogeneous of degree be any point the domain: such that, point defined From Theorem A we can see that when we choose an initial point and build the sequence of iterates: Manuscript received 1991; revised July the review of approving it for Brian A. Hanson. with IBM IEEE Log Number 9206396.