/
AbstractA novel framework for automatic articulatory-acoustic featuree AbstractA novel framework for automatic articulatory-acoustic featuree

AbstractA novel framework for automatic articulatory-acoustic featuree - PDF document

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
400 views
Uploaded On 2016-12-02

AbstractA novel framework for automatic articulatory-acoustic featuree - PPT Presentation

two basic modes ID: 496122

two basic modes

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "AbstractA novel framework for automatic ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AbstractA novel framework for automatic articulatory-acoustic featureextraction has been developed for enhancing the accuracy ofplace- and manner-of-articulation classiÞcation in spoken lan-guage. The ÒelitistÓ approach focuses on frames for which neu-ral network (MLP) classiÞers are highly conÞdent, anddiscards the rest. Using this method, it is possible to achieve aframe-level accuracy of 93% for manner information on a cor-pus of American English sentences passed through a telephonenetwork (NTIMIT). Place information is extracted for eachmanner class independently, resulting in an appreciable gain inplace-feature classiÞcation relative to performance for a man-ner-independent system. The elitist framework provides a two basic modes Ð (1) feature classiÞcation based on the MLPoutput for all frames (Òmanner-independentÓ) and (2) manner- 4. Manner-Independent Feature ClassiÞcation Table 2 illustrates the efÞcacy of the ARTIFEX system for theAF dimension of voicing (associated with the distinctionbetween speciÞc classes of stop and fricative segments). Thelevel of classiÞcation accuracy is high Ð 92% for voiced seg-ments and 79% for unvoiced consonants (the lower accuracyproportion of unvoiced frames in the training data). Non-tures (Table 3) is considerably lower than for voicing. Accu-racy ranges between 11% correct for the ÒdentalÓ feature(associated with the [th] and [dh] segments) to 79% correct forthe feature ÒalveolarÓ (the [t],[d],[ch],[jh],[s],[f],[n],[nx],[dx]segments). ClassiÞcation accuracy ranges between 48% and82% correct among vocalic segments (Òfront,Ó ÒmidÓ andÒbackÓ). Variability in performance reßects to a certain degreethe amount of training material available for each feature. 5. An Elitist Approach to Frame Selection classes (plus ÒsilenceÓ) in the ARTIFEX system, making it dif-Þcult to effectively train networks expert in the classiÞcation ofeach place feature. There are other problems as well. The lociof maximum articulatory constriction for stops differ fromthose associated with fricatives. And articulatory constrictionhas a different manifestation for consonants compared to vow-els. The number of distinct places of articulation for any given Table 2terms of percent correct, marked in ) for the AFdimension of voicing for the NTIMIT corpus. TheARTIFEX ClassiÞcationReference VoicedUnvoicedVoiced0601Unvoiced0606 Figure 1Overview of the MLP-based, articulatory-acoustic-feature extraction (ARTIFEX) system (cf. Section Table 1phonetic segments in the NTIMIT corpus used fortraining and testing of the ARTIFEX system. Thephonetic orthography is a variant of Arpabet. Seg-ments marked with an asterisk (*) are [+round]. Theconsonantal segments are marked as ÒnilÓ for thefeature Òtense.ÓConsonantsMannerPlace VoicingStaticStaticStopBilabial---StopBilabial+--StopAlveolar---StopAlveolar+--StopVelar---StopVelar+--FricativeAlveolar---FricativeAlveolar++FricativeLab-dent-++FricativeLab-dent+++FricativeDental-++FricativeDental+--FricativeAlveolar-++FricativeAlveolar+++FricativeVelar-++FricativeVelar+++FricativeGlottal-++NasalBilabial+++NasalAlveolar+++NasalVelar+++NasalBilabial+--NasalAlveolar+--NasalVelar+- FlapAlveolar+++FlapAlveolar+-ApproximantsHeightPlace VoicingStaticStaticHighBack+--HighFront+--MidCentral+--MidCentral+--MidRhotic+--MidRhotic+--MidRhotic+--MidCentral+-VowelsHeightPlaceTenseStaticStaticHighFront-++HighFront-++HighFront+--MidFront-++MidFront+--LowFront+++LowFront+--LowCentral+--LowCentral+++LowBack+++MidBack+-[ow]*MidBack+--HighBack-++HighBack+- manner class is usually just three or four. Thus, if it were possi-ble to identify manner features with a high degree of assuranceclassiÞcation system in a manner-speciÞc manner that couldpotentially enhance place-feature extraction performance. Towards this end, a frame-selection procedure was devel-oped. Frames situated in the center of a phonetic segment tendto be classiÞed with greater accuracy than those close to thesegmental borders [2]. This ÒcentristÓ bias in feature classiÞca-with which MLPs classify AFs, particularly those associatedwith manner of articulation. For this reason the output level ofa network can be used as an objective metric with which toselect frames most ÒworthyÓ of manner designation.By establishing a network-output threshold of 0.7 (relativeto the maximum) for frame selection, it is possible to improvethe accuracy of manner-of-articulation classiÞcation between2% and 14%, thus achieving an accuracy level of 77% to 98%correct for all manner classes except the ßaps (53%), as illus-trated in Table 4. The overall accuracy of manner classiÞcationsible, in principle, to use a manner-speciÞc classiÞcation pro-cedure for extracting place-of-articulation features.The primary disadvantage of this elitist approach concernsthe approximately 25% of frames that fall below threshold andare discarded from further consideration. The distribution ofportion of segments (16%) all (or nearly all) frames fall belowthreshold and therefore it would be difÞcult to reliably classifyAFs associated with such phones. By lowering the threshold itis possible to increase the number of segments containingsupra-threshold frames, but at the cost of classiÞcation Þdelityover all frames. A threshold of 0.7 represents a compromisebetween a high degree of frame selectivity and the ability toclassify AFs for the overwhelming majority of segments. ARTIFEX ClassiÞcationConsonantal SegmentsVocalic SegmentsN-SReferenceLabAlvVelDenGloRhoFrtCenBkSilLabial60240301010102020105Alveolar0500000003020005Velar082300000004010105294001010105030108112005010215100307020201000010090601Front0104010000020702010203010001021210000302010000041724030601000000000000 Table 3A confusion matrix illustrating classiÞcation perfor-mance for place-of-articulation features (percent cor-rect, marked in ) using all frames (i.e., manner-independent mode) in the corpus test set. The dataare partitioned into consonantal and vocalic classes. ARTIFEX ClassiÞcationAnteriorCentralPosteriorGlottalReference M-IM-SM-IM-SM-IM-SM-IM-S Anterior66801713040601020713767706090102Posterior11121914617401010912161304072968Anterior46444055010001000402859600010300Posterior01013143625700001615304906021934Anterior646520310204ÐÐ120969860305ÐÐPosterior100532392856Anterior828307140203ÐÐ121169801009ÐÐPosterior171624354850 Table 5Manner-speciÞc (M-S) classiÞcation (percent cor-rect, marked in extraction for each of the four major manner classes.Place classiÞcation performance for the manner-independent (M-I) system is shown for comparison. ARTIFEX ClassiÞcationReferenceM-IM-SM-IM-SM-IM-SVOWEL HEIGHTLowMidHighLow77831316010115185873120902511227373VOWEL LENGTHTenseLaxTense7891160923386962--SPECTRUMStaticDynamicStatic (Vowels)8177192331216979--Static (Fricatives)8698090237505950-- Table 6marked in articulatory features of vowel height, intrinsic vowel ARTIFEX ClassiÞcationVocalicNasalStopFricativeFlapSilenceAllBestAllBestAllBestAllBestAllBestAllBestVocalic969802010101010000000000141073850402040101000402090804026677150900000604060302010703798900000604293012110804060245530000010102000301050200008996 Table 4ClassiÞcation performance (percent correct, markedtion approach for mannerclassiÞcation. ÒAllÓ refersto the manner-independent system using all framesexceeding the 0.7 threshold. Confusion matrix illus- 6. Manner-SpeciÞc Articulatory Place ClassiÞcation In the classiÞcation experiments illustrated in Table 3, placeinformation was correctly classiÞed for 71% of the frames. Theaccuracy for individual place feature classes ranged between11% and 82%. Articulatory-place information is likely to beclass separately (cf. [10]). Table 5 illustrates the results of suchmanner-speciÞc, place classiÞcation. In order to characterize potential efÞcacy of the method, manner information forthe test materials was derived from the reference labels foreach segment rather than from automatic classiÞcation of man- Separate MLPs were trained to classify place-of-articula-tion features for each of the Þve manner classes Ð stops, nasals,fricatives, ßaps and vowels (the latter includes the approxi-mants). The place dimension for each manner class was parti- basic features. For consonantal segments the relative constriction Ð anterior, central and posterior (as well as theglottal feature for the stops and fricatives). For example, Òbila-fricatives. In this fashion it is possible to construct a relationalmanner class. For vocalic segments, front vowels were classi-Þed as anterior, and back vowels as posterior. The liquids (i.e.,[l] and [r]) were assigned a ÒcentralÓ place given the contextualThe gain in place-of-articulation classiÞcation associatedwith manner-speciÞc feature extraction is considerable formost manner classes, as illustrated in Table 5. In manyinstances the gain in place classiÞcation is between 10% and30%. In no instance does the manner-speciÞc regime signiÞ- 7. Manner-SpeciÞc Non-Place Feature ClassiÞcation as well as on the dimensions of height (high, mid, low) andintrinsic duration (tense/lax) for vocalic segments only. Thedynamic/static features are useful for distinguishing affricates(such as [ch] and [jh] from ÒpureÓ fricatives as well as separat-ing diphthongs from monophthongs among vowels. The heightfeature is necessary for distinguishing many vocalic segmentsfrom each other. The tense/lax feature provides importantinformation pertaining to vocalic duration and stress-accent(cf. [4]). Although there are gains in performance (relative tomanner-independent classiÞcation) for many of the features(Table 6) the magnitude of improvement is not quite as impres-sive as observed for articulatory-place features. 8. Discussion and Conclusions Current methods for annotating spoken-language materialfocus on the phonetic segment and the word. Manual annota-tion is both costly and time-consuming. Moreover, few individ-uals possess the complex constellation of skills and expertiserequired to perform large amounts of such annotation in highlyaccurate fashion. Therefore, the future of spoken-languageannotation is likely to reside in automatic procedures. Themost advanced of the current automatic phonetic annotationsystems [1][6][7] require a word transcript to perform, andeven under such circumstances the output is in the form of pho-netic segments only.The output of such Òsuper-alignersÓ is subject to errorbuilt into these systems to accommodate idiolectal and dialec-tal variation. The ability to capture Þne nuances of pronuncia-tion at the level of the phonetic segment is limited by virtue ofthe extraordinary amount of variation observed at this level inel inability to convert AFs into phonetic segments is limited. Forthe NTIMIT corpus the use of the ARTIFEX system improvesphone classiÞcation at the frame level by only a small amount(from 55.7% for a conventional phone-recognition system to61.5% accuracy when phonetic identity is derived from man-ner-independent, articulatory-feature inputs). The elitist frame-work results in only a small additional gain in performance atthe phonetic-segment level, despite the dramatic improvementin AF classiÞcation, suggesting that the phone segment maynetic properties of spoken language. For such reasons, future-generation speech recognition andsynthesis systems are likely to require much Þner detail inmodeling pronunciation than is currently afforded by segmen-tal systems. The ARTIFEX system, in tandem with the elitistapproach, provides one potential means with which to achievehigh-Þdelity, phonetic characterization for speech technologydevelopment and the scientiÞc study of spoken language. 9. Acknowledgements The research described in this study was supported by the U.S.Department of Defense and the National Science Foundation.Mirjam Wester is afÞliated with A 2 guage and Speech, Nijmegen University. 10. References [1]Beringer N. and Schiel F. ÒThe quality of multilingualautomatic segmentation using German MAUS,Ó Proc.Inter. Conf. Spoken Lang. Proc. Vol. IV, pp. 728-731,[2]Chang, S., Shastri, L and Greenberg, S. ÒAutomatic pho-English),Ó Proc. Inter. Conf. Spoken Lang. Proc. , Vol. IV,[3]Greenberg, S. ÒSpeaking in shorthand Ð A syllable-cen-tric perspective for understanding pronunciation varia-tion,Ó Speech Communication [4]Hitchcock, L. and Greenberg, S. ÒVowel height is inti-American English discourse,Ó submitted to Eurospeech- (available from http:www.icsi.berkeley.edu/~ste-veng/prosody), 2001.[5]Jankowski, C., Kalyanswamy, A., Basson, S., and Spitz,J. ÒNTIMIT: A phonetically balanced, continuousspeech, telephone bandwidth speech database,Ó Proc. [6]Kessens, J.M., Wester, M., and Strik, H., ÒImproving theand cross-word pronunciation,Ó Speech Communication Kirchhoff, K. Robust Speech Recognition Using Articula- Ph.D. Thesis, University of Bielefeld.[8]Schiel, F. ÒAutomatic phonetic transcription of non-prompted speech,Ó Proc. Int. Cong. Phon. Sci. [9]Strik, H., Russell, A. van den Heuvel, H. Cucchiarini, C.and Boves, L. ÒA spoken dialogue system for the Dutchpublic transport information service.Ó Journal of Speech Technology [10]Wester, M., Greenberg, S. and Chang, S. ÒA Dutch treat-ture classiÞcation,Ó Proc. Eurospeech , 2001.