/
2.FEATUREEXTRACTIONANDMODELS2.1.MFCC&EnergyfeaturesThemostcommonlyused 2.FEATUREEXTRACTIONANDMODELS2.1.MFCC&EnergyfeaturesThemostcommonlyused

2.FEATUREEXTRACTIONANDMODELS2.1.MFCC&EnergyfeaturesThemostcommonlyused - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
375 views
Uploaded On 2016-08-01

2.FEATUREEXTRACTIONANDMODELS2.1.MFCC&EnergyfeaturesThemostcommonlyused - PPT Presentation

AswiththeAfricanelephantexperimentsmultipleexperimentalsetupsareimplementedincludingcallerindependentCIrankdependentRDagedependentADgenderdependentGDandcallerdependentCDEvaluation ID: 427550

AswiththeAfricanelephantexperiments multipleexperimentalsetupsareimplemented includingcaller-independent(CI) rank-dependent(RD) age-dependent(AD) gender-dependent(GD)andcaller-dependent(CD).Evaluation

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "2.FEATUREEXTRACTIONANDMODELS2.1.MFCC&Ene..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

2.FEATUREEXTRACTIONANDMODELS2.1.MFCC&EnergyfeaturesThemostcommonlyusedfeaturesforhumanspeechanalysisandrecognitionareMel-FrequencyCepstralCoefficients(MFCCs)[8],oftensupplementedbyanenergymeasure.Althoughthereareseveralpossiblemethodsforcomputation,herethefilterbankapproachisused,wherethespectrumofeachHamming-windowedsignalisdividedintoMel-spacedtriangularfrequencybins,thenaDiscreteCosineTransform(DCT)isappliedtocalculatethedesirednumberofcepstralcoefficients.Logenergyiscomputeddirectlyfromthetime-domainsignal.Additionally,deltaanddelta-deltaMFCCsrepresentingthevelocityandaccelerationprofilesofthecepstralandenergyfeaturesarecalculatedusinglinearregressionoveraneighborhoodoffivewindows.2.2.GFCCGreenwoodFunctionCepstralCoefficients(GFCCs)areageneralizationoftheMFCCbasedonusingtheGreenwoodfunction[9]forfrequencywarping.Thesefeaturesareappropriateforspectralanalysisofvocalizationsforawidevarietyofspecies,givenbasicinformationabouttheunderlyingfrequencyrange[10].GFCCsareusedhereasbasefeaturesforanalysisofanimalvocalizations,withenergy,delta,anddelta-deltafeaturescomputedidenticallytothoseforMFCCsdescribedabove.2.3.FundamentalfrequencyFundamentalfrequencycontoursareextractedfromthevocalizationsusingtheCOLEAtoolbox[11]cepstrumimplementation.Resultsarepost-processedbymedianfiltering.Unvoicedframesareconsideredtohaveafrequency(andcorrespondingjitterandshimmer)ofzero.2.3.Jitter&ShimmerJitterisameasureofperiod-to-periodfluctuationsinfundamentalfrequency.Jitteriscalculatedbetweenconsecutivevoicedperiodsviatheformula:Jitter,'+1NwhereTiisthepitchperiodoftheilhwindowandNisthetotalnumberofvoicedframesintheutterance.Shimmerisameasureoftheperiod-to-periodvariabilityoftheamplitudevalue,expressedas:Shimmer=AiifINA.NiN1whereAiisthepeakamplitudevalueoftheilhwindowandNisthenumberofvoicedframes.3.DATABASE3.1.SUSASDatabaseThisSpeechUnderSimulatedandActualStress(SUSAS)datasetwascreatedbytheRobustSpeechProcessingLaboratoryattheUniversityofColorado-Boulder[12].Thedatabaseencompassesawidevarietyofstressesandemotions.Utterancesaredividedintotwoportions,"actual"and"simulated".Inthispaper,weusetheutterancesinthesimulatedconditions,consistingofutterancesfromninemalespeakersineachofelevenspeaking-styleclasses.TheelevenstylesincludeAngry,Clear,Cond50,Cond70,Fast,Lombard,Loud,Neutral,Question,Slow,andSoft.TheCond50styleisrecordedwiththespeakerinamediumworkloadcondition,whileintheCond70stylethespeakerisinahighworkloadcondition.TheLombardspeakingstylecontainsutterancesfromsubjectslisteningtopinknoisepresentedbinaurallythroughheadphonesatalevelof86dB.Thevocabularyincludes35highlyconfusableaircraftcommunicationwords.Eachoftheninespeakers(3speakersfromeachof3dialectregions)inthedatasethastworepetitionsofeachwordineachstyle.Allspeechtokensweresampledby16bitsA/Dconverteratasamplefrequencyof8kHz.3.2.AfricanElephantEmotionalArousalDatasetElephantvocalizationswerecollectedfromsixadultnonpregnant,nulliparousfemaleAfricanelephants(LoxodontaAfricana)byKirstenM.LeongandJosephSoltisatDisney'sAnimalKingdom(DAK),LakeBuenaVista,Florida,U.S.A.ThedatacollectionoccurredfromJuly2005toDecember2005.Eachelephantworeacustomdesignedcollar,containingamicrophoneandanRFradiothattransmittedaudiototheelephantbarn,wherethedatawasrecordedonDATtapes.Theaudiowaspassedthroughananti-aliasingfilterandstoredoncomputersatasamplingrateof7518Hz[13].Thereare131vocalizationsusedfortheseexperiments,alllow-frequencyrumblecalls.EachvocalizationislabeledbyindividualID,socialrank,ageandarousallevels.Ofthesixfemales,threeareofhighsocialrankandtheremainingthreeareoflowrank.Similarly,threefemalesareofoldageandthreeareofyoungage.Emotionalarousallevelwasdeterminedfromobservationoftime-synchronizedvideobasedonspecificsocialcontextcriteria.Theemotionalarousallevelsarecategorizedaslow(L),medium(M)andhigh(H),with51,46,and34callsineachcategoryrespectively.IV-1082 AswiththeAfricanelephantexperiments,multipleexperimentalsetupsareimplemented,includingcaller-independent(CI),rank-dependent(RD),age-dependent(AD),gender-dependent(GD)andcaller-dependent(CD).Evaluationisdoneusingleave-one-outcross-validationacrossthedataset.ResultsinTable3aboveshowasimilarpatterntotheelephantarousalexperiments.Inallcases,addingjitterorshimmerindividuallyincreasestheaccuracy,withshimmerhavingslightlybetterperformance,whileusingthetwotogethergivessubstantiallybetterresultsthanusingthemindividually.Individualvariationagainseemstobethestrongestconfoundingfactorinaccurateclassificationofarousal,asindicatedbythe96%peakaccuracyforthecaller-dependentexperiments.5.CONCLUSIONSJitterandshimmerfeatureshavebeenevaluatedasimportantfeaturesforanalysisandclassificationofspeakingstyleandarousallevelinbothhumanspeechandanimalvocalizations.AddingjitterandshimmertobaselinespectralandenergyfeaturesinanHMM-basedclassificationmodelresultedinincreasedclassificationaccuracyacrossallexperimentalconditions.Inevaluationofanimalarousallevels,thelargestobstacletoaccurateclassificationisshowntobeindividualvariability,ratherthanrank,gender,oragefactors.6.ACKNOWLEDGEMENTSTheauthorswouldliketoacknowledgeMarekB.Trawickiforprovisionofinitialexperimentcode.ThiscontributionisbasedonworksupportedbytheNationalScienceFoundationunderGrantNo.IIS-0326395.[5]A.Nogueiras,A.Moreno,A.Bonafante,andJ.Maririo,"SpeechEmotionRecognitionUsingHiddenMarkovModels,"Eurospeech2001,PosterProceedings,pp.2679-2682,2001.[6]K.Oh-Wook,C.Kwokleung,H.Jiucang,andL.Te-Won,"EmotionRecognitionbySpeechSignals,"presentedatEurospeech,Geneva,2003.[7]B.F.Fuller,Y.Horii,andD.A.Conner,"Validityandreliabilityofnonverbalvoicemeasuresasindicatorsofstressor-provokedanxiety,"ResearchinNurse&Health,vol.15(5),pp.379-389,Oct.1992.[8]X.Huang,A.Acero,andH.-W.Hon,SpokenLanguageProcessing.UpperSaddleRiver,NewJersey:PrenticeHall,2001.[9]D.D.Greenwood,"Criticalbandwidthandthefrequencycoordinatesofthebasilarmembrane,"TheJournaloftheAcousticalSocietyofAmerica,vol.33,pp.1344-1356,1961.[10]P.J.Clemins,M.B.Trawicki,K.Adi,J.Tao,andM.T.Johnson,"Generalizedperceptualfeaturesforvocalizationanalysisacrossmultiplespecies,"ProceedingsoftheIEEEICASSP,vol.1,pp.1253-1256,May2006.[11]P.Loizou,"COLEA:AMATLABsoftwaretoolforspeechanalysis."DepartmentofElectricalEngineering,UniversityofTexasatDallas,Richardson,TX,1999.[12]J.H.L.Hansen,S.E.Bou-Ghazale,R.Sarikaya,andB.Pellom,"GettingStartedwiththeSUSAS:SpeechUnderSimulatedandActualStressDatabase,"RobustSpeechProcessingLaboratoryApril15,1998.[13]K.M.Leong,A.Ortolani,K.D.Burks,J.D.Mellen,andA.Savage,"QuantifyingacousticandtemporalcharacteristicsofvocalizationsofagroupofcaptiveAfricanelephants(Loxodontaafricana),"Bioacoustics,vol.13,pp.213-231,2003.[14]S.Young,etal.,theHTKBook(forHTKVersion3.2.1),2002.7.REFERENCES[1]J.H.L.Hansen,"Analysisandcompensationofspeechunderstressandnoiseforenvironmentalrobustnessinspeechrecognition,"SpeechCommunications,SpecialIssueonSpeechUnderStress,vol.20(2),pp.151-170,November1996.[2]S.Bou-GhazaleandJ.H.L.Hansen,"Anoveltrainingapproachforimprovingspeechrecognitionunderadversestressfulenvironments,"EUROSPEECH-97,vol.5,pp.2387-2390,Sept.1997.[3]S.Bou-GhazaleandJ.H.L.Hansen,"AComparativeStudyofTraditionalandNewlyProposedFeaturesforRecognitionofSpeechUnderStress,"IEEETransactionsonSpeechandAudioProcessing,vol.8(4),pp.429-442,July2000.[4]J.Soltis,K.M.Leong,andA.Savage,"AfricanelephantvocalcommunicationII:rumblevariationreflectstheindividualidentityandemotionalstateofcallers,"AnimalBehaviour,vol.70,pp.589-599,2005.IV-1084