/
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCES IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCES

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCES - PDF document

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
451 views
Uploaded On 2015-06-11

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCES - PPT Presentation

16 NO 8 NOVEMBER 2008 Epoch Extraction From Speech Signals K Sri Rama Murty and B Yegnanarayana Senior Member IEEE Abstract Epoch is the instant of signi64257cant excitation of the vocaltract system during production of speech For most voiced speec ID: 84173

NOVEMBER

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "IEEE TRANSACTIONS ON AUDIO SPEECH AND LA..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

MURTYANDYEGNANARAYANA:EPOCHEXTRACTIONFROMSPEECHSIGNALSastheinstantofsigniÞcantexcitation.Somemethodsalsoex-ploittheperiodicitypropertyofthesignalintheadjacentcy-clesforepochextraction.Themethodproposedinthispaperassumesandexploitstheimpulse-likecharacteristicoftheexci-tation.Theintervalsbetweentheadjacentimpulsesarenotnec-essarilyequal,i.e.,theglottalcyclesneednotbeperiodiceveninshortintervalsofafew(2Ð4)glottalcycles.TheÞrstcontributiontothedetectionofepochswasduetoSobakin[10].AslightlymodiÞedversionwasproposedbyStrube[11].InStrubeÕswork,somepredictormethodsbasedonLPanalysisforthedeterminationoftheepochswerereviewed.Thesemethodsdonotalwaysyieldreliableresults.SobakinÕsmethodusingthedeterminantoftheautocovariancematrixwasexaminedcritically,anditwasshownthatthedeterminantwasmaximumifthebeginningoftheinterval,onwhichtheautocovariancematrixwascomputed,coincidedwiththeglottalIn[12],amethodbasedonthecompositesignaldecomposi-tionwasproposedforepochextractionofvoicedspeech.Asu-perpositionofnearlyidenticalwaveformswasreferredtoasacompositesignal.TheepochÞlterproposedinthiswork,com-putestheHilbertenvelopeofthehighpassÞlteredcompositesignaltolocatetheepochinstants.Itwasshownthattheinstantsofexcitationofthevocal-tractcouldbeidentiÞedpreciselyevenforcontinuousspeech.However,thismethodissuitableforan-alyzingonlycleanspeech.TheerrorsignalobtainedintheLPanalysis,referredtoastheLPresidual,isknowntocontaininformationpertainingtoepochs.AlargevalueoftheLPresidualwithinapitchperiodissupposedtoindicatetheepochlocation[13].However,epochidentiÞcationdirectlyfromtheLPresidualisnotrecommended[11],becausetheLPresidualcontainspeaksofrandompolarityaroundtheepochs.ThismakesunambiguousidentiÞcationoftheepochsfromtheLPresidualdifÞcult.AdetailedstudywasmadeonthedeterminationoftheepochsfromtheLPresidual[14],andamethodforunambiguousidentiÞcationofepochsfromtheLPresidualwasproposed.Aleast-squaresapproachforglottalinverseÞlteringfromtheacousticspeechwaveformwasproposedin[15].Inthispaper,covarianceanalysiswasdiscussedforaccuratelyperformingtheglottalinverseÞlteringfromtheacousticspeechwaveform.Amethodbasedonmaximum-likelihoodtheoryforepochdeterminationwasproposedin[16].Inthismethod,thespeechsignalwasprocessedtogetthemaximum-likelihoodepochdetection(MLED)signal.ThestrongestpositivepulseintheMLEDsignalindicatestheepochlocationwithinapitchperiod.However,theMLEDsignalcreatesnotonlyastrongandsharpepochpulse,butalsoasetofweakerpulseswhichrepresentthesuboptimalepochcandidateswithinapitchperiod.Hence,aselectionfunctionwasderivedusingthespeechsignalanditsHilberttransform,whichemphasizedthecontrastbetweentheepochandthesuboptimalpulses.UsingtheMLEDsignalandtheselectionsignalwithappropriatethreshold,theepochsweredetected.Thelimitationofthismethodisthechoiceofwindowforderivingtheselectionfunction,andalsotheuseofthresholdfordecidingtheepochs.AFrobeniusnormapproachfordetectingtheepochswasproposedin[17].Inthispaper,anewapproachbasedonsingularvaluedecomposition(SVD)wasproposed.TheSVDmethodamountstocalculatingtheFrobeniusnormsofsignalmatrices,andistherefore,computationallyefÞcient.Themethodwasshowntobeworkingonlyforvowelsegments.NoattemptwasmadeindetectingepochsindifÞcultcaseslikenasals,voicedconsonants,andsemivowels.Amethodfordetectingtheepochsinaspeechsignalusingthepropertiesofminimumphasesignalsandgroup-delayfunc-tionwasproposedin[18].Themethodisbasedonthefactthattheaveragevalueofthegroup-delayfunctionofasignalwithinananalysisframecorrespondstothelocationofthesigniÞcantexcitation.Animprovedmethodbasedonthecomputationofthegroup-delayfunctiondirectlyfromthespeechsignalwasproposedin[19].Robustnessofthegroup-delaybasedmethodagainstadditivenoiseandchanneldistortionswasstudiedin[20].Fourmeasuresofgroup-delay(averagegroup-delay,zerofrequencygroup-delay,energy-weightedgroup-delay,anden-ergy-weightedphase)andtheiruseforepochdetectionwasin-vestigatedin[21].Inthispaper,theeffectofthelengthofanal-ysiswindow,thetradeoffbetweenthedetectionrateandthetimingerror,andthecomputationalcostofevaluatingthemea-sureswereexaminedindetail.Inthispaper,itwasshownthattheenergy-weightedmeasuresperformedbetterthantheothertwomeasures.Adynamicprogrammingprojectedphase-slopealgorithm(DYPSA)forautomaticestimationofglottalclosureinstantsinvoicedspeechwaspresentedin[22]and[23].Inthismethod,thecandidatesforGCIwereobtainedfromthezerocrossingsofthephase-slopefunctionderivedfromtheen-ergy-weightedgroup-delay,andwerereÞnedbyemployingadynamicprogrammingalgorithm.Inthispaper,itwasshownthatDYPSAperformedbetterthantheexistingmethods.Epochisaninstantproperty,but,inmostofthemethodsdiscussedabove(exceptthegroup-delaybasedmethods),theepochsaredetectedbyemployingblockprocessingap-proaches,whichresultinambiguityaboutthepreciselocationoftheepochs.MostoftheexistingmethodsrelyontheLPresidualsignalderivedbyinverseÞlteringthespeechsignal.Thoughthesemethodsworkwellinmostcases,theyneedtodealwiththefollowingissues:1)selectionofparameters(orderofLPanalysis,lengthofthewindow)forderivingtheerrorsignal;2)dependenceofthesemethodsontheenergyoftheerrorsignal,whichinturndependsontheenergyofthesignal;3)theaccuracywithwhichtheepochscanberesolveddecreasesasaresultofblockprocessing;4)settingathresholdvaluetotakeanunambiguousdecisiononthepresenceofanepoch;5)thoughsomeofthesemethodsexploitperiodicityforaccurateestimationofepochlocations,theexcitationim-pulsesneednotbeperiodic.Ingeneral,itisdifÞculttodetecttheepochsinthecaseoflowvoicedconsonants,nasalsandsemivowels,breathyvoices,andfemalespeakers.Inthispaper,weproposeanewmethodforepochextractionwhichisbasedontheassumptionthatthemajorsourceofex-citationofthevocal-tractsystemisduetoasequenceofim-pulse-likeeventsintheglottalvibration.Theimpulseexcita-tiontothesystemresultsinadiscontinuityinfrequencyintheoutputsignal.Weproposeanovelapproachtodetecttheloca-tionofthediscontinuityinfrequencyintheoutputsignalbyconÞningtheanalysisaroundasinglefrequency.InSectionII,wediscussthebasicprincipleoftheproposedmethodandil-lustratetheprincipleforseveralcasesofsyntheticexcitationsignals.InSectionIII,wediscusstheissuesinvolvedinap-plyingthemethoddirectlyonspeechdata.InSectionIV,we 1604IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.16,NO.8,NOVEMBER2008presentourproposedmethodtoextractepochsfromthespeechsignal.InSectionV,theperformanceoftheproposedmethodintermsofidentiÞcationaccuracyisgiven,andtheresultsarecomparedwiththreeexistingmethodsforepochextraction.InSectionVI,theperformanceoftheproposedmethodiseval-uatedfordifferenttypesofdegradations,andtheresultsarecomparedwiththeexistingmethods.Finally,inSectionVIIwesummarizethecontributionsofthispaper,anddiscusssomelim-itationsoftheproposedmethodwhichpromptfurtherinvestiga-tionforextractingepochsfromspeechsignalsrecordedinprac-ticalenvironments.II.BASISFORTHEROPOSEDETHODOFXTRACTIONSpeechisproducedbyexcitingthetime-varyingvocal-tractsystembyoneormoreofthefollowingthreetypesofexcitation:1)glottalvibration;2)frication;3)burst.Theprimarymodeofexcitationisduetoglottalvibration.Whileexcitationispresentthroughouttheproductionprocess,itisconsideredsigniÞcant(especiallyduringglottalvibration)onlywhenthereislargeen-ergyinshort-timeinterval,i.e.,whenitisimpulse-like.Theseimpulse-likecharacteristicsareusuallyexhibitedaroundthein-stantsofglottalclosureduringeachglottalcycle.Thepresenceoftheseimpulse-likecharacteristicssuggeststhattheexcitationcanbeapproximatedasasequenceofimpulses.Thisassump-tionontheexcitationofthevocal-tractsystemsuggestsanewapproachforprocessingthespeechsignalasdiscussedinthisAllphysicalsystemsareinertialinnature.Theinertialsys-temsrespondwhenexcitedbyanexternalsource.Theexcita-tiontoaninertialsystemcanbeanyofthefollowingfourtypes.ExcitationimpulseisnotintheobservedintervalofthesignalÑSinusoidalgenerator:Outputsignalisthere-sponseofapassiveinertialsystemforanimpulse,andtheimpulsesthemselvesarenotpresentintheobservedintervalsofthesignal.Sinusoidalexcitation:Sinusoidalexcitationcanbeviewedasimpulseexcitationinthefrequencydomain.Hence,asinusoidalexcitationtoaninertialsystemselectsthecorrespondingfrequencycomponentfromtransferfunctionofthesystem.Thoughsinusoidalexcitationiswidelyusedtoanalyzesyntheticsystems,itisnotcommonlyfoundinphysicalsystems.Randomexcitation:Randomexcitationcanbeinter-pretedasimpulseexcitationofarbitraryamplitudeateveryinstantoftime.Sinceimpulseexcitationsarepresentoveralltheinstantsoftime,itisdifÞculttoobservethemfromtheoutputofthesystem.Randomexcitationdoesnotpossessimpulse-likenatureeitherinthetime-domainorinthefrequency-domain,andhence,theimpulsescannotbeperceived.Sequenceofimpulsesasexcitation:Inthiscase,thesig-nalsaregeneratedbyapassiveinertialsystemwithaÞxedsequenceof(periodicand/oraperiodic)impulsesasexcitation.Thetimeinstantsofimpulsesmaynotbeobservedfromtheoutputofthesystem,buttheycanbeperceived.Ifthesequenceofimpulsesisperiodicinthetime-domain,thenitcorrespondstoaperiodicsequence Fig.1.Inertialsystemexcitedwithasequenceofimpulses.ofimpulsesinthefrequency-domainalso,andcanalsobeperceived.Consideraphysicalsystemexcitedbyasequenceofimpulsesofvaryingstrengths,asshowninFig.1.OneofthechallengesintheÞeldofsignalprocessingistodetectthetimeinstants oftheimpulsesandtheircorrespondingstrengths fromtheoutputsignal.Inanaturalscenariolikespeechproduction,thecharacteristicsofthesystemvarywithtimeandareunknown.Hence,thesignalprocessingproblemcanbeviewedasablinddeconvolution,whereneitherthesystemresponsenortheexci-tationsourceareknown.Inthispaper,weattempttodetectthetimeinstantsofexcitation(epochs)ofthevocal-tractsystem.Consideraunitimpulseinthetimedomain.Ithasallthefrequenciesequallywellrepresentedinthefrequencydomain.Whenaninertialsystemisexcitedbyanimpulse-likeexcitation,theeffectoftheexcitationspreadsuniformlyinthefrequencydomainandismodulatedbythetime-varyingtransferfunctionofthesystem.Theinformationaboutthetimeinstantsofoc-currenceoftheexcitationimpulsesreßectsasdiscontinuitiesinthetimedomain.ItmaybedifÞculttoobservethesedisconti-nuitiesdirectlyfromthesignalbecauseofthetime-varyingre-sponseofthesystem.TheeffectofthediscontinuitiescanbehighlightedbyÞlteringtheoutputsignalthroughanarrowbandÞltercenteredaroundafrequency.Theoutputofthenarrow-bandÞlterpredominantlycontainsasinglefrequencycompo-nent,andasaresult,thediscontinuitiesduetotheexcitationimpulseswillgetmanifestedasadeviationfromthecenterfre-quency.ThetimeinstantsofthediscontinuitiescanbederivedbycomputingtheinstantaneousfrequencyoftheÞlteredoutput[24].Atutorialreviewontheinstantaneousfrequencyanditsinterpretationisgivenin[25].Ithasbeenpreviouslyobservedthatisolatednarrowspikesintheinstantaneousfrequencyofthebandpass-Þlteredoutput[26,ch.11]areattributedtoeitherthevalleysintheamplitudeenvelopeortheonsetofanewpitchpulse,butnopreviousworkexploredthefeasibilityofthistypeofobservationforepochextraction.A.ComputationofInstantaneousFrequencyTheinstantaneousfrequencyofarealsignal isdeÞnedasthetimederivativeoftheunwrappedphaseofthecomplexanalyticsignalderivedfrom [24].Thecomplexanalyticsignalcorrespondingtoarealsignal isgivenby (1)where istheHilberttransformoftherealsignal isgivenby (2) MURTYANDYEGNANARAYANA:EPOCHEXTRACTIONFROMSPEECHSIGNALSwhereIFTdenotestheinverseFouriertransform,and givenby Theanalyticsignalthusderivedcontainsonlypositivefre-quencycomponents.Theanalyticsignal canberewritten (4)where iscalledtheamplitudeenvelope,and iscalledtheinstantaneousphase.Directcomputationofthe from(6)suffersfromtheproblemofphasewrapping, isconstrainedtoaninterval or .Hence,theinstantaneousfrequencycannotbecomputedbyexplicitdif-ferentiationofphase withoutÞrstperformingthecom-plextaskofunwrappingthephaseintime.Theinstantaneousfrequencycanbecomputeddirectlyfromthesignal,withoutgoingthroughtheprocessofphaseunwrapping,byexploitingtheFouriertransformrelations.Takinglogarithmonbothsidesof(4),anddifferentiatingwithrespecttotime ,wehave wherethesuperscript denotesthederivativeoperator,and istheinstantaneousfrequency.Thatis (8)where denotestheimaginarypart. canbecomputedbyusingtheFouriertransformrelations.Theanalyticsignal canbesynthesizedfromitsfrequencydomainrepresen-tationthroughtheinverseFouriertransform (9)where istheFouriertransformoftheanalyticsignal ,andiszerofornegativefrequencies.Differentiatingbothsidesof(9)withrespecttotime ,wehave Theinstantaneousfrequency canbeobtainedfrom(7)and(10)as (11)where denotesrealpart.Computationoftheinstantaneousfrequencygivenin(11)isimplementedinthediscretedomainasfollows: Here,IDFTdenotestheinversediscreteFouriertransform,and isthetotalnumberofsamplesinthesignal.Theinstantaneousfrequencymaybeinterpretedasthefre-quencyofasinusoidwhichlocallyÞtsthesignalunderanalysis.However,ithasaphysicalinterpretationonlyformonocom-ponentsignals,wherethereisonlyonefrequencyoranarrowrangeoffrequenciesvaryingasafunctionoftime.Inthiscase,theinstantaneousfrequencycanbeinterpretedasdeviationoffrequencyofthesignalfromthemonotoneateveryinstantoftime.Thenotionofasingle-valuedinstantaneousfrequencybe-comesmeaninglessformulticomponent(multiplefrequencysi-nusoids)signals.Themulticomponentsignalhastobedispersedintoitscomponentsforfurtheranalysis.Inthispaper,weproposetousearesonatortoÞlteroutfromasignalamonocomponentcenteredaroundasinglefrequencyforfurtheranalysis.Aresonatorisasecond-orderinÞnite-im-pulseresponse(IIR)Þlterwithacomplexconjugatepairofpolesinthe -plane[27].Aresonatorwithnarrowbandwidth(cor-respondingtoaradius )waschosentorealizethenarrowbandÞlter.Anidealresonator wasnotusedinordertoavoidsaturationoftheÞlteroutput.WhenamulticomponentsignalisÞlteredthrougharesonatorcenteredaroundafrequency ,theoutputsignalpredom-inantlycontainsthe frequencycomponent.Anydeviation infrequencyoftheÞlteredoutputcanbeattributedtotheimpulse-likecharacteristicspresentinthemulticomponentsignal.Ingeneral,theanalyticsignalcorrespondingtotheÞl-teredoutputcanbeexpressedas Hence,theinstantaneousfrequencyoftheÞlteredoutput(pre-dominantlymonocomponent)isgivenby Fig.2(a)showsamulticomponentsignalintheformofanape-riodicsequenceofimpulseswitharbitrarystrengths.ThesignalÞlteredbya500-Hzresonator,andtheinstantaneousfrequencyoftheÞlteredoutputareshowninFig.2(b)and(c),respec-tively.Attheinstantsofimpulselocations,theinstantaneousfrequencydeviatessigniÞcantlyfromthenormalizedcenterfre-quency ,where isthefrequencyoftheresonator, isthesamplingfrequency.Foraresonatorfrequency Hz,andsamplingfrequency ,theinstanta-neousfrequency(around )showssharppeaksatthelocationsoftheimpulses.TheillustrationinFig.2showsthatthediscontinuityinformationcanbederivedfromtheÞlteredoutputevenifthesequenceofimpulsesarenotregularlyspaced, Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on January 30, 2009 at 00:56 from IEEE Xplore. Restrictions apply. MURTYANDYEGNANARAYANA:EPOCHEXTRACTIONFROMSPEECHSIGNALS Fig.5.A100-mssegmentof(a)speechwaveform,(b)outputoftheresonatorat500Hz,(c)instantaneousfrequencyoftheÞlteredoutput,and(d)differencedEGGsignal.bythedifferencedEGGsignal,illustratingthepotentialoftheproposedmethod.Inthecaseofspeech,instantaneousfrequencyoftheÞlteredoutputalsocontainsthetime-varyingfrequencychangesassoci-atedwiththevocal-tracttransferfunction,whichisundesirable.Asaresult,thoughthepeaksintheinstantaneousfrequencyoftheÞlteredoutputindicatetheepochlocationsaccuratelyforthesegmentshowninFig.5,itmaynotbeusefultoextracttheepochlocationsunambiguouslyforanychosencenterfrequency .Thus,themethodofepochextractionusingtheinstan-taneousfrequencyoftheÞlteredoutputdependscriticallyonthechoiceofcenterfrequencyoftheÞlter.Asinglecenterfre-quencymaynotbesuitableforextractingtheepochlocationsofanarbitrarysegmentofspeech.Thecenterfrequencyhastobechosenbasedonthecharacteristicsofthespeechsegmentunderanalysis.Thechoiceofthecenterfrequencyalsodependsonthedistributionofenergyofthespeechsegmentinthefrequencydomain.ToillustratethesigniÞcanceofchoiceofthecenterfrequencyoftheÞlter,theinstantaneousfrequencycomputedaroundfourdifferentcenterfrequenciesareshowninFig.6.Thespectrogram,thespeechsignalandthedifferencedEGGsignalarealsogivenforreference.ThespectrograminFig.6(a)showsabandofenergyaround500Hz.Theinstantaneousfrequencycomputedaround500Hz[Fig.6(d)]indicatesunambiguouspeaks/valleysthatareincloseagreementwiththeactualepochsshownbythedifferencedEGGsignal[Fig.6(c)].Intheinstanta-neousfrequenciescomputedaround1000and2000Hz,showninFig.6(e)and(f),respectively,theepochlocationscannotbeidentiÞedeasily.Thisisbecausetheenergyofthesignalinthosefrequencybandsisverylow.Sincethespectrogramshowslargeenergyinthebandaround2500Hz,theinstantaneousfrequencycomputedaround2500Hzshowssharppeaks/valleysaroundtheepochlocations.However,theinstantaneousfrequencyplotinFig.6(g)showslessambiguouspeaks/valleysinthetimein-terval570Ð620ms,thanthoseinthetimeinterval520Ð570ms.Thisisbecausetheintensityofthe2500-Hzfrequencybandinthetimeinterval570Ð620msisgreaterthantheintensityofthebandinthetimeinterval520Ð570ms. Fig.6.Instantaneousfrequencyofaspeechsegmentcomputedaroundfourdifferentcenterfrequencies.(a)Spectrogramofthespeechsegment.(b)Speechwaveform.(c)DifferencedEGGsignal.Instantaneousfrequencyplotscom-putedaround(d)500Hz,(e)1000Hz,(f)2000Hz,and(g)2500Hz.Noticethattheinstantaneousfrequenciescomputedaround1000and2000Hzalsocontainallthepeaks/valleyscorre-spondingtotheepochlocations,buttheycannotbelocatedeasilyduetoßuctuationsintheneighborhood.Thisisbecausetheinstantaneousfrequencycapturesnotonlythediscontinu-itiesduetotheexcitationimpulses,butalsotheßuctuationsduetothetime-varyingvocal-tractsystem.Hence,itisdifÞculttoextracttheinstantsofexcitationfromtheinstantaneousfrequencycomputedaroundanarbitrarycenterfrequency.Thecenterfrequencyhastobechoseninsuchawaythatthediscontinuitiesduetotheexcitationimpulsesdominateovertheßuctuationsduetothetime-varyingvocal-tractsystem.IV.EXTRACTIONROMSINGA0-HzRESONATORThediscontinuityduetoimpulseexcitationisreßectedacrossallthefrequenciesincludingthezerofrequency.Thatis,eventheoutputoftheresonatoratzerofrequencyshouldhavethein-formationofthediscontinuitiesduetoimpulse-likeexcitation.TheadvantageofchoosingthezerofrequencyresonatorÞlteristhatthecharacteristicsofthetime-varyingvocal-tractsystemwillnotaffectthecharacteristicsofthediscontinuitiesintheres-onatorÞlteroutput.Thisisbecausethevocal-tractsystemhasresonancesatmuchhigherfrequenciesthanatzerofrequency. Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on January 30, 2009 at 00:56 from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on January 30, 2009 at 00:56 from IEEE Xplore. Restrictions apply. 1610IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.16,NO.8,NOVEMBER2008 Fig.12.Illustrationofgroup-delaybasedmethodforepochextraction[18].(a)Speechsignal,(b)LPresidual,(c)phase-slopefunction,and(d)differencedEGGsignal.Thepulsesin(d)indicatethedetectedepochlocations.fromthespeechsignalalone.Therearethreecomponentsinthealgorithm.TheÞrstcomponentgeneratescandidateepochsusingzerocrossingsofthephase-slopefunction.Theenergyweightedgroup-delaywasusedasameasuretoderivethephase-slopefunction.Thesecondcomponentemploysanovelphase-slopeprojectiontechniquetorecovercandi-datesforwhichthephase-slopefunctiondoesnotincludeazero-crossing.Thesetwocomponentsdetectalmostallthetrueepochs,buttheyalsogeneratealargenumberoffalsealarms.Thethirdcomponentofthealgorithmusesdynamicprogrammingtoidentifythetrueepochsfromthesetofhypoth-esizedcandidatesbyminimizingacostfunction.Forevaluatingthistechnique,theMATLABimplementationoftheDYPSAavailablein[29]wasused.TheCMU-Arcticdatabase[30],[31]wasemployedtoeval-uatetheproposedmethodofepochdetectionandtocomparetheresultswiththeexistingmethods.TheArcticdatabasecon-sistsof1132phoneticallybalancedEnglishsentencesspokenbytwomaleandonefemaletalkers.Thedurationofeachutter-anceisapproximately3s,whichmakesthedurationoftheentiredatabasetobearound2h40min.Thedatabasewascollectedinasoundproofbooth,anddigitizedatasamplingfrequencyof32kHz.Inadditiontothespeechsignals,theArcticdata-basecontainsthesimultaneousrecordingsofEGGsignalscol-lectedusingalaryngograph.ThespeechandEGGsignalsweretime-alignedtocompensateforthelarynx-to-microphonedelay,determinedtobeapproximately0.7ms.ReferencelocationsoftheepochswereextractedfromthevoicedsegmentsoftheEGGsignalsbyÞndingpeaksinthedifferencedEGGsignal.Theper-formanceofthealgorithmswasevaluatedonlyinthevoicedseg-ments(detectedfromEGGsignal)betweenthereferenceepochlocationsandtheestimatedepochlocations.Thedatabasecon-tainsatotalof792249epochsinthevoicedregions.Theperformanceoftheepochdetectionmethodswasevalu-atedusingthemeasuresdeÞnedin[23].Fig.13showsthechar-acterizationofepochestimatesshowingeachofthepossiblede-cisionsfromtheepochdetectionalgorithms.Thefollowing Fig.13.Characterizationofepochestimatesshowingthreelarynxcycleswithexamplesofeachpossibleoutcomefromepochextraction[23].IdentiÞcationaccuracyismeasuredasavarianceofTABLEIOMPARISONOFETHODSONATABASE.IDRÑIDENTIFICATIONATE,MRÑMATEFARÑFATE,IDAÑIDENTIFICATIONCCURACY measuresweredeÞnedtoevaluatetheperformanceofepochdetectionalgorithms.Larynxcycle:Therangeofsamples ,givenanepochreferenceat withprecedingandsucceedingepochrefer-encesatsamples and ,respectively.IdentiÞcationrate(IDR):Thepercentageoflarynxcy-clesforwhichexactlyoneepochisdetected.Missrate(MR):Thepercentageoflarynxcyclesforwhichnoepochisdetected.Falsealarmrate(FAR):Thepercentageoflarynxcyclesforwhichmorethanoneepochisdetected.IdentiÞcationerror :Thetimingerrorbetweentheref-erenceepochlocationandthedetectedepochlocationinlarynxcyclesforwhichexactlyoneepochwasdetected.IdentiÞcationaccuracy (IDA):ThestandarddeviationoftheidentiÞcationerror .Smallvaluesof highaccuracyofidentiÞcation.TableIshowstheperformanceresultsonArcticdatabaseforidentiÞcationrate,missrate,falsealarmrate,andidentiÞca-tionaccuracyforthethreemethodsHE-based,GD-based,andDYPSAalgorithm,aswellasfortheproposedmethod.Fig.14showsthehistogramsofthetimingerrors indetectingtheepochlocations,averagedovertheentireArcticdatabase.FromTableI,itcanbeconcludedthattheDYPSAalgorithmper-formedbestamongthethreeexistingtechniques,withaniden-tiÞcationrateof96.66%.Theproposedmethodofepochde-tectiongivesabetteridentiÞcationrateaswellasidentiÞcationaccuracy,comparedtotheresultsfromtheDYPSAalgorithm. MURTYANDYEGNANARAYANA:EPOCHEXTRACTIONFROMSPEECHSIGNALS Fig.14.Histogramoftheepochtimingerrorsfor(a)HE-basedmethod,(b)GD-basedmethod,(c)DYPSAalgorithm,and(d)proposedmethod. Fig.15.HistogramoftheepochtimingerrorsfordegradationbywhitenoiseatanSNRof10dB.(a)HE-basedmethod,(b)GD-basedmethod,(c)DYPSAalgorithm,and(d)proposedmethod.VI.EFFECTOFOISEONERFORMANCEOFROPOSEDETHODOFXTRACTIONInthissection,westudytheeffectof(moderatelevelsof)noiseontheaccuracyoftheepochdetectionmethods.Theex-istingmethodsandtheproposedmethodareevaluatedonanartiÞciallygeneratednoisyspeechdatabase.Severalnoiseen-vironmentsatvaryingsignal-to-noiseratio(SNR)weresimu-latedtoevaluatetheepochdetectionmethods.ThenoiseusedwastakenforNOISEX-92database[32].Thedatabasecon-sistsofwhite,babble,high-frequency(HF)channel,andvehiclenoise.ThenoisefromtheNOISEX-92databasewasaddedtotheArcticdatabasetoformnoisyspeechdataatdifferentlevelsofdegradation.Theutterancesareappendedwithsilencesuchthatthetotalamountofsilenceineachutteranceisconstrainedtobeabout60%ofdata,includingthepausesintheutterances.IncludingdifferentnoiseenvironmentsandSNRs,thedatabaseconsistsof33hofnoisyspeechdata.TableIIshowsthecomparativeresultsofepochdetectionmethodsfordifferenttypesofdegradationsatvaryingSNRs.Fig.15showsthedistributionofthetimingerrors indetectingtheepochlocations,forwhitenoiseenvironmentatanofSNRof10dB.Theproposedmethodconsistentlyperformsbetterthantheexistingtechniquesevenunderdegradation.Theimprovedperformanceoftheproposedmethodmaybeattributedtothefollowingreasons.1)Thereisnoblockprocessinginvolvedinthismethod.Hence,therearenoeffectsofthesizeandtheshapeofthewindow.TheentirespeechsignalisprocessedatoncetoobtaintheÞlteredsignal.2)Theproposedmethodisnotdepen-dentontheenergyofthesignal.Thismethoddetectstheepochlocationseveninweaklyvoicedregionslikevoice-bar.3)Thereisonlyoneparameterinvolvedintheproposedmethod,i.e.,thelengthofthewindowforremovingthetrendfromtheoutputof0-Hzresonator.4)Therearenocriticalthresholdsorcostsin-volvedinidentifyingtheepochlocations.VII.SUMMARYANDInthispaper,weproposedamethodforepochextractionthatdoesnotdependonthecharacteristicsofthevocal-tractsystem.Themethodexploitstheimpulse-likeexcitationofthevocal-tractsystem.Themethodusestheoutputofspeechfromazerofrequencyresonator.ThepositivezerocrossingsoftheÞlteredsignalcorrespondtoepochs.TheidentiÞcationrateandidenti-ÞcationaccuracyareevaluatedusingtheCMU-Arcticdatabase,wherethespeechsignalandthecorrespondingEGGsignalsareavailable.TheepochinformationderivedfromtheEGGsig-nalsisusedasareference.TheperformanceoftheproposedmethodiscomparedwiththeresultsfromtheDYPSAandtwoothermethods.TheproposedmethodgivessigniÞcantlybetterresultsintermsofidentiÞcationrateandidentiÞcationaccu-racy.Itisalsointerestingtonotethattheproposedmethodisro-bustagainstdegradationssuchaswhitenoise,babble,high-fre-quencychannel,andvehiclenoise.Therearemanynovelfeaturesintheproposedmethodofepochextraction.Themethoddoesnotuseanyblockprocessingasmostsignalprocessingmethodsdo.Theperformanceofthemethoddoesnotdependontheenergyofthesegmentofspeechsignal,andhence,themethodworksequallywellforalltypesofspeechsoundunits.Inaddition,therearenoparameterstocon-trol,andnoarbitrarythresholdingintheidentiÞcationofepochs. 1612IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.16,NO.8,NOVEMBER2008TABLEIIOMPARISONFORETHODSFORSANDNVIRONMENTSDENTIFICATIONATE,MRÑMATE,FARÑFATE,IDAÑIDENTIFICATIONCCURACY Themethodperformswellforspeechcollectedwithaclose-speakingmicrophone,evenwiththeadditionofdegradations.However,themethodisnotlikelytoworkwellwhenthedegra-dationsproduceadditionalimpulse-likesequencesinthecol-lectedspeechdataasinthecaseofreverberation.Themethodisalsonotlikelytoworkwellwhenthereisinterferenceofspeechfromotherspeakers.Ourfutureeffortswillbeinthedirectionofdevelopingmethodsforextractingepochsfromspeechwithdegradationsinvolvingsuperposedimpulse-likecharacteristics.Sincetheproposedmethodprovidesaccuratelocationsofepochs,theresultsareusefultodevelopmethodsforpitchextraction,andalsoforvoiceactivitydetection.Also,sincetheÞlteredsignalgivesanindicationofglottalactivity,themethodmaybeusedforanalysisofphonationcharacteristics[33]innormalandpathologicalvoices.ThemethodmayalsobeausefulÞrststepinaccurateanalysisofvocal-tractcharacteristicsbyfocusingtheattentionintheregionaroundtheepochs.Accu-rateanalysisofexcitationsourceandtime-varyingvocal-tractsystemsmayleadtoabetteracousticÐphoneticanalysisofspeechsoundsinmanylanguages,anditalsomayprovideausefulsupplementtotheexistingspectral-basedmethodsofspeechanalysis. Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on January 30, 2009 at 00:56 from IEEE Xplore. Restrictions apply.