/
(includingvelum,tongue,andlips)whichmodifyVTreso-nancestoformtheforman (includingvelum,tongue,andlips)whichmodifyVTreso-nancestoformtheforman

(includingvelum,tongue,andlips)whichmodifyVTreso-nancestoformtheforman - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
366 views
Uploaded On 2015-12-07

(includingvelum,tongue,andlips)whichmodifyVTreso-nancestoformtheforman - PPT Presentation

1Forconveniencewewillrefertothespectralpeaksinwhispersasformantsalthoughtheydodifferinseveralrespectsfromtheirvoicedcounterparts Figure1BlockdiagramofreconstructionmechanismItistrivialtonumerica ID: 216981

1Forconveniencewewillrefertothespectralpeaksinwhispersas`formants' althoughtheydodifferinseveralrespectsfromtheirvoicedcounterparts. Figure1:Blockdiagramofreconstructionmechanism.Itistrivialtonumerica

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "(includingvelum,tongue,andlips)whichmodi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

(includingvelum,tongue,andlips)whichmodifyVTreso-nancestoformtheformantsofphonatedspeech.Bycontrast,unphonatedspeechlacksawell-denedglottalsourceofpitch:instead,thereisabroadbandexcitationcausedbyturbulentair-owexhaledfromthelungs[8].Nophonationtakesplaceduringwhispering,evendur-ingproductionofphonemesthatwouldnormallybestronglyphonated:whispersrequirenosignicantvocalcordvibration,withthevocalcordsremainingopen.Similarly,vocalcordsthataredamagedbydisease,orwhichhavebeensurgicallyremoved,presentlittleobstacletolungexhalation.Aconse-quenceoftheopenvocalcordsinwhispersisthattheirspectralpeaks1havelowerenergyandarefrequency-shiftedinrelativelypredictablewayswithrespecttotheirphonatedcounterparts2.2.ProcessingofWhispersOnedisadvantagesharedbyallwhisper-inputsystemsisthefactthatwhispers,withmuchloweracousticpowerthanspeech[7],andwithrelativelyatspectrum,areinherentlynoise-like.Theyarethushighlysusceptibletoacousticinterference,andanysys-temwhichanalyseswhisperstodeteminebothtime-domainandfrequency-domaininformationneedstoberobusttoerror.IntheMELP/CELPbasedsystemsmentioned,robustnessistypicallyrequiredforvoiceonsetdetection(includingvoiceactivitydetectionandvoiced/unvoicedswitching)andformantfrequencydetermination.Thishasbeenwellstudied,andinpar-ticular,theprobabilitymassfunction(PMF)formantdetectionmethodhasbeenshowntoperformwell[9].2.3.LinearityandtimeinvarianceManyofthecommonspeechrepresentationmodels,includingCELPandMELP,makeanassumptionthatboththepitchglot-talcomponent,andthevocaltractcomponentofthenalspeechcanberepresentedaslineartimeinvariant(LTI)systems[7]andareassumedtobemututallyindependent.Infact,theas-sumptionisdemonstrablyuntrue[10]butisconsideredarea-sonableapproximationfornormalphonatedspeech.Forwhis-peredspeech,wheretheVTexcitationtendstoconsistoftur-bulentaperiodicairowfromlungexhalationthroughtheopenglottis,the`pitch'ltercannolongerbetreatedasindependentoftheexcitationsource[11].HoweveritisstillviabletomodeltheVTshapeasanLTIsystem,giventhatthenonlinearityissubsumedbytheexcitationsource.ThepropertyofVTlinearityisexploitedbytheproposedsystem.Aswillbediscussedbelow,whispersareassumedtobealinearcombinationofexcitationandVTresponse.TheyareanalysedtodeterminespectralresonancepeaksintheVTresponse,whicharethenusedinthere-synthesisofvoicedspeech,excitedbyarticialglottalexcitation.3.Proposedsystem3.1.FormantextractionGivenavocaltractrepresentedbyorderLLPC,thenthein-stantaneousVTpredictionlterisrepresentedbythecurrentparametersetaL,asiswellknown:F(z)=LXk=1akz�k(1) 1Forconveniencewewillrefertothespectralpeaksinwhispersas`formants',althoughtheydodifferinseveralrespectsfromtheirvoicedcounterparts. Figure1:Blockdiagramofreconstructionmechanism.ItistrivialtonumericallyevaluateasetofrootsfromaL,andinstronglyvoicedframes,theserootswouldtendtocorre-spondtoformantfrequencies.Howeverinwhisperedspeech,therootscorrespondtoregionswithamuchgreaterpositionaluncertainty(andlowerenergy):estimatesofwhisperformantpositionsderivedfromLPCrootsareknowntobeinaccurate[9].Thus,eithersimpleaveragingoratechniquesuchasPMFneedstobeemployed.Inthesystemproposedhere,formantcandidatesaredeterminedforhighlyoverlappedanalysiswin-dows,withtime-domainBlackmanlteringusedto`smooth'formanttransitionsbetweenframes.3.2.RenementmechanismThesmoothedpolefrequenciesFandmagnitudesMareassignedtoformantsconnedtorelativelywidepredenedrangesdenedbyFXlowtoFXhighsuchthatFX2[FXlow;FXhigh](forformantsX=1;2:::Ns).TheoutcomeoftheformantassignmentprocessistypicallyNsformantpositionsandassociatedmagnitudes.Evidently,noteveryanalysisframeofrecordedspeechcontainsmeaningfulformantinformation–forexamplegapsbetweenwords–andthusajudgementismadeastowhetheraformantcandidateisgenuine.Thisisbaseduponcomparingtheinstantaneousaver-agetothelong-termaveragemagnitudeMXforeachformantcandidateX=1;2::Ns;F0X(n)=FX(n)ifMX(n)XMX0ifMX(n)XMX(2)Thus,weakformantsareculled.Inpractice,settingX=2(X�5)worksreasonablywelltoquadraticallyincreasethecullingforhigherformants,toaccountforthemuchreducedenergyofthosehigherformants(andhencelowerSNRinAWGN),howeveritisrecognisedthatamoreintelligentcri-teriamaybepreferable.Thesecondrenementistoapplyafrequencytranslationtotheextractedformantarraystocounterthewell-knownfre-quencydifferencebetweenwhisperresonancesandequivalentspeechresonances(i.e.formants).Thedifferencesareprimar-ilyduetothefactthatLPCspeechanalysismeasurestheav-erageresonancepositionbetweenopenandclosedglottissit-uations(i.e.innormalLPCanalysiswhentheglottisopensandclosesrapidlyduringvoicedspeech[7]).Forwhisperedanalysis,theglottisisalwaysopen,andthustheinuenceoftheclosed-glottisresonances(ashorterVT)areremoved.Thisphenomenahasbeendiscussedin[3],andtheaveragedegreeofshifthasbeendeterminedin[12].3.3.ReconstructionmechanismThereconstructionmechanism,showninFig.1containstwocomponents.AsmentionedinSection2.3,thetechniqueex-ploitstheLTInatureoftheVTresponsebyrstsynthesisingstandaloneformantsbeforemodulatingthemwithanarticial Figure2:Diagramillustratingdoubleendedevaluationmeth-ods(above)andsingleendedevaluation(below).glottalsignal.Thisapproachisunusual,sincethehumanspeechproductionmechanismoperatesintheoppositesequence,asdoprosthesessuchaselectrolarynxandtracheosophagealpuncture(TEP),andpreviouslypublishedMELP/CELPreconstructionmethods.Theparticularreconstructionusedheretakesthefol-lowingform,usingthepreviouslyderivedformantlocationsandmagnitudesFXandMX,aswellastherenedformantsF0X:S0(z)=(NsXX=1MXcos(F0X)+ UW(z)):P(z)(3)givenscalargain Uwhichallowstheinclusionofwide-bandhigh-frequencyexcitationfromtheoriginalwhispertobemaintainedinthereconstructedspeech–especiallyimportantforsibilants.Theglottalmodulation,P(z)isthendenedas:P(z)=maxfM1 ;1g:maxf�jcos(F1= )j;0g2(4)where isagainsettingthatrelatesthedepthofglottalmodulationtoF1energy,sothatlessvoicingresultsinreducedmodulationdepth.Scalar denesarelationshipbetweenF1andf0(usuallyintherange8to12).Duringthereconstructionmechanism,formantsbeginaspurecosinesattheextractedfor-mantfrequencies(uptoNsperframe),andofthedetecteden-ergylevels.Theseareaugmentedbytheadditionofthescaledwhispersignaltoimparthighfrequencywidebandresonancesthataredifculttomodelwithcosines.ItisimportanttonotethathereisnodecisionprocessbetweenV/UVframes: Udoesnotvarybecause,inpractice,harddecisionsderivedfromwhis-persdonotworkwell.Theresultantcombinationismodulatedbyaclipped,raisedcosineglottalsignalwhichisharmonicallyrelatedtoF1,andwiththedepthofmodulationreduceddur-inglowenergyanalysisframes.Thedegreeofclippingaf-fectspitchenergy.Thisarticialglottalmodulationissimilarinshapetothepitchexcitationoflegacyvocoders[13].4.EvaluationThereconstructionsystemisevaluatedusingwell-knownobjec-tivecriteriaaswellasinformallisteningtests.Thechosenob-jectivecriteriaarerstlytwospectrally-relevantdistancemea-sures:log-likelihoodratio(LLR)andItakura-Saito(IS)[14],plussegmentalsignal-to-noiseratio(SSNR).Apartfromthese,thesystemisalsoevaluatedusingthemostrecentITU-Tper-ceptualevaluationofspeechquality(PESQ)standardP.862.2[15]fordouble-endedcomparison,aswellasthesingle-endedITU-TspeechqualityassessmentstandardP.563[16].Botharedesignedtomimichumansubjectiveresponses.Sincetheaimofthesystemistoreconstructfullyphonedspeechfromwhispers,ausefulperformancecriteriawouldbethequalityofthereconstructedspeech.Howeverfouroftheveevaluationmeasuresaredouble-ended,meaningthatareferencesignalisrequiredwithwhichtocomparethereconstructedout-put.Normally,thereferencewouldbetakenastheinputsignal.Butitmakesnosensetocomputeadistancebetweenwhisperinputandreconstructedspeechoutput–whenthesystemistry-ingtoachievea`large'distancefromthewhisperinput.Therefore,forthedouble-endedmeasures,webeginwithcleanspeechSwhichisthen`whisperised'togenerateanarti-cialwhispersignalW.Theprecisemethodorwhisperisationisnotdescribedforspacereasons,butitcloselyfollowsthemethodofSTRAIGHT[17]byremovinglong-termpitchpre-dictionandre-synthesisingwithequivalentenergybutpseudo-randompitchexcitation.Double-endedmeasurescomparear-ticialspeech(S0)reconstructedfromarticialwhispers(W)againstoriginalspeech(S).Forreference,WandSarealsocompared.Theseevaluations,showninFig.2,willbepre-sentedinSection4.2.Singleendedmeasuresdirectlyevaluatescoresforrealwhispersandreconstructedspeech.4.1.MethodologyGivenoriginalspeechS,articialwhisperWandreconstructedspeechS0,werstuseautoregressivemodellingtodeterminecorrespondingLPCvectorsfortime-alignedsegmentsofeachsignal,aS,aWandaS0respectively,thencomputetheLLR,isdenedas[14]:dLLR=logaS0RSaTS0 aSRSaTS(5)whereRSisthespeechautocorrelationmatrix.Similarly,thewell-knownISdistancemeasureiscomputedfromthesamerawinputdataasfollows:dIS=2S 2S0aS0RSaTS0 aSRSaTS+log2S 2S0�1(6)where2Sand2S0denoteLPCgainsoftheoriginalandreconstructedspeech,respectively,obtainedfrom1=F(ej!T)where!T=2k=Nrfork=0;1;:::Nr�1,forafrequencyresolutionoffFs=2NrgHzatsamplefrequencyFs.Sinceitisunclearwhichsignalshouldbeconsideredthereferenceandwhichisconsidereddegraded,andtheISmeasureisnotsym-metrical,wereportaveragescoresfoundfrombothdirections.4.2.ScoresThesystemisoperatedtoreconstructuptofourformantsNs=4,withaconstantharmonicrelationshipbetweenpitchfunda-mentalandF1setby X=10.Highlyoverlapped128sampleanalysiswindowsshift16samplesbetweenanalysisiterations. =1=8, U=0:8,=0:5.Theanalysisorder,L=8andasamplerateof8kHzisusedthroughout.Arandomlyselectedsetof16balancedmaleandfemaleTIMITsentencesfromDR1toDR8wereusedforthedouble-endedevaluations.Forsingle-endedevaluation(i.e.P.563),whisperedandspokenTIMITsentenceswererecordedinananechoicchamberandreconstructionisfromthewhispersalone.Asdiscussedabove,double-endedscoresarebetweentheTIMITinputandspeechreconstructedfromwhisperisedversionsoftheinputspeech.Resultsareshownintable1.TheresultsforSSNR,LLR,ISandP.563indicatethatthereconstructedspeechS0iseithermore`speech-like'orhas Figure3:SpectrogramsofaTIMITutteranceshowingoriginalspeechS,articialwhisperWandreconstructedspeechS0fromtoptobottom,respectivelyhigherestimatedMOSthanW.Thisshowsthattherecon-structedspeech,whilefarfromperfect,isbetterthanthewhis-pers.TheP.862.2resultsaremixed:onanabsolutescale,noneofthescoresaregood.Furthermore,rawMOSforS!WisbetterthanthatforS!S0.HowevertheS!S0listenerqualityobjective(LQO)scoredoesoutperformthatofS!W.Forcomparison,notethatthescoresforELspeechtendtobeworse,withthemeanW!ELbeing0:178 1:048.Despitethis,itisinterestingtonotethatP.563scoresforELspeechare,ingen-eral,extremelyhigh(oftenexceeding4andcanbebetterthanthecleanspeech).Table1:MeanobjectiveevaluationscoresfororiginalspeechS,whispersWandreconstructedspeechS0.P.862scoresareshownasRawMOS LQO Doubleendedmeasures Singleend Test LLR SSNR IS P.862 P.563 S!W 0.827 26.92 12.70 1:234 0:559 S=3:620 W!S0 0.696 22.74 3.09 0:958 0:589 W=2:864 S!S0 0.789 25.55 10.44 0:680 0:648 S0=3:394 SpectrogramsareplottedforS,WandS0inFig.3forTIMITsentenceSI1154.Theoriginalspeechcontainsalargeamountofenergyinlowfrequencyregionsofvoicedphonemes.Severalformantsarevisible.Bycontest,Wismainlydiffusehigh-frequencyenergy.Manyformantsarestillvisible,butf0isvirtuallyabsent.S0obviouslyaccentuatestheenergyofallfor-mants,andhasincreasedf0energyoverthewhispersthrougharticialglottalmodulation.SincethemodulationrelatestoF1bothharmonicallyandintermsofenergy,ittendstomatchthelowfrequencyenergyconcentrationsseeninS.Fromthesespectrograms,itcouldbearguedthattherelativecontributionoff0andfF1:::F4gshouldbebetterbalancedbydecreasing Uandincreasing.Infact,therearemanytuneableparameterswithinthisreconstructionmodel;theresultsspaceisrelativelyunexploredatthepresenttime.Forreference,examplerecord-ingsofS,WandS0accompanythispaper.5.ConclusionThispaperhaspresentedanewmethodofrecreatingspeechfromwhispersbyreversingtheLTIsource-lterspeechpro-ductionmodel:articiallygeneratedcosineformantsareanexcitationsourceforanarticiallygeneratedglottalmodula-tion.Frequenciesandamplitudesforthearticialformantsarederivedfromtranslatedversionsofhighlyoversampled,time-domainsmoothedandculledLPCroottracks.ThearticialglottalmodulationisaharmonicderivativeofF1,withdepthofmodulationreducedforanalysisframesexhibitingonlylowen-ergyF1LPCroots.Thehighfrequencywidebanddistributionparticularlyfoundinsibilantfricativesiscontributedthroughtheconstantadditionoftheoriginalwhisper–noframe-by-framedecisionprocessistakingplace.Thesystemdoesnotrequireaprioriorspeaker-dependentinformationandisalow-complexityframe-by-frameprocessingapproach.Resultsshowthatinmostcases,thereconstructedspeechexhibitsimprovedqualityovertheequivalentwhisperedorarticiallywhisperisedspeech.6.AcknowledgementsSupportforthisworkisgratefullyacknowledgedfromtheFun-damentalResearchFundsfortheCentralUniversities,Chinagrantno.WK2100000002.ThanksarealsogiventoDrHamidRezaSharifzadehformaterialusedinthisresearch.