/
TheCentreforSpeechTechnologyResearch,UniversityofEdinburgh,UKjoao.cabr TheCentreforSpeechTechnologyResearch,UniversityofEdinburgh,UKjoao.cabr

TheCentreforSpeechTechnologyResearch,UniversityofEdinburgh,UKjoao.cabr - PDF document

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
361 views
Uploaded On 2015-11-10

TheCentreforSpeechTechnologyResearch,UniversityofEdinburgh,UKjoao.cabr - PPT Presentation

tc0tcT0wherewgpTheLFmodelisde ID: 188842

!tc0 tc!T0wherewg=!p.TheLF-modelisde

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "TheCentreforSpeechTechnologyResearch,Uni..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

TheCentreforSpeechTechnologyResearch,UniversityofEdinburgh,UKjoao.cabral@ucd.ie,s.renals@ed.ac.uk,jyamagis@inf.ed.ac.uk,korin@cstr.ed.ac.ukAbstractControlovervoicequality,e.g.breathyandtensevoice,isimportantforspeechsynthesisapplications.Forexample,transformationscanbeusedtomodifyaspectsofthevoicere-latedtospeakerÕsidentityandtoimproveexpressiveness.How-ever,itishardtomodifyvoicecharacteristicsofthesyntheticspeech,withoutdegradingspeechquality.State-of-the-artsta-tisticalspeechsynthesisers,inparticular,donottypicallyal-lowcontroloverparametersoftheglottalsource,whicharestronglycorrelatedwithvoicequality.Consequently,thecon-trolofvoicecharacteristicsinthesesystemsislimited.Incon-trast,theHMM-basedspeechsynthesiserproposedinthispaperusesanacousticglottalsourcemodel.ThesystempassestheglottalsignalthroughawhiteningÞltertoobtaintheexcitationofvoicedsounds.Thistechnique,calledglottalpost-Þltering,allowstotransformvoicecharacteristicsofthesyntheticspeechbymodifyingthesourcemodelparameters.Weevaluatedtheproposedsynthesiserinaperceptualex-periment,intermsofspeechnaturalness,intelligibility,andsimilaritytotheoriginalspeakerÕsvoice.TheresultsshowthatitperformedaswellasaHMM-basedsynthesiser,whichgeneratesthespeechsignalwithacommonlyusedhigh-qualityspeechvocoder.IndexTerms:HMM-basedspeechsynthesis,voicequality,glottalpost-Þlter1.IntroductionConcatenation-basedspeechsynthesisprovideverylowpara-metricßexibilitytotransformvoicequality,becausespeechissynthesisedjoiningrecordedunits.Incontrast,HMM-basedspeechsynthesisersuseaparametricmodelofspeech.Thetyp-icalmodelofthesesystemsconsistsofpassingaspectrallyßatexcitationthroughasynthesisÞlterwhichrepresentsthespec-tralenvelope.Inthismethod,theexcitationofunvoicedspeechistypicallymodelledbywhitenoise,whilethevoicedexcita-tionismodelledbyaperiodicimpulsetrain.ThesigniÞcantad-vantageofusingthisspeechrepresentationisthatthespectralenvelopecanbeefÞcientlycalculated,e.g.bylinearpredictionorcepstralanalysis.Thecounterpartisthepoorrepresentationoftheglottalsource,whichlimitsvoicequalitymodellingandmightproduceunnaturalspeechquality.Betterexcitationmod-elsthantheimpulsetrainhavebeenproposedtoimprovespeechnaturalnessinHMM-basedspeechsynthesis,e.g.[1,2].Suchmodelshavemoredetailsofthesource,suchasaperiodicityas-pects,buttheydonotrepresenttheglottalpulsecharacteristics.Accordingwiththetheoryofspeechproduction,voicedspeechcanbeobtainedbypassingaglottalsourcemodelthroughasynthesisÞlter,whichrepresentsthevocaltractsys-tem.Thisspeechmodelisdifferentfromthatoftheimpulseresponsethatrepresentsthespectralenvelope.Themainprob-lemwiththesource-tractmodelisthatthemethodstoestimatetheglottalsourceandthevocaltractÞlteraretypicallylessro-bustthanthosetoestimatethespectralenvelope.Nevertheless,thistypeofspeechmodelhasbeensuccessfullyusedinHMM-basedsynthesis.Forexample,thesystemin[3]modelstheglot-talsourceandthevocaltractÞlterusingLPCparameters,ob-tainedbyiterativeadaptiveinverseÞltering[4].Inthesynthesispart,theexcitationisobtainedbytransformingarealglottalpulsetohavethedesireddurationandspectralcharacteristics,usingF0andtheglottalparameters,respectively.Inthissys-tem,voicetransformationscouldbeperformedusingalibraryofglottalpulseshapesfordifferentvoicequalities.Inpreviouswork[5],weproposedtorepresenttheexcita-tionoftheHMM-basedspeechsynthesiserwithaspectrallyßatsignalwhichisobtainedbypassinganacousticglottalsourcemodel,theLiljencrants-Fant(LF)model[6],throughapost-Þlter.Wecallthisoperationglottalpost-Þltering.Resultsofaperceptualtestshowedthatspeechsynthesisedwiththepost-ÞlteredLF-modelsoundedmorenaturalthanusingtheimpulsetrain.Inthispaper,weproposeanotherHMM-basedspeechsyn-thesiserthatusesasynthesismethodwithglottalpost-Þltering.Theresultsshowedthatthissystemperformssimilarlytoadifferentversionofthesynthesiserthatusesthehigh-qualityspeechvocoderSTRAIGHT[7].Theproposedsystemhastheadvantagethatallowstomodifyparametersofthesourcemodel,totransformthevoicequalityofthesyntheticspeech.Thetechniquetocontrolthepitchoftheoutputspeech,intheglottalpost-Þlteringtechnique,isalsoimproved,inthiswork.2.Liljencrants-Fantmodel2.1.WaveformTheLiljencrants-Fant(LF)model[6]isanacousticmodelof !tc0,tc!T0wherewg=!p.TheLF-modelisdeÞnedbysixshapepa-rameters:tc,tp,te,Ta,T0,andEe.Theremainingparameters(E0,"and#)canbecalculatedbyusingtheenergyandcon-tinuityconstraints,whicharegivenby$T00eLF(t)dt=0andeLF(te)=eLF(t+e)="Ee,respectively.TheÞrstbranchofequation(1)startsattheinstantofglottalopening,to=0 '* +)* + ,withthespectralenvelope,H(w).Glottalpost-ÞlteringisusedtogeneratetheexcitationofvoicedspeechbypassingtheLF-modelsignalthroughtheglottal-postÞlter(GPF),F(w).ThisÞltertransformstheinputLF-modelsignalintothespectrallyßatexcitation.Speechsynthesisedwiththisexcitationmodelcanberepresentedby:Y(w)=ELF(w)F(w)H(w),(7)whereELF(w)istheFourierTransform(FT)oftheLF-model.3.2.GlottalPost-FilterCalculationThestylisedspectrumoftheGPFisdescribedbythreelinearsegments,whoseslopesaresymmetrictotheslopesoftheLF-modelspectrum.ThestylisedspectrumthatcorrespondstotheÞltertransferfunctionisshowninFigure2b).TheparametersoftheGPF,thefrequencies )closetotheminimumpitchperiodofthespeaker.ThisavoidsproblemsinsynthesisingspeechwithhighF0values,asexplainedinthenextsection.TheparameterFgoftheGPFcanbecalculatedusing(5).Forthis,theintegralIoftheLF-modelhastobecalculated.First,theLF-modelwaveformisobtainedfrom(1),bysettingTa=0.Thereturnphaseisnotconsidered,becausetheauthorsin[9]assumethatFgdoesnotdependonTain(5).Next,theresultingLF-modelwaveformisintegratedtoobtaintheglottalßowpulse,uLF(n),associatedwiththeLF-model.Iiscalcu-latedastheintegralofuLF(n).Finally,thefrequencyFgiscalculatedasFg=1/(2!)&Ee#Fs/I,whereFsisthesam-plingfrequency.TheotherparameterusedtoobtaintheGPFisthefrequencyFc,whichiscalculatedfromtheLF-parameter                !"#$%&Figure3:WaveformsofthereferenceLF-modelandtheLF-modelobtainedbyincreasingtheSQofthereferenceLF-modelby40%.model,thespectraltiltdecreases(lowerattenuationatthehigherfrequencies).ThevariationsinthespectrumoftheLF-modelsignalproducesimilarchangesinthespectrumofthesynthetic ,aretheweightingfunctionsoftheperiodicandnoisecomponents,respectively.Forstatisticalmodelling,thesystemusesa5statecontext-dependentHMM.Eachfeaturevectorcontainsthestaticanddy-namicfeatures(!and!2)ofthespectrumandexcitation:39thordermel-cepstralcoefÞcients(obtainedfromtheFFTcoefÞ-cients),Þveaperiodicityparameters(meanvaluesoftheaperi-odicitymeasurementinÞvefrequencybands),andlogF0.Thestateoutputprobabilitydistributionusedtomodeleachspeech ,andRQ.AnestimateofthemaximumF0ofthespeakerwasalsocalculated.TheparameterteofthereferenceLF-modelwassetapproximatelyequaltotheminimumT0ofthespeaker.Then,theothertimeparametersofthereferenceLF-modelwerecalculatedbyusingthemeanvaluesofthedimensionlesspa-rametersandequations(2)to(4).Inthisway,thereferenceLF-modelwasshortenoughtoavoidtheproblemofsynthesis-inghigh-pitchedspeechandthedimensionlessparameterswereequaltothemeanvaluesobtainedfromthemeasurements.TheGPFwasimplementedasalinearphaseFIRÞlter,topreservethephaseinformationoftheLF-model.4.2.2.SynthesisTheSTRAIGHTsynthesismethodusedbytheHMM-basedspeechsynthesiserwasreplacedbythesynthesismethodwithglottalpost-Þltering.SpeechissynthesisedasshownintheblockdiagramofFigure5.Thismethodalsousesamulti-bandmixedexcitationmodel,whichisrepresentedbyX(w)=KeELF(w)F(w)Wp(w)+N(w)Wa(w),(9)whereELF(w)istheFTofaperiodicLF-modelsignal,F(w)representsthetransferfunctionoftheGPF,N(w)istheFTofthenoisesignal,andKeisascalefactortomatchtheenergyoftheperiodicexcitationtotheenergyofthenoise.representtheweightingfunctionsofSTRAIGHT,whichareobtainedfromtheaperiodicityparameters.Theperiodiccomponentoftheexcitationistheconcatena-tionoftwoLF-modelsignals,whichstartattheinstantofmax-imumexcitationte.ThesesignalsareobtainedbyadjustingthelengthofthereferenceLF-model(bytruncating/paddingwithzeros)tothetargetT0=1/F0.Thatis,forsynthesisingthespeechframei,theÞrstLF-modelhasthedurationTi!10(equaltotheperiodofthepreviousframe)andthesecondhasthedu-rationTi0.TheresultingLF-modelwaveformisapproximatelycenteredattheinstantofmaximumexcitation, medianfortheÞrsttwotasksandthemeanfortheWERtask.IntheMOSandSIMplots,themedianisrepresentedbyasolidbaracrossabox,showingthequartiles.