/
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING VOL IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING VOL

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING VOL - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
469 views
Uploaded On 2015-05-19

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING VOL - PPT Presentation

8 NO 6 NOVEMBER 2000 A Computationally Efficient Multipitch Analysis Model Tero Tolonen Student Member IEEE and Matti Karjalainen Member IEEE Abstract A computationally efficient model for multipitch and periodicity analysis of complex audio signa ID: 69692

NOVEMBER

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "IEEE TRANSACTIONS ON SPEECH AND AUDIO PR..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

708IEEETRANSACTIONSONSPEECHANDAUDIOPROCESSING,VOL.8,NO.6,NOVEMBER2000AComputationallyEfficientMultipitchAnalysisModelTeroTolonen,StudentMember,IEEE,andMattiKarjalainen,Member,IEEEAcomputationallyefficientmodelformultipitchandperiodicityanalysisofcomplexaudiosignalsispresented.Themodelessentiallydividesthesignalintotwochannels,belowandabove1000Hz,computesa“generalized”autocorrelationofthelow-channelsignalandoftheenvelopeofthehigh-channelsignal,andsumstheautocorrelationfunctions.Thesummaryautocorrelationfunction(SACF)isfurtherprocessedtoobtainan TOLONENANDKARJALAINEN:COMPUTATIONALLYEFFICIENTMULTIPITCHANALYSISMODEL709 Fig.1.BlockdiagramoftheMeddis–O’Mardmodel[17].Thepaperisorganizedasfollows.InSectionII,theproposedpitchanalysismodelisintroducedandcomparedtopitchper-ceptionmodelsreportedintheliterature.SectionIIIdescribeshowtheperiodicityrepresentationcanbeenhancedsothatperi-odicitiesmaybemoreeasilyinvestigated.SectionIVdiscussesthemodelparametersandshowswithexampleshowtheyaf-fectthebehaviorofthemodel,andSectionVdemonstratesthemodelperformanceinmultipitchdetermination.Finally,Sec-tionVIconcludesthepaperwithasummaryanddiscussion.II.PNALYSISA.MultichannelPitchAnalysisInmanyrecentmodelsofhumanperception,thekeycompo-nentisafilterbankthatsimulatesthebehaviorofthecochlea.Thefilterbankseparatesasoundsignalintosubbandchannelsthathavebandwidthscorrespondingtothefrequencyresolutionofthecochlea.Acommonchoiceistouseagammatonefilter-bank[22]withchannelscorrespondingtotheequivalentrectan-gularbandwidth(ERB)channelsofhumanaudition[23].Fig.1depictsthepitchperceptionmodelofMeddisandO’Mard[17]thatusesthefilterbankapproach.Theinputsignalisfirstdividedinto40–128channelsdependingontheimplementation[16],[17],[21].Thesignalineachchannelishalf-waverectifiedandlowpassfiltered.Essentially,thisstepcorrespondstothedetectionoftheenvelopeofthesignalineachchannel.Fromtheenvelopesignals,aperiodicitymea-sure,suchastheautocorrelationfunction(ACF),iscomputedwithineachchannel.Finally,theACFsaresummedacrossthechannelstoyieldasummaryautocorrelationfunction(SAFC)thatisusedinpitchanalysis.InstudiesthathaveappliedthepitchanalysisparadigmofFig.1,severalimplementationsarereported.Insomesystems,pre-processingofthesignalisperformedbeforethesignalentersthefilterbank.Forinstance,in[17]abandpassfilterisusedforsimulatingthemiddleeartransferfunction.In[17]thehalf-waverectificationandlowpassfilteringblockisreplacedwithablockthatestimatestheprobabilityofneuralactivationineachchannel.In[24],anautomaticgaincontrolblockisaddedafterthehalf-waverectificationandthelowpassfilteringisremoved.Thereareseveralapproachesforcomputationoftheautocor-relationorasimilarperiodicitymeasurewithineachofthechan-nels.Thetimedomainapproachisacommonchoice[16],[17],[21].Inthesesystems,anexponentialwindowisappliedwithawindowtimeconstantthatvariesfrom2.5ms[17]to25ms[21].Ourexperimentshavedeterminedthattheeffectivelengthofthewindowshouldbeapproximately10–30mssothatthewindowspansmorethanoneperiodofthepitchedtonewithallfundamentalperiodsthatareintherangeofinterest.Ellis[21]appliesalogarithmicscaleoftheautocorrelationlagwithap-proximately48samplesforeachoctave.Hemotivatestheuseofsuchascalebybetterresemblancewiththepitchdetectionresolutionofthehumanauditorysystem.Thisrequiresinterpo-lationofthesignalsinthechannels.Ellisnotesthathalf-waverectificationispreferredoverfull-waverectificationinordertosuppressoctaveerrors.SomeofthepitchanalysissystemsprefertouseadiscreteFouriertransform(DFT)basedautocorrelationcomputationforcomputationalefficiency[24].Thisapproachalsoallowsforprocessingofthesignalinthefrequency-domain.Asdiscussedbelow,nonlinearcompressionoftheDFTmagnitudemaybeusedtoenhancetheperformanceofthepitchanalysis.Suchacompressionisnotreadilyimplementableinatime-domainAlthoughtheunitarypitchperceptionmodelofMeddisandO’Mardhasbeenwidelyadopted,somestudiesquestionthegeneralvalidityoftheunitarypitchperceptionparadigm.Par-ticularly,ithasbeensuggestedthattwomechanismsforpitchperceptionarerequired:oneforresolvedharmonicsandoneforunresolvedharmonics[25],[26].Thetermresolvedharmonicsreferstothecasewhenonlyoneornocomponentsfallwithinthe10-dB-downbandwidthofanauditoryfilter[27].Intheothercase,thecomponentsaresaidtobeunresolved.Thepresentstudydoesnotattempttoanswerthequestiononthepitchde-tectionmechanismofthehumanauditorysystem.Infact,theproposedmodelhasonlytwochannelsanddoesnotattemptdi-rectlytofollowhumanresolvability.Interestinglyenough,asshowninthefollowingsubsection,themodelstillqualitativelyproducessimilarandcomparableresultstothoseofthemoreelaboratemultichannelpitchanalysissystems.Thecomputationaldemandsofmultichannelpitchanalysissystemshaveprohibitedtheiruseinpracticalapplicationswheretypicallyreal-timeperformanceisrequired.Thecomputationalrequirementsaremostlydeterminedbythenumberofchannelsusedinthefilterbank.Thismotivatesthedevelopmentofasim-plifiedmodelofpitchperceptionpresentedbelowthatismoresuitableinpracticalapplicationsandstillqualitativelyretainstheperformanceofmultichannelsystems.B.Two-ChannelPitchAnalysisAblockdiagramoftheproposedtwo-channelpitchanalysismodelisillustratedinFig.2.Thefirstblockisapre-whiteningfilterthatisusedtoremoveshort-timecorrelationofthesignal.Thewhiteningfilterisimplementedusingwarpedlinearprediction(WLP)asdescribedin[28].TheWLPtechniqueworksasordinarylinearpredictionexceptthatitimplementscritical-bandauditoryresolutionofspectralmodelinginsteadofuniformfrequencyresolution,andcanbeusedtoreducethe 710IEEETRANSACTIONSONSPEECHANDAUDIOPROCESSING,VOL.8,NO.6,NOVEMBER2000 Fig.2.Blockdiagramoftheproposedpitchanalysismodel.filterorderconsiderably.AWLPfilterof12thorderisusedherewithsamplingrateof22kHz,Hammingwindowing,framesizeof23.2ms,andhopsizeof10.0ms.InversefilteringwiththeWLPmodelyieldsthepre-whitenedsignal.Toacertainextent,thewhiteningfiltermaybeinterpretedasfunctionallysimilartothenormalizationofthehaircellactivityleveltowardspectralflatteningduetotheadaptationandsaturationeffects[29],[30].ThefunctionalityofthemiddlepartofFig.2correspondstothatoftheunitarymultichannelpitchanalysismodelofFig.1.Thesignalisseparatedintotwochannels,belowandabove1kHz.Thechannelseparationisdonewithfiltersthathave12dB/octaveattenuationinthestop-band.Thelowpassblockalsoincludesahighpassrolloffwith12dB/octavebelow70Hz.Thehigh-channelsignalishalf-waverectifiedandlowpassfilteredwithasimilarfilter(includingthehighpasscharacteristicat70Hz)tothatusedforseparatingthelowchannel.Theperiodicitydetectionisbasedon“generalizedautocorre-lation,”i.e.,thecomputationconsistsofadiscreteFouriertrans-form(DFT),magnitudecompressionofthespectralrepresenta-tion,andaninversetransform(IDFT).Thesignal inFig.2correspondstotheSAFCofFig.1andisobtainedas (1)where and arethelowandhighchannelsignalsbe-foretheperiodicitydetectionblocksinFig.2.Theparameter determinesthefrequencydomaincompression[31].Fornormal but,asdetailedinSectionIV,itisad-vantageoustouseavaluesmallerthan2.NotethatperiodicitycomputationusingtheDFTallowsthecontroloftheparameter ortheuseofsomeothernonlinearprocessingofthefrequencytransform,e.g,applicationofnaturallogarithmresultinginthecepstrum.Thisisnotdirectlypossiblewithtime-domainperiod-icitydetectionalgorithms.ThefastFouriertransform(FFT)anditsinverse(IFFT)aretypicallyusedtospeedthecomputationofthetransforms.ThelastblockofFig.2presentstheprocessingoftheSACF(denoted ).ThispartofthealgorithmisdetailedinSectionIII.BeforecomparingtheperformanceofthemodelsofFigs.1and2,itisinstructivetostudythesensitivitytothephaseprop-ertiesofthesignalinpitchanalysiswhenusingamultichannelmodeloratwo-channelmodel.Inthetwo-channelcase,thelowchannelisphase-insensitive(exceptforthewindowingeffects) Fig.3.ComparisonoftheSACFfunctionsofthetwomodelsusingthe“musicalchord”testsignal.Thetwo-channelSACFisplottedonthetopandtheMeddis–HewittSACFonthebottom.duetotheautocorrelation[noticethemodulusin(1)].However,thehighchannelisphase-sensitivesinceitfollowstheamplitudeenvelopeofasignalinthefrequencybandabove1000Hz.Thus,allphase-sensitivityinourmodelisinherentlycausedbythehighchannel.ThisisdifferentfromtheMeddis–Hewittmodelwhereallchannelsarephase-sensitivesincetheyfollowtheen-velopeofthesignalinthecorrespondingfrequencyband.How-ever,whenlowerchannelsresolvetheharmonics,thedifferenceisrelativelysmallsinceinthatcasetheautocorrelationcompu-tationremovesthephase-sensitivity.C.ComparisonofMultichannelandTwo-ChannelModelsTheperformanceandvalidityoftheproposedtwo-channelSACFmodel(withoutpre-filteringandpre-whitening,usingrunningautocorrelationsimilarto[17])inpitchperiodicityanal-ysisisevaluatedherebyacomparisonwiththemultichannelSACFmodelofMeddisandHewitt.TheAIMsoftware[32]wasusedtocomputetheMeddis–HewittSACFs.Thetestsig-nalswerechosenaccordingto[17].Theresultsofthe“musicalchord”experiment[17]withthetwo-channelandthemultichannelmodelsareillustratedonthetopandthebottomplotsofFig.3,respectively.Inthiscase,thetestsignalconsistedofthreeharmonicsignalswithfundamentalfrequencies392.0,523.2,and659.2Hzcorrespondingtotones ,andE ,respectively.TheG toneconsistedoffirstfourharmonics,andtheC andE tonescontainedthefirstthreeharmonicseach.Alltheharmoniccomponentswereofequalamplitude.BothmodelsexhibitanSACFpeakatalagof7.7ms.Thiscorrespondstoafrequencyof130Hz(toneC ),whichistheroottoneofthechord.Thewaveformsofthetwosummaryautocorrelationfunctionsaresimilaralthoughthescalesdiffer.Whileitisonlypossibletoreportthisexperimenthere,themodelsbehavesimilarlywithabroaderrangeoftestsignals.MoreexamplesofSACFanalysisareavailableathttp://www.acoustics.hut.fi/~ttolonen/pitchAnalysis/.III.ENHANCINGTHEUMMARYUTOCORRELATIONThepeaksintheSACFcurveproducedasoutput ofthemodelinFig.2arerelativelygoodindicatorsofpotentialpitchperiodsinthesignalbeinganalyzedasshowninFig.3.How-ever,suchasummaryperiodicityfunctioncontainsmuchredun-dantandspuriousinformationthatmakesitdifficulttoestimate TOLONENANDKARJALAINEN:COMPUTATIONALLYEFFICIENTMULTIPITCHANALYSISMODEL711 Fig.4.Anexampleofmultipitchanalysis.Atestsignalwiththreeclarinettoneswithfundamentalfrequencies147,185,and220Hz,andrelativermsvaluesof0.4236,0.7844,and1,respectively,wasanalyzed.Top:two-channelSACF,bottom:two-channelESACF.whichpeaksaretruepitchpeaks.Forinstance,theautocorrela-tionfunctiongeneratespeaksatallintegermultiplesofthefun-damentalperiod.Furthermore,incaseofmusicalchordstheroottoneoftenappearsverystrongthoughinmostcasesitshouldnotbeconsideredasthefundamentalperiodofanysourcesound.Tobemoreselective,apeakpruningtechniquesimilarto[21],butcomputationallymorestraightforward,isusedinourmodel.Thetechniqueisthefollowing.TheoriginalSACFcurve,asdemonstratedabove,isfirstclippedtopositivevaluesandthentime-scaled(expandedintime)byafactoroftwoandsubtractedfromtheoriginalclippedSACFfunction,andagaintheresultisclippedtohavepositivevaluesonly.Thisremovesrepetitivepeakswithdoublethetimelagwherethebasicpeakishigherthantheduplicate.Italsoremovesthenear-zerotimelagpartoftheSACFcurve.Thisoperationcanberepeatedfortimelagscalingwithfactorsofthree,four,five,etc.,asfarasdesired,inordertoremovehighermultiplesofeachpeak.Theresultingfunctioniscalledheretheenhancedsummaryautocorrelation(ESACF).AnillustrativeexampleoftheenhancedSACFanalysisisshowninFig.4forasignalconsistingofthreeclarinettones.Thefundamentalfrequenciesofthetonesare147,185,and220Hz.TheSACFisdepictedonthetopandtheenhancedSACFcurveonthebottom,showingclearindicationofthethreefunda-mentalperiodicitiesandnootherpeaks.Wehaveexperimentedwithdifferentmusicalchordsandsourceinstrumentsounds.Inmostcases,soundcombinationsoftwotothreesourcesarere-solvedquiteeasilyiftheamplitudelevelsofthesourcesarenottoodifferent.Forchordswithfourormoresources,thesubsig-nalseasilymaskeachothersothatsomeofthesourcesarenotresolvedreliably.Onefurtherideatoimprovethepitchresolu-tionwithcomplexmixtures,especiallywithrelativelydifferentamplitudes,istouseaniterativealgorithm,wherebythemostprominentsoundsarefirstdetectedandfilteredout(seeSec-tionVI)orattenuatedproperly,andthenthepitchanalysisisrepeatedfortheresidual.IV.MThemodelofFig.2hasseveralparametersthataffectthebehaviorofpitchanalysis.Inthefollowing,weshowwithex-amplestheeffectofeachparameterontheSACFandESACFrepresentations.Inmostcases,itisdifficulttoobtainthecor-rectvalueforaparameterfromthetheoryofhumanperception.Rather,weattempttoobtainmodelperformancethatissimilarTABLEIALUESFORTHEROPOSED tothatofthehumanperceptionorapproximatelyoptimalbasedonvisualinspectionofanalysisresults.Thesuggestedparam-etervaluesarecollectedinTableI.Inthefollowingsection,therearetwotypesofillustrations.InFigs.5and7,theSACFsofoneframeofthesignalareplottedonthetop,andoneortwoESACF’sthatcorrespondtothepa-rametervaluesthatwefoundbestsuitedaredepictedonthebottom.InFigs.6and8,consecutiveESACFsareillustratedasafunctionoftime.Thisrepresentationissomewhatsimilartothespectrogram:timeisshownonthehorizontalaxis,ESACFlagontheverticalaxis,andthegraylevelofapointcorrespondstothevalueoftheESACF.Testsignalsaremixturesofsyntheticharmonictones,noise,andspeechsignals.Eachsynthetictonehastheamplitudeofthefirstharmonicequalto1.0,andtheamplitudeofthe thharmonicequalto .Theinitialphasesoftheharmonicsare0.Thenoisethatisaddedinsomeexam-plesiswhiteGaussiannoise.Inthiswork,wehaveusedspeechsamplesthathavebeenrecordedinananechoicchamberandinnormalofficeconditions.Thesamplingrateofalltheexamplesis22050Hz.A.CompressionofMagnitudeinTransformDomainInSectionIIwemotivatedtheuseoftransform-domaincom-putationofthe“generalizedautocorrelation”bytwoconsidera-tions:1)itallowsnonlinearcompressionofthespectralrepre-sentationand2)itiscomputationallymoreefficient.Thefol-lowingexamplesconcentrateonmagnitudecompressionandsuggestthatthenormalautocorrelationfunction( )issub-optimalforourperiodicitydetectionmodel.Exponentialcom-pressioniseasilyavailablebyadjustingparameter in(1).Whileweonlyconsiderexponentialcompressioninthiscontext,othernonlinearfunctionsmaybeappliedaswell.Acommonchoiceinthespeechprocessingcommunityistousethenaturallogarithm,whichresultsinthecepstrum.Nonlinearcompressioninthetransformdomainhasbeenstudiedinthecontextofpitchanalysisofspeechsignals[31].Inthatstudy,theperiodicitymeasurewascomputedusingthewidebandsignaldirectlywithoutdividingintochannels.Itwasreportedthatthecompressionparameter gavethebestresults.Itwasalsoshownthatpre-whiteningofthesignalspectrumshowedatendencyofimprovingthepitchanalysisaccuracy.Fig.5illustratestheeffectofmagnitudecompressiontotheSACF.TheupperfourplotsdepicttheSACF’sthatareobtained , , ,and ,whereasthebottomplotshowstheESACFthatisobtainedfromtheSACF .AHammingwindowof1024samples(46.4mswithasamplingfrequencyof22050Hz)isused.Thetestsignalconsistsoftwosyntheticharmonictoneswithfundamentalfrequencies140.0and148.3HzandwhiteGaussiannoisewithsignal-to-noiseratio(SNR)of2.2dB.Thefundamentalfrequenciesoftheharmonictonesareseparated 712IEEETRANSACTIONSONSPEECHANDAUDIOPROCESSING,VOL.8,NO.6,NOVEMBER2000 Fig.5.Anexampleofmagnitudecompression.TestsignalconsistsoftwoharmonictoneswithaddedGaussianwhitenoise(SNRis2.2dB).ThefirstfourplotsfromthetopillustratetheSACFcomputedwith,and,respectively.ThebottomplotdepictstheESACFcorrespondingtothesecondplotwithbyonesemitone.Thetwotonesareidentifiablebylisteningtothesignal,althoughthetaskismuchmoreinvolvedthanwiththesignalwithouttheadditivenoise(seetheexamplesattheWWW).TheexampleshowsthattheSACFpeaksgetbroaderasthevalueof increases.However,theperformancewithlowvaluesof iscompromisedbysensitivitytonoise,asseenbythenumberandlevelofthespuriouspeaksinthetopplot.Accordingtothisexample, isagoodcompromisebetweenlag-domainresolutionandsensitivitytonoise.Whileweprefertheuseof ,insomecomputationallycriticalapplicationsitmaybeusefultouse ifoptimizedroutinesareavailableforthesquare-rootoperationbutnotforthecubic-rootoperation.Itisinterestingtocompareexponen-tialcompressiontothecepstrumwherelogarithmiccompres-sionisused.Cepstralanalysistypicallyresultsinhigherlag-do-mainresolutionthanexponentialcompressionwith ,butitmaybeproblematicwithsignalswithlowamplitudelevels,sincethenaturallogarithmapproaches asthesignalampli-tudeapproaches0[31].B.Time-DomainWindowingThechoiceofthetime-domainwindowincorrelationcom-putationaffectsthetemporalresolutionofthemethodandalsosetsalowerboundtothelagrange.Theshapeofthewindowfunctiondeterminestheleakageofenergyinthespectraldo-main.Italsodeterminestheeffectivelengthofthewindowthatcorrespondstothewidthofthemainlobeofthewindowinthespectraldomain.WehaveappliedaHammingwindowinourpitchanalysismodel,althoughothertaperedwindowsmay Fig.6.Anexampleoftheeffectofthewindowlength.TestsignalconsistsoftheFinnishvowel/a/spokenbyamale.Thepitchofthevowelisfalling.Thesignalisaddedontoitselfafteradelayof91ms.PlotsfromtoptobottomillustratetheESACFcomputedwithwindowlength23.2,46.4,92.9,and185.8ms,respectively.beusedaswell.Forstationarysignals,increasingthewindowlengthreducesthevarianceofthespectralrepresentation.How-ever,thesoundsignalstypicallyencounteredinpitchanalysisarefarfromstationaryandalongwindowisboundtodegradetheperformancewhenthepitchischanging.Fig.6illustratesthistrade-off.ConsecutiveframesofESACF’scomputedonthetestsignalareplottedinaspectrogram-likefashioninFig.6.Thedarkerapointisinthefigure,thehigheristhevalueoftheESACF.ThetestsignalconsistsofFinnishvowel/a/spokenbyamale.Thepitchofthevowelisfalling.Thevowelsignalisaddedontoitselfafteradelayof91ms.Listeningtothesignalrevealsthatthehumanauditorysystemiseasilyabletodistinguishtwovowelswithfallingpitches.Thevalueofthehopsizeparameteris10ms,i.e.,consecutiveframesareseparatedintimeby10ms.TheHammingwindowlengthsoftheanalysisare,fromthetoptothebottom,23.2,46.4,92.9,and185.8ms.ThetwoplotsonthetopofFig.6exhibittwodistinctpitchtrajectories,asdesired.However,asthewindowlengthisincreased,thetwotrajectoriesaremergedintoonebroadertrajectory,asshowninthetwobottomplots.Clearly,thetwocasesonthetoparepreferredoverthetwoonthebottom.WhenthetwotopplotsofFig.6arecompared,itisno-ticedthattheoneusingashorterwindowexhibitsmoreun-desiredpeaksintheESACF.Asexpected,theuseofalongerwindowreducestheartifactsthatcorrespondtonoiseandother TOLONENANDKARJALAINEN:COMPUTATIONALLYEFFICIENTMULTIPITCHANALYSISMODEL713un-pitchedcomponents.FromFig.6itissuggestedthataHam-mingwindowwithalengthof46.4msisagoodcompromise.Whenataperedtime-domainwindow,suchastheHammingwindow,isused,thehopsizeistypicallychosentobelessthanhalfofthewindowlength.Thisensuresthatthesignalisevenlyweightedinthecomputation.Thehopsizemaybefurtherre-ducedtoobtainfinersamplingoftheESACFframesintime,ifdesired.Wehavechosenahopsizeequalto10mswhich,fromourexperiments,seemsagoodcompromisebetweenthedisplacementofconsecutiveESACFframesandcomputationaldemands.Asimilarhopsizevalueisveryoftenusedinspeechandaudioprocessingapplications.C.Pre-WhiteningThepre-whiteningfilterthatisusedbeforethefilterbankre-movesshort-timecorrelationfromthesignal,e.g.,duetofor-mantresonanceringinginspeechsignals.Inthespectraldo-main,thiscorrespondstoflatteningthespectralenvelope.Wethusexpectthewhiteningtogivebetterresolutionofthepeaksintheautocorrelationfunction.Sincewhiteningflattenstheoverallspectralenvelopeofasignal,itmaydegradethesignal-to-noiseratioofanarrowbandsignalsincethenoiseoutsidethesignalbandisstrengthened.However,asthefollowingexampleillustrates,thewhiteningdoesnottypicallydegradethetwo-channelpitchestimatorperformance.Fig.7demonstratestheeffectofpre-whiteningwithtestsignalofFig.5,i.e.,twoharmonictoneswithfun-damentalfrequenciesseparatedbyonesemitoneandadditivewhiteGaussiannoise.ThefirsttwoplotsillustratetheSACFwithout(top)andwith(second)pre-whitening.ThepeaksofthewhitenedSACFarebetterseparatedthanthosewithoutwhitening.Thespuriouspeaksthatarecausedbythenoisestillappearatarelativelylowlevel.ThisisconfirmedbyinvestigationofthecorrespondingESACFsinthethirdandthefourthplotofFig.7.D.Two-ChannelFilterbankThechoiceofthefilterbankparametersaffectstheper-formanceoftheperiodicitydetectorquitesignificantly,asdemonstratedbythefollowingexample.Themostimportantparametersarethefilterordersandcut-offfrequencies.AsdiscussedinSectionII,thetwofiltersarebandpassfilterswithpassbandfrom70–1000Hzforthelowerchannelandfrom1000–10000Hzfortheupperchannel.Thelowestcut-offfre-quencyat70HzischosensothatDCandvery-low-frequencydisturbancesaresuppressedwhiletheperiodicitydetectionoflow-pitchedtonesisnotdegraded.Thecrossoverfrequencyat1000Hzisnotacriticalparameter;itmayvarybetween800–2000Hz(see,e.g.,[30],[33]).Itisrelatedtotheupperlimitoffundamentalfrequenciesthatmaybeestimatedprop-erlyusingthemethodandnaturallyalsoaffectsthelagdomaintemporalresolutionofperiodicityanalysis.Whenatonewithafundamentalfrequencyhigherthanthecrossoverfrequencyisanalyzed,theSACFisdominatedbythecontributionfromthehighchannel.Thehigh-channelcompressedautocorrela-tioniscomputedafterthelow-passfilteringatthecrossoverfrequency,thus,thehigh-channelcontributionforfundamental Fig.7.Anexampleoftheeffectofpre-whitening.ThetestsignalisthesameasinFig.5.ThefirsttwoplotsillustratetheSACFwithout(top)andwith(second)pre-whitening.ThethirdandthefourthplotdepicttheESACFcorrespondingtothefirstandthesecondplot,respectively.frequenciesabovethecrossoverfrequencyisweak.Themethodisreallyaperiodicityestimator:itisnotcapableofsimulatingproperlythespectralpitch,i.e.,apitchthatisbasedonresolvedharmonicswithfundamentalfrequencyhigherthan1000Hz.Notethattheotheraforementionedfilterbankmethodsarealsoperiodicitydetectorsandhavetheirfundamentalfrequencydetectionupperlimitrelatedtothelowpassfilterafterthehalf-waverectification.BothfiltersareoftheButterworthtypeformaximallyflatpassbands.Thefilterorderthatisusedforeverytransitionbandisaparameterthathasasignificanteffectonthetemporalres-olutionoftheperiodicitydetector.Fig.8showsanexampleoftheeffectofchannel-separationfiltering.ThetestsignalisthesameasthatintheexampleofFig.6,i.e.,aFinnishvowel/a/withafallingpitchisaddedtoitselfafteradelayof91ms.TheplotsfromtoptobottomillustratetheESACF’swhenfilterorders1,2,and4(correspondingtoarolloffof6,12,and24dB/octave)areusedforeachtransitionband,respectively.ItisobservedthatthespuriouspeaksintheESACFrepresentationsarereducedasthefilterorderisincreased.ByexaminationofFig.8weconcludethatfilterorder2foreachtransitionbandisthebestcompromisebetweenresolutionoftheESACFandthenumberofspuriouspeaksintheESACF.V.MTheperformanceofthepitchanalysismodelisdemonstratedwiththreeexamples:1)resolutionofharmonictoneswithdifferentamplitudes;2)musicalchordsthatareplayedwithrealinstruments;3)ESACFrepresentationofamixtureoftwovowelswithvaryingpitches.Thetestsignalscanbedownloadedathttp://www.acous-tics.hut.fi/~ttolonen/pitchAnalysis/. 714IEEETRANSACTIONSONSPEECHANDAUDIOPROCESSING,VOL.8,NO.6,NOVEMBER2000 Fig.8.Anexampleoftheeffectofchannel-separationfiltering.TestsignalisthesameasinFig.6.TheplotsfromtoptobottomillustratetheESACF’swhenthefilterorders1,2,and4(correspondingtoaroll-offof6,12,and24dB/octave)areusedforeachtransitionband,respectively.Fig.9showsthefirstexamplewherethetestsignalconsistsoftwosyntheticharmonictoneswithdifferentamplituderatios.Thefundamentalfrequenciesofthetonesare140.0and148.3Hz.TheamplituderatiosintheplotsofFig.9are,fromthetoptothebottom,0.0,3.0,6.0,and10.0dB.Thetonesareclearlyseparatedinthetopplotandtheseparationdegradeswithin-creasingamplituderatio,asexpected.Thisisinagreementwithperceptionofthetestsignals;thetestsignalwith0.0dBampli-tuderatioisclearlyperceivedastwotoneswhileat10.0dBtheweakersignalisonlybarelyaudible.Fig.10showsanESACFrepresentationofamixtureoftwovowelswithvaryingpitches.ThetwoFinnish/ae/vowelshavebeenspokenbyamaleinananechoicchamberandmixed.Thefigureshowstwodistincttrajectorieswithfewspuriouspeaks.TheexampleofFig.11illustratespitchanalysisonmusicalchords.Thetestsignalsconsistof2–4clarinettonesthataremixedtogettypicalmusicalchords.ThetonesthatareusedformixingareD (146.8Hz),F (185.0Hz),A (220.0Hz),and (261.6Hz).TheplotsfromthetoptothebottomshowtheESACFofoneanalysisframe.Thetestsignalsare,respectively,tonesD andA andF ,andA ;andD F andC Thisexampleallowsustoinvestigatetheperformanceofthemodelwhentonesareaddedtothemixture.ThelittlearrowsabovetheplotsindicatethetonepeaksintheESACFrepresen-tations.Thefirsttwoplotsshowtheperformancewithonlytwotonespresent.Inbothcases,thetwotonesareshownwithpeaksofequalheight.ThemaximumvalueofthepeakcorrespondingtothetoneD atlagof 7msisalmost30inthetopplotandonlyalittlemorethan20inthesecondplot.Inthethirdplot,threetonesarepresent.Thetonesarethesamethatwereusedforthefirsttwoplots,butnowtheD peakismorepronounced Fig.9.Exampleoftheresolutionofharmonictoneswithvaryingamplitudes.Theamplituderatiosofthetwotonesare,fromthetoptothebottom,0.0,3.0,6.0,and10.0dB. Fig.10.Anexampleofthepitchanalysisoftwovowelswithcrossingpitchtrajectories.ThetestsignalconsistsoftwoFinnishvowels/ae/,onewitharaisingpitchandtheotheronewithafallingpitch.thanthepeakscorrespondingtotonesF ,andA .Finally,inclu-sionoftoneC againaltersthepeakheightsoftheotherpeaks.Thisdependenceofthepeakheightiscausedpartlybythecom-putationoftheESACFrepresentationfromtheSACFrepresen-tation,andpartlysincethetoneshaveharmonicswithcollidingfrequencies.Inallthecases,however,thetonesareclearlydis-tinguishablefromtheESACFrepresentation.VI.SUMMARYANDThemultipitchanalysismodeldescribedabovehasbeende-velopedasacompromisebetweencomputationalefficiencyandauditoryrelevance.Thefirstpropertyisneededtofacilitatead-vancedapplicationsofaudioandspeechsignalanalysis.Com-putationalauditorysceneanalysis(CASA)[18]–[21],structuredandobject-basedcodingofaudiosignals,audiocontentanal-ysis,soundsourceseparation,andseparationofspeechfromseverebackgroundnoiseareamongsuchapplications.Auditoryrelevanceisadvantageousinordertoenablecomparisonofthesystemperformancewithhumanauditoryprocessing.ComputationalefficiencywasshownbytestingthealgorithmofFig.2ona300MHzPowerPCprocessor(AppleMacin-toshG3).ComputationoftheSACFusingWLPpre-whitening,samplerateof22kHz,andframesizeof1024samples(46ms)tooklessthan7.0msperframe.Witha10mshopsize,onlya TOLONENANDKARJALAINEN:COMPUTATIONALLYEFFICIENTMULTIPITCHANALYSISMODEL715 Fig.11.Exampleofpitchanalysisofmusicalchordswithclarinettones.TheplotsshowESACF’sofsignalsconsistingoftonesD andA (topplot);D andF (secondplot);D ,andA (thirdplot);andD F ,andC partofprocessor’scapacityisusedinreal-timepitchanalysis.Amultichannelmodelwithcorrelationineachchannelcouldnotbeimplementedasreal-timeanalysisusingcurrentgeneralpurposeprocessors.Althoughtheauditoryanalogyofthemodelisnotverystrong,itshowssomefeaturesthatmakeiteasiertointerpretanalysisresultsfromthepointofviewofhumanpitchperception.Thepre-whiteningoftheinputsignal,oftenconsideredusefulinpitchanalysis,maybeseentohaveminorresemblancetospec-tralcompressionintheauditorynerve.Thedivisionoftheaudiofrequencyrangeintotwosubranges,belowandaboveabout1kHz,isakindofminimumdivisiontorealizedirecttimesynchronyofneuralfiringsatlowfrequen-ciesandenvelope-basedsynchronyathighfrequencies.Notethatthemodelisaperiodicityanalyzerthatdoesnotimplementspectralpitchanalysis,whichisneededespeciallyifthefunda-mentalfrequencyoftheinputsignalexceedsthesynchronylimitfrequencyofabout1kHz.Inthisstudywefocusedonthepitchanalysisoflow-to-midfundamentalfrequenciesanddidnottrytoincludespectralpitchanalysis.Thecomputationoftime-lagcorrelation(generalizedauto-correlation)isdifficulttointerpretfromanauditorypointofviewsinceitiscarriedoutinthefrequencydomain.Theonlyin-terestingauditoryanalogyisthattheexponentforspectralcom-pressionisclosetotheoneusedincomputationoftheloudnessdensityfunction[34].Itwouldbeinterestingtocomparethere-sultwithneuralinterspikeintervalhistogramswhich,however,arelessefficienttocompute.Theenhancedsummaryautocorrelation(ESACF)isaninterestingandcomputationallysimplemeanstoprunetheperiodicityofautocorrelationfunction.Inatypicalcasethisrepresentationhelpsinfindingthefundamentalperiodicitiesofharmoniccomplextonesinamixtureofsuchtones.Itremovesthecommonperiodicitiessuchastheroottoneofmusicalchords.Thisisusefulinsoundsourceseparationofharmonictones.Inmusicsignalanalysis,thecomplementofsuchpruning,i.e.,detectionofchordperiodicitiesandrejectionofsingletonepitches,mightaswellbeausefulfeatureforchordanalysis.Aninterestingquestionofpitchanalysisisthetemporalinte-grationthatthehumanauditorysystemshows.Aswithalargesetofotherpsychoacousticfeatures,theformationofpitchper-cepttakes100–200mstoreachitsfullaccuracy.Usingasingleframelengthof46ms,correspondingtoaneffectiveHammingwindowlengthofabout25ms,isacompromiseofpitchtrackingandsensitivitytonoise.AveragingofconsecutiveframescanbeusedtoimprovethestabilityofSACFwithsteady-pitchsignals.Bettertemporalintegrationstrategiesmaybeneededwhenthereisfasterpitchvariation.ThispaperhasdealtwithmultipitchanalysisofanaudiosignalusingtheSACFandESACFrepresentations.Thenextstepinatypicalapplicationwouldbetodetectanddescribepitchobjectsandtheirtrajectoriesintime.Relatedtosuchpitchobjectdetectionistheresolutionoftheanalysiswhenharmonictonesofdifferentamplitudesarefoundinamixturesignal.AsshowninFig.9,separationofpitchobjectsiseasyonlywhenthelevelsoftonesarerelativelysimilar.Aneffectiveandnat-uralchoicetoimprovethepitchanalysiswithvaryingampli-tudesistouseaniterativetechnique,wherebythemostpromi-nentpitchesarefirstdetectedandthecorrespondingharmoniccomplexesarefilteredoutbyFIRcombfilteringtunedtorejectagivenpitch.Toneswithlowamplitudelevelcanthenbeana-lyzedfromtheresidualsignal.Thesameapproachisusefulinmoregeneralsound(source)separationofharmonictones[35].ThepitchlagscanbeusedtogeneratesparseFIR’sforrejecting(orenhancing)specificharmoniccomplexesinagivenmixturesignal.Anexampleofvowelseparationandspectralestimationoftheconstituentvowelsisgivenin[35].Thiscanbeconsideredaskindofmul-tipitchpredictionofharmoniccomplexmixtures.[1]W.M.Hartmann,“Pitch,periodicity,andauditoryorganization,”Acoust.Soc.Amer.,vol.100,pp.3491–3502,Dec.1996.[2]W.Hess,PitchDeterminationofSpeechSignals.Berlin,Germany:Springer-Verlag,1983.[3]W.J.Hess,“Pitchandvoicingdetermination,”inAdvancesinSpeechSignalProcessing,S.FuruiandM.M.Sondhi,Eds.NewYork:MarcelDekker,1992,ch.6,pp.3–48.[4]J.A.Moorer,“Onthesegmentationandanalysisofcontinuousmusicalsoundbydigitalcomputer,”Ph.D.dissertation,Dept.Music,StanfordUniv.,Stanford,CA,May1975.[5]R.C.Maher,“Evaluationofamethodforseparatingdigitizedduetsig-J.AudioEng.Soc.,vol.38,pp.956–979,June1990.[6]R.C.MaherandJ.W.Beauchamp,“Fundamentalfrequencyestimationofmusicalsignalsusingatwo-waymismatchprocedure,”J.Acoust.Soc.Amer.,vol.95,pp.2254–2263,Apr.1994.[7]C.Chafe,D.Jaffe,K.Kashima,B.Mont-Reynard,andJ.Smith,“Tech-niquesfornoteidentificationinpolyphonicmusic,”Dept.Music,Stan-fordUniv.,Stanford,CA,Tech.Rep.STAN-M-29,CCRMA,Oct.1985.[8]C.ChafeandD.Jaffe,“Sourceseparationandnoteidentificationinpoly-phonicmusic,”Dept.Music,StanfordUniversity,Stanford,CA,Tech.Rep.STAN-M-36,CCRMA,Apr.1986.[9]M.Weintraub,“Acomputationalmodelforseparatingtwosimultanioustalkers,”Proc.IEEEInt.Conf.Acoustics,Speech,SignalProcessingvol.1,pp.81–84,1986. 716IEEETRANSACTIONSONSPEECHANDAUDIOPROCESSING,VOL.8,NO.6,NOVEMBER2000[10]K.KashinoandH.Tanaka,“Asoundsourceseparationsystemwiththeabilityofautomatictonemodeling,”Int.ComputerMusicConf.,pp.248–255,1993.[11]A.Klapuri,“Numbertheoreticalmeansofresolvingamixtureofsev-eralharmonicsounds,”inProc.SignalProcessingIX:TheoriesAppli-,vol.4,1998,p.2365.[12]K.D.Martin,“Automatictranscriptionofsimplepolyphonicmusic:Ro-bustfrontendprocessing,”Mass.Inst.Technol.,MediaLabPerceptualComputing,Cambridge,Tech.Rep.399,1996. ,“Ablackboardsystemforautomatictranscriptionofsimplepoly-phonicmusic,”Mass.Inst.Technol.MediaLab.PerceptualComputing,Cambridge,Tech.Rep.385,1996.[14]D.P.W.EllisandB.L.Vercoe,“Awavelet-basedsinusoidmodelofsoundforauditorysignalseparation,”inInt.ComputerMusicConf.1991,pp.86–89.86–89.USAStandardAcousticalTerminology,Amer.Nat.Stand.Inst.,S1.1-1960,1960.[16]R.MeddisandL.O’Mard,“Aunitarymodelforpitchperception,”Acoust.Soc.Amer.,vol.102,pp.1811–1820,Sept.1997.[17]R.MeddisandM.Hewitt,“Virtualpitchandphasesensitivityofacomputermodeloftheauditoryperiphery—I:Pitchidentification,”Acoust.Soc.Am.,vol.89,pp.2866–2882,June1991.[18]A.S.Bregman,AuditorySceneAnalysis.Cambridge,MA:MITPress,[19]M.P.Cooke,“Modelingauditoryprocessingandorganization,”Ph.D.dissertation,Univ.Sheffield,Sheffield,U.K.,1991.[20]G.J.Brown,“Computationalauditorysceneanalysis:Arepresentationalapproach,”Ph.D.dissertation,Univ.Sheffield,Sheffield,U.K.,1992.[21]D.P.W.Ellis,“Prediction-drivencomputationalauditorysceneanal-ysis,”Ph.D.dissertation,Mass.Inst.Technol.,Cambridge,June1996.[22]R.D.Patterson,“Thesoundofthesinusoid:Spectralmodels,”J.Acoust.Soc.Amer.,vol.96,pp.1409–1418,Sept.1994.[23]B.C.J.Moore,R.W.Peters,andB.R.Glasberg,“Auditoryfiltershapesatlowcenterfrequencies,”J.Acoust.Soc.Amer.,vol.88,pp.132–140,July1990.[24]M.SlaneyandR.F.Lyon,“Aperceptualpitchdetector,”Proc.IEEEInt.Conf.Acoustics,Speech,SignalProcessing,vol.1,pp.357–360,1990.[25]R.CarlyonandT.M.Shackleton,“Comparingthefundamentalfre-quenciesofresolvedandunresolvedharmonics:EvidencefortwopitchJ.Acoust.Soc.Amer.,vol.95,pp.3541–3554,June1994.[26]R.P.Carlyon,“Commentson‘aunitarymodelofpitchperception’,”Acoust.Soc.Amer.,vol.102,pp.1118–1121,Aug.1997.[27]B.R.GlasbergandB.C.J.Moore,“Derivationofauditoryfiltershapesfromnotched-noisedata,”Hear.Res.,vol.47,pp.103–138,1990.[28]U.K.Laine,M.Karjalainen,andT.Altosaar,“Warpedlinearprediction(WLP)inspeechandaudioprocessing,”Proc.IEEEInt.Conf.Acous-tics,Speech,andSignalProcessing,pp.III.349–III.352,1994.[29]E.D.YoungandM.B.Sachs,“Representationofsteady-statevowelsinthetemporalaspectsofthedischargepatternsofpopulationsofauditory-nervefibers,”J.Acoust.Soc.Amer.,vol.66,pp.1381–1403,Nov.1979.[30]S.Seneff,“Ajointsynchrony/mean-ratemodelofauditoryspeechpro-J.Phonetics,vol.16,pp.55–76,1988.[31]H.Indefrey,W.Hess,andG.Seeser,“Designandevaluationofdouble-transformpitchdeterminationalgorithmswithnonlineardistortioninthefrequencydomain—Preliminaryresults,”inProc.IEEEInt.Conf.Acoustics,Speech,SignalProcessing,1985,pp.11.11.1–11.11.4.[32]R.D.Patterson,M.H.Allerhand,andC.Giguère,“Time-domainmod-elingofperipheralauditoryprocessing:Amodulararchitectureandasoftwareplatform,”J.Acoust.Soc.Amer.,vol.98,pp.1890–1894,Oct.[33]T.Dau,B.Kollmeier,andA.Kohlrausch,“Modelingauditorypro-cessingofamplitudemodulation—I.Detectionandmaskingwithnarrow-bandcarriers,”J.Acoust.Soc.Amer.,vol.102,pp.2898–2905,Nov.1997.[34]E.ZwickerandH.Fastl,Psychoacoustics:FactsandModels.Berlin,Germany:Springer-Verlag,1990.[35]M.KarjalainenandT.Tolonen,“Multi-pitchandperiodicityanalysismodelforsoundseparationandauditorysceneanalysis,”inProc.IEEEInt.Conf.Acoustics,Speech,andSignalProcessing,vol.2,1999,pp.TeroTolonen(S’98)wasborninOulu,Finland,in1972.HemajoredinacousticsandaudiosignalprocessingandreceivedtheM.Sc.(Tech.)andLic.Sc.(Tech.)degreesinelectricalengineeringfromtheHelsinkiUniversityofTechnology(HUT),Espoo,Finland,inJanuary1998andDecember1999,respectively.Heiscurrentlypursuingapostgraduatedegree.HehasbeenwiththeHUTLaboratoryofAcousticsandAudioSignalPro-cessingsince1996.Hisresearchinterestsincludemodel-basedaudiorepresen-tationandcoding,physicalmodelingofmusicalinstruments,anddigitalaudiosignalprocessing.Mr.TolonenisastudentmemberoftheIEEESignalProcessingSocietyandtheAudioEngineeringSociety.MattiKarjalainen(M’84)wasborninHankasalmi,Finland,in1946.Here-ceivedtheM.Sc.andDr.Tech.degreesinelectricalengineeringfromtheTam-pereUniversityofTechnology,Tampere,Finland,in1970and1978,respec-tively.HisdoctoraldissertationdealtwithspeechsynthesisbyruleinFinnish.From1980to1986,hewasAssociateProfessorand,since1986,FullProfessorofacousticswiththeFacultyofElectricalEngineering,HelsinkiUniversityofTechnology,Espoo,Finland.Hisresearchactivitiescoverspeechsynthesis,analysis,andrecognition,auditorymodelingandspatialhearing,DSPhardware,software,andprogrammingenvironments,aswellasvariousbranchesofacoustics,includingmusicalacousticsandmodelingofmusicalDr.KarjalainenisafellowoftheAESandamemberofASA,EAA,ICMA,ESCA,andseveralFinnishscientificandengineeringsocieties.HewastheGen-eralChairofthe1999IEEEWorkshoponApplicationsofAudioandAcoustics,NewPaltz,NY.