/
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING VOL IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING VOL

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING VOL - PDF document

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
479 views
Uploaded On 2015-02-21

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING VOL - PPT Presentation

17 NO 7 SEPTEMBER 2009 Babble Noise Modeling Analysis and Applications Nitish Krishnamurthy Student Member IEEE and John H L Hansen Fellow IEEE Abstract Speech babble is one of the most challenging noise interference for all speech systems Here a ID: 37310

SEPTEMBER

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "IEEE TRANSACTIONS ON AUDIO SPEECH AND LA..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 15,2010 at 04:23:14 UTC from IEEE Xplore. Restrictions apply. KRISHNAMURTHYANDHANSEN:BABBLENOISE:MODELING,ANALYSIS,ANDAPPLICATIONS1397TABLEIERCEPTUALLASSIFICATIONOF(BAB)ROWD beidentiÞed.Theregionsbetweenfourtosixspeakersinbabblearethemostconfusablebabbletypes;thisisthetransitionregionfrombabbletolargecrowdnoise.Asthenumberofspeakersin-creasetheprobabilityofobservingindividualwordsreduces.Inthelargecrowdscenario(LCR),individualspeechorspeakerinformationcannotbeidentiÞed.Inthiscase,LCR-Babblecon-sistsofspeakerrumblewherenospeciÞcinformationcanbeob-tained(e.g.,speakercount,conversation,individualwords,etc.).Asthenumberofsubjectsinbabbleincrease,thetimevaryingnatureofthebabblereduces.Thechangeinpropertiesofbabblewithanincreaseinthenumberofspeakersisstudiedinthefol-lowingsections.III.BABBLEANDIntheprevioussection,babbleismodeledasaudiostreamsfromindividualspeakers.Thissectionstudiestheimpactoftheoverlapofphonesequencesontheresultingacousticspace.Asthenumberofoverlappingphonesincreasewithinagivenbabbleutterance,thedifferencesbetweenindividualframesareaveragedandtheiridentitybecomesblurred.Thisremovestheabilitytodistinguishindividualphonesinababbleutter-ance.Fig.3demonstratesthisaspectusingthereductioninItakuraÐSaito(IS)[13]distance.Thisdistancereducesbetweenwaveformswhenthenumberofdistinctphonessuperimposedincreases.ThesymmetricISmeasureisdeÞnedas (4)where and aretheall-polemodelparametersfromthegainnormalizedspectraofthetwowaveformstobecompared, istheISdistancegivenby (5)where TheexperimentsareconductedusingsyntheticphonesfromthesamespeakersgeneratedbytheFestivalSpeechSynthesizersystem.Thephonesgeneratedare@,A,Y,U,i,wherephonesarerepresentedusingSingle-SymbolARPAbetversion[14,pg.117].Thesephonesaregeneratedwith16-kHzsampleratefor12ms.Thesewaveformsaremodeledusing12th-orderlinearpredictioncoefÞcients(LPCs)[15].Fig.3illustratesthefrequencyresponseoftheLPmodelsasthenumberofoverlappingphonesisincreased.Twoobservationscanbemadefromthisexperiment:First,asthenumberofoverlap-pingphonesincrease,theabilitytodistinguishbetweenthephonemespectradecreasesimplyingthattheresultingsoundsaresimilar.ThisobservationisalsoreßectedintheISmeasuresbetweenwaveforms.Second,theresolutionofresonatingpolesintheLPspectraarelessdistinctasthenumberofspeakersinbabbleincrease.Asthenumberofspeakersincrease,thespectrumofbabbleconvergestoanaggregatemodelofthevocaltractconÞgurationacrossdifferentphones.Fig.4showsthemeanISdistancesandthevarianceinthosedistancesasafunctionofnumberofoverlappingphonemes.Asthenumberofphonemeskincrease,variouscombinationsofÞvephonesarechosenandsuperimposed.Thisprocesscanbeextrapolatedto ,withaninÞnitenumberofphonemesoverlapping,theresultingspectraapproximatesspeechshapednoise.Thereisamonotonicdecreaseinthemeanandvarianceofthedistancesbetweentheaveragedphonesasthenumberofphonesinbabbleutteranceincreases.Thissuggeststhatwithanincreaseinthenumberofspeakersinbabble,thenoisebecomeslocalizedintheacousticspace.Here,theacousticspaceischaracterizedbytheLPcoefÞcients.Thisobservationcanalsobeextendedtogeneralacousticspaces.Let be vectorsdescribingtheacousticspaceofthegivendata.ItisassumedthatthecentroidsofthevectorquantizedacousticfeaturessufÞcientlydescribetheacousticspace.ItisnotedthatmostspeechsystemsarebasedonsomeformofclassiÞcationforwhichaprerequisitestepisquantizationoftheavailableacousticspace.Foranyacousticspace,thefarthertheentitiestobeclassiÞed,thebetteristheclassiÞcationaccuracy.An dimensionalcubeisusedtomodeltheacousticspaceenclosedbythesecentroids.Fig.5describestheconstructionofthisspaceintwodimensions.TheverticesofthisÞgurearegivenby ,and Inthis -dimensionalspace,thehyper-cuboidwouldhave vertices,wherethecuboidspaceistotallycharacterizedbythefollowingsetofpoints: Here,themaximaandminimaareevaluatedforeachdimen-sionseparatelyacrossallcentroids.Theentireacousticspaceofthedataisenclosedwithinavolumeboundedbytheseex-tremepoints.Sincethespaceismodeledusingacuboid,allthecentroidsareeitherontheedgesorwithinthevolumeenclosedbythecuboid.Thevolumeofthiscuboidismeasured,andthisvolumewillbeanindicatoroftheacousticvariationofthedata.Thevolumeofthisenclosed -dimensionalcubiodwithadja-centedges isgivenby (8)where Here,itisnotedthatalargeacousticvolumeimpliesanex-pansiveacousticvariationinthedata.Conversely,asmall 1394IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.17,NO.7,SEPTEMBER2009BabbleNoise:Modeling,Analysis,andApplicationsNitishKrishnamurthy,StudentMember,IEEE,andJohnH.L.Hansen,Fellow,IEEESpeechbabbleisoneofthemostchallengingnoiseinterferenceforallspeechsystems.Here,asystematicapproachtomodelitsunderlyingstructureisproposedtofurthertheex-istingknowledgeofspeechprocessinginnoisyenvironments.Thispaperestablishesaworkingfoundationfortheanalysisandmod-elingofbabblespeech.WeÞrstaddresstheunderlyingmodelformultiplespeakerbabblespeechÑconsideringthenumberofcon- 1406IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.17,NO.7,SEPTEMBER2009TABLEIVISMATCHOFTHEATCHEDANDISMATCHEDONDITIONSFORACHURATIONXACTEERIESTANDAVINGTHEUMBEROFATCHEDHOWSTHEVERAGEEERPAVINGUMBEROFPEAKERSINISMATCHEDACHURATIONHOWSTHEHANTHOSEAVINGINTHEICINITYOF TABLEVERFORMANCEOFTHEATCHEDANDISMATCHEDACHURATIONOISEIS10-dBBL(BNTRODUCEDISSHOWNOLLOWEDBY X.FORKANDMPROVEMENTSThisstudyhasconsideredtheproblemofanalysis,modeling,anddetectionofcharacteristicsofbabblespeech,knowntobethemostchallengingnoiseinterferenceinspeechsystemstoday.TherearesigniÞcantdifferencesbetweenbabblecollectedfromrealspeakerscenariosandbabbleconstructedbyaddingindi-vidualspeakerstreamsofdatatogether.Thedifferencesariseduetodifferentdataacquisitionchannels,whendataiscol-lectedfromindividualspeakersorthereareconversationscol-lectedfromclosemicrophones.Incontrast,whenbabbleiscol-lectedinnaturalsettings(exampleinameetingroomscenario)afar-Þeldmicrophoneisused.ThisleadstosigniÞcantdif-ferencesinchannelconditions.Theimpactofthelanguageofbabbleindifferentspeechsystems,andtheabilitytodetecttheparticularlanguagesofthebabbleiscurrentlyunderstudy.Fi-nally,theimpactofgroupstress/emotiononbabbleanditsim-pactonspeechsystemsisaninterestingÞeldforfurtherinves-tigation.XI.CInthispaper,aframeworktocharacterizebabblenoiseispro-posed.Babbleisknowntobethemostchallengingdistortioninspeechsystems,duetoitsspeaker/speechlikecharacteris-tics.Therearedifferencesinthenumberofspeakersperframepdfswhenbabblenoiseismodeledasasumofconversationsasopposedtoaddingindividualstreamsofspeakers.Oneofthemainfactorsimpactingthenatureofbabbleisthenumberofspeakersinbabblenoise.Analgorithmwasproposedtode-tectthenumberofspeakersinagiveninstanceofbabble.ThealgorithmwasevaluatedonsimulatedconversationsfromtheSwitchboardcorpus.Detectionperformanceofover80%ac-curacyisobtainedindetectingspeakercounttowithin thenumberofconversations,giventhateachconversationisas-sumedtobeconsistingoftwospeakers.Theperformanceisencouraging,giventhesigniÞcantchallengeincharacterizingbabblespeech.ItisbelievedthatthisrepresentsoneoftheÞrststudiestospeciÞcallyaddresstheunderlyingstructureofbabblenoise.ThisÞndingfromcharacterizationofbabbleopensuppossibilitiesforfutureworkandalsoimpactsexistingapplica-tions.Babblecanbeusedasasourceofinformation(languageID,genderratio,groupemotioncharacteristics,etc.)itself.Inourdatacollection,wehavefounddifferentbabblecharacter-isticswhenthepreviousparametershavechanged.Thisinfor-mationcanbeofvalueinandofitselfforthepurposesofen-vironmentforensics.Alternatively,thisinformationcanbeusedinordertosupplementspeechsystemsinordertomaintainper-formanceinthemostchallengingofnoisetypes.Here,theim-pactofbabblenoiseonspeakerveriÞcationhasbeenstudied,wheretheimpactofbabblespeakercountdetectionwasshowntohelpoverallperformance.Ithasbeenshownthatproperse-lectionofin-setspeakerplusbabblenoisemodelscanimprove KRISHNAMURTHYANDHANSEN:BABBLENOISE:MODELING,ANALYSIS,ANDAPPLICATIONS1407theperformanceofin-set/out-of-setspeakerveriÞcationby24%comparedtochoosingagenericbabblemodel.OnedrawbackofthecurrentsetupisthatitrequiresasufÞcientdataforcharac-terizingthenumberofspeakers.Second,theworkhasbeenpri-marilyfocussedonmodelingbabbleasthenumberofspeakers;suchamodelingsufÞcesforspeakeridentiÞcationsystems,butforspeechrecognitionadditionalinformationsuchaslanguageinformationofthebackgroundisrequired.ThisisimportantbecauseEnglishspeechrecognitioninEnglishbabblewouldbemorechallengingthanEnglishspeechrecognitioninback-groundbabbleconsistingofaforeignlanguage.ItissuggestedthattheseinitialÞndingswillopenascopeofinnovationandap-plicationsinthestudyofbabbleforspeechandlanguagetech-[1]Y.Gong,ÒSpeechrecognitioninnoisyenvironments:Asurvey,ÓSpeechCommun.,vol.16,pp.261Ð291,Apr.1995.[2]R.C.Rose,E.M.Hofstetter,andD.A.Reynolds,ÒIntegratedmodelsofsignalandbackgroundwithapplicationtospeakeridentiÞcationinnoise,ÓIEEETrans.SpeechAudioProcess.,vol.2,no.2,pp.245Ð257,Apr.1994.[3]J.H.L.HansenandD.Cairns,ÒICARUS:Asourcegeneratorbasedrealtimesystemforspeechrecognitioninnoise,stress,andLombardeffect,ÓSpeechCommun.,vol.16,no.4,pp.391Ð422,Jul.1995.[4]J.H.L.HansenandV.Varadarajan,ÒAnalysisandnormalizationofLombardspeechunderdifferenttypesandlevelsofnoisewithappli-cationtoin-setspeakeridsystems,ÓIEEETrans.Audio,Speech,Lang.Process.,tobepublished.[5]M.AkbacakandJ.H.L.Hansen,ÒEnvironmentalSnifÞng:Noiseknowledgeestimationforrobustspeechsystems,ÓIEEETrans.Audio,Speech,Lang.Process.,vol.15,no.2,pp.465Ð477,Feb.2007.[6]A.VargaandR.Moore,ÒHiddenMarkovmodeldecompositionofspeechandnoise,ÓinProc.ICASSP,1990,pp.845Ð848.[7]M.Cooke,ÒAglimpsingmodelofspeechperceptioninnoise,ÓAcoust.Soc.Amer.,vol.3,no.119,pp.1562Ð1573,Mar.2006.[8]G.LiandM.E.Lutman,ÒSparsenessandspeechperceptioninnoise,ÓProc.Interspeech,Pittsburg,PA,2006,pp.1466Ð1469.[9]P.C.Loizou,SpeechEnhancement:TheoryandPractice..BocaRaton,FL:CRC,2007.[10]N.Morales,L.Gu,andY.Gao,ÒAddingnoisetoimprovenoiserobust-nessinspeechrecognition,ÓinProc.Interspeech,Antwerp,Belgium,2007,pp.930Ð933.[11]P.D.OÕGrady,A.B.Pearlmutter,andS.T.Rickard,ÒSurveyofsparseandnon-sparsemethodsinsourceseparation,ÓInt.J.Imag.Syst.Technol.,vol.15,pp.18Ð33,Aug.2005.[12]J.H.L.Hansen,ÒAnalysisandcompensationofspeechunderstressandnoiseforenvironmentalrobustnessinspeechrecognition,ÓSpeech,vol.20,no.1,pp.151Ð173,Nov.1996.[13]R.Gray,A.Buzo,A.Gray,andY.Matsuyama,ÒDistortionmeasuresforspeechprocessing,ÓIEEETrans.Acoust.Speech,SignalProcess.vol.28,no.4,pp.367Ð376,Aug.1980.[14]J.R.Deller,J.H.L.Hansen,andP.J.Proakis,Discrete-TimePro-cessingofSpeechSignals..NewYork:Wiley,1999.[15]L.R.RabinerandR.W.Schafer,DigitalProcessingofSpeechSig-.EnglewoodCliffs,NJ:Prentice-Hall,1978.[16]V.PrakashandJ.H.L.Hansen,ÒIn-set/out-of-setspeakerrecognitionundersparseenrollment,ÓIEEETrans.Audio,Speech,Lang.Process.vol.15,no.7,pp.2044Ð2052,Sep.2007.[17]H.GishandK.Ng,ÒParametrictrajectorymodelsforspeechrecogni-tion,ÓinProc.Interspeech,Philadelphia,PA,1996,vol.1,pp.466Ð469.[18]Y.Gong,ÒStochastictrajectorymodelingandsentencesearchingforcontinuousspeechrecognition,ÓIEEETrans.SpeechAudioProcess.vol.5,no.1,pp.33Ð34,Jan.1997.[19]D.A.Reynolds,T.F.Quatieri,andR.B.Dunn,ÒSpeakerveriÞcationusingadaptedGaussianmixturemodels,ÓDigitalSignalProcess.,vol.10,no.1Ð3,pp.19Ð41,Jan.2000. NitishKrishnamurthyreceivedtheB.E.degreeininstrumentationandcontrolengineeringfromtheUniversityofPune,Pune,India,in2004andtheM.S.degreeinelectricalengineeringfromtheUniversityofTexasatDallas,Richardson,in2007.HeiscurrentlypursuingthePh.D.degreeattheCenterforRobustSpeechSystems,UniversityofTexasatDallas.HehasbeenaResearchInternatTexasInstrumentsintheareaofspeechandlanguagesystems,duringthesummersof2007and2008.Hisresearchfocusesonacousticnoisecharacterizationforspeechsystems.Hisresearchinterestsalsoincludeembeddedspeechtospeechtranslationandspeechrecognitionsystems. JohnH.L.Hansen(SÕ81ÐMÕ82ÐSMÕ93-FÕ07)re-ceivedtheB.S.E.E.degreefromtheCollegeofEngi-neering,RutgersUniversity,NewBrunswick,NJ,in1982andtheM.S.andPh.D.degreesinelectricalen-gineeringfromtheGeorgiaInstituteofTechnology,Atlanta,in1988and1983,respectively.HejoinedtheErikJonssonSchoolofEngineeringandComputerScience,UniversityofTexasatDallas(UTD),Richardson,inthefallof2005,whereheisaProfessorandDepartmentHeadofElectricalEngineering,andholdstheDistinguishedUniversityChairinTelecommunicationsEngineering.HealsoholdsajointappointmentasaProfessorintheSchoolofBrainandBehavioralSciences(SpeechandHearing).AtUTD,heestablishedtheCenterforRobustSpeechSystems(CRSS)whichispartoftheHumanLanguageTechnologyResearchInstitute.Previously,heservedasDepartmentChairmanandProfessorintheDepartmentofSpeech,Language,andHearingSciences(SLHS),andProfessorintheDepartmentofElectricalandComputerEngineering,attheUniversityofColoradoBoulder(1998Ð2005),wherehecofoundedtheCenterforSpokenLanguageResearch.In1988,heestablishedtheRobustSpeechProcessingLaboratory(RSPL)andcontinuestodirectresearchactivitiesinCRSSatUTD.Hisresearchinterestsspantheareasofdigitalspeechprocessing,analysisandmodelingofspeechandspeakertraits,speechenhancement,featureestima-tioninnoise,robustspeechrecognitionwithemphasisonspokendocumentretrieval,andin-vehicleinteractivesystemsforhands-freehumanÐcomputerinteraction.Hehassupervised43(18Ph.D.,25M.S.)thesiscandidates,isauthor/coauthorof294journalandconferencepapersintheÞeldofspeechprocessingandcommunications,coauthorofthetextbookDiscrete-TimePro-cessingofSpeechSignals,(IEEEPress,2000),coeditorofDSPforIn-VehicleandMobileSystems(Springer,2004),AdvancesforIn-VehicleandMobileSystems:ChallengesforInternationalStandards(Springer,2006),In-VehicleCorpusandSignalProcessingforDriverBehaviorModeling(Springer,2008),andleadauthorofthereportÒTheimpactofspeechunderÔstressÕonmilitaryspeechtechnology,Ó(NATORTO-TR-10,2000).Prof.Hansenwastherecipientofthe2005UniversityofColoradoTeacherRecognitionAwardasvotedbythestudentbody.HealsoorganizedandservedasGeneralChairforICSLP/Interspeech-2002:InternationalConferenceonSpokenLanguageProcessing,September16Ð20,2002,andisservingasTechnicalProgramChairforIEEEICASSP-2010,Dallas,TX.In2007,hewasnamedIEEEFellowforcontributionsinÒRobustSpeechRecognitioninStressandNoise,ÓandiscurrentlyservingasMemberoftheIEEESignalProcessingSocietySpeechTechnicalCommitteeandEducationalTechnicalCommittee.Previously,heservedasTechnicalAdvisortoU.S.DelegateforNATO(IST/TG-01),IEEESignalProcessingSocietyDistinguishedLecturer(2005Ð2006),AssociateEditorforIEEETRANSACTIONSONPEECHANDROCESSING(1992Ð1999),AssociateEditorfortheIEEESIGNALROCESSING(1998Ð2000),EditorialBoardMemberfortheSignalProcessingMagazine(2001Ð2003).HehasalsoservedasaGuestEditoroftheOctober1994specialissueonRobustSpeechRecognitionfortheIEEERANSACTIONSONPEECHANDROCESSING.HehasservedontheSpeechCommunicationsTechnicalCommitteefortheAcousticalSocietyofAmerica(2000Ð2003),andisservingasamemberoftheISCA(InternationalSpeechCommunicationsAssociation)AdvisoryCouncil. KRISHNAMURTHYANDHANSEN:BABBLENOISE:MODELING,ANALYSIS,ANDAPPLICATIONS1401 Fig.11.Asthenumberofspeakersinbabbleincrease,nhopsincreasesduetodecreaseinframecontiguity.intheUBM.TheaveragenumberofhopsperframeisdeÞned Totalnumberofhopsfortheutterance Numberofframesintheutterance Thevalueofmeanhopsisbetween0and1.Ifthevalueofnhopsis1,itimpliesthattheaverageresidencetimeforaframeintheGaussianis1frame,whichcorrespondstoeveryconsec-utiveframebeingassociatedwithadifferentGaussian.Whennhopsis0.5,asinglehopbetweenGaussiansoccurseverytwoframes.Fig.11showstherelationbetweenthenumberofhops(meannhops)versusanincreaseinthenumberofspeakersinthebabbleinstanceofduration1min.Theaverageresidencymonotonicallydecreases(i.e.,hopsincrease)withanincreasingnumberofspeakersinthebabble.Therelativechangeinmean-hopsismoreforasmallernumberofspeakers(onetotwospeakers),ascomparedtowhenmoresubjectsareinbabble,wherethebabbleislesstimevaryingandnhopsbecomescon-stant.Whenthenumberofspeakersapproaches ,itisex-pectedthatthenumberofhopswillreduceasbabblewilltendtobestationary.Intheprevioussection,twoaspectsofbabblewereanalyzedwiththeÞrstbeingtheshrinkageoftheacousticspaceasthenumberofspeakersincrease,andthesecondistheincreasedchaoticstructureinbabblewithanincreaseinthenumberofparticipatingsubjectsinbabble.Itisimportanttonotethedifferenttime-domainanalysisframe-lengthschosenforthetwoexperiments.Thesetwoaspectsofbabblearecomplementary;Adecreaseinacousticvolumeindicatesthatwithanincreasingnumberofspeakersinbabble,thebabbleislesstimevaryinginthelongterm,butforshortersegmentsthechaoticnatureofbabbleincreases.Itshouldbenotedthatforanalysisoftheacousticspace,largeframesofduration125msarechosenversus20msHere,weassumethateachGaussianintheGMMcorrespondstoauniquephoneme.AsthenumberofspeakersintheUBMincreasesitispossiblethatmorethanonepdfwillbeusedtorepresenttheshoulderofaphonemedistribu-arechosentoassessdurationalcontinuity.AnotherobservationfromthesecondexperimentisthatUBMsconstructedfromspeechutterancesofindividualspeakersdonotnecessarilymodeltheexacttimevaryingnatureofbabble.Next,asystemtodetectthenumberofspeakersisproposedbasedontheobservationthattheacousticvolumebecomesconcentratedasthenumberofspeakersinbabbleincreases.AsobservedinSectionII,thenumberofspeakersatanygiventimeisapproximatelythenumberofconversations.Fig.1describestheconstructionofatwoconversationbabbleaudiostream.AsshownintheÞgure,eachstreamconsistsofdatafromasingleconversation.ThebabblestreamfromtwoconversationsisconstructedbyoverlappingindividualconversationsfromSwitchboard.Inababbledatastream,theidentityoftheindi-vidualspeakersislost.Fig.12showsthehistogramsofframecountforaÞxednumberofspeakersfortwo,four,six,andnineconversations.ThesehistogramsareoraclehistogramsconstructedfromthetranscriptsoftheSwitchboardcorpus.Switchboardisacorpusofover240hofspontaneoustelephonespeech.ItcontainsbothAandBsidesoftelephoneconversa-tion,makingitsuitabletosimulatebabbleconversations.Undertheassumptionthateachconversationhasonlyonespeakerspeakingatanypointintime,theaveragenumberofspeakersdetectedisequaltothenumberofconversations.ThepdfdistributionsforthenumberofspeakersspeakingperframeinbabbleisshowninFig.12.FromthemodelforbabbledescribedinSectionII,thenumberofconversationsreßectsthenumberinstantaneousspeakersinbabble,undertheassumptionthattherearetwosubjectsparticipatinginanysingleconversa-tion,thetotalnumberofspeakerswouldbetwicethenumberofconversations.Ifthenumberofspeakersspeakingatagiveninstanceisclosetothenumberofconversations,detectingthenumberofspeakersinbabblerequiresaknownrelationshipbe-tweenthenumberofconversationsintheacousticenvironmentandthenumberofspeakers.Thenumberofspeakersspeakingatatimeisafunctionofthefollowingvariables:¥thetopicofconversation;¥howthespeakersprovideinputintheconversations(e.g.,somespeakersareactiveandcontribute,whileothersarepassiveandspendmoretimelistening).Dependingontheindividualnatureofeachconversation,theresultingbabblewilltakeonmanyforms.AsillustratedinFig.13,atwo-stagedetectorfordetectingthenumberofspeakersatagiventimeisproposed.TheÞrststagedetectorestimatesthenumberofspeakersforeachframe.Aspeakernumberhis-togramisgeneratedforeachframeinthedatastream.Thishis-togramisexpectedtohaveconsiderableßuctuationsincethenumberofspeakersactivecanvaryfromzerotothetotalnumberofparticipantsinallconversations.Inthesecondstage,thehis-togramisthenconsideredasafeature,withitsdimension beingafunctionofthemaximumnumberofconversationstobedetected(themaximumvalueof isrestrictedbyacousticvariability).Next,thehistogramisnormalizedusingthetotalnumberofframesinthedatastream.Thisfeatureisseentobehighlycorrelatedforababblesequence.Finally,adiscreteco-sinetransformDCTisappliedandtheÞrsttendimensionsare 1402IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.17,NO.7,SEPTEMBER2009 Fig.12.PDFsfornumberofspeakersperframeforbabbleconstructedusingtwo,four,six,andnineconversations,withanincreaseinthenumberofconversationsthedistributionhasalargervariance. Fig.13.Flowdiagramfordetectingnumberofspeakersspeakingattime.employedforclassiÞcationinordertoreducethedimensionalcorrelationaswellasreducethedatadimensionality.V.DETECTIONOFUMBEROFAsystemisproposedforaclosedset,wherethemaximumnumberofspeakersspeakingatatimeisÞxedtoanumber Thedetectionschemeisatwostagedetectionscheme,wherethepreliminarydetectordecidesonapreframebasisthenumberofspeakers,andthesecondstagedecidesthenumberofconversa-tionsinanutterance.Thesecondstagedetectorusesperframedecisionsfromthepreliminarydetector. KRISHNAMURTHYANDHANSEN:BABBLENOISE:MODELING,ANALYSIS,ANDAPPLICATIONS1405 Fig.15.Schematicforusingthenumberofspeakersformaintainingperformanceforin-set/out-of-setspeakerveriÞcation.TABLEIIIERFORMANCEIN functionofthenumberofspeakersspeakingatatime.ThenextsectionevaluatestheperformanceofthespeakerveriÞcationsystemwheredetectedbabblenoiseinformationisincorporatedwithinthesystem.IX.RESULTSTheperformancemismatchwasevaluatedforbabblenoisewherethespeakercountvariedfrom1Ð10speakers.TheUBMwastrainedusing60malespeakersfromtheTIMITcorpuswhichareseparatefromthein-set/out-of-setspeakers.Here,re-sultsarepresentedforspeechdegradedat10dB,thoughresultsaresimilarfordifferentSNRs(e.g.,5Ð15dB).Trainingdataforeachin-setspeakerisabout5s.TableIIIshowsthebaselineperformanceofthein-setspeakerveriÞcationsystemwithouttheintroductionofbabbledistortion.TheaverageperformanceofthespeakerveriÞcationsystemunderdifferentcleancon-ditionsis9.25%EER.Forspeechcorruptedbybabblenoise,wherethenumberofspeakersinbabblevaryingfromonetonineat10-dBSNRundermatchedtest/trainconditions,theper-formancedropsto27.94%EER.Testconditionsareconsideredtobematchedwhenthespeakercountinbabbleiswithinababblespeakercountwindowof oftheactualspeakercount.Mismatchispresentwhenmodelsarechosenoutsideofthis babblespeakerwindow.Performancemismatchforeachspeakernumberconditionisevaluatedusingtherelation ThismismatchistheaverageperformancemismatchbetweentheexactEERandtheEERwhenadifferentmodelischosenasthetargetmodel.TableIVshowstheaverageperformancemismatchundermatchedandmismatchedconditionsforthetask.Asobserved,matchedcasesalwaysoutperformthemis-matchedcondition.Also,performancewithareducednumberofsubjectsinthedegradingbabbleisbetterthanperformancewhenamodelwithmorenumberofspeakersinbabbleisused.TheaverageperformancemismatchacrossallconditionswhenmatchedmodelsarechosenisshowninTableV.TheEERperformancelossundermatchedconditions( speakerdif-ferenceinselectedbabblenoisemodel)is %ascom-paredtoanaverage %EERlosswhenmodelsarechosenoutsidethis windowsize.Thiscorrespondstoanaverage23%relativeimprovementontheEERsacrossallconditionsbychoosingtheappropriatesetofin-set/out-of-setspeakersplusbabblenoisemodels.Therefore,employingbabblenoisemodeldetectionhelpsmaintainingoverallspeakerIDperformance.Anotherobservationisthatitisbettertochoosespeakermodelswithareducednumberofspeakersinthebabble.ThiscanbeattributedtothedifferenceinbackgroundspeakersaidingtheseparationinthespeakerIDsystem.Withanincreaseinnumberofspeakers,thetestandtraininginstancesofbabblearenotasdistinguishableandthisreducesthebackgroundcon-tributingtothespeakerseparation.Thebabblemodeldetectorinßuenceismoreimportantasthenumberofspeakersinbabble KRISHNAMURTHYANDHANSEN:BABBLENOISE:MODELING,ANALYSIS,ANDAPPLICATIONS1399 Fig.5.Illustrationoftheacousticarea/volumeoccupiedbyaGMMoffour Fig.6.Inter-centroidaldistancebetweencentroidsofafour-mixtureGMM. Fig.7.Asthenumberofparticipatingspeakersinbabbleincrease,thevolumeenclosedbytheirGMMcentroidsreduces.viewedasalayeringofphonemesandsilenceperiodsfromindi-vidualsubjects.Theacoustictrajectoryofspeechfromasinglesubjectisexpectedtobesmoothforamajorityoftheportionssincetheinertialnatureofthephysiologyofspeechproductionwouldnotallowforfrequentabruptmovementintheacoustic Fig.8.Skewnessinthepdfsoftheinter-centroidaldistanceincreasesasthenumberofspeakersinbabbleincreaseshowingthenonuniformspreadofdatainacousticspace. Fig.9.Top:centroidsclustercloserasthenumberofspeakersinbabblein-crease.Bottom:Þgureillustratingtheresultingcompactnessoftheacousticspacewithanincreaseinthenumberofspeakersinbabble.space.Trajectorymodelsofspeechcapitalizeonthisphenom-enon.Gish[17]consideredthisacoustictrajectoryasmovementinthefeaturespace(thetrajectoryismodeledasapolynomialtoÞtfeaturesinawindowparametrictrajectorymodeling)andGong[18]consideredthisasmovementwithinthestatesofanHMM(thisisdonebyassigningViterbipathswithinHMMsstochastictrajectorymodeling).Ifweconsidertheacoustictra-jectoryofbabble,abruptanduneventrajectoriesareexpectedincontrastwithnaturalspeechfromasinglespeaker.Itissug-gestedthatthisisduetothelayeringofindividualspeechtrajec-tories,resultinginconßictingarticulatoryresponsesfromsimul-taneousspeakers.Adirectconsequenceofthetrajectorybeing 1396IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.17,NO.7,SEPTEMBER2009numberofsubjectsspeakingsimultaneouslywillresultinanin-creaseintheacousticvariability.Thenumberofconversationsinthegivenenvironmentinßuencesthenumberofpossiblespeakersspeakingatanyinstantoftime.Inadditiontothenumberofspeakers,theemotion/stresslevelsintheindividualconversations[12]playaroleinthespectralstructureoftheindividualdatastreams.Thelanguageoftheindividualcon-versationswillalsocontributetothestructureoftheindividualconversations.Theacousticsoftheenvironmentplayaroleindecidingiftheindividualsourcescontributeadditively,orifthereisaconvolution/reverberationeffectinthebabblenoise.Theplacementofthemicrophonerelativetotheindividualcon-versationsestablishesthedominanceofindividualspeakersintherecordeddata.Anotherfactorinßuencingbabblenoiseinanenvironmentisthetiming/turn-takingnatureofspeakerswithineachconversationgroup.Thiswilldependontheconversationtopicsandthenumberofindividualspeakerswhocontributetoeachconversation.Withineachconversation,thedominanceofindividualspeakerswillaffectthenatureofbabblefromagivenenvironment.Giventhesefactors,itcanbededucedthattheapproximationofrealbabbledatabyaddingindividualsentencesthereforedependsuponthespeechapplicationandthespeciÞckindofbabbleenvironment.Here,wefocusonbabbleasasumofconversations.B.AnalysisofTurn-TakingWithinaConversationInthissection,amodelofbabbleasasumofconversationsisproposed.Here,thepdfdistributionofthespeechfromapersonAis ,anditisassumedthatspeechstreamsarestatisticallyindependentandidenticallydistributed.Withthis,thejointpdf streamsisgivenby (1)where isthecharacteristicequationoftheindividualpdfs.Al-ternatively,ifbabbleismodeledasasumof conversations,thenthepdfofthespeechstreamoutputoftheconversationisgivenby ,Thiscanbewrittenas sincethespeechfromthespeakerswillbecorrelated.Ifbabbleismod-eledasasumof conversationsassumingconversationstobeindependent,thenthejointcharacteristicequationof conver-sationsisgivenby Ifeachconversationisrestrictedtobebetweentwopeople,theconversationoutputcanbemodeledasasequenceof and .Here,0denotessilence,1denotesonesubjecttalking,and2denotesbothsubjectstalking.Thisdecisionismadeonaframe-by-framebasis.SuchaschemelendsitselftoMarkovmodelingwhereeachstateisaconversationmode.Inacon-versationinvolvingtwosubjects,itisexpectedthatasinglepersontalksmostofthetime,withsilencebetweenturn-takingandoccasionallysmallinstanceswherebothspeakersspeaksimultaneously.Thesituationwherebothproducespeechsi-multaneouslyoccurswhenthereisashortpausebetweenturntakingandtheframeoverlapsattheendofonespeakerandstartofanother.Aseparatecaseoccurswhenbotharelaughing,oragreeing,orifthereisback-channelfeedback,etc.If and aretheprobabilitiesofobserving0,1,and2thenintuitively, .Ifwemodelbabbleasasumof conversa-tions,then statesarepossible(0speakersto speakingperframe),theprobabilitiesofeachstateindividually ... ... where Foratwo-speakercase,itcanbeseenthatunless willbethemostprobableeventwhentwostreamsarecombined.Thissituationcanbeextendedto conversations isthemostprobableevent.Thisobservationisusedtodetectthenumberofspeakersinbabbleconditions.C.AnalysisofBabbleasaFunctionofNumberofSpeakersForanalysisofspeechbabble,babbleisstudiedasthreesep-arateacousticcases.Here,babbleiscategorizedbasedonthenumberofspeakersspeakinginstantaneouslywiththefollowingthreeclasses.¥Competingspeaker():havingonlytwosub-jectstalkingsimultaneously.¥Babble(BAB):Inthisconditionindividualspeakerscanbeheardandattimes,individualwordscanalsobeheard.¥Large-crowd():Soundslikeadiffusedbackgroundrumble,whereindividualconversationsorspeakersarenotTheboundariesbetweenBABandLCRareßuidanddependingonvariousfactorssuchastherelativedistanceoftheconversa-tionsfromthemicrophone,thecategoryofbabblenoisecanbedecided.ToobtainanestimateoftheboundariesbetweenBABandLCR,aperceptualexperimentwascarriedout.Here,eachsubjectwasgiventhedeÞnitionsofBABandLCRandaskedtoclassify18samplesasbabbleasBABorLCR.Inthesoundsamples,thenumberofspeakersinbabblewasvariedfromtwototen.Threeinstancesofsoundsforeachspeakercountweregenerated.Atotalof12subjectswereapartoftheexperiment.TheresultsareshowninTableI.Eachoftheabovementionedbabblescenarioshavetheiruniquefeatures.Inthebabblesce-nario(BAB),individualspeakersaregenerallynotdiscerniblebutoccasionally,individualwordsalongwiththespeakercan speakers.Finally,theproblemofin-set/out-of-setspeakerrecognitionisconsideredinthecontextofinterferingbabblespeechnoise.Resultsareshownfortestdu-rationsfrom2Ð8s,withbabblespeakergroupsrangingfromtwotoninesubjects.Itisshownthatbychoosingthecorrectnumber KRISHNAMURTHYANDHANSEN:BABBLENOISE:MODELING,ANALYSIS,ANDAPPLICATIONS1395 Fig.1.BabbleNoise:theleftblockshowsÞvestreams(Þvespeakers)areoverlappedandtherightblockshowsthecasewheretwoconversations(twospeakersandthreespeakers)areoverlapped.babblenoiseisshowntoimprovespeechsystemperformanceunderbabblenoiseconditions.II.ANALYSISODELINGOFThecommonnotionofbabbleinspeechsystemsisthenoiseencounteredwhenacrowdoragroupofpeoplearetalkingtogether.Anapproximationtorealbabbleistoaddstreamsofspeakersspeakingindividually,ratherthanaddingconversa-tions.TherearesomesigniÞcantdifferencesbetweensuchanapproximationandrealbabblewhich,toourknowledgehasnotyetbeenconsidered.ConsiderascenariowithÞvespeakersasshowninFig.1.IndividualstreamsofÞvespeechutterancesareshowninCase1ontheleft,wheresrepresentsspeechactivity,and#silence.Thespeechframesarelabeled1andthesilenceframesarelabeled0.IfÞvespeechstreamsareadded,thisim-pliesallÞvesubjectswillspeaksimultaneously,butnotwowillbeengagedinaconversation.InCase2,itisassumedthatintheroom,theÞvearedividedintotwogroups,oneconsistingoftwosubjectsandtheotherofthreesubjectswhoareinvolvedwithintwoseparateconversations.Overtime,therewillbebabblefromtwoconversationgroups,wheremostofthetimetherewouldbesimultaneousspeechfromtwospeakers,onefromeachcon-versation.Speakersinvolvedinaconversationwouldchangeovertimesincetheytaketurnstospeakwithineachconversa-tion.InCase1,therewouldbeÞvesubjectstalkingsimultane-ously,whereas,incase2therewouldbetwosubjectstalkingsimultaneouslymostofthetime,andthesetwosubjectswouldchangewithtime,dependingonthedynamicsofeachconversa-tion.Fig.2showsthedifferenceinthedistributions(pdfs)ofthenumberofspeakersspeakingperframewhentwospeakersareaddedversustwoconversationsareadded.Asobservedfromthepdfs,whenspeechfromindividualspeakersareaddedthereisnopossibilityofmorethantwospeakersspeakingatthesametimewhereas,whentwoconversationsareaddedmostofthetimetwospeakersarespeakingbutattimesitispossiblethatallfourspeakersspeaksimultaneously.So,tomodelbabblenoise,itismoreaccuratetoemployamodelconsistingofasumofconversationsratherthanindividualspeechstreamsofcon-versationsoverlappedwitheachother.Whenindividualspeechstreamsareoverlappedundertheassumptionofindependence,itisaninaccuratemodelforactualbabblenoisesincespeech Fig.2.Differenceinpdfsofnumberofspeakerstalkingsimultaneouslywhen(a)twoconversationsareadded,and(b)twospeakersspeakingindividuallyarefromeachspeakerinaconversationiscorrelatedtotheother(i.e.,turn-takingwithineachconversation).ThenextsectionidentiÞesthevariablesthatinßuencebabbleinagivenenviron-A.FactorsInßuencingBabbleBabblenoiseisafunctionofthenumberofspeakersinanacousticenvironment.Thenumberofconversationsandgroupingofthespeakersimpacttheacousticvariabilityofthebabble.Inaconversation,therecanbemorethantwosubjectsparticipating,butusuallythereisonlyonesubjectspeakingatanygivenpointintime.Inaconversation,thespeakermightchangewithtime,butingeneraltherewillbeonlyonespeakerspeaking.Thenumberofconversationsdictatethenumberofsubjectsspeakingsimultaneouslyinbabble.Reducingthe 1400IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.17,NO.7,SEPTEMBER2009 Fig.10.Illustrationofthereductioninthecontiguityofadjacentframesasnumberofsubjectsinbabbleincrease.smoothisthatindividualfeatureswouldbelocalizedinthefea-turespace.Theabruptorrandomnatureofbabblewouldleadtoarelativelysmallerlocalizedacousticspace.Theacoustictra-jectoryisatimefunctionofthevariationofacousticfeaturesgivenby (12)where isafunctionthatmapsthefeaturespacetothetrajec-toryspace.Inaquantizedacousticspace,thefeaturesfromthesameacousticregionwillsharesimilaracousticproperties.The isdeÞnedas when andisanindicatorofmovementbetweenquantizationregions.Here,aÒ1ÓindicatesmovementacrossquantizedregionsandaÒ0Ómeansthatthecurrentframeisinthesamequantizationregionasthepreviousframe.Acousticfeaturesarethusmappedintoasequenceofzerosandones,wherealargenumberof0Õswouldsignifyblocksofcontiguousspeechfromaconsistentspeaker,whileaseriesof1Õssuggestsmorerandommovementbetweenspeakersandphonemecontent.Fig.10illustratestheacousticmovementforasinglespeakerandmultispeakerbabble.Here,A-B-C-Dareadjacentframesofbabble.EachframeisassociatedwithamixtureintheaggregatespeechmodelwhicharemodeledbyGMMs.Eachmixturerepresentsanacousticregion.Forspeechfromasinglesubjectasshownontheleft,adjacentacousticfeatures(e.g.,A-B-C,D-E-F,G-H-I)willhavemovementinthesameacousticregion(e.g.,A-B-Ctomixture1).Formultispeakerbabble,adjacentframesresidewillmoverandomlyacrossacousticregions(e.g.,Atomixture3,Btomixture2,andCtomixture2).Therefore,itisexpectedthatameasureofspeakerbabblecanbeobtainedbydetermininghowlongwestaywithinapdfoverthetimeusingageneralGMM.Ifwehopfrequentlywithinmixturesforadjacentframes,thereisgreaterspectralvariationandweexpectittobebabble.IfconsecutiveframesappeartostaywiththesameGMMmixturelonger,lessspectralvariabilityispresentanditismorelikelyasinglespeaker.UBMisemployedforanalyzingthemovementintheacousticspace.UBMshavebeenusedformodelingbackgroundspeakersforthespeakerveriÞcationtask[19].AUBMisaGMMtrainedwithspeechstreamsfromindividualspeakers.Thisrepresentsanaggregatemodelforspeechbyasinglespeaker.IffeaturesfromasinglespeakerareassignedtothelargestscoringGaussiansintheUBMbasedonthemaximum-likelihoodcriterion,contiguousblockswouldresideinthesameGaussian.Asthenumberofspeakersincreases,movementbetweenacousticregionsshouldresultforadjacentframesacrossbabbledatastreams.TheUBMsinourcasearetrainedwithspeechfromindividualspeakers,similartothemodelsusedforspeakeridentiÞcationA.AnalysisofAcousticMovementTodemonstratetheimpactofthenumberofspeakersontheacousticvariabilityofbabble,a256-mixtureUBMisconstructedusingallthetrainingdatafromtheTIMITcorpus.Fromthisdata,19-dimensionalMFCCsareextractedusinga20-mswindowwitha10-msskipbetweenadjacentframes.IndividualGaussiansintheUBMcanbeviewedasmodelsofacousticallysimilarphoneblocksinthetrainingfeatures.Ifthetestaudiostreamcontainsspeechfromasinglespeaker,con-tiguousframesareexpectedtobeacousticallysimilar,resultingincontiguousframesassociatedwiththesameGaussian.Asthespeakercountinthebabbleincreases,thereisanincreasedhoppingbetweenGaussiansduetotheacousticvariationinthedata.Toquantifythedegreeofabruptnessinbabble,ameasureofthenumberofhopsperaudiosegmentframeofdataisproposed.AhopisdeÞnedasamovementbetweenGaussians KRISHNAMURTHYANDHANSEN:BABBLENOISE:MODELING,ANALYSIS,ANDAPPLICATIONS1403Letasetoftrainingfeaturevectorsbedenotedby .Here, denotesthenumberofframesinthetrainingset.If representsthemodelforbabble speakers,theneachframe isclassiÞedaccordingtothemostlikelynumberofspeakers as Usingtheabovedecisionsforallframesofanutterance,aprob-abilitymassfunctionforthenumberofspeakersinthegivenutteranceisevaluatedasfollows: totalnumberofframesdetectedashaving speakers totalnumberoftestframes ADCToftheobservedpdfisevaluated.TheDCTreducesthedimensionalityofthefeaturevectorandmakesthedimensionsindependent.TheDCTofthisfeaturevectorfor conversationsisdenotedby .Here, isthedimensionofthefeaturevector.Thetestfeature isclassiÞedaccordingtothefol-lowingcriterion: (17)Here, isthecovarianceofG.Thetestfeatureisassignedonthebasisofthehighestcorrelation.Toimplementthedetec-tionscheme,separatedetectorsfor1-to- babblespeakersaretrained,andeachtestframeisassignedtoonedetectorforeveryutterance.Ahardspeakercountdecisionismadeonaper-framebasis.TheÞrststagedetectoristrainedusingTIMITdata,sincethisdataisreadspeechwithlimitedpausesectionswithinanutterance.ThisleadstoaspeakercountspeciÞcmodelfor speakerbabblesincereadspeechcontainslimitedpausesec-tions.Thesecondstageusesacorrelation-baseddetector.Thisproposedsecond-stageisrequiredbecauseinactualconversa-tions,thenumberofspeakersspeakingatanygiventimecanvarydependingonthenatureoftheconversation.TotrainforaÞxednumberofspeakers,babblesampleswiththerequirednumberofspeakersareusedasenrollmentfeatures.Thetrainingfeaturesareobtainedfromthisenrollmentfeaturedataandav-eragedovertheenrollmentsequencetoprovidethetrainenroll-mentfeature.Afterthetestdatafeatureextraction,thecorrela-tionofthetestfeatureismeasuredacrosstheclosedsetofenroll-mentfeatures.Theoveralldecisionforthenumberofspeakersforagivenutteranceisdecidedbasedonthemaximumcorrela-tionwiththetestdata.VI.RESULTSETECTIONOFUMBEROFAspreviouslydescribed,thespeakerbabblecountdetectorconsistsoftwostages,whereeachstageispresentedseparatelybelow.A.Stage1:PreliminaryDetector(ShortTerm)ThepreliminaryStage1detectorismadefrombabbledatawithananalysisframelengthof125mswithnooverlapbetweenconsecutiveframes.Forparameterization,19-di-mensionalMFCCsareextractedasfeatures.Theresultinghistogramsofthebabblespeakercountdetectedfromover-lappedSwitchboardcorpusconversationsisshowninFig.14.IfwecompareFig.12withFig.14,itisobservedthatthedetectionperformanceisverypoorforthecorrectnumberofspeakersforagivenframe.Thedetectoroutputisskewedwhereastheoraclepdfsaresymmetric.Italsoisobservedthatthehistogramsvarywithachangeinthenumberofbabblespeakers.Thisfeatureisusedtodesignthesecondstagede-B.Stage2:NumberofSpeakersDetector(LongTerm)ThisstageoftheframeworkisevaluatedonsimulateddatausingtheSwitchboardcorpusforatotalof110hofdatacon-structedbyoverlappingdifferentnumbersofbabblespeakerstoformeachtest/trainingutterance.Thetestandtrainweresep-arateinstancesofbabblewithnooverlapofthesamespeakers(i.e.,theactualspeakersusedtocreatetheoverlappedbabblespeechweredifferentfortestandtrain).Thedatawassplitintoatotalof800utterancesacrossninetestcases(eachtestcasehaving (fromonetonine)conversations).Thetrainingsetconsistsof60instancesofbabbleforninetestcases.Babbledatawasframedusingwindowlengthsof0.125ms(1000sam-plesat8KHz).ResultsforbabblespeakercountclassiÞcationofthenumberofspeakersisshowninTableII.Asthenumberofconversationsincrease,theacousticseparationinthedatadecreasesandhencetheerrorindetectingtheexactspeakercountincreases(i.e.,itiseasiertodetectthedifferencebetweenthree-to-Þvebabblespeakersversus13to15babblespeakers,becausethespectraldiversitywilldecreaseasthespeakercountincreases).Ontheotherhand,theaccuracyisveryhighforaspeakercountbetweenwindow oftheexpectedspeakercount.Thisisexpectedintheprobabilitydistribution(Fig.12)ofthenumberofspeakerswhenconversationsoverlap.FromTableII,itisseenthatthelowestbabblespeakercountperfor-mancewithadetectionwindowof isabout81.6%whensevenconversationsarepresent.Giventhenatureofthetask,itisdifÞculttoaccuratelydeterminetheactualnumberofpeopleinaconversationatagivenpointoftime,butbyestimatingtheboundsonthenumberofconversations,itispossibletoestimatetheminimumnumberofpeopleinbabble.VII.BOISEANDOBUSTTostudytheimpactofcharacterizingthenumberofspeakersinbabblewhenbabbleisaprimarysourceofadditivenoise,anin-set/out-of-setspeakerveriÞcationsystemisemployed(afulldescriptionofin-setrecognitionisfoundin[16]).Forin-setspeakerrecognition,thetestutteranceisscoredagainstallin-setspeakermodelsrelativetothebackgroundmodel.Ifthespeakerisdetectedasanyofthein-setmodels,thespeakerissaidtobeanin-setspeaker.Theprimarymotivationforthisphaseofthestudyistodeterminetheimpactofchoosingthecorrectbabblespeakercountinbackgroundforattemptingtomatchthetestandtrainbackgroundscenarios,andtostudytheimpactofthe inbabblespeakercountdetection.Toachievethis,thetrainandtestspeakerutterancesaredegradedwithbabblecontainingadifferentnumberofspeakers.Foragivensignal-to-noiseratio(SNR),theclosestcorrespondingmatching(havingasimilar 1404IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.17,NO.7,SEPTEMBER2009 Fig.14.PDFsfornumberofspeakersperframeinbabblewhenbabbleisconstructedusingtwo,four,six,andnineconversations.TABLEIIATRIXINOFTHEUMBEROFONVERSATIONSETECTEDTOTHECTUALONVERSATIONSACHONTAINSTHELASSIFICATIONERCENTAGESONTAINSCCURACYITHAINDOWIZEOF speakercountbabble)testmodelsarechosen.Here,thespeakercharacterizationisachievedonthebasisofthenumberofspeakersinbabbleasshowninFig.15.Fromtheinputdata,thenumberofspeakersinthebackgroundbabblenoiseisestimatedwhilekeepingtheSNRÞxed,andthetargetmodelshavingthesamenumberofspeakersischosen.ThespeakerveriÞcationsystememploysabinarydetectorthatassignsatesttokentothemostlikelyin-setorout-of-set(UBM)model.TheefÞciencyofthisbinarydetectorismeasuredintermsofequalerrorrate(EER).Here,theEERrepresentstheclassiÞcationerrorwhentheprobabilityoffalseacceptisequaltotheprobabilityoffalsereject.AlowerEERindicatesabetteroveralldetectionsystem,assumingequalcostforfalserejectandfalseaccept.Ingeneral,whennoiseisintroducedundermatchedtest/trainconditions,theEERincreases.Thenextsectiondescribesexperimentswherethenumberofspeakersinthebabblenoiseisusedtodeterminethein-set/out-of-setmodelstobeused.Here,theattemptisnottoimproveperformanceforthein-setsystem,buttodemonstratethattheselectionofanadequatelymatchedcondition(intermsofthenumberofcorruptingbabblespeakers)helpsmaintainoverallperformance.Thenextsectiondescribestheexperimentalsetup.VIII.EAcorpusofbabbledataisgeneratedbyvaryingthenumberofspeakersinthebabble.ForaÞxednumberofspeakersinbabble,acorpusoftenbabbleinstancesisdividedintosectionsof3,3,and4instancesfortest,train,anddevelopmentrespectively.Eachofthebabbleinstancesareconstructedusingadifferentsetofspeakers(i.e.,theexactspeakersusedfortraining,devel-opment,andtestingaremutuallyexclusive).Eachofthetest,train,anddevelopmentsetsaredegradedwiththeirrespectivebabbleinstancesataÞxedSNR.ThespeakerIDsystemisevalu-atedoverthreeconditions:for15,30,and45in-setspeakersandfordifferentdurationoftestutterances:2,4,6,and8s,respec-tively.ForaÞxedSNR,atotalof12conditionsareevaluated.Thein-set/out-of-setdatabaseconsistsofthemalespeakersfortheTIMITcorpusat16kHz,andbabbledistortionisconstructedusingfemalespeakersfromTIMIT.Thefeaturesusedforclas-siÞcationare19-dimensionalMFCCs.Babbleismodeledasa 1398IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.17,NO.7,SEPTEMBER2009 Fig.3.ISmeasuredecreasesasthenumberofsuperimposingphonesincrease.acousticvolumewouldmeanlessacousticvariation.Forasinglespeaker,alargeracousticspaceisexpectedsincedistinctphonemeswouldbepresent.Whereas,forbabblewithmultiplesimultaneousspeakers,theexpectedacousticvolumeshouldbesmaller.Furthermore,asthenumberofspeakersinthebabbleincrease,ashrinkageintheacousticspaceisexpected.Anothermeasureofthisspreadoftheacousticspaceisthedistancebetweenthepdfcentroids.Thesecentroidsareanestimateofthecompactnessoftheacousticdataclusters.ThisschemeisillustratedforonecentroidinFig.6.TheEuclideandistancebetweentwopointsinthe -dimensionalspaceis Thesedistancesarecalculatedforallcentroidsdescribingtheacousticspace.Asthenumberofspeakersincreasewithinbabble,thecentroidclusterswillmovecloser(e.g.,thepointsA,B,C,D,EinFig.5willmoveclosertogether).Thismetricthereforeprovidesadditionalinformationonthedistributionofthecentroids(i.e.,informationpertainingtorelativeclosenessofthecentroidsintheacousticspace).ThesevolumeandacousticspacecharacteristicsareevaluatedonasyntheticbabblecorpusconstructedusingthetestcorpusofTIMITconsistingofbothmaleandfemalespeakers.Thenumberofspeakersisvarieduniformlyfromonetoninesubjectsspeakingatatime.Here,19-dimensionalMelFrequencycepstralcoefÞ-cients(MFCCs)arethenextractedfrom125-ms(1000samplesat8-kHzsamplerate)frames.Thelargeframesizehasbeenchosentoanalyzetheaggregatespectralstructureofbabble.TheseMFCCvectorsareassumedtocharacterizetheacousticspaceofbabblebyclusteringandemployingGaussianmixturemodels(GMMs).Thepdfsaregivenas (11)where istheconditional19-dimensionalGaussian.TheGMMmodelparametersareestimatedusingtheEM Fig.4.Asthenumberofsuperimposingphonesincrease,themeanspectraldistancereduces.Theindividualspectraofsuperimposedphonesarelessdis-tinguishableasseenfromthedropinvariance.algorithm,wherethedataissplitinto32mixturesandthemeansofeachmixtureisusedtocharacterizetheacousticspace.Theacousticvolumeisevaluatedusingthesecentroids.Fig.7showstheresultingmonotonicdecreaseintheacousticvolumeasthenumberofspeakersinbabbleincreases.Here,thereisanexponentialreductioninvolumeasthenumberofspeakersinbabbleincrease.Toprocessspeechinnoise,ideally,noiseshouldbelocalizedinthisspaceandseparatedfromtheacousticvolume.However,noiseandspeechsharethesameacousticspacewhendescribedusingMFCCspectralfeatures,thereforedistinguishingspeechversusbabblebecomesdif-Þcult.Moreover,theacousticspaceofbabbleisasubregionoftheentireacousticspaceoccupiedbyspeechfromasinglespeaker.Fig.8showsthehistogramsofdistancesbetweenthecentroidsforaspeechsignalwithone,four,andninespeakers.Thedistancehistogramswithonespeakerismorebroadandßat,withdistributionsapproximatingGammadistributionsasthenumberofspeakersincreases.ThevariationofthemeandistancesisshowninFig.9,whereasthenumberofspeakersincrease,themeandistancebetweenthecentroidsdecreases,whichimpliestheacousticfeaturesareclusteredtightly.Asisevidentfromthevolumeanddistanceplots,incaseswherethereisareducednumberofspeakersinbabble,thecentroidsenclosealargervolume,andtheyareuniformlydistributed.Withanincreaseinthenumberofspeakers,themeandistancereducesandthevolumealsodecreases.Theacousticvolumedescribesthereductionintheacousticspaceofbabblewithanincreaseinthenumberofcontributingspeakers.Also,anotherimpactoftheincreaseinthenumberofspeakersisanincreaseintheabruptnessinthespectralmovementforbabblewhichisstudiedinthenextsection.IV.AOVEMENTINAsobservedintheprevioussection,theamountofacousticvariabilityofbabbledependsonthenumberofsubjectscon-tributingtothebabble.Ifspeechfromasubjectismodeledasasequenceofphonemeutterances,multispeakerbabblecanbeItisnotedthat19dimensionswereusedin[16],andwith32Gaussianmix-turesthelikelihoodswerefoundtoconverge.