/
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING VOL IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING VOL

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING VOL - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
585 views
Uploaded On 2014-12-16

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING VOL - PPT Presentation

17 NO 7 SEPTEMBER 2009 Convolutive Transfer Function Generalized Sidelobe Canceler Ronen Talmon Israel Cohen Senior Member IEEE and Sharon Gannot Senior Member IEEE Abstract In this paper we propose a convolutive transfer func tion generalized si ID: 24613

SEPTEMBER

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "IEEE TRANSACTIONS ON AUDIO SPEECH AND LA..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1420IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.17,NO.7,SEPTEMBER2009ConvolutiveTransferFunctionGeneralizedSidelobeCancelerRonenTalmon,IsraelCohen,SeniorMember,IEEE,andSharonGannot,SeniorMember,IEEEInthispaper,weproposeaconvolutivetransferfunc-tiongeneralizedsidelobecanceler(CTF-GSC),whichisanadap-tivebeamformerdesignedformultichannelspeechenhancementinreverberantenvironments.Usingacompletesystemrepresentationintheshort-timeFouriertransform(STFT)domain,weformulateaconstrainedminimizationproblemoftotaloutputnoisepower Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on October 28, 2009 at 11:20 from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on October 28, 2009 at 11:20 from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on October 28, 2009 at 11:20 from IEEE Xplore. Restrictions apply. TALMONetal.:CONVOLUTIVETRANSFERFUNCTIONGENERALIZEDSIDELOBECANCELER1423similarlyto .Thus,wecanwritetheestimationerrorasasumoftwoterms (9)where representsthespeechdistortion,and representstheresidualnoiseNow,from(11),theMSEofeachsubbandintheSTFTdo-mainassociatedwiththeresidualnoiseis (12)where isamatrixtraceand Thus,wecanstatethenoisereductionproblemsubjecttozerospeechdistortionasfollows: Byusingtherelativeimpulseresponses,whichrepresentthecouplingbetweenthespeechcomponentsateachmicrophone,thezerospeechdistortionconstraintcanbewrittenexplicitly(seeAppendixB),andwecanrewritetheoptimizationproblem (14)where isdenedas isama-trixconsistingofthe convolutionmatrices ,and isaconstantmatrixgivenby where isavectorofzerosand isaunitmatrix.SolvingtheaboveoptimizationproblemrequiresestimatesoftheRTFsofallsensorsintheSTFTdomain andestimatesofthenoisesignalsPSDs .AgeometricillustrationoftheoptimalbeamformerisshowninFig.1.Theconstraintin(14),posedonthematrix ,canbebrokeninto constraintsoneachofitscolumns Fig.1.Geometricinterpretationoftheoptimalbeamformer. ,where isaunitvectoroflength Thus,eachconstrainttakesthefollowingform (15)where isthe thcolumnofthematrix ,i.e.,itisavectoroflength andequals .Now,(14)canbesolvedusing Lagrangemulti-IV.GBasedonthegeneralizedsidelobecanceler(GSC)structure[6],wetransformtheconstrainedoptimizationproblemintoanunconstrainedform.Theminimizationandtheconstraintaredecoupledintotwoparallelprocessingbranches,yieldinganMVDR-equivalentbeamformercarriedoutintheSTFTdomain.Considerthenullspaceof ,denedby andtheconstraintshyperplanes,denedas Thus,weget hyperplanes,paralleltothenullspace .Itisworthwhilenotingthatsuchanullspaceexistsdueto sions.Let denotetherangeof ,i.e., UsingthefundamentaltheoremofLinearAlgebra[24],wehave .Thus,eachvector inthelinearspacecanbeuniquelysplitintoasumoftwovectorsinmutuallyorthogonalsubspaces,asfollows: Inthefollowing,weneglecttheunitvectorlengthnotationforsimplicity.Thisderivationisinhigherdimensionsthanthenullspacedenitionpre-sentedin[6]sinceourpresentationinvolvesdependencybetweentimeframesineachsubband. Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on October 28, 2009 at 11:20 from IEEE Xplore. Restrictions apply. TALMONetal.:CONVOLUTIVETRANSFERFUNCTIONGENERALIZEDSIDELOBECANCELER1425B.BlockingMatrixDenetheblockingmatrixas ............ Usingsimplearithmetic,weobtainthat satises From(25),using and(29),thenoisecanceleroutput andbyusingtheproposedblockingmatrixfrom(31),weobtain Hence,thedesiredsignalisblockedandthenoisecanceleroutputiscomprisedofreferencenoise-onlysignals.C.NoiseCancelerOurgoalistominimizethepoweroftheoutputsignal,i.e., whichisobtainedbyadjustingthenoisecancelerlter From(33),weknowthatthenoisecancelerreceivesnoise-onlysignals(theoutputoftheblockingmatrix),andthatitisde-signedtocanceloutthecoherentnoiseattheoutputofthexedbeamformer(30).Thisadjustmentproblemistheclassicalmultichannelnoisecancellationproblem,thatcanbesolvedbyusingtheWienerlter.Inthispaper,weimplementthenoisecanceleradaptivelyusingtheNLMSalgorithm.Subsequently,wetakeadvantageofthefactthatspeechsignalsarebetterrepre-sentedintheSTFTdomainthaninthetimedomain.Thecondi-tionnumberoftheautocorrelationmatrixoftheSTFTsamplesofthespeechsignaliscloserto1,yieldinganimprovedconver-gencerateofadaptivealgorithmsintheSTFTdomainthaninthetimedomain[18],[23],[25].V.PROPOSEDImplementingtheGSCschemeasdescribedaboveisinsuf-cientduetolargedimensionalityofthesystemwhenperfectlyrepresentedintheSTFTdomain.Thus,weproposeapproximaterepresentationsthatreducethemodelcomplexity,whilemain-tainingsatisfactoryperformance.WefocusourworkonconvolutionapproximationsintheSTFTdomain.Theseapproximationscanbeappliedonthe ,whichrepresentstheconvolutionwiththeRTFanddescribesthecouplingbetweenthespeechcomponents.ApplyingsuchanapproximationontheGSCframeworkin-uencebothprocessingbranchesandhasamajorimpactonthebeamformeroutput.Attheupperbranch,theapproximatedxedbeamformerdoesnotmeettheconstraint,thus,speechdistortionisintroducedintothesystem.Atthelowerbranch,theapproximatedblockingmatrixdoesnotbelongtothenullspace,andasaresultdoesnotblockthespeechsignalcompletely.Clearly,theamountofleakagehasamajorinuenceonthequalityofthebeamformeroutput.Incaseofsignicantleakagefromtheblockingmatrix,speechtracesareleftattheoutputofthenoisecancelerandsubtractedfromthexedbeamformeroutput,causingdistortionatthebeamformeroutput.A.MTFApproximationNowweapplytheMTFapproximationontheGSCschemeintheSTFTdomain.Underthisapproximation,aconvolutioninthetimedomainbecomesascalarmultiplicationintheSTFTdomain(aspreviouslymentioned,withaproperzeropaddingoftheSTFTanalysiswindow).Thus,thecross-bandltersarene-glectedandtheband-to-bandlterisapproximatedasasinglecoefcient.Accordingly,theconvolutionmatrixofthecross-bandltersoftheRTFbetweenthespeechcomponentatthe thmicrophoneandthespeechcomponentattherstmicro-phoneareneglected andtheconvolu-tionmatrixoftheband-to-bandlterisapproximatedby Thus,thematrix undertheMTFmodelisreducedtoadi-agonalmatrix ............... Bysubstituting(36)into(26),weobtaintheapproximatedxedbeamformer (37)where and .............. where Substituting(37)into(24),reducesthexedbeamformeroutputbasedontheMTFapproximationto whichistheoutputofthexedbeamformerusingtheTF-GSCmethod[6]. denotethe thoutputchanneloftheblockingmatrix.Thus,itcanbewrittenas (39) Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on October 28, 2009 at 11:20 from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on October 28, 2009 at 11:20 from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on October 28, 2009 at 11:20 from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on October 28, 2009 at 11:20 from IEEE Xplore. Restrictions apply. 1430IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.17,NO.7,SEPTEMBER2009 Fig.6.SignalwaveformsandspectrogramsunderidentiedRTFscenario.(a)Reverberantspeechsourcereceivedattherstmicrophone.(b)Noisysignalreceivedattherstmicrophone,SNRdB.(c)EnhancedsignalobtainedattheTF-GSCoutput,SNRdB.(d)EnhancedsignalobtainedattheCTF-GSCoutput,dB.(e)ReferencenoisesignalattheoutputoftheTF-GSCblockingmatrix.(f)ReferencenoisesignalattheendoftheCTF-GSCblockingmatrix.achievedbythecompetingmethodsismaintained.Thus,sincetheRTFsareknownandarenotinuencedbychangesoftheinputSNR,wemayconcludethatbothcompetingalgorithmsdependontheinputSNRonlythroughtheRTFsidentica-tion.AnotherobservationdrawnfromFig.5isthecompetingmethodsperformanceunderthetwodifferentlocationsofthenoisesource.Itshowsthatbothmethodsobtainbetterresultswhenthenoisesourceislocatednearthemicrophonearray,sinceinthiscase,thecreatednoiseeldatthearrayislessdiffusedandiseasiertoattenuate.Inthesecondexperiment,theRTFsareunknownandshouldbeestimatedfromthemeasurements.Tomakeafaircompar-isonweimplementedanimprovedversionoftheTF-GSCtech-nique.TheoriginalversionoftheTF-GSC,proposedin[6],isbasedontheRTFidenticationmethod[33],whichassumesthepresenceofnonstationarysourceandsta-tionaryuncorrelatednoise.Weimprovethisalgorithmbyre-placingtheRTFidenticationmethodwithamethodadaptedtospeechsignals[15]whichalsotakesadvantageofsilentpe-riods.Thus,boththeimprovedversionoftheTF-GSCandtheproposedmethodrequireknowledgeofspeechpresenceproba-bilities,whichcanbeobtainedusingaVAD.Fig.6(a)–(f)showsthewaveformandthespectrogramofthespeechcomponentreceivedbytheprimarymicrophone,theprimarymicrophonenoisymeasurementatSNRlevelof5dB,theenhancedspeechobtainedbytheTF-GSCandtheproposedmethods,andareferencenoisesignalobtainedattheoutputofAsmentionedabove,intheseexperimentsweuseashortperiodofnoise-onlysignalinanaprioriknownlocationratherthanusingaVAD.ThesilenttimeframesareusedforestimationthenoisePSDsandthetimeframesthatcontainspeechareusedforidentifyingtheRTFs.theblockingmatrixinbothmethods(audiolesareavailable).WeclearlyobservethattheenhancedsignalobtainedbytheproposedalgorithmislessnoisythantheenhancedsignalobtainedbytheTF-GSCtechnique.Inaddition,wecanobserveasignicantspeechdistortionattheoutputoftheTF-GSCmethod(e.g.,fromthewaveformsintherangeof1.5–2s),whereastheoutputoftheproposedmethodseemsundistorted.ExaminingthereferencenoisesignalsinFig.6(e)–(f)maysuggestanexplanation.ItshowsthatthereferencenoisesignalobtainedattheoutputoftheblockingmatrixintheproposedmethodhaslesscomponentsofspeechthanthereferencenoisesignalobtainedattheoutputoftheblockingmatrixoftheTF-GSCmethod.Thesespeechcomponentsareleakedintotheoutputofthenoisecancelerandthenaresubtractedfromthexedbeamformeroutput,inducingadistortion.Fig.7summarizestheSNRimprovement,theSegSNRimprovementandthenoisereductionobtainedbythecom-petingalgorithms.ItshowsthattheSNRimprovement(andtheSegSNRimprovement)achievedbytheproposedmethodishigherthantheSNRimprovement(andtheSegSNRim-provement)achievedbytheTF-GSC.Inaddition,theproposedalgorithmobtainsbetternoisereduction.Furthermore,wenoticethatastheinputSNRlevelincreases,thedifferencebetweentheSNRimprovementachievedbythecompetingmethodsincreases.Inparticular,athigherinputSNRlevelstheproposedmethodbasedontheCTFapproximationbecomesmoreadvantageous.SincetheCTFmodelisassociatedwithlargermodelcomplexitythantheMTFmodel,astheinput[Online].Available:http://www.ee.technion.ac.il/people/IsraelCohen/Pub-lications/CTF-GSC-audio-les/waves.pdf TALMONetal.:CONVOLUTIVETRANSFERFUNCTIONGENERALIZEDSIDELOBECANCELER1431 Fig.7.ResultsobtainedunderidentiedRTFscenariowithroomreverberationtimesetto0.5s.Curvesobtainedwhenthenoisesourceispositionedrelativelynearthearrayareplottedinsolidline.Curvesobtainedwhenthenoisesourceispositionedrelativelyfarawayfromthearrayareplottedindashedline.(a)SNRimprovement.(b)SegSNRimprovement.(c)Noisereduction. Fig.8.NoisereductionobtainedinvariousroomreverberationtimesrelyingonidentiedRTF.Thenoisesourceispositionedrelativelyfarawayfromthearray.(a)InputSNR0dB.(b)InputSNR5dB.SNRlevelincreasesandthedatabecomesmorereliable,largernumberofparameterscanbeaccuratelyestimated[14],[16].SimilartrendscanbeobservedintheSegSNRimprovementandinthenoisereductionndings,wherethedifferenceinthesemeasuresbetweenthecompetingmethodsincreasesinfavoroftheproposedmethodastheinputSNRincreases.Wecanalsoobservethatinthisexperimentthecompetingmethodsobtainbetterresultswhenthenoisesourceislocatedrelativelyfarawayfromthemicrophonearray,unliketheresultsshowninFig.5.TheinputSNRisdenedastheratiobetweentheenergyofthespeechcomponentandthenoisecomponentattheprimarymicrophone.Now,sincetheacousticroomimpulseresponsebetweenthenoisesourceandthearrayconveysmoreenergyasthenoisesourcebecomesfurtheraway,thenoisesourcepowerisdecreasedinordertomaintainacertaininputSNRlevel.Consequently,theRTFidenticationimproveswhenthenoisesourceismovedfurtherawayfromthearrayanditspowerisdecreased.Inthethirdexperiment,weexploretheproposedmethodper-formanceinvariousroomreverberationtimes.Asinthepre-viousexperiment,theRTFsareunknownandshouldbeesti-matedfromthemeasurementsandtheremotenoisesourceissimulated(locatedfurtherawayfromthemicrophonearray).Aspreviouslymentioned,sincetheinputSNRisdenedastheratiobetweenthevarianceofthereverberantspeechcomponentandthenoisecomponentattheprimarymicrophone,itissigni-cantlyinuencedbychangesoftheroomreverberationtime.Inordertocircumventthesechanges,weredenetheinputSNRastheratiobetweenthevariancesofthecleansourcestheinputSNRandthereverberationtimeareindependent)andmaintainitonaxedlevelduringthisexperiment.Inaddition,weusethenoisereductionmeasuretoevaluatetheperformancesinceboththeSNRandtheSegSNRmeasuresmightbeinu-encedbytheroomreverberationtime.Fig.8showsthenoisereductionobtainedbytheproposedmethodandtheTF-GSCinvariousreverberationtimes.Itshowsthatasthereverberationtimeincreases,thenoisere-ductionobtainedbybothcompetingmethodsdecreases.Inaddition,weobservethatinshorterreverberationtimestheTF-GSCachievesbetternoisereduction,whereasinlongerreverberationtimestheproposedmethodperformsbetter.SincetheeffectivelengthoftheRTFincreaseswiththereverberationtime,theCTFmodelbecomesmoreappropriatefortheRTFrepresentation.Asmentionedinprevioussections,undertheMTFmodeltheRTFlengthisboundedbythelengthofthetimeframe(whichshouldberelativelyshorttoobtainlargernumberofframesforreducingtheestimationvariance,andtovalidatetheassumptionthatthespeechisstationaryineachtimeframe).However,undertheCTFmodel,longRTFscanberepresentedusingshorttimeframesbyusinglongerCTFlters.Itisworthwhilenotingthatwhenusingtheproposedmethod,theCTFlterlengthincreases[accordingto(61)]andmorevariablesareestimatedasthereverberationtimeincreases,whereas,thenumberofvariablesisunchangedwhenusingthe 1432IEEETRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.17,NO.7,SEPTEMBER2009TF-GSC(i.e.,largerpartoftheRTFistruncated).WecanalsoobservethatbothcompetingmethodsperformbetterinhigherinputSNRlevels[Fig.8(b)]thaninlowerinputSNRlevels[Fig.8(a)].Inaddition,theproposedmethodbecomesadvanta-geousovertheTF-GSCinshorterreverberationtimewhentheinputSNRishigher.InFig.8(b),theintersectionpointbetweenthecurvesisatreverberationtimeofapproximately280ms,andinFig.8(a)theintersectionpointbetweenthecurvesisatreverberationtimeofapproximately350ms.Asalreadystated,theCTFmodelisassociatedwithagreatermodelcomplexitythantheMTFmodel[16];thus,theCTF-GSCbecomesmoreadvantageouswhentheinputSNRishigherandthedataismorereliable.VII.CWehaveproposedanMVDRbeamformerbasedonanewapproachforsignalandsystemrepresentationintheSTFTdo-main.TheproposedalgorithmisimplementedusingtheGSCscheme,yieldinganunconstrainedminimizationproblemwhichcanbesolvedefciently.Unlikeotherclassicalmethods,whichrelyonthemultiplicativemodelforlinearconvolutionrepresen-tation(theso-calledMTFapproximation),ourmethodisbasedonaconvolutivemodel(theCTFapproximation).TheCTFap-proximation,whichwasshowntobemoreaccurateandlessrestrictive,enablesrepresentationsoflongtransferfunctionswithshorttimeframes.Thispropertymaybeespeciallyusefulinreverberantenvironments,whereacousticroomimpulsere-sponsesarelong.Wedemonstratedtheperformanceofthepro-posedmethodandcompareditwiththeTF-GSCinreverberantenvironments.WhentheinputSNRissufcientlyhigh,theCTFapproximationandproposedmethodenableimprovedSNRandbetternoisereduction.TheimprovedexperimentalresultsimplythattheCTFapproximationmaybebeneciallyutilizedalsoinotherbeamformingmethods.Anadaptiveversionoftheproposedsolutionisatopicforfu-turereserach.TheproposedRTFidenticationisbasedonbatchprocessing,andtherefore,anadaptiveversionoftheRTFiden-ticationisrequired.Inaddition,onlineestimationofspeechpresenceprobabilitiesandnoisePSDsshouldbeincorporatedintothesystem,inordertofullyenjoytheadvantagesoftheproposedGSCscheme.ERIVATIONOFWriting(6)inmatrixformyields (62)where representscomplexconjugatetransposeand isaconvolutionmatrixofthecross-bandlter ofsize .Let beaconcatenationof fromallsubbands,i.e., andlet beaconcatenationofalltheconvolutionma-tricesof ofsize ,i.e., Thus,wecanrewrite(62)as Now,byconcatenatingtheltersandtheSTFTsamplesfromallthemicrophones,theestimator(63)canbecompactlyexpressed (64)where isavectoroflength denedas and isamatrixofsize ,denedas ERIVATIONOFFrom(4),wecanwritethecouplingbetweenthespeechcom-ponentsateachmicrophoneusingtherelativeimpulseresponse orinacompactform (66)where isamatrixofsize consistingofthe convolutionmatrices .Itisworthwhilenoting isaunitmatrix .Substituting(66)into(10)yields (67)where isaconstantmatrixofsize givenby Weassumedthatthenumberoftimeframes isgreaterthanthelengthofthecross-bandlter  . Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on October 28, 2009 at 11:20 from IEEE Xplore. Restrictions apply. Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on October 28, 2009 at 11:20 from IEEE Xplore. Restrictions apply.