/
IJISET InternationalJournal of Innovative Science Engineering  Techn IJISET InternationalJournal of Innovative Science Engineering  Techn

IJISET InternationalJournal of Innovative Science Engineering Techn - PDF document

susan
susan . @susan
Follow
342 views
Uploaded On 2021-06-09

IJISET InternationalJournal of Innovative Science Engineering Techn - PPT Presentation

158 Performance Evaluation of Speech Denoising Using Three Different Deep Neural Networks Gilu Abraham1 and Preethi Bhaskaran2 1 Electronics and Communication Engineering Rajagiri School of Enginee ID: 838579

signal speech enhancement neural speech signal neural enhancement layer ijiset network deep phase dnn vol engineering input fully 148

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "IJISET InternationalJournal of Innovativ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 IJISET InternationalJournal of Innovativ
IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 158 Performance Evaluation of Speech Denoising Using Three Different Deep Neural Networks Gilu Abraham1 and Preethi Bhaskaran2 1 Electronics and Communication Engineering, Rajagiri School of Engineering and Technology,Kochi, Kerala 682039, India IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 159 Babble noise is considered as one of the best noises for masking speech. In [1]they proposed denoising of speech signal ected by b DNN. The experimental results revealed that DNN made with combination of wiener filter performs significantly better against baseline speech enhancement methods.A deep learning approach to process raw waveform tproduce denoised speech signal is developed [2]Using deep feature loss,a fully convolutional context aggregation network(FCCAN)is built[7]. By comparing the internal feature activations in erent networkloss value is computedCAN is able to outperform the similar networks trained using conventionalregression losses.Supervised learningbased speech enhancement techniques have achieved substantial success and show significant improvements over the conventional approaches. Existing supervised learning techniques tried to minimize the mean squared error between the enhanced output and the predefined training targets[9, 10, 11]. A DNNbased speech enhancement by incorporating a speech perception model into the loss functionis developed [3]. Systematic evaluations showed the proposed method improve speech intelligibility in a wide range of signal to noise ratios and noise types while maintaining the speech quality.In this paper speech enhancement using FCNN, CNN and CED arepresented and their performance is evaluated based on RMSE, SDR and PSNR values. System Overviewhe block diagram of the designed speech enhancement system using deep neural ne

2 twork(DNN) is shown in Fig. 1. A DNN is
twork(DNN) is shown in Fig. 1. A DNN is adopted as the mapping function from noisy to clean speech features. The patternframework is developed in two phases.Fig. 1Block diagram of speech enhancement using deep neural networkIn the training phase, a DNNbased regression model was prepared to utilizethe logpower spectral features from sets of noisy and clean speech information. The logpower spectral features areusedsince it is thought to oer perceptually applicable parameters. Shorttime ourier transformation (STFT) is first applied and discrete ourier transform (DFT) is computed for windowed frames with 75% overlapping. Then the log power spectrum is calculated and the magnitude and phase of the input signal areseparated.In the speech improvement phase, the noisy speech features are prepared by the wellprepared DNN model to anticipate the perfect features from the clean speech signal. After that acquired the evaluated log power spectrum features of clean speech.Eventhough the phase of the speech signal is significant for speech recognition, training neural network with phase may add complexity to the network. Howeverhere phase was directly taken from the noisy speech and are multiplied with the estimated signal by thinking that our ears are insensitive to small phase distortions or global spectral shifts.A framework of the clean speech signal is then obtainedfrom inverse DFT (IDFT) of the present edge range. The waveform of the whole utterance is then synthesized using an overlapadd method IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 160 PreprocessingThe first phase of the work comprises executing the Analysis and Synthesis modules which permit to separate thefeatures fundamental for preparing and testing the DNN architecture. The Analysis Module takes the .wav files from the database and concentratetheir spectrogram amplitude and phase. The logpower spectrumis utilized as a contribution for the Neural Networks models, while the phasewill be utilized in

3 the Synthesis Module, so as to recreate
the Synthesis Module, so as to recreate the input speech signal. STFT is utilized to obtain the spectrogram of the signalecorded audio, similar to the 4.3 s long noisy sentence, is basically effectively a vector of consecutive examples, and therefore could legitimately be utilized as contribution to our DNN. In any case, despite the fact that this sentence was recorded with a somewhat low example recurrence of 8000 Hz, the whole information vector woulbe huge: 4.3 s x 8000 Hz = 34400 samples in length.Such a long input vector would make the DNN very complex, which would make it problematic to train.Only a part of the sentence is utilized as input information to the network, and all things considered, make the information vector selfassertively short. In the extraordinary case, we could even utilize only one sampleas informationHowever, one single frame is just one number, but one number i.e., one single frame wouldn't give our DNN any informationon whether this frame speaks to either noise information or clean signal information or both.Hence it will get no opportunity of really improving the signal.Rather, we have to show our noisy signal to the DNN in the conceivable structure that contains thegreatest measure ofimportant data to follow up onFig. 2shows the standard Fourier approach taken for this phenomenonHere the input signal information is signifiedby a complex FFT array whose angle and amplitudevalues can be used as input to theDNN. However,to reduce the complexity only the magnitude spectrum of the speech signal is fed to train the DNN for speech enhancement.Then the resulting magnitude of the trained signal is combined with the noisy phase. However, this introduces a limit to the designed system.The Fig. 3 shows the complete preprocessing stage.FigStandard Fourier Approach steps in speech to spectrogram conversion IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 161 Training of DNNThe architecture adopted here is FCNN, CNNnd CEDneural network with

4 many levels of nonlinearities allowing t
many levels of nonlinearities allowing them to represent a highly nonlinear regression function that mapsnoisy speech features to clean speech features. To reduce the complexity of the neural network and to speed up the processthe features are normalized to zero mean and unit variance. The training of DNN as a regression model consists of an unsupervised pretraining part and a supervised finetuning.The type of hidden units is sigmoid, and the output unit is linear.Fig. 3. Prerocessing steps applied to the speech signalAt that pointthe backpropagation algorithm with the RMSEbased article work between the standardized logpower spectral features of the assessed and the reference clean signal is embraced to prepare the DNN.Rather than prepreparing for introducing the parameters in the first a few hidden layers, the finetuning part performs administered preparing of the considerable number of parameters in the system. The RMSE model in the logpower spectral area has demonstrated consistency with the human soundrelated framework. A smaller than expected cluster stochastic gradient descent algorithm is used to improve the error function.Enhancement StageOnce the DNN is trained, then the trained DNN can be used to enhance speech. As the DNN's output is constrained to the logpower spectrum of the improved signal, we have to replicate the waveform by switching the preprocessing steps. Here the phase of the noisy audio is utilized to recreate the perfect clean audio.ictorial representation of speech enhancement using is shown in Fig. 4Fig. 4. Pictorial representation of the enhancement stage IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 162 Deep LearningIn recent years, Deep Learning has gotten exceptionally well known in areas as Image Processing, Video Processing, Machine Translationand numerous others. One of the explanationthat make Deep Learning the cutting edge of such huge numbers of works is the capacity to display complex nonstraight mapping capacities.

5 Along these lines, Speech Processing is
Along these lines, Speech Processing isanother zone that made Deep Learning the best in class for some undertakings, onethem being Speech Enhancement.Deep Learning is tied in with learning different degrees of representation and deliberation that help to understand information, for example, pictures, sound, and content. Deep Learning comprises of Deep Neural Networks, which are accountable for gaining from input train information. At the work on, contingent upon the assignment, there are different typesof Deep Neural Networks, for example, Convolutional Neural Networks, Fully associated Neural Networks, convolutional encoderdecoder systems and so forth.Fully Connected Neural Network (FCNN)The FCNNis a network with fully associatedlayers, where each neuron in a current layer is connected to each neuron in the subsequent layer[2]. This is why it is called fully connected. The number of neurons in the neighboring layers obviously can be arbitrary.FCNNsystems are the workhorses of deep learning, utilized for a huge number of uses. The significant bit of leeway of fully connected systems is that they are structure agnostic[2]. That is, no unique suspicions should be made about the input information.FCNNconsists of a series of fully associated layers. A fully associated layer is a function from m to n. Each output dimension depends on each input dimension.Fig. 5shows the representation of FCNNFig. 5. Pictorial representation of the fully associatedlayerFig. shows the structure of fully connected neural networkThe fully connected layer multiplies the input by a weight matrix and then adds a bias vector. The number of neurons in the layer and the number of activations from the previous layerdetermines the dimensions of the weight matrix and the bias vectorFig. 6. Structure of fully connected neural network IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 163 Convolutional Neural NetworkA CNN (ConvNet) is a deep learning algorithm thatcan take in an input image, ap

6 point significance to different angles/o
point significance to different angles/objectsin the image and erentiate one from the other. The prehandling required in a ConvNet is minimumwhen contrasted with other classification algorithms[12]A ConvNetcan effectively catch the Spatial and Temporal conditions in the given contribution through the utilization of significant filters. The design plays out a superior fitting to any dataset because of the decrease in the numberof parameters included and the eusability of loads.Fig. 7 shows dierent layers present in the Fig. 7. Dierent layers presnt in CNN (one group of convolution)The layer convolves the contribution by moving the filters alongwiththe input vertically and horizontallyThen dot product of filter weights and input information is computed. After thata bias term is added.As compared with FCNN convolutional layers consistan only lesser number of parameters.onvolutional EncoderDecoderThe Convolutional EncoderDecoder (CED) system comprises of symmetric encoding layers and deciphering layers[11]. An encoder comprises sets of Convolution, batch normalization (BN), maxpooling, and ReLU layers. The decoder comprises of sets of Conv, BN, and inspecting layers. A CED packs include alongwithan encoder, and afterward reproduces includes alongwitha decoder.Fig. 8shows the architecture of CED.Fig. 8. Structure of convolutional encoderdecodeEncoder decoder circuit is preferred because in case of CNN maxpooling layer reduces the dimensions to make the process easier. Hence in the decoderan upsampling layer is used to reverse the performance of the maxpooling layer.Design of Network Architectureconsistof multiple hidden layers such as input layer, convolutional layer, Rectified Linear Unit (ReLU) layer, pooling layer, fully connected layer, normalized layerand regression layer. CNN shares weights in the convolutional layer reducing IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 164 the memory footprint and increases the performance of the network. Here

7 the weights of each convolutional layer
the weights of each convolutional layers are initialized based on glorot initializer where the values varbetween +0.01 to 0.01 whose mean is zero and variance is (2/ fanin + fanout). The number of neurons in the hidden layer ofFCNNis determined by the size of the input image. The batch normalization layer is used, increases the stability of the neural network. Then a nonlinear activation function is applied through the ReLU layer to improve the convergence properties when low error is obtained. In the pooling layer, max pooling is used, where the maximum value of the representative pixel is selected from the feature map using the 2*2 grid, which results in the reduction of sample size. In the e of CNNan additional convolutional layer is prsent, which helps in reducing the number of learnable parameters. onvolutional kernel is passed over the image horizontally and vertically. Ican be of size either 7x7, 5x5 or 3x3.Experimental SetupPreprocessingDatasetThe experiment was conducted generated taset, for which clean signals are collected from the TIMIT databaseand babble noise clips were collected from the NOIZEUSdatabase[1, 2]Clean signal and babble noise are mixed to produdataset for the evaluation of the experiment. Both data in the training set and the testing set (200utterances) were added with babble noise clips at 10dB, 5dB, 0dB, 5dB, and 10dB SNR. Signal TransformationThe sound signals were downsampled to kHz, and the quiet edges were expelled from the signal. The spectral vectors were analyzedutilizing a 128point Short Time Fourier Transform (STFT) with a window move of 32pointpoint STFT size vectors were decreased to 65point by evacuating the symmetric half. For FC, CNN, and CED the information features comprised of 8 sequential noisy STFT amplitude vectors (size: 65x8). Both informationfeatures were normalizedto have zero mean and unit variance.Phase Aware ScalingTo avoid extreme erences (more than 45 degree) between the noisy and clean phase, at reconstruction, the noisy spectral phase was used instead to perform inverse STFT and recover human speech. However, because t

8 he human ear is not susceptible to phase
he human ear is not susceptible to phase dierence smaller than 45 degree, the distortion was negligible. Through phase aware scaling, the phase mismatch smaller than 45 degree. Hence the phase of the noisy speech is directly multiplied at the output of the neural network and is then used for estimating theenhanced speech.OptimizationThe convolutional kernel weights were initialized using glorot initializeri.e. kernels with minimum value. Convolution layers were trained from scratch, with the aid of batch normalization layer added after each convolution layer. All networks were trained using backpropagation with gradient descent optimization using Adam with a minibatch size of 64.Evaluation MetricThe performance evaluation of the network is compared using the following three parameters:Mean Square Error (MSE)The mean squared error (MSE) or mean squared deviation (MSD) of an estimator quantifies the normal of the squares of the mistakesthat is, the normal squared dierence between the evaluated qualities and the actual value[1, 13]The way that MSE is quite often carefully positive (and not zero) is a result of irregularity or on the grounds that the estimator doesn'trepresent data that could create a progressively precise estimate. The MSE is a proportion of the nature of an estimator. It is consistently nonnegative, and qualities more like zero are better. IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 165 Signal to Distortion Ratio (SDR)SDRwas used to measure the amount of error present between clean and denoised speech[1, 13]. It is defined as the ratio of clean speech to MSE value. It is measured the log scale. Higher the value larger the quality of the signal.Peak Signal to Noise Ratio (PSNR)It is defined as the proportion between the greatest conceivable intensity of a sign and the intensity of undermining noise that ects the fidelity of its representation[1, 13]. Since numerous signs have a wide power range, PSNR is generally communicated as far as

9 the logarithmic decibel scale. Higher t
the logarithmic decibel scale. Higher the value better will be the nature of the enhanced speech signal.AnalysisAnd Simulation ResultsTraining and TestingBackpropagation algorithms are used to train the Deep CNN for mappingandspeech enhancement. It consistof two phaseIn the first phaseDNNs are trained using spectrogram of speech signals for speech enhancement. In the second phase trained DNN is tested using a sample spectrogram of the speech signal.Here the performance of three basic DNNs FC, CNN and CED is compared and evaluated theperformance of three networks using MSE, SDRand PSNR.Simulation Resultsig. 9show the denoised speech usingFC, CNN, and CEDFig. 9Speech denoising using deep neural network for SNR 0dBFig. 10 show the spectrogram of the denoised speech using FC, CNN, and CED. It gives the visual representation of the spectrum of frequencies of a signal as it varies with time. It allows visualization of both frequency and amplitude informatiof audio in one display. IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 166 Fig. 10Spectrogram of the denoised speech using DNN for SNR 0dBFig. 11shows the mean square error comparison of the designed three different neural netwo. It shows the dierence between the estimated and actual value for di erent SNR. Minimum the value of ated signal.Fig. 12shows the signal to distortion ratio comparison of the three designed neural network. It measurethe quality of the signal. Higher the value better the quality of the speech.Fig. 11Mean square error comparisonplotFig. 12Signal distortion ratio comparison for dierent SNRplot IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 167 Fig. 13shows the peak signal to ratio comparison of thethreedesigned neural network. It also the quality of the signal. Higher the value better the quality of the speechFig. 13Peak Signal to Noise ratio comparison for

10 dierent SNRplotConclusion and Future Wor
dierent SNRplotConclusion and Future WorkDeep learning has become a gamechanger in the field of speech enhancement. DNNsrepresent one of the most promising approaches.Herein speech enhancement usingFCNN, CNNnd CEDneural network is implemented for speech enhancement. The performance of the three designed network architecture is compared using MSE, PSNR and SDR scores. CNN can eectively denoise speech with smaller network size according to its weight sharing property. CNN performs better for both positive SNR’s and negative SNR’s but due to high learnable parametersFC doesn’t perform well in negative SNRs. But in positive SNR’s FC also performs well. CED has also shown its performance eciency inboth positive and negative SNR.References[1]Nasir Saleem, Muhammad Irfan, Xuhui Chen, Muhammad Ali, “Deep Neural Networkbased Supervised Speech Enhancement in Speech Babble Noise,” IEEE/ICIS International Conference on Computer and Information ScienceJune [2]Se Rim Park, Jin Won Lee, “A Fully Convolutional Neural network For Speech Enhancement” arXiv: 1609.07132v1 [cs.LG],22 Sep 2016.[3]Subhashree Satpathy, Ravinder Kumar, “Performance Evaluation of Active Noise Control Algorithm Using Matlab,” IJISET International Journal of Innovative Science, Engineering Technology, Vol. 2 Issue 8, August 2015.[4] Yan Zhao, Buye Xu, Ritwik Giri, Tao Zhang, “Perceptually Guided Speech Enhancement Using Deep Neural Networks ”, IEEE International Cference on Acoustics, Speech And Signal Processing (ICASSP),2018.[5]Ming Tu, Xianxian Zhang, “Speech Enhancement Based on Deep Neural Networks with Skip Connections,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)[6]J. Johnson, A. Alahi, L. FeiFei, “Perceptual Losses for RealTime Style Transfer and SuperResolution”, European Conference on Computer Vision (ECCV)[7]Francois G. Germain, Qifeng Chen, Vladlen Koltum, “Speech Denoising with Deep Feature Losses”, arXiv: 1806.10522v2 [eess.AS], 14 Sep 2018.[8]Yariv Ephraim, Harry L Van Trees, 

11 47; Speech Enhancement Using AMinimum Me
47; Speech Enhancement Using AMinimum Mean Square ErrorShortTime Spectral Amplitude Estimator”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol.32, no. 6, pp. 11091121,1984.[9]Pascal Scalart et al., “Speech Enhancement Based On APriori Signal to Noise Estimation”, IEEE/ International Conference on Acoustics, Speechand Signal Processing (ICASSP), vol. 2, pp. 629632, 1996.[10]Yong Xu, Jun Du, LiRong Dai, ChinHui Lee, “A regression Approach to Speech Enhancement Based on Deep Neural Networks”, IEEE/ ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 1, pp. 719, 2015.[11]Bingyin Xia, Changchun Bao, “Speech Enhancement with Weighted Denoising AutoEncoder”, in INTERSPEECH, pp. 3448, 2013. IJISET InternationalJournal of Innovative Science, Engineering & Technology, Vol. 7 Issue 2, February 2020 ISSN (Online) 2348 7968Impact Factor (2019) 6.248www.ijiset.com 168 [12] Ossama Abdel Hamid, Abdel Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, Dong Yu, “ Convolutional Neural Networks for Speech Recognition”, IEEE/ ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 10, pp. 15331545, 2014.[13]Ambalika, Er. Sonia SainiPerformance Comparison of Speech Enhancement Algorithms Using Different Parameters”, International Journal of Advanced Research in Electronics and Communication Engineering (IJARECE)Vol. 5, no.7, July First Author Gilu K Abraham, received the B.Tech in ECE from SJCET, Palai, India in 2017. She is perusing M.Tech in ECE from Rajagiri School of Engineering and Technology, Kochi, India. She has received the best paper award for the paper entitled “Lung Nodule Detection in CT images A Novel Approach” at NCMLAI2019 held at CITSecond AuthorPreethi Bhaskaran, received her BE, and ME in Applied Electronics from Anna University. She has more than 10 years of experience in eaching. She is currently working as Assistant Professor, Department of ECE, Rajagiri School of Engineering and TechnologyKochi, India. She has publishedresearch papers in various journals and confer