Vous êtes sur la page 1sur 4

3rd International Conference on Electrical & Computer Engineering

ICECE 2004, 28-30 December 2004, Dhaka, Bangladesh

ON THE ESTIMATION OF NOISE FROM PAUSE REGIONS FOR SPEECH ENHANCEMENT USING SPECTRAL SUBTRACTION
Md. Rashidul Islam, Hasibul Haque, M. Q. Apu and *Md. Kamrul Hasan Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka-1000, Bangladesh *Email: khasan@eee.buet.ac.bd, {shanto_012, hasib017, mqapu}@yahoo.com ABSTRACT
In this paper, we propose an algorithm to identify pause regions in speech by tracking second-order minima of frame variances of the FFT spectrum that fall within a prescribed frequency range. To study the effect of noise estimation algorithm on speech enhancement, the noise spectrum is estimated using different methods from the identified pause regions. The effectiveness of the algorithm proposed here for detecting noise only frames with a view to estimate noise spectrum is illustrated comparing speech enhancement results obtained via true and estimated noise spectra. Various techniques have been reported so far in the literature for pause detection. Most of them, however, can detect pause accurately only at high SNRs [3]-[6]. The pause detection in low SNRs is still a challenging problem. The methods described in [7]-[9], though work at low SNRs, they are either suitable for stationary white noise or computationally expensive. In many algorithms [10]-[11] noise estimate is continuously updated to cope better with nonstationary noise. The continuous update of the noise estimate (independently in the sub bands), however, is susceptible to erroneous capture of speech energy. The work of Fischer and Stahl [12] has shown that the corruption of noise estimation by speech is too large in case of adaptive scheme. So far enhancement algorithms based on noise estimation from pause regions play a very important role. The work in this paper is directed towards finding a computationally efficient algorithm for pause detection in low SNRs. The pause regions in this work are identified using a modified frame variance method. Then noise estimation for all the detected noise only frames is done using three different approaches. Finally, the effectiveness of different noise estimation strategies has also been studied using their estimates in speech enhancement algorithms.

1. INTRODUCTION
In any practical speech enhancement problem there are two major components: (1) estimation of noise power spectra and (2) estimation of a priori SNR. The estimation of the noise power spectra plays a key role behind an effective speech enhancement algorithm. For single channel speech enhancement systems the estimation of the noise spectra is usually performed from speech pauses since no secondary channel is available. The accuracy of noise spectra estimation is thus highly dependent on the accurate identification of speech pauses given the noisy speech only. If the pause detection is not accurate enough then speech echoes and residual noise may be present in the enhanced speech [1]. Again, since most realistic noisy environments are characterized by nonstationarity, it is necessary to update the noise estimate as often as possible to attain optimum noise reduction. Many reported results on speech enhancement have avoided the difficulty of pause detection for noise estimation by using an ideal pause detection algorithm that works only on the clean speech or by using short test signals with a initial noise only period or by estimating noise manually [2].

2. PROPOSED ALGORITHM FOR PAUSE DETECTION


Let the clean speech, noise and the noisy signals in the time domain are denoted by x(t ) , d (t ) and y (t ) respectively. If it is assumed that the noise is additive, y (t ) can be expressed as (1) y (t ) = x(t ) + d (t ) The FFT domain representation of Eq. (1) is Yn , k = X n , k + D n , k (2)

ISBN 984-32-1804-4

402

where X n, k , D n, k and Yn, k are the clean speech, noise and noisy signal FFT spectral components, respectively, at n -th frame and k -th frequency bin. The corrupted speech signal y (t ) is segmented into frames each of length 32ms (i.e., 256 samples at 8 kHz) with an overlap of 50%. Then N = 512 point FFT of each frame is taken. The variance of n th frame of a noisy speech signal is defined as 1 N 2 n = ( Yn , k Y n ) 2 (3) N k =1

we track the second-order minima. The final step is to determine a variance threshold. The pause frames are identified as the ones whose variance value fall below this threshold. We choose this threshold as the median value of the second-order minima of the modified variances. We presumably select the first frame as a pause frame only if the algorithm fails to detect this frame.

where N is the frame length and Y n is the mean of spectral magnitudes of the n th frame. If we track the variances of the FFT spectrum of different frames then whenever target speech is absent, which means that the input signal consists of noise only, the value of variances will be less than those of the frames where speech is present. This variance difference reveals a crucial fact of identifying the speech pause or noise only frames. But the practical problem associated with this method is that in case of low SNRs the difference in variances of speech and non-speech frames may not be significant for distinction. To overcome this problem, we use a prefiltering approach that would make the identification of the noise only frames in the low SNRs less erroneous. We evaluate the variance of the frequency spectrum that fall within the range 235 to 2350 Hz. Speech spectrum outside this frequency range may be considered insignificant as compared to the noise spectrum in that range. Therefore in the isolation of speech and noise only frames such frequency filtering will not have any deleterious effect. Here, we define the modified variance as
2 n =

Fig. 1. Characteristic of modified frame variance.

1 Nf

k =15

(Y

150

n ,k

Y n )2

(4)

Fig. 2. Threshold selection for pause detection.

where N f is the frame length of the spectral lines in the frequency range 235 to 2350 Hz and Y n is the
f

mean of the filtered spectral magnitudes of the nth frame.


The next step in the proposed pause detection algorithm is tracking second-order minima )2 )2 )2 ). If n < n 1 and of the modified variances ( n )2 )2 )2 )2 n < n +1 then the first-order minima n, min 1 = n . )2 Then second-order minima is tracked from n, min 1 in a similar process. The reason behind this two-step minima tracking is that at low SNRs not all the minima represents pause regions. To increase the probability that the minima are due to speech pause

The pause detection algorithm is now illustrated using a target sentence (collected from TIMIT database) of approximately 4ms length mixed with air-cockpit noise (digitally added from standard noise file) at -10dB. The sentence used is Pretty soon a woman came along carrying a folded umbrella as a walking stick. Fig. 1 shows the variances of frames of the original speech calculated according to Eq. (3), and of the noise-corrupted speech calculated according to Eqs. (3) and (4). For the original speech, a variance value close to zero refers to a pause frame. Note that in case of noisy speech no variance value will be zero rather the minimum value will be close

403

to average noise power. Though at high SNR variance calculated using Eq. (3) may be used for the isolation of pause regions, variance calculated according to Eq. (4) shows superior performance even at a very low SNR. Fig. 2 shows the first- and second-order minima of the modified variance

Fig. 3. Identified pause regions by the proposed method at SNR= -10 dB.

(Eq. (4)). The threshold is set as the median of the of the second-order minima of the modified variances. The frames with variances below this threshold are taken as noise only frame. The pause regions selected by the proposed algorithm are shown in Fig. 3. There are clearly visible 10 pause regions in the given speech out of which 7 are correctly detected by our algorithm at SNR= -10 dB.

given noisy speech we have used the same Eq. (5) for pause regions with more than one frames. For a pause region containing only one frame noise spectrum is obtained as the squared-value of the Fourier magnitude spectrum. After estimating the noise spectra of each pause region, we attempted three approaches for final estimation: (1) estimate noise from the very beginning pause only and use it for the whole speech, (2) estimate noise from every pause and use it for enhancement until the next pause is identified, and finally (3) estimate noise as an arithmetic frame average of the estimated noise spectra of all the pause frames selected and use it for the whole speech. Fig. 4 shows the actual noise spectra and the estimated noise spectra obtained using the 3rd approach stated above. It can be observed that the average value of the estimated noise spectra using Eq. (5) operating on the actual noise is approximately two times of that obtained from the identified pause regions using our approach. Thus the noise spectra obtained from the pause regions is an under-estimate of the desired spectra resulting from Eq. (5) with actual noise. However, if a multiplication factor (which is 2 for this paper) resulting from the two averages ratio is used, a noise estimate that will closely resemble the estimate from the actual noise may be obtained.

4. RESULTS
Simulation results using subtraction method proposed in [13] are shown for the female utterance Pretty soon a woman came along carrying a folded umbrella as a walking stick taken from the TIMIT database. Different standard noises, e.g. white, aircockpit, babble, and highway, were added to the given speech at various SNRs (-10 20 dB). The signal was reconstructed using the standard overlapadd method with a 50% overlap. Noise was estimated as described in section 3. Several speech enhancement results are presented in Figs. 5-6 to demonstrate the performance of the proposed scheme. To compare effectiveness of different noise estimation schemes described in section 3 in speech enhancement, we have used some measures, namely IS measures and Overall Output SNR. It is obvious from the results shown in Figs 5-6 that using the average noise estimation scheme shows excellent match with that of using actual noise in case of all types of noises and quality measures.

3. NOISE ESTIMATION
) The noise spectra ( [ n, k ] ) can be estimated from the pause regions as [14] ) ) [ d ( n, k ) ] = D [ d ( n 1, k ) ] + (1 D ) Yn, k (5)
where 0.5 D 0.9 . As in [13], we have used D = 0.9 and = 2 for all cases in this work. In obtaining the noise spectra using the actual noise Eq. (5) was used. However, for noise estimation from the identified pause regions of the

Fig. 4. Comparison of results of noise estimation.

404

5. CONCLUSION
This paper has addressed a speech pause detection algorithm effective at low SNRs. The effect of the noise spectra estimated from the identified pause regions using different approaches has been studied for speech enhancement problems. It has been observed that the average noise spectra of all the identified pause frames perform almost equally as that of the true noise spectra. This fact has revealed that the proposed method can be used for accurate noise estimation as required by the sophisticated speech enhancement algorithms invariably.
25 20 25

Output SNR

15 10 5 0 5 10

Output SNR

Actual Average First pause Adaptive

20 15 10 5 0

(a)
0 10 20

5 10

(b)
0 10 20

Input SNR
25 20 25 20

Input SNR

15 10 5 0 5 10

15 10 5 0

(c)
0 10 20

5 10

(d)
0 10 20

Input SNR

Input SNR

Fig. 5. Improvement of Output SNR in comparison with Input SNR for (a) White, (b) Aircockpit, (c) Babble, (d) Highway noise.
6 5 4

5 4 3 Degraded Actual Avarage First pause Adaptive

3 2 1

2 1 (a)
0 10 20

0 10

Input SNR
6 5 4

0 10

(b) 0 10 Input SNR

20

5 4 3

3 2 1

2 1

0 10

(c)
0 10 20

0 10

(d)
0 10 20

Input SNR

Input SNR

Fig. 6. Variation of IS with Input SNR for (a) White, (b) Aircockpit, (c) Babble, (d) Highway noise.

6. REFERENCES
[1] Mukul Bhantnagar B.E., A Modified Spectral Subtraction Method Combined with Perceptual Weighting for Speech Enhancement, Thesis., Presented to the Faculty of The University of Texas at Dallas, August 2002.

[2] Mark Marzinzik and Birger Kollmeier, Speech pause detection for Noise Spectrum Estimation by tracking Power Envelope Dynamics, IEEE Transcript on Speech and Audio Processing, vol. 10, no. 2, February 2002. [3] G. S. Kang and L. J. Fransen, Quality improvement of LPC-processed noisy speech by using spectral subtraction, IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 930942, 1989. [4] C. Elberling, C. Ludvigsen, and G. Keidser, The design and testing of a noise reduction algorithm based on spectralsubtraction, Scand. Audio., vol. Suppl. 38, pp. 3949, 1993. [5] H. Sheikhzadeh, R. L. Brennan, and H. Sameti, Realtime implementation of HMM-based MMSE algorithm for speech enhancement in hearing aid applications, Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing 1995, vol. 1, 1995, pp. 808811. [6] K. Itoh and M. Mizushima, Environmental noise reduction based on speech/nonspeech identification for hearing aids, Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing 1997, Conference Proceedings. Los Alamitos, CA: IEEE Comput. Soc. Press, 1997, pp. 419422. [7] I. Abdallah, S. Montrsor, and M. Baudry, Speech signal detection in noisy environment using a local entropic criterion, Proc. 5th Eur. Conf. Speech Communication Technology, EUROSPEECH97, Rhodes, Greece, 1997. [8] B. L. McKinley and G. H. Whipple, Model based speech Fig. 7. Variation of IS with Input SNR for (a) White, (b) Aircockpit, (c) Babble, (d) Highway noise. pause detection, Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing 1997, Los Alamitos, CA, 1997, pp. 11791182. [9] M. Dendrinos and S. Bakamidis, Voice activity detection in colored noise environment through singular value decomposition, Proc. 5th Int. Conf. Signal Processing Applications and Technology, Waltham, MA: DSP Associates, vol. 1, pp. 137141, 1994. [10] H. G. Hirsch, Estimation of Noise Spectrum and its Applicatio to SNR Estimation and Speech Enhancement, Int. Compute. Sci. Inst., Berkeley, CA, Tech. Rep. Tr-93012, 1993. [11] H. G. Hirsch and C. Ehrlicher, Noise estimation techniques for robust speech recognition, Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing 1995, vol. 1, 1995, pp. 153-156. [12] A. Fischer and V. Stahl, On improvement measures for spectral subtraction applied to robust automatic speech recognition in car environments, Proc. Workshop Robust Methods Speech Recognition Adverse Conditions, Tampere, Finland, pp. 7578, 1999. [13] Md. Kamrul Hasan and Lutfa Akter, Quality improvement of enhanced speech in dct domain using modified a priori SNR, IEEE Signal Processing Letters, To Be Published. [14 ] N. Virag, Single channel speech enhnacement system based on masking properties of the human auditory system, IEEE Trans. Speech Audio Processing, vol. 7, pp. 126-137, 1999.

Output SNR

IS

IS

IS

IS

Output SNR

405

Vous aimerez peut-être aussi