Académique Documents
Professionnel Documents
Culture Documents
ABSTRACT explicit knowledge of the original impulse responses, making its use
impractical in real-world reverberant settings, where information re-
In this work, we develop a two-stage blind dereverberation scheme garding the original RIRs may not be readily available. In addition,
for reverberant speech enhancement. The proposed algorithm oper- the MINT is only useful when background noise is weak or well-
ates by first splitting the reverberant inputs into different subbands. controlled because it is very sensitive to even small errors in the
In the first stage, the inverse filters are estimated using the blind mul- estimated channel impulse. To address these concerns, Furuya and
tiple input-output inverse-filtering theorem (MINT), while in the sec- Kataoka [4] have recently proposed a more robust speech derever-
ond stage our approach suppresses late reverberant energy by sub- beration method using a blind deconvolution MINT-based inverse-
tracting the power spectrum of the late impulse components from filtering approach for early reverberation cancellation and spectral
the power spectrum of the inverse filtered speech signal. Experi- subtraction for late reverberation compensation. To overcome inac-
mental results with reverberant speech indicate that the performance curacies arising from the inverse filtering-based blind deconvolution
of the proposed algorithm is substantially better when compared to stage, as well as tail fluctuations of real-world impulse responses, the
existing dereverberation algorithms. proposed speech dereverberation technique utilizes a second spectral
subtraction stage to minimize the influence of long-term reverbera-
Index Terms— Blind dereverberation, psychoacoustic model, tion [4]. Spectral subtraction methods can suppress late reverberant
spectral subtraction, perceptual evaluation of speech quality (PESQ). energy by estimating the power spectrum of clean speech after sub-
tracting the power spectrum of reverberation from that of the rever-
1. INTRODUCTION berant speech (e.g., see [7, 8]).
In this paper, we propose a subband MINT-based blind deconvo-
Reverberation or the collection of reflected sounds from surfaces lution method for speech dereverberation. Motivated by a perceptual
in an acoustic enclosure leads to temporal and spectral smearing, model that closely approximates auditory frequency selectivity, the
which distorts both the envelope and fine structure of speech [1]. reverberant speech is first passed through a gammatone filterbank
Tackling speech degradation due to reverberation has recently be- [9]. The MINT-based blind deconvolution method proposed in [4]
come an area of intense research activity and has given rise to sev- is then applied separately to each subband. The proposed subband
eral speech dereverberation algorithms (e.g., see [2]–[5]). The most inverse-filtering technique is based on the correlation of the input
challenging issue when dealing with speech dereverberation is es- (reverberant) signals in different bands.This leads to a considerably
timating the unknown room impulse responses (RIRs) between the more accurate inverse filter estimation, when compared against the
speaker and the microphone(s). This essentially makes dereverber- original MINT-based blind deconvolution method, which uses the
ation a “blind” problem. The goal of blind dereverberation is to whole spectrum for correlation estimation. We therefore show that
retrieve the (unknown) input speech signal and ideally the relevant a subband-based speech dereverberation approach can compensate
features of the unknown RIRs, while only assuming knowledge of for the inverse filtering errors arising from traditional speech dere-
the microphone outputs, as well as some relevant information on the verberation algorithms. Furthermore, a spectral subtraction method
statistical properties of the original source [6]. relying on an asymmetric smoothing function is used for late re-
The MINT method which is based on the multiple input-output verberation reduction [8]. The performance of the proposed blind
inverse-filtering theorem (e.g., see [3]), remains one of the most effi- single-input two-output algorithm for reverberant speech enhance-
cient dereverberation algorithms to-date. The MINT has been found ment is assessed through experiments in a two-microphone configu-
to perform near perfect towards finding accurate estimates of inverse ration using the perceptual evaluation of speech quality (PESQ) met-
impulse responses even in non-minimum phase reverberant environ- ric.
ments1 . However, the main drawback of the MINT is that it requires
2. ALGORITHM FORMULATION
This work was supported by Grants R03 DC 008882 (K. Kokkinakis) and
R01 DC 007527 (P. C. Loizou) awarded from the National Institute on Deaf- In this section, we analyze the blind deconvolution method based
ness and Other Communication Disorders (NIDCD) of the National Institutes
on the MINT inverse-filtering theorem for a single-input and two-
of Health (NIH).
1 Due to physical considerations, room impulse responses exhibit non- microphone system configuration. Next, the proposed perceptually-
minimum phase characteristics. This is due to the fact that the reverberant motivated subband blind dereverberation method is presented. This
components of an acoustic transfer function contain more energy than the is followed by a brief description of the spectral subtraction method
direct sound itself. used for late reverberation suppression.
Authorized licensed use limited to: Nanyang Technological University. Downloaded on August 17,2010 at 15:58:40 UTC from IEEE Xplore. Restrictions apply.
2.1. Blind deconvolution based on MINT 0
In the conventional MINT algorithm [3], the original RIRs are −10
used to compute the inverse filters. Since the RIRs are unknown, the
4715
Authorized licensed use limited to: Nanyang Technological University. Downloaded on August 17,2010 at 15:58:40 UTC from IEEE Xplore. Restrictions apply.
where Rm indicates the correlation matrix in the mth subband and convolving the clean speech signal with synthetically generated RIRs
m = 0, 1, . . . , M − 1 denotes the subband index. After solving (10) using the image method for a 6.60 × 4.60 × 3.10 m room (e.g., see
for each subband, the signals in each subband are inverse filtered and [12]). The distance between the single source signal and the two mi-
added together, yielding the clean signal crophones is 4.80 m and the two microphones are positioned around
10 cm apart.
2
m =
x hm,j ∗ xm,j (11)
j=1
3.2. Algorithm parameters
where ∗ defines convolution in the time-domain. In a similar manner In the whitening stage, the inverse of the autoregressive vocal
to [10], the resulting output of each subband is time-reversed, passed tract filter is estimated by linear prediction coding (LPC) analysis of
through the gammatone filter, time-reversed again, and then summed the reverberant speech signal. Prior studies have suggested that such
across all subbands to obtain the final (enhanced) speech signal. a whitening strategy can lead to a better speech correlation cancel-
lation when compared to just computing the whitening filter based
2.3. Spectral subtraction on the inverse of the long-term average of the speech spectrum (e.g.,
see [13]). The whitening is performed on non-overlapping frames
In order to compensate for the inaccuracies arising from the blind of 25 ms duration corresponding to 200 samples at 8 kHz and the
deconvolution stage and to further eliminate the tail fluctuations of whitening filter is obtained using 10th-order LPC analysis.
the impulse responses, a spectral subtraction method is used in the In addition, a gammatone filterbank of 128 4th-order filters is
third stage of our dereverberation approach. Here, to suppress the used to divide the frequency range of 50–4000 Hz to subbands sim-
late reverberant energy, the power spectrum of the late impulse com- ilar to the human auditory system, which uses filters with quasi-
ponents is subtracted from the power spectrum of the inverse filtered logarithmically spaced center frequencies. The correlation matrix is
speech signal. The signal power spectrum of the clean speech signal estimated for approximately 16,000 sample blocks and is averaged
can be estimated as in [8] with a moving average factor of 0.9 for each set. The reverbera-
tion time of the room is T60 = 500 ms in the 20–4000 Hz frequency
| Su (t, ) |2 = |Sx (t, )|2 · band, which is a typical value for a severely reverberant environ-
ment. The length of the RIRs and the estimated inverse filters are
| Sx (t, ) |2 − γ w(t − ρ) ∗ | Sx (t, ) |2 approximately 1,5002 and 700 taps, respectively. The parameters a,
max , ε (12) γ, ρ and ε used in the spectral subtraction stage are set to 5, 0.3, 7
| Sx (t, ) |2 and 0.03, respectively.
where t and are the frame and frequency bin indices, respectively, 3.3. Performance evaluation
and | Sx (t, ) |2 and | Su (t, ) |2 are the power spectra of the subband
inverse filtered signal and the estimated clean signal after the spectral
subtraction stage. In order to approximate the coefficients of the late The overall quality of the enhanced signals is assessed with the
impulse response components, we apply an assymetrical smoothing perceptual evaluation of speech quality (PESQ) score [14]. The
function resembling the Rayleigh distribution. This smoothing func- PESQ employs a sensory model to compare the original (unpro-
tion has the form cessed) with the enhanced (processed) signal, which is the output
of the dereverberation algorithm, by relying on a perceptual model
⎧ of the human auditory system. The PESQ score has been shown
⎪
⎪ t+a −(t + a)2 to exhibit a high correlation coefficient (Pearson’s correlation) of
⎨ 2 exp , if t > −a
w(t) = a 2a2 (13) r = 0.91 with subjective listening quality tests [15]. The PESQ mea-
⎪
⎪ sures the subjective assessment quality of the dereverberated speech
⎩
0, otherwise rated as a value between 1 and 5 according to the five grade mean
opinion score (MOS) scale. Here, we use a modified PESQ measure
referred to as mPESQ, with parameters optimized towards assess-
where γ is the scaling factor, ρ is the relative delay of the late im-
ing speech signal distortion, calculated as a linear combination of
pulse response components, a denotes the control parameter and ε
the average disturbance value Dind and the average asymmetrical
is the maximum attenuation floor.
disturbance values Aind [14, 15]
4716
Authorized licensed use limited to: Nanyang Technological University. Downloaded on August 17,2010 at 15:58:40 UTC from IEEE Xplore. Restrictions apply.
TEST SET INPUT BD SBD SBD+SS MINT
A 2.56 2.96 3.18 3.30 4.95
B 2.55 2.91 3.20 3.33 4.95
C 2.58 2.99 3.18 3.31 4.92
D 2.53 2.78 2.99 3.21 4.89
E 2.49 2.79 2.95 3.14 4.94
F 2.59 2.83 2.97 3.23 4.90
G 2.61 2.87 3.09 3.29 4.95
H 2.58 2.88 3.07 3.38 4.88
I 2.54 2.90 2.99 3.20 4.91
J 2.63 2.80 3.09 3.31 4.94
MEAN + STD 2.57± 0.04 2.87 ± 0.07 3.07 ± 0.09 3.27 ± 0.07 4.92 ± 0.03
Table 1. Speech dereverberation performance of the SBD+SS method in terms of mPESQ values averaged across both microphones.
Table 1 shows the performance of the proposed method in terms [1] P. Assmann and A. Summerfield, “The perception of speech un-
of the mPESQ scores calculated from the test set created from the der adverse conditions,” Speech Processing in the Auditory System,
IEEE database. Results for the proposed subband-based blind dere- S. Greenberg, W. Ainsworth, A. Popper and R. Fay (Eds.), New York:
Springer-Verlag, 231–308, 2006.
verberation inverse-filtering technique without spectral subtraction
[2] J. Benesty, M. Sondhi and Y. Huang, Eds., Springer Handbook of
(SBD) and with the spectral subtraction stage (SBD+SS) are pre-
Speech Processing. Germany: Springer, 2007.
sented separately to better demonstrate the impact of each stage in
[3] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acoustics,” IEEE
the entire dereverberation process. The mPESQ scores obtained from Trans. Speech Audio Process., vol. 36, pp. 145–152, 1988.
the blind deconvolution-based MINT inverse-filtering approach (BD) [4] K. Furuya and A. Kataoka, “Robust speech dereverberation using
[4] and the conventional (non-blind) MINT method [3] are also re- multi-channel blind deconvolution with spectral subtraction,” IEEE
ported to provide a benchmark for assessing the performance of the Trans. Audio, Speech, Lang. Process., vol. 15, pp. 1579–1591, 2007.
proposed algorithm. [5] K. Kokkinakis and P. C. Loizou, “Selective-tap blind dereverberation
The average improvement in mPESQ is approximately 0.7 when for two-microphone enhancement of reverberant speech,” IEEE Signal
comparing SBD+SS to the baseline input mPESQ, which indicates a Process. Lett., vol. 16, pp. 961–964, 2009.
relatively high improvement in speech quality. Although the conven- [6] S. Haykin, Ed. Unsupervised Adaptive Filtering (Volume II: Blind De-
tional MINT performs substantially better, the proposed method still convolution). New York: Wiley, 2000.
provides an adequate improvement by blindly estimating the RIRs [7] K. Lebart, J. M. Boucher and P. N. Denbigh, “A new method based on
spectral subtraction for speech dereverberation,” Acta Acoust., vol. 83,
with only about half of the length of the original RIRs. By com- 359–366, 1997.
paring SBD to BD, it is evident that the proposed method improves [8] M. Wu and D. L. Wang, “A two-stage algorithm for one-microphone
the BD approach by making use of different subbands in a manner reverberant speech enhancement,” IEEE Trans. Audio, Speech, Lang.
similar to human auditory frequency selectivity. This leads to a bet- Process., vol. 14, 774–784, 2006.
ter channel correlation estimation and consequently more accurate [9] R. D. Patterson and B. C. J. Moore, “Auditory filters and excitation
inverse filters. In addition, a more precise estimation of the inverse patterns as representations of frequency resolution,” Frequency Selec-
filter coefficients yields fewer errors, thus enabling the spectral sub- tivity in Hearing, B. C. J. Moore (Ed.), London: Academic Press,
traction to suppress the late reverberation more effectively. The im- 123–177, 1986.
provement obtained solely due to the spectral subtraction stage is on [10] D. L. Wang and G. J. Brown, “Separation of speech from interfer-
average 0.2 (SBD to SBD+SS). ing sounds based on oscillatory correlation,” IEEE Trans. Neural Net-
works, vol. 10, pp. 684–697, 1999.
[11] IEEE Subcommittee, “IEEE recommended practice speech quality
4. CONCLUSIONS measurements,” IEEE Trans. Audio Electroacoust., vol. 17, pp. 225–
246, 1969.
We have developed and tested a two-stage single-input and two- [12] J. B. Allen and D. A. Berkley, “Image method for efficiently simulat-
ing small-room acoustics,” J. Acoust. Soc. Am., vol. 65, pp. 943–950,
output subband-based blind dereverberation algorithm (SBD+SS)
1979.
suitable for reverberant speech enhancement. The proposed algo-
[13] K. Kokkinakis and A. K. Nandi, “Multichannel blind deconvolution
rithm operates by first splitting the reverberant inputs into different for source separation in convolutive mixtures of speech,” IEEE Trans.
subbands. In the second stage, the inverse filters are estimated us- Audio, Speech, Lang. Process., vol. 14, pp. 200–212, 2006.
ing the blind deconvolution MINT-based inverse filtering approach, [14] ITU-T Recommendation, “Perceptual evaluation of speech quality
while the third-stage suppresses late reverberant energy by subtract- (PESQ), an objective method for end-to-end speech quality assessment
ing the power spectrum of the late impulse components from the of narrow-band telephone networks and speech coders,” ITU-T Rec-
power spectrum of the inverse filtered speech signal. Experiments ommendation P.862, 2001.
carried out with speech signals in a reverberant setup, indicate that [15] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures
the proposed blind speech dereverberation technique is capable of for speech enhancement,” IEEE Trans. Audio, Speech, Lang. Process.,
equalizing long acoustic echo paths with nearly no degradation. vol. 16, pp. 229–238, 2008.
4717
Authorized licensed use limited to: Nanyang Technological University. Downloaded on August 17,2010 at 15:58:40 UTC from IEEE Xplore. Restrictions apply.