Académique Documents
Professionnel Documents
Culture Documents
1, JANUARY 1997 11
AbstractÐ Several noise compensation schemes for jecte to noise. In noise, people speak louder, and
speech d there
are increases in duration, pitch, and higher frequency
recognition in impulsive and nonimpulsive noise are considered. energy
Th
e noise compensation schemes are spectral subtraction,
contents of speech [21]. The noise-induced stress can be as
HMM-based Wiener
filters, noise-adaptive HMM's, and a
harmful as the noise itself. However, in
to recognition this
front-end impulsive noise removal. The use of the cepstral-
time
paper the effects of the additive noise
, the focus is on on
matrix as an improved speech
feature set is explored, and
speech recognition. The noisy speech
signal is modeled
the noise compensation methods are extended for use with
cepstral-time features. evaluations, on
Experimental a spoken as
digit database, in the presence of car noise, helicopter noise,
The main parameters of an HMM are the state transition probabilities and the state It is assumed that the noise remains
observation pdf's. The state tran-sition probabilities model the variations in durations stationary in between
and articulation rates of speech. The state observation pdf's model the variations in the is the spectrum of the
spectral content of the speech segments associated with each state. A variant ofthe update
HMM's is periods. In (6), th
the left±right model. It is called this because state transitions can only be made from a noise block, and the integer
left state to a right state that is for The left±right HMM is useful for variable is the number of
modeling random functions of time using the left to right progression through thein a noise-only period. Alternatively, the
model. In HMM theory, the likelihood that an unknowntime-averaged noise spectrum can be
i obtained using a first-order digital
observation vector sequence s an lowpass filter as
i obtaine
acoustic realization of the word model s d by (7)
summing the probabilities of the over all
observation state where typically, The
sequences as spectral subtraction
equation (5) can be expressed in terms of
a filter with fre-quency response
SNR
(8)
(3)
Due to the variations of the noise
The probability that an observation vector belongs to spectrum, spectral sub-traction may
is modeled by produce negative estimates of the power,
state a mixture of multivariate or the magnitude, spectrum. This
Gaussian pdf's as outcome is more probable at frequencies
with a low SNR. To avoid negative
(4) magnitude estimates, the spectral
is the prior probability of the th component subtraction output is processed using a
where of
the Gaussian mixture pdf for
state and is a if
Gaussian pdf with a mean vector and a covariance matrix otherwise.
Th state Gaussian pdf the main function
e is through
Note that the mapping in (9) is equivalent
which the signal and the noise influence the likelihood score for a model
to restricting the magnitude frequency
response of the spectral subtraction filter to
III. SPECTRAL SUBTRACTION
the range The
This section considers spectral subtraction for speech recog-nition in noisy problem in spectral subtraction is due to
environments. In spectral subtraction, an es-timate of the speech spectra is obtained the processing distortions resulting from the
by subtracting an estimate of the average noise spectra from the noisy speech [2], [4].
variations of the noise spectrum. There are
The equation describing spectral subtraction may be expressed as
many variants of spectral subtraction that
SNR (5) basically differ in the methods used for
where the estimation of the noise spectrum and the
different degrees of time averaging that
frequency variable,
spectrum of noisy speech they impose on the estimated signals. A
estimate of the clean speech spectrum , commonly used method for reducing
time-averaged estimate of the noise the noise variance is to replace the
spectra. spectral subtraction filter (8) with
For magnitude spectral subtraction, the exponent , a smoothed version defined as
and
for power spectral subtraction, The parameter
controls the amount of noise subtracted from
SNR the wher is the time-averaged estimate of the
noisy signal and may be dependent on the SNR. The time-averaged noise spectrum is e
obtained from the periods when the signal is absent and only the noise is present as
spectrum. Further smoothing can be
achieved by lowpass function as a nonlinear estimator of the
filtering the variations of across successive speech noise spectrum:
blocks.
The so-called nonlinear spectral subtraction methods make use of the information S
on the local SNR and utilizes the observation that at low SNR, oversubtraction can N
R
lead to improved performance. Lockwood and Boudy [9] suggested the following
(11)
where the subscript denotes a nonlinear estimate. The
6) nonlinear noise estimate is a function of the maximum value of the noise spectrum over frames, the SNR, and the
linear
VASEGHI AND MILNER: NOISE COMPENSATION METHODS FOR HMM SPEECH RECOGNITION 13
frames
frames (12)
SNR
where is a control parameter. In (12), as the SNR de-creases, theFig. 1. Illustration of HMM's with state-dependent
Wiener filters for noisy speech recognition.
output of the nonlinear estimator approaches and as the SNR increases it
approaches zero.
The noise estimate is, however, forced to be an overestimation by using a limiting Step 1: For the noisy speech and each
function such as HMM, obtain a maximum
likelihood state sequence.
(13) Step 2: From the state sequence,
produce a series of state-
Spectral subtraction utilizes only an estimate of the noise power spectrum. As a dependent Wiener filters.
result, spectral subtraction performs poorly when compared with methods such as Step 3: For each model, use the
Wiener filters and model adaptation that utilize estimates of the power spectra of the Wiener filter sequence to
signal and the noise processes. filter the noisy speech.
Step 4: Obtain a probability score for
IV. HMM-BASED WIENER FILTERS the filtered speech and its
respective model, and finally,
This section describes the use of Wiener filters with HMM's for improving speech select the model with the
recognition in noisy conditions. Two im-plementation methods are described, namely, highest score.
state-dependent Wiener filters and state-integrated Wiener filters. For a sta-tionary
In state-dependent Wiener filtering,
signal observed in additive noise, the Wiener filter equations in the time and the
when the noisy speech is processed by the
frequency domains are given as [2], [22]
filter sequence derived from the correctly
(14) hypothesised HMM, the effect of noise is
reduced. However, when the filter
(15)
sequence is derived from an incorrectly
hy-pothesised HMM, no significant noise
where and denote the signal autocorrelation reduction occurs [5], [24]. This process
matrix, signal autocorrelation vector, and the noise autocor- increases the likelihood of accurate
relation matrix, respectively, is the speech recognition. A drawback of this
signal power spectrum (the method is that it relies on the accuracy of
operator denotes expectation), the state sequence from which the state-
and is the noise power spectrum. From (15), for dependent Wiener filters are derived. The
additiv problem is that the accuracy of the state
e noise, the Wiener filter attenuates the noisy sequence estimators deteriorates rapidly
signal frequencies in proportion to the SNR. The application of a Wiener filter with increasing noise.
requires the signal and the noise power spectra. For quasistationary noise, the noise
power spectra may be estimated and updated from speech inactive periods. The
speech power spectra may be obtained from the HMM cepstral means using an B. HMM State-Integrated Wiener
Filters
inverse discrete cosine transform (IDCT) followed by an exponential operation to
convert the cepstral variables to power spectral variables [6], [7], [15], [16]. In the context of an HMM, Wiener
filtering can be accom-plished by
A. State-Dependent Wiener Filter Sequence adaptation to noise of the mean cepstral
vectors in the states of the HMM [7]. Due
A state-dependent Wiener filter sequence is derived from the cepstral vectorsto the logarithmic operation in forming
contained in the states of the HMM's together with an estimate of the noise power cepstral features, filtering of noisy speech
spectrum from speech inactive periods [24]. The implementation of state-dependent in the cepstral domain is an additive
Wiener filters, which is illustrated in Fig. 1, involves the following steps: operation given by
(16)
speec
h the noisy speech , and the Wiener
where denote the cepstra of the filtered
, respectively. From (16), the cepstrum of a Wiene
may be expressed as
Now, consider the filtered signal when applied to the Gaussian scoring function of Assuming that the amplitude of a noise
the HMM state observation as pulse is a zero mean Gaussian process
with variance the pdf of impulsive noise
can be defined as
(23)
(37)
Fig. 9. One-dimensional Cepstra: (0) (1♀) ← delta ← delta-delta Fig. (1♀)← delta ← delta-delta
features; 11. One-dimensional Cepstra: (1) fea-
Lynx noise. tures; Lynx noise.
Fig. 10. One-dimensional Cepstra: (0) (1♀)← delta ← delta-delta features; Car noise. Fig. 12. One-dimensional Cepstra: (1) (1♀) ← delta
← delta-delta features; Car noise.
2) Wiener Filters for Spectral-Time Features: The Wiener filter for a noisy
spectral-time matrix can be formed by mapping the time dimension to frequency and VIII. EXPERIMENTAL
using a modified 2-D Wiener filter defined as RESULTS
This section presents experimental
(41) evaluations of the noise compensation
techniques described in the earlier
sections using the NOISEX-92 spoken
where and are the power spectra single digit speech database [30]. Of the
of the signal and the noise , respectively. The various noises available on the NOISEX
cepstral time of the filtered spectral-time features is given by database, the Lynx helicopter noise and
the Volvo car noise were selected as
speech contaminants. In all experiments,
the digits were modeled using an eight-
state, single mode per mixture,
continuous density HMM with a diagonal
(42)
covariance matrix. The HMM structure
was left-to-right with no skip states. To
where is the cepstral-time estimate of the clean
generate features, speech was Hamming
speech. In a similar way to the 1-D Wiener filter (see Section IV-B), the 2-D Wiener windowed every 16 ms with a window
filter can be integrated with the states of the HMM's. width of 32 ms. For each speech frame, a
3) Noise-Adaptive HMM's: The state observation statistics of an HMM, using25-channel filterbank spectrum with mel-
cepstral-time speech features, consist of a mean cepstral-time matrix and a fourth- scaled frequency axis was obtained. Each
order covariance tensor. The covariance tensor can be simplified to a ªdiagonalº speech spectral vector was then
tensor if it is assumed that the elements are uncorrelated. The equations for the transformed to a cepstral vector. The
adaptation of cepstral-time-based HMM's to noise can be obtained using a similarcepstral vectors were truncated to either
procedure to that described in Section VI. 15 or 14 coeffi-cients. For each cepstral
vector, the velocity and acceleration
features were derived over a window of
five cepstral vectors as described in [29]. To obtain 2-D cepstral-time features, eacheach figure, the unmatched conditions
group of eight consecutive spectral vectors were transformed to a cepstral-timewith no noise compensation (NNC) form
matrix. The cepstral-time matrices were truncated to either or , depending on whetherthe worst case, due to the uncompensated
the zeroth row and column were included or excluded. mismatch, which exists between the
Figs. 9±12 show the experimental results obtained for HMM's using cepstral vectormodels trained on clean
features, and Figs. 13±16 show the results for HMM's using cepstral-time features. In
VASEGHI AND MILNER: NOISE COMPENSATION METHODS FOR HMM SPEECH RECOGNITION 19
speech and tested on noisy speech. With matched conditions, the models are trained
and tested with speech contaminated under similar noise conditions, which should
indicate the best performance the system can achieve. The graphs indicate that under
matched conditions, the recognition accuracy does not suffer greatly and remains
steady down to SNR's as low as 0 dB. The deterioration in performance is relatively Fig. 18. Recognition in simulated random
duration impulsive noise.
rapid below 0 dB but remains well above the performance in unmatched conditions.
Matched and unmatched conditions can be used to indicate the upper and lower
bounds on recognition. The Wiener filtering was implemented
For spectral subtraction, the nonlinear method described in Section III was by replacing the clean model cepstral
implemented. The noise estimate was obtained from the speech inactive periods.means with the Wiener filter adapted
These periods are labeled in the NOISEX database. The maximum attenuation of themeans, as described in Section IV-B. The
spectral filter was limited to 20 dB. To reduce the processing distortions due to the experiments on car noise and Lynx noise
noise variance, the spectral filter sequence, for successive speech blocks, wasshow that state-integrated Wiener filters
obtained and lowpass filtered before application to the noisy speech. work remarkably well; the Wiener filter
performance is much better than the
spectral subtraction method and produces
results that are comparable with those
obtained with model adaptation.
20 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 1, JANUARY 1997
speech recognition,º in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1988, pp. 481±484. Saeed Vaseghi received
15] M. J. F. Gales and S. J. Young, ªAn improved approach to the hidden Markov model the B.Sc. degree in
decomposition,º in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1992, pp. electri-cal and electronics
729±734. engineering from
16] , ªHMM recognition in noise using parallel model combination,º University of Newcastle,
in Proc. Eurospeech, 1993, pp. 837±840. England, and the Ph.D.
17] D. Mansour and B. H. Juang, ªThe short-time modified coherence representation and noisy degree in signal
speech recognition,º IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 795±804, processing from
June 1989. Cambridge University,
18] Y. Ariki, S. Mizuta, M. Nagata, and T. Sakai, ªSpoken-word recognition using dynamic features England.
analyzed by two-dimensional cepstrum,º Proc. Inst. Elec. Eng., vol 136, pt. I, no. 2, pp. His Ph.D. research in
133±140, Apr. 1989. restoration of archived
gramophone records led
to the establishment of the
CEDAR audio processing
system for the restoration
of degraded audio signals.
His research interests
include audio signal
restoration, modeling of
the time-varying and the
nonlinear aspects of
speech production and
perception, and the
development of
features and probabilistic models for speech pattern
recognition. He has authored a book entitled
Advanced Signal Processing and Digital Noise
19] S. V. Vasegi, P. N. Connor, and B. P. Milner, ªSpeech modeling using Reduction and is a lecturer at the Department of Electrical and Electronic cepstral-time feature matrices in hidden Markov models,º
Proc. Inst. Engineering of Queens University, Belfast, Northern Ireland.
Elec. Eng., vol. 140, no. 5, pp. 317±320, Oct. 1993.
20] O. Ghitza, ªAuditory nerve representation as a front-end for speech
recognition in a noisy environment,º Comput. Speech Language,
vol. 1, pp. 109±130, 1986.
21] C. Pisoni, ªSome acoustic-phonetic correlates of speech produced
in noise,º in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, 1985, pp. 1581±1584.
22] N. Wiener, Extrapolation, Interpolation, and Smoothing of
Stationary Time Series, with Engineering Applications.
Cambridge, MA: MIT Press, 1949.
23] V. L. Beattie and S. J. Young, ªNoisy speech recognition using hidden
Markov model state based filtering,º in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing,
1991, pp. 917±920.
24] S. V. Vaseghi and B. P Milner, ªNoisy speech recognition based on HMM's, Wiener filters and
re-evaluation of most likely candidates,º in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. II, 1991, pp. 103±106.
25] J. A. Nolazco Flores and S. J. Young, ªAdapting a HMM-based recognizer for noisy speech
enhanced by spectral subtraction,º in Proc. Eurospeech, 1993, pp. 829±832.
26] S. Vaseghi and P. Rayner, ªDetection and suppression of impulsive noise in speech
communications systems,º Proc. Inst. Elec. Eng., Commun. Speech Vision, pp. 38±46, Feb
1990.
27] S. J. Godsill, ªThe restoration of degraded audio signals,º Ph.D. Thesis, Cambridge Univ., UK
1993.
28] J. W. Tukey, Exploratory Data Analysis. Reading, MA: Addison Wesley, 1971.
29] S. Furui, ªSpeaker-independent isolated word recognition using dynamic features of speech
spectrum,º IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-34, no. 1, pp. 52±59,
Feb. 1986.
30] A. P. Varga, H. J. M. Steenken, M. Tomlinson, and D. Jones, ªThe NOISEX-92 study on the
effect of additive noise on automatic speech recogintion,º Tech. Rep., DRA Speech Res. Unit,
1992.
Ben Milner was born in Norfolk, England, on October 18, 1968. He received the
Ph.D. degree in signal processing from the University of East Anglia in 1995.
Prior to this, he received the B.Eng. degree in electronic engineering in 1991 and
spent some time in the oil industry, working on side scan sonar and seabed
profiling.
Since 1994, he has been working at BT Labora-tories in the speech recognition
group, specializing in noise and channel robustness and front-end pro-cessing.
Dr. Milner is an associate member of the IEE and is a visiting signal processing lecturer at the University
of East Anglia.