Vous êtes sur la page 1sur 22

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO.

1, JANUARY 1997 11

Noise Compensation Methods for Hidden Markov


Model Speech Recognition in Adverse Environments
Saeed V. Vaseghi and Ben P. Milner

AbstractÐ Several noise compensation schemes for jecte to noise. In noise, people speak louder, and
speech d there
are increases in duration, pitch, and higher frequency
recognition in impulsive and nonimpulsive noise are considered. energy
Th
e noise compensation schemes are spectral subtraction,
contents of speech [21]. The noise-induced stress can be as
HMM-based Wiener
filters, noise-adaptive HMM's, and a
harmful as the noise itself. However, in
to recognition this
front-end impulsive noise removal. The use of the cepstral-
time
paper the effects of the additive noise
, the focus is on on
matrix as an improved speech
feature set is explored, and
speech recognition. The noisy speech
signal is modeled
the noise compensation methods are extended for use with
cepstral-time features. evaluations, on
Experimental a spoken as
digit database, in the presence of car noise, helicopter noise,

and impulsive noise, demonstrate that the noise compensation (1)


methods achieve substantial improvement in recognition across
a wide range of signal-to-noise ratios. The results also wher
show e
that the cepstral-time matrix is more robust than a vector of
identical size, which is composed of a combination of
cepstral discrete time index,
and differential cepstral features.
clean
speech,
I. INTRODUCTION noise.

S In the frequency domain, (1) becomes


PEECH recognition systems operating in adverse environ- (2)
ments, such as in a moving car or in a noisy office, have to
wher
deal with a variety of ambient noise and distortions. Currently, e is the frequency index, and and
most speech recognition systems are based on hidden
Markov
are the spectra of the noisy speech, the clean speech, and the
models (HMM's) [1], and this paper explores and
evaluates
noise respectively.
,
several noise compensation methods for HMM's operating in
The organization of this paper is as follows. A brief intro-
nois conditions. The signa processing stages in an
y major l
duction to HMM speech recognition is presented in Section
HMM-based speech recognition system are acoustic
feature
II. Spectral subtraction, HMM-based Wiener filters,
impulsive
extraction, acoustic segmentation, and model likelihood calcu-
noise removal, and model adaptation are described in
Sections
lation. Noise affects each stage of the recognition process, and
III VI. Section describes the of cepstral-time
to VII use
the result is a rapid deterioration in recognition accuracy
with
feature matrices as an improved speech feature set and the
decreasin signal-to- ratio (SNR). recognition
g noise Speech
expansion of the noise compensation schemes into 2-D
for
systems achieve best performance when the models are trained
use with cepstral-time matrices. Experimental results,
which
and operated in matched environments. For most
applications,
are presented in Section VIII, compare the performance of
the
this is impractical, as the operating environment varies with
various noise compensation methods described in this
paper.
time and space, and therefore, some form of noise
compen-
Finally, Section IX concludes the paper.
sation must be employed. The research work in noisy
speech
recognition may be classified into three broad categories:
a) filtering of the speech prior to classification II. HIDDEN MARKOV MODELS
noisy
[3]±[9],
A HMM [1] is a finite state structur that
statistical e is
b) adaptation of the speech models to include the effects
particularl well for the statistical characterization
y suited of
of noise [10]±[16],
nonstationary signals such as speech and time-varying noise.
c) the use of features that are robust to noise
[17]±[20].
In HMM theory, a nonstationary process is modeled by a
chain
Speech recognizers operating in noisy conditions must also
of stationary states, with each state having a different
set
deal with the changes in the speaking habits of people
sub-
of statistical characteristics. An -state HMM is defined by
Manuscript received December 19, 1992; revised June 27, 1996. The the parameter set where
initial state probability for ,
associate editor coordinating the review of this paper and approving it for state
publication was Prof. John H. L. Hansen.
transition probability from to state ,
state
S. V. Vaseghi is with the Department of Electrical and Electronic Engineer-
state observation probability density function (pdf)
ing, Queen's University of Belfast, Belfast, Northern
Ireland.
B. P. Milner is with the Speech Technology Unit,
BT Laboratories,
that is usually modeled by a mixture of Gaussian
Martlesham Heath, Suffolk, England.
densities.
Publisher Item Identifier S 1063-6676(97)00767-0.

1053±587X/97$10.00  ̄ 1997 IEEE


12 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 1, JANUARY 1997

The main parameters of an HMM are the state transition probabilities and the state It is assumed that the noise remains
observation pdf's. The state tran-sition probabilities model the variations in durations stationary in between
and articulation rates of speech. The state observation pdf's model the variations in the is the spectrum of the
spectral content of the speech segments associated with each state. A variant ofthe update
HMM's is periods. In (6), th
the left±right model. It is called this because state transitions can only be made from a noise block, and the integer
left state to a right state that is for The left±right HMM is useful for variable is the number of
modeling random functions of time using the left to right progression through thein a noise-only period. Alternatively, the
model. In HMM theory, the likelihood that an unknowntime-averaged noise spectrum can be
i obtained using a first-order digital
observation vector sequence s an lowpass filter as
i obtaine
acoustic realization of the word model s d by (7)
summing the probabilities of the over all
observation state where typically, The
sequences as spectral subtraction
equation (5) can be expressed in terms of
a filter with fre-quency response

SNR
(8)

(3)
Due to the variations of the noise
The probability that an observation vector belongs to spectrum, spectral sub-traction may
is modeled by produce negative estimates of the power,
state a mixture of multivariate or the magnitude, spectrum. This
Gaussian pdf's as outcome is more probable at frequencies
with a low SNR. To avoid negative
(4) magnitude estimates, the spectral
is the prior probability of the th component subtraction output is processed using a
where of
the Gaussian mixture pdf for
state and is a if
Gaussian pdf with a mean vector and a covariance matrix otherwise.
Th state Gaussian pdf the main function
e is through
Note that the mapping in (9) is equivalent
which the signal and the noise influence the likelihood score for a model
to restricting the magnitude frequency
response of the spectral subtraction filter to
III. SPECTRAL SUBTRACTION
the range The
This section considers spectral subtraction for speech recog-nition in noisy problem in spectral subtraction is due to
environments. In spectral subtraction, an es-timate of the speech spectra is obtained the processing distortions resulting from the
by subtracting an estimate of the average noise spectra from the noisy speech [2], [4].
variations of the noise spectrum. There are
The equation describing spectral subtraction may be expressed as
many variants of spectral subtraction that
SNR (5) basically differ in the methods used for
where the estimation of the noise spectrum and the
different degrees of time averaging that
frequency variable,
spectrum of noisy speech they impose on the estimated signals. A
estimate of the clean speech spectrum , commonly used method for reducing
time-averaged estimate of the noise the noise variance is to replace the
spectra. spectral subtraction filter (8) with
For magnitude spectral subtraction, the exponent , a smoothed version defined as
and
for power spectral subtraction, The parameter
controls the amount of noise subtracted from
SNR the wher is the time-averaged estimate of the
noisy signal and may be dependent on the SNR. The time-averaged noise spectrum is e
obtained from the periods when the signal is absent and only the noise is present as
spectrum. Further smoothing can be
achieved by lowpass function as a nonlinear estimator of the
filtering the variations of across successive speech noise spectrum:
blocks.
The so-called nonlinear spectral subtraction methods make use of the information S
on the local SNR and utilizes the observation that at low SNR, oversubtraction can N
R
lead to improved performance. Lockwood and Boudy [9] suggested the following
(11)
where the subscript denotes a nonlinear estimate. The
6) nonlinear noise estimate is a function of the maximum value of the noise spectrum over frames, the SNR, and the
linear
VASEGHI AND MILNER: NOISE COMPENSATION METHODS FOR HMM SPEECH RECOGNITION 13

noise estimate. One form for the


function is
SNR

frames

frames (12)
SNR

where is a control parameter. In (12), as the SNR de-creases, theFig. 1. Illustration of HMM's with state-dependent
Wiener filters for noisy speech recognition.
output of the nonlinear estimator approaches and as the SNR increases it
approaches zero.
The noise estimate is, however, forced to be an overestimation by using a limiting Step 1: For the noisy speech and each
function such as HMM, obtain a maximum
likelihood state sequence.
(13) Step 2: From the state sequence,
produce a series of state-
Spectral subtraction utilizes only an estimate of the noise power spectrum. As a dependent Wiener filters.
result, spectral subtraction performs poorly when compared with methods such as Step 3: For each model, use the
Wiener filters and model adaptation that utilize estimates of the power spectra of the Wiener filter sequence to
signal and the noise processes. filter the noisy speech.
Step 4: Obtain a probability score for
IV. HMM-BASED WIENER FILTERS the filtered speech and its
respective model, and finally,
This section describes the use of Wiener filters with HMM's for improving speech select the model with the
recognition in noisy conditions. Two im-plementation methods are described, namely, highest score.
state-dependent Wiener filters and state-integrated Wiener filters. For a sta-tionary
In state-dependent Wiener filtering,
signal observed in additive noise, the Wiener filter equations in the time and the
when the noisy speech is processed by the
frequency domains are given as [2], [22]
filter sequence derived from the correctly
(14) hypothesised HMM, the effect of noise is
reduced. However, when the filter
(15)
sequence is derived from an incorrectly
hy-pothesised HMM, no significant noise
where and denote the signal autocorrelation reduction occurs [5], [24]. This process
matrix, signal autocorrelation vector, and the noise autocor- increases the likelihood of accurate
relation matrix, respectively, is the speech recognition. A drawback of this
signal power spectrum (the method is that it relies on the accuracy of
operator denotes expectation), the state sequence from which the state-
and is the noise power spectrum. From (15), for dependent Wiener filters are derived. The
additiv problem is that the accuracy of the state
e noise, the Wiener filter attenuates the noisy sequence estimators deteriorates rapidly
signal frequencies in proportion to the SNR. The application of a Wiener filter with increasing noise.
requires the signal and the noise power spectra. For quasistationary noise, the noise
power spectra may be estimated and updated from speech inactive periods. The
speech power spectra may be obtained from the HMM cepstral means using an B. HMM State-Integrated Wiener
Filters
inverse discrete cosine transform (IDCT) followed by an exponential operation to
convert the cepstral variables to power spectral variables [6], [7], [15], [16]. In the context of an HMM, Wiener
filtering can be accom-plished by
A. State-Dependent Wiener Filter Sequence adaptation to noise of the mean cepstral
vectors in the states of the HMM [7]. Due
A state-dependent Wiener filter sequence is derived from the cepstral vectorsto the logarithmic operation in forming
contained in the states of the HMM's together with an estimate of the noise power cepstral features, filtering of noisy speech
spectrum from speech inactive periods [24]. The implementation of state-dependent in the cepstral domain is an additive
Wiener filters, which is illustrated in Fig. 1, involves the following steps: operation given by

(16)
speec
h the noisy speech , and the Wiener
where denote the cepstra of the filtered
, respectively. From (16), the cepstrum of a Wiene
may be expressed as

where in (17), speech cepstrum is derived fro


speech power spectrum, and is the cep
of the sum of the power spectra of the
speech and the noise. The filtered signal
in (16) may be rewritten as
(18)
14 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 1, JANUARY 1997

Fig. 3. Binary-state HMM of impulsive noise. With


the values of the transition probabilities as shown,
the likelihood of occurrence an impulsive noise is
Fig. 2. Impulsive noise modeled as the output of a channel filter excited by an amplitude-modulated binary independent of the state.
sequence.

Now, consider the filtered signal when applied to the Gaussian scoring function of Assuming that the amplitude of a noise
the HMM state observation as pulse is a zero mean Gaussian process
with variance the pdf of impulsive noise
can be defined as

(23)

where is the probability of occurrence of


an impulse. In a communication system,
an impulsive noise originates at some
(19) point in time and space and propagates
through the channel to the receiver; see
wher Fig. 2. The noise pulse is shaped by the
e is the mean of the speech cepstrum. It is channel and may be considered to be the
assume that the mean channel impulse response Hence, the
d cepstrum is approximately impulsive noise may be expressed as
t , which is cepstrum obtained
equal o the form
the power spectrum. Thus, from (19), Wiener filtering is equivalent to replacing the
(24)
mean cepstral vector of each mixture with An advantage of this implementation
technique over the state-based Wiener filtering is that it does not rely on the accuracy
of the maximum likelihood state sequence estimated from the noisy speech. An alternative model for an impulsive
Experimentally, this method resulted in substantial improvement to recognition noise sequence is the two-state HMM
performance. shown in Fig. 3. In this model, the
state corresponds to the ªoffº condition
V. SPEECH RECOGNITION IN IMPULSIVE NOISE when impulsive noise is absent. The
This section considers the modeling, detection, and removal of impulsive-type state corresponds to the ªonº
noise. An impulsive noise sequence condition. In
can be modeled as an amplitude-modulated, binary-state, ran-dom sequence as this state, the model emits short duration
pulses of random amplitude and duration.
(20) The probability of a transition from state
where is a binary-valued sequence of one's and zero's that signals the presence or the to state is denoted by In one of its
absence of a noise pulse, and is a random noise process. Two statistical models forsimplest forms, the state emits samples
impulsive noise are the Bernoulli±Gaussian and the Pois-son±Gaussian processes [2].generated from a zero mean Gaussian
The autocorrelation and power spectrum functions of an uncorrelated impulsive noiserandom process. The impulsive noise
process may be expressed as model in the state can be trained to model
a variety of noise pulses of different
(21) shapes, durations, and pdf's.
(22) For a signal contaminated by an
impulsive noise sequence, a ªlocalº time-
where
varying signal to impulsive noise ratio
variance of the noise amplitude, autocorrelation lag, can be defined as
time index,
Kronecker delta function. (25)
where

and are the signal power and the


of each impulse, respectively. The average signal to impulsive noise ratio depends on (26)
the average amplitude and the rate of occurrence of impulsive noise. Assuming that is
the fraction of samples contaminated by impulsive noise, an average signal to
impulsive noise ratio can be defined as Note that for a given signal power, different valu
and can yield the same average SINR.
VASEGHI AND MILNER: NOISE COMPENSATION METHODS FOR HMM SPEECH RECOGNITION 15

where is the pdf of the observation sequ


along the state sequence of the HMM A problem in
HMM's for the detection of impulsive
noise is the sensitivity of the accuracy of
the ML state sequence to the presence of
a background signal. A solution is to
decorrelate the speech using an inverse
linear predictor and to train the HMM's
on impulsive noise observed in a
Fig. 4. Impulsive noise removal system incorporating a detection and an interpolation subsystems.
decorrelated background signal. An
alternative solution is to use HMM's that
A. Impulsive Noise Detection combine the speech and the impulsive
In the impulsive noise model of (20), the binary-valued process signals thenoise states as described in [13] and [16].
presence or the absence of an impul-sive noise, and hence, impulsive noise detection
is equivalent to estimation of the binary sequence In this section, a front-end method B. Front-End Impulsive Noise Removal
and an HMM-based method for the detection of an impulsive noise sequence are A typical impulsive noise sequence
described. leaves a large fraction of speech samples
1) Detection Using an Inverse Predictor: The detectability of a noise pulse observed inunaffected. Thus, it is advantageous to
a relatively high level of speech can be improved through decorrelation, which has the effectlocate each noise pulse and correct only
of enhancing the amplitude of an impulsive-type event relative to the ªbackgroundº speech.those samples that are distorted. The
The correlation structure of speech may be modeled by a linear predictor and the noisyfront-end impulsive noise removal system
speech expressed as evaluated in this paper is composed of a
detector and an interpolator as shown in
(27) Fig. 4 [2], [26]. The detector is described
in Section V-A. The output of the
where are the predictor coefficients, and detector subsystem is used as a binary
is the speech switch to control the interpolator. A
excitation. The process of decorrelation is performed by an inverse predictor and detector output of `0' signals the absence
yields a noisy speech excitation as of impulsive noise, and the interpolator is
bypassed. A detector output of `1' signals
(28) the presence of impulsive noise, and the
where is the estimation error due to the noise interpolator is activated to replace the
The noisy excitation is passed through a threshold samples obliterated by noise. The
to make an estimate of the noise state as interpolator is based on a linear
prediction model of speech and makes
if (29)
effective use of the undistorted samples
otherwise
on both sides of the discarded samples.
where the threshold is derived from a robust estimate The interpolator works well for the
of the variance of the speech excitation [2]. A robust replacement of missing speech segments
estimate is obtained by excluding the larger, outlying, samples, which may beof up to 50 samples at a 10-kHz sampling
impulsive noise [2], [28]. An impulsive noise detector, based on an inverse predictor,rate.
is incorporated in the impulsive noise removal system of Fig. 4. Since each impulse,
in passing through the inverse predictor, is shaped by the impulse response of the VI. NOISE-ADAPTIVE SPEECH
inverse predictor, it follows that a matched filter can enhance noise detection [2]. The MODELS
details of impulsive noise detection system are described in [2], [26], and [27].
A deficiency of conventional filtering
2) Noise Detection Based on the ML State Sequence: The maximum likelihood
methods, such as spectral subtraction, is
(ML) state sequence of an HMM of impulsive noise (see Fig. 3) can be used as a
that crucial speech information may be
detector of the presence or the absence of impulsive noise. For a given obser-vation
removed during the filtering process. For
sequence the maximum
noisy speech recognition, an alternative to
filtering is to adapt the parameters of the
likelihood state speech models to include the effects of
sequence of an noise in an attempt to obtain models that
HMM is obtained as would have been obtained under matched
training and testing conditions. Adaptation
of speech model parameters depends on the choice of the speech features. For linear assumed that at any given time, each speech
speech features such as power spectral or correlation features, the statistics of the noisy spectral band is dominated either by the
speech are given as the sum of the statistics of the speech and the additive noise. In [11], signal energy or by the noise energy. Varga
Roe described a method for the noise adaptation of LPC-based speech features. For [13], [14] and Gales and Young [15], [16]
cepstral speech features, the nonlinear, logarithmic transformation from the spectral proposed a model combination method in
domain to the cepstral domain affects the adaptation. Nadas et al. [12] introduced noise- which a speech
adaptive speech models trained on log-power spectral features. In their models, it is
16 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 1, JANUARY 1997

changing environments. This resilience to


noise may be at-tributed to a better set of
speech features extracted by the auditory
system, to more efficient use of the
temporal cor-relation and dynamics of
speech features, and to the use of a prior
knowledge and the higher levels of
language processing, context, grammar,
Fig. 5. Block diagram configuration of adaptation system. etc. Currently, the most widely used
features for computer speech recognition
HMM and a noise HMM are combined to produce a model of the noisy signal. Fig. 5are the cepstral features, which are
outlines the stages involved in the model adaptation process described in [16], whereadopted through a process of elimination
the HMM state observation parameters are converted from the cepstral domain intoin that they are the best feature set among
the log spectral domain using an inverse DCT. The log spectral features are then those proposed and investigated so far.
mapped to linear power spectral features. Noise adaptation takes place on the powerAn important area of research in speech
spectral features, and the combined noise and speech model is mapped back to therecognition is the development of novel
cepstral domain. In the linear spectral domain, the adaptation of the means and features that are robust to noise. The
variances of speech power spectrum to noise are given as work on this subject may be divided into
two categories.
(31)
In one category, the speech features or
(32) speech models are modified to make
more efficient use of the correlation
where
between successive speech segments. In
are the means and the variances of the power spectra of
HMM theory, it assumed that within each
the clean signal the noise , and the noisy signal
, respectively. The state, speech features are independent
index designates identically distributed (IID). The IID
the th frequency band, and is the covariance of the assumption contributes to a rapid
th and the th frequency bands. Assuming that the cepstral features are Gaussian deterioration in the performance of
distributed, then the log spectral features, which are obtained from an inverse DCT, HMM's in noisy conditions. Speech
are also Gaussian. Hence, due to the exponential mapping from the cepstral to thefeature vectors are correlated in time, and
spectral domains, it follows that the linear power spectral variables are log-normal an effective method that utilizes the
distributed. The mapping functions for translating the mean and variance of a normal correlation structure can improve
distribution to a log-normal distribution are given as recognition. Examples of such work are
the short-time modified coherence
(33 features [17] and cepstral-time speech
) features [18], [19].
(34
In the second category, the effort is
)
directed on the de-velopment of nonlinear
wher
e is the mean of the log spectra of the th frequency speech features, which in some way
is the covariance of th and model the human auditory processing
band and the the th system. For example, Ghitza [20] used an
frequency bands; the subscript denotes a log domain variable. ensemble interval histogram to model the
The mapping for the transform of the mean and variances from log normal to normal auditory nerve firings in the cochlear.
are This section considers the use of a
cepstral-time feature matrix for improved
(35) speech representation. The motivations
(36) for exploring cepstral-time matrices are
the followin¸g:
Following the transform to the log spectral domain, a DCT is used to transform the
log domain parameters to the cepstral domain. 1) Cepstral-time features provide
a systematic method for
encoding the temporal
VII. IMPROVED SPEECH FEATURES FOR NOISY CONDITIONS
dynamics of speech spectrum
In contrast to computer speech recognition systems, human speech recognition and are more robust than an
exhibits a remarkable degree of resilience to noise and distortions and the ability to identically sized feature set
operate in rapidly composed of cepstral and
differential cepstral features.
2) The correlation of cepstral features along the time axis can be used for abands, and the samples within each band
more effective implementation of noise compensation methods. are averaged to form power spectral
3) Cepstral features can be made almost insensitive to a wide range of features. Along the time axis, that is,
channel distortions. across the successive blocks, a triangular
4) A cepstral-time matrix can be more conveniently adapted to noise than window of length samples is run, and the
differential cepstral features. spectral values within the span of the
triangle are averaged to form spectral-
time features. The overall effect is that
A. Cepstral-Time Features
each spectral-time feature is obtained by
Cepstral-time features are formed as follows. Speech
averaging the samples within the span of
is
a frequency-time pyramid. The
segmented into overlapping blocks of time-domain samples,
and each block is transformed via a discrete Fourier transform to spectral samples.
These spectral samples are grouped into overlapping, mel scaled, triangular frequency
VASEGHI AND MILNER: NOISE COMPENSATION METHODS FOR HMM SPEECH RECOGNITION 17

Fig. 8. Spectral-spectral matrix of noise.

Fig. 6. Interpretation of various regions of cepstral-time matrix.


B. Noise Compensation Methods with Cepstral-Time Matrices

In this section, spectral subtraction, Wiener filter, and noise


adaptation techniques are extended for use with the cepstral-
time matrix.
1) Spectral Subtraction: Let denote the spectral-time matrix
of noisy speech modeled as

(37)

where are the spectral-time matrices of the speech and the


noise, respectively. For subtraction of the averaged noise
spectrum, the spectral-time matrix of noisy speech is
transformed into a ªspectral-spectralº matrix by converting the
time dimension in (37) to frequency dimension to yield

Fig. 7. Comparison of the performance of an HMM with (a) a 1♀ 2 3


(38)
cepstral-time matrix and (b) a 42-D feature vector composed of 14
cepstral, 14 delta cepstral, and 14 delta±delta cepstral features.
where and denote the clean speech and noise spectral matrices,
respectively. An estimate of the clean speech can be obtained
spectral-time variables are converted into logarithmic variables from extended spectral subtraction as
denoted and then grouped into a sequence of
by
spectral-time matrices. Each spectral-time matrix is
transformed via a 2-D DCT to a cepstral-time matrix (39)
Following the DCT operation, the lower submatrix
When a noise spectral-time is transformed ,
is selected as speech features. This submatrix represents the matrix it can be expressed
spectral-time envelope
of speech and contains the set of
into a spectral±spectral matrix (40)
coefficients most useful for speech recognition. An alternative
as
implementation of the cepstral-time matrix is to apply a 1- where the term is the short-time average of the noise spectrum
along the time span of the spectral-time matrix and, hence,
D DCT along the time axis of a matrix consisting appears only in the column
of
of the spectral-spectral matrix. In addition, at , the nonnegative
conventional cepstral vectors, as in Fig. magnitude spectrum is expressed in terms of a time-averaged
6. component and a random zero-mean component Note from
(39) that by transforming the time axis of the spectral-time
In transformation from a spectral-time to a cepstral-time matrix
frequency of the spectral-time matrix is to the magnitude-frequency domain, a second subtractable
matrix, the axis noise mean is produced. Fig. 8 illustrates a typical
converted to quefrency , and the time is converted to noise spectral-spectral matrix obtained from the transformation of a
axis spectral-time matrix. It can be seen that the zeroth column of the
frequency Fig. 6 illustrates various regions of a cepstral-time spectral-spectral matrix contains the time-averaged noise power
matrix. In particular, the effects of the channels are at each frequency and the higher
concentrated in the zeroth column. This column, due to its
order columns contain the temporal dynamics of noise
high variance, does not contribute well to the recognition variations at each frequency across the duration of the spectral-
process, and discarding it removes the effects of the channel time matrix.
without compromising the recognition rate.
Fig. 7 is a comparative evaluation of cepstral vector features and cepstral-time matrix features for a
spoken digit database in the presence of white noise. The figure shows that an HMM employing
cepstral-time matrix features performs better than an HMM employing a feature vector of identical
size com-posed of cepstral, velocity cepstral, and acceleration cepstral features. The differential
features were derived over a window of five cepstral vectors as described in [29]. In addition, note
that the same number of frames were used in both methods.
18 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 1, JANUARY 1997

Fig. 9. One-dimensional Cepstra: (0) (1♀) ← delta ← delta-delta Fig. (1♀)← delta ← delta-delta
features; 11. One-dimensional Cepstra: (1) fea-
Lynx noise. tures; Lynx noise.

Fig. 10. One-dimensional Cepstra: (0) (1♀)← delta ← delta-delta features; Car noise. Fig. 12. One-dimensional Cepstra: (1) (1♀) ← delta
← delta-delta features; Car noise.

2) Wiener Filters for Spectral-Time Features: The Wiener filter for a noisy
spectral-time matrix can be formed by mapping the time dimension to frequency and VIII. EXPERIMENTAL
using a modified 2-D Wiener filter defined as RESULTS
This section presents experimental
(41) evaluations of the noise compensation
techniques described in the earlier
sections using the NOISEX-92 spoken
where and are the power spectra single digit speech database [30]. Of the
of the signal and the noise , respectively. The various noises available on the NOISEX
cepstral time of the filtered spectral-time features is given by database, the Lynx helicopter noise and
the Volvo car noise were selected as
speech contaminants. In all experiments,
the digits were modeled using an eight-
state, single mode per mixture,
continuous density HMM with a diagonal
(42)
covariance matrix. The HMM structure
was left-to-right with no skip states. To
where is the cepstral-time estimate of the clean
generate features, speech was Hamming
speech. In a similar way to the 1-D Wiener filter (see Section IV-B), the 2-D Wiener windowed every 16 ms with a window
filter can be integrated with the states of the HMM's. width of 32 ms. For each speech frame, a
3) Noise-Adaptive HMM's: The state observation statistics of an HMM, using25-channel filterbank spectrum with mel-
cepstral-time speech features, consist of a mean cepstral-time matrix and a fourth- scaled frequency axis was obtained. Each
order covariance tensor. The covariance tensor can be simplified to a ªdiagonalº speech spectral vector was then
tensor if it is assumed that the elements are uncorrelated. The equations for the transformed to a cepstral vector. The
adaptation of cepstral-time-based HMM's to noise can be obtained using a similarcepstral vectors were truncated to either
procedure to that described in Section VI. 15 or 14 coeffi-cients. For each cepstral
vector, the velocity and acceleration
features were derived over a window of
five cepstral vectors as described in [29]. To obtain 2-D cepstral-time features, eacheach figure, the unmatched conditions
group of eight consecutive spectral vectors were transformed to a cepstral-timewith no noise compensation (NNC) form
matrix. The cepstral-time matrices were truncated to either or , depending on whetherthe worst case, due to the uncompensated
the zeroth row and column were included or excluded. mismatch, which exists between the
Figs. 9±12 show the experimental results obtained for HMM's using cepstral vectormodels trained on clean
features, and Figs. 13±16 show the results for HMM's using cepstral-time features. In
VASEGHI AND MILNER: NOISE COMPENSATION METHODS FOR HMM SPEECH RECOGNITION 19

Fig. 16. Two-dimensional Cepstra, 14 by 3ÐCar noise.


Fig. 13. Two-dimensional Cepstra, 15 by 4ÐLynx noise.

Fig. 14. Two-dimensional Cepstra, 15 by 4ÐCar noise.

Fig. 17. Recognition in machine gun noise.

Fig. 15. Two-dimensional Cepstra, 14 by 3ÐLynx noise.

speech and tested on noisy speech. With matched conditions, the models are trained
and tested with speech contaminated under similar noise conditions, which should
indicate the best performance the system can achieve. The graphs indicate that under
matched conditions, the recognition accuracy does not suffer greatly and remains
steady down to SNR's as low as 0 dB. The deterioration in performance is relatively Fig. 18. Recognition in simulated random
duration impulsive noise.
rapid below 0 dB but remains well above the performance in unmatched conditions.
Matched and unmatched conditions can be used to indicate the upper and lower
bounds on recognition. The Wiener filtering was implemented
For spectral subtraction, the nonlinear method described in Section III was by replacing the clean model cepstral
implemented. The noise estimate was obtained from the speech inactive periods.means with the Wiener filter adapted
These periods are labeled in the NOISEX database. The maximum attenuation of themeans, as described in Section IV-B. The
spectral filter was limited to 20 dB. To reduce the processing distortions due to the experiments on car noise and Lynx noise
noise variance, the spectral filter sequence, for successive speech blocks, wasshow that state-integrated Wiener filters
obtained and lowpass filtered before application to the noisy speech. work remarkably well; the Wiener filter
performance is much better than the
spectral subtraction method and produces
results that are comparable with those
obtained with model adaptation.
20 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 1, JANUARY 1997

TABLE I IX. CONCLUSION


RECOGNITION PERFORMANCE FOR SPEECH CONTAMINATED BY IMPULSIVE NOISE
In this paper, several noise
compensation schemes were considered
and extended for use with cepstral-time
matrices. Experimental results show that
of the three noise compensation schemes
considered in this paper, noise-adaptive
HMM's produce the best result; this is
followed closely by state-integrated
Wiener filters and then spectral
For model adaptation/combination experiments, the clean digit models were
subtraction. The processing distortions in
adapted by converting the speech state observation statistics (means and variances)
spectral subtraction is a fundamental
from the cepstral domain to the power spectral domain, adding the noise model
limitation of this method and is the main
statistics, and converting back to the cepstral domain. The noise model was a one-
cause for the relatively poor performance
state, single mode per mixture HMM. The figures show that the improvement in
of spectral subtraction. For speech
recognition accuracy that results from adaptation is substantial, and the performance
recognition in impulsive noise, a front-
of noise-adaptive HMM's approaches the results obtained in matched conditions.
end method and the model combination
Note that the use of cepstral-time matrix features consis-tently improve the
method are evaluated. The model combi-
recognition accuracy relative to cepstral vector features. With the spectral subtraction
nation method achieves good
methods, the improvement in accuracy depends on the subset of the cepstral-time
performance for longer duration noise
matrix, which is used for speech representation, and the best results are obtained
pulses such as a machine gun noise.
when the zeroth row and column of the matrix are removed.
Experimental results indicate that the
Effects of Impulsive Noise on Speech Recogni-tion: Experimental results on the
front-end noise removal method is more
effects of varying the frequency of occurrence and the amplitude of impulsive noise
effective in compensating for short
on the performance of HMM-based speech recognition systems are tabulated in Table
duration impulses. This may be due to the
I, which shows the recognition performance for speech contaminated by simulated
utilization of the distinct and localized
impulsive noise. In this experiment, both the percentage of speech samples
character of an impulsive noise in the
contaminated by impulsive noise and the overall signal to impulsive noise ratio
time domain by the front-end system.
(SINR) have been varied.
For robust speech recognition, the speech
features and models should make optimal
From Table I, as more speech samples are contaminated by impulsive noise, or as
use of the temporal dynamics of the speech
the impulsive noise power increases, the performance of the recognizer deteriorates.
spectrum. The discrete cosine transform
Note that at a given SINR, increasing the percentage of samples corrupted by
used in extracting the cepstral-time features
impulsive noise implies that the average impulse amplitude is decreased. From the
provides a method for decomposing the
columns of Table I, at a given SNR, as the frequency of occurrence of impulses
temporal variations of speech cepstral
increase, the recognition performance deteriorates. As expected, a few large impulses
features. The use of cepstral-time matrix
have a lesser degrading effect than a large number of small amplitude impulses.
results in consistent improvement in speech
Fig. 17 shows the performance of speech recognition in the presence of a machine
recognition over the more conventional
gun noise. Machine gun noise can be considered to be the impulse response of the
combination of cepstral and differential
machine gun and the acoustic environment. As such, they have a relatively long and
cepstral features.
well-defined shape. The method of model combination [15] works well for such long-
duration high-amplitude impulses, and the performance approaches that of the
matched conditions. Fig. 18 shows that for the shorter duration, simulated impulses,
REFERENCES
the front-end combination of noise detection and interpolation improves the
performance considerably at a SNR as low as 0 dB. 1] J. Deller, J. G. Proakis, and H. L.
Hansen, Discrete-Time Processing of
Speech Signals. New York: Macmillan,
1993.
2] S. V. Vaseghi, Advanced Signal
Processing and Digital Noise
Reduction. New York: Wiley, 1996.
3] B. H. Juang, ªSpeech recognition in
adverse environments,º Comput.
Speech Language, pp. 275±294, 1991.
4] S. F. Boll, ªSuppression of acoustic noise
in speech using spectral sub-traction,º
IEEE Trans. Acoust., Speech, Signal
Processing, vol. ASSP-27,
42. 113±120, Apr. 1979.
5] A. D. Berstein and I. D. Shallom, ªAn
hypothesized Wiener filtering approach
to noisy speech recognition,º in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Proc. IEEE Int. Conf. Acoust., Speech, Signal
1991, pp. 913±916. Processing, 1987, pp. 1139±1142.
6] Y. Ephraim, D. Malah, and B. H. Juang, ªOn the application of hidden Markov models for 12] A. Nadas, D. Nahamoo, and M. A.
enhancing noisy speech,º IEEE Trans. Acoust., Speech, Signal Processing, vol 37, pp. Picheny, ªSpeech recognition using
1846±1856, Dec. 1989. noise-adaptive prototypes,º IEEE Trans.
7] S. V. Vaseghi and B. P. Milner, ªNoise adaptive hidden Markov models based on Wiener Acoust., Speech, Signal Processing, vol.
filters,º in Proc. Eurospeech, 1993, pp. 1023±1026. 37, pp. 1495±1503, Oct. 1989.
8] J. S. Lim and A. V. Oppenheim, ªAll-pole modeling of degraded speech,º IEEE Trans. 13] A. Varga and R. K. Moore, ªHidden
Acoust., Speech, Signal Processing, vol ASSP-26, Markov model decomposition of speech
42. 197±210, June 1978. and noise,º in Proc. IEEE Int., Conf.
9] P. Lockwood and J. Boudy, ªExperiments with a nonlinear spectral subtractor (NSS), hidden Acoust., Speech, Signal Processing,
Markov models and the projection, for robust speech recognition in cars,º in Proc. 1990, pp. 845±848.
Eurospeech, 1991, pp. 79±82. 14] A. Varga, R. Moore, J. Bridle, K.
10] J. S. Bridle, K. M. Ponting, M. D. Brown, and A. W. Borret, ªA noise compensating spectrum Ponting, and M. Russell, ªNoise
distance measure applied to automatic speech recognition,º in Proc. Inst. Acoust., vol. 6, pt. 4, compensation algorithms for use with
pp. 307±314, 1984. hidden Markov model based
11] D. B. Roe, ªSpeech recognition with a noise-adapting codebook,º in
VASEGHI AND MILNER: NOISE COMPENSATION METHODS FOR HMM SPEECH RECOGNITION 21

speech recognition,º in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1988, pp. 481±484. Saeed Vaseghi received
15] M. J. F. Gales and S. J. Young, ªAn improved approach to the hidden Markov model the B.Sc. degree in
decomposition,º in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1992, pp. electri-cal and electronics
729±734. engineering from
16] , ªHMM recognition in noise using parallel model combination,º University of Newcastle,
in Proc. Eurospeech, 1993, pp. 837±840. England, and the Ph.D.
17] D. Mansour and B. H. Juang, ªThe short-time modified coherence representation and noisy degree in signal
speech recognition,º IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 795±804, processing from
June 1989. Cambridge University,
18] Y. Ariki, S. Mizuta, M. Nagata, and T. Sakai, ªSpoken-word recognition using dynamic features England.
analyzed by two-dimensional cepstrum,º Proc. Inst. Elec. Eng., vol 136, pt. I, no. 2, pp. His Ph.D. research in
133±140, Apr. 1989. restoration of archived
gramophone records led
to the establishment of the
CEDAR audio processing
system for the restoration
of degraded audio signals.
His research interests
include audio signal
restoration, modeling of
the time-varying and the
nonlinear aspects of
speech production and
perception, and the
development of
features and probabilistic models for speech pattern
recognition. He has authored a book entitled
Advanced Signal Processing and Digital Noise
19] S. V. Vasegi, P. N. Connor, and B. P. Milner, ªSpeech modeling using Reduction and is a lecturer at the Department of Electrical and Electronic cepstral-time feature matrices in hidden Markov models,º
Proc. Inst. Engineering of Queens University, Belfast, Northern Ireland.
Elec. Eng., vol. 140, no. 5, pp. 317±320, Oct. 1993.
20] O. Ghitza, ªAuditory nerve representation as a front-end for speech
recognition in a noisy environment,º Comput. Speech Language,
vol. 1, pp. 109±130, 1986.
21] C. Pisoni, ªSome acoustic-phonetic correlates of speech produced
in noise,º in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, 1985, pp. 1581±1584.
22] N. Wiener, Extrapolation, Interpolation, and Smoothing of
Stationary Time Series, with Engineering Applications.
Cambridge, MA: MIT Press, 1949.
23] V. L. Beattie and S. J. Young, ªNoisy speech recognition using hidden
Markov model state based filtering,º in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing,
1991, pp. 917±920.
24] S. V. Vaseghi and B. P Milner, ªNoisy speech recognition based on HMM's, Wiener filters and
re-evaluation of most likely candidates,º in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. II, 1991, pp. 103±106.
25] J. A. Nolazco Flores and S. J. Young, ªAdapting a HMM-based recognizer for noisy speech
enhanced by spectral subtraction,º in Proc. Eurospeech, 1993, pp. 829±832.
26] S. Vaseghi and P. Rayner, ªDetection and suppression of impulsive noise in speech
communications systems,º Proc. Inst. Elec. Eng., Commun. Speech Vision, pp. 38±46, Feb
1990.
27] S. J. Godsill, ªThe restoration of degraded audio signals,º Ph.D. Thesis, Cambridge Univ., UK
1993.
28] J. W. Tukey, Exploratory Data Analysis. Reading, MA: Addison Wesley, 1971.
29] S. Furui, ªSpeaker-independent isolated word recognition using dynamic features of speech
spectrum,º IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-34, no. 1, pp. 52±59,
Feb. 1986.
30] A. P. Varga, H. J. M. Steenken, M. Tomlinson, and D. Jones, ªThe NOISEX-92 study on the
effect of additive noise on automatic speech recogintion,º Tech. Rep., DRA Speech Res. Unit,
1992.
Ben Milner was born in Norfolk, England, on October 18, 1968. He received the
Ph.D. degree in signal processing from the University of East Anglia in 1995.
Prior to this, he received the B.Eng. degree in electronic engineering in 1991 and
spent some time in the oil industry, working on side scan sonar and seabed
profiling.
Since 1994, he has been working at BT Labora-tories in the speech recognition
group, specializing in noise and channel robustness and front-end pro-cessing.
Dr. Milner is an associate member of the IEE and is a visiting signal processing lecturer at the University
of East Anglia.

Vous aimerez peut-être aussi