Optimizing Regression For In-Car Speech Recognitio

See
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221489930
Optimizing regression for in-car speech

recognition using multiple distributed
microphones
Conference Paper January 2004
Source: DBLP
CITATIONS
READS
17
3 authors, including:
Weifeng Li
Kazuya Takeda
Tsinghua University
Nagoya University
86 PUBLICATIONS 350 CITATIONS
352 PUBLICATIONS 2,993 CITATIONS
SEE PROFILE
SEE PROFILE
All content following this page was uploaded by Kazuya Takeda on 09 October 2014.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
Optimizing Regression for in-car Speech Recognition using

Multiple Distributed Microphones
Weifeng Li, Kazuya Takeda, Fumitada Itakura
Center for Integrated Acoustic Information Research
Nagoya University
Abstract
In this paper, we address issues in improving handsfree speech recognition performance in different car environments using multiple spatially distributed microphones. In previous work, we proposed multiple regression of the log-spectra (MRLS) for estimating the logspectra of speech at a close-talking microphone. In this
paper, the idea is extended to nonlinear regressions. Isolated word recognition experiments under real car environments show that, compared to the nearest distant microphone, recognition accuracies could be improved by
about 40% for very noisy driving conditions by using the
optimizing regression method, The proposed approach
outperforms linear regression methods and also outperforms adaptive beamformer by 8% and 3% respectively
in terms of averaged recognition accuracies.
1. Introduction
Noise-robustness is today one of the most challenging
and important problems in automatic speech recognition (ASR). State-of-the-art speech recognition systems
are known to perform reasonably well by using a closetalking microphone worn near the mouth of the speaker.
In applications where the use of a close-talking microphone is neither desirable nor practical, the use of a distant microphone is required. However, as the distance
between speaker and microphone increases, the recorded
signal becomes more susceptible to distortion from background noise which severely degrades the performance
of ASR systems. This problem can be greatly alleviated
by the use of multiple microphones to capture the speech
signals. Among the various approaches using multiple
microphones, microphone arrays are commonly used for
data collection and speech enhancement. In real car environments, traditional beamforming is difficult to apply
because both speaker position and noise location are not
fixed [1]. Recently an adaptive beamforming-based approach (Generalized Sidelobe Canceller (GSC) [2] for
example) has become attractive for its dynamic parameter adjustment. However, a persistent problem in microphone arrays has been the poor low frequency directivity
for practical aray dimensions [3]. Also, problems such as
signal leakage [4] inherent in GSC degrade its quality.

As a new multi-microphone approach to improve the
in-car speech recognition performance, multiple regression of log spectra (MRLS) was proposed in [5], in which
multiple spatially distributed microphones are used. The
basic idea of MRLS is to approximate the log mel-filter
bank (log MFB) outputs of a close-talking microphone
by a linear combination with the log spectra of the distant
microphones, and it can be regarded as an extension of
spectral subtraction. The regression method was shown
to be effective in real car environments because it can statistically optimize the parameters in terms of minimum
log spectral distance using a large in-car speech corpus.
In this paper, we extend the linear regression to nonlinear regressions, such as multi-layer perceptron (MLP)
and support vector machines (SVM), to approximate the
speech signals of the close-talking microphone because
we believe that a better approximation will result in the
improvement of recognition performance. We also investigate the regression approach on the cepstrum domain
beside the log-spectra. The effectiveness of the proposed
nonlinear approaches are demonstrated in the improvement of both signal-to-deviation ratio (SDR) and word
recognition accuracies.
The organization of this paper is as follows: In Section 2, regression methods used for combining the data
from the distant microphones are presented. In Section
3, we describe the result of experimental evaluations using in-car driver speech under real environments. Comparison with the adaptive beamforming is also presented.
Section 4 summarizes this paper.
2. Regression Methods
Let denote the feature vector (e.g. log-spectrum or
cepstrum) for the th distant microphone and denote the corresponding th element for frame . Let
denote the feature vector collected from the closetalking microphone and denote the th element
denote the estimated feature vector
for frame . Let
obtained from the feature vectors of five distant microphones. Each element of is approximated independently.
In the linear regression (LR), is given by
9 10 11 12

(1)
s are the real valued elements of the 5-dimensional

vector and is the bias, which are obtained by minimizing mean squared error over the training examples.
In support vector machine regression (SR), we introduce the vector

and the th element of the feature vector is estimated by

(2)
where
and
are the Lagrange multipliers for
the th support vectors . The Lagrange multipliers (
,
) and the support vectors are found by solving
the dual optimization problem in support vector learning
[6]. denotes the kernel function for which we use
the Gaussian kernel in the experiments.
For multi-layer perceptron regression (MR), the network with one hidden layer composed of 8 neurons is
used. The th element of the feature vector is estimated
by

(3)
where
is the tangent hyperbolic activation function. The parameters are found by minimizing the mean squared error over the training examples
using a gradient descent algorithm [7].
3. In-car Recognition Experiments

3.1. Experimental setup
A data collection vehicle (DCV) has been specially designed for developing the in-car speech corpus at Center
for Integrated Acoustic Information Research (CIAIR),
Nagoya University, Nagoya, Japan. Figure 1 shows the
side view and the top view of the arrangement of microphones in the data collection vehicle. The driver wears a
Table 1: 15 driving conditions
driving environment
in-car state
10
11
12
idling
city
expressway
air-conditioner on high level
window (near the driver) open
air-conditioner on low level
CD player on
normal
Figure 1: Side view and top view of the arrangement of

multiple spatially distributed microphones and the linear
array in the data collection vehicle.
headset with a close-talking microphone (1 in Figure 1)
placed in it. Five distant microphones (3 to 7) are placed
around the driver. A four-element linear microphone array (9 to 12) with an inter-element spacing of 5 cm is
located at the visor position. The speech signals captured
by the close-talking microphone are used as the reference
in the regression analysis. Speech signals are digitized
into 16 bits at a sampling frequency of 16 kHz.
For spectral analysis, 24 channel mel-filter-bank
(MFB) analysis is performed by applying a triangular
windows on the FFT spectrum of 25 millisecond-long
windowed speech, with a frame shift of 10 milliseconds.
Spectral components lower than 250 Hz are filtered out.
Then log MFB parameters are estimated. Finally 12 mean
normalized mel-frequency cepstral coefficients (CMNMFCC) are obtained through Discrete Cosine Transformation (DCT) on the log MFB parameters and by subtracting the cepstal means.
We performed isolated word recognition experiments
on the 50 word sets under 15 driving conditions (3 driving environments 5 in-car states = 15 driving conditions as listed in Table 1). For each driving condition, 50
words are uttered by each of 18 speakers. Among them
the data uttered by 12 speakers (6 male and 6 female) is
used for learning the regression models and the remaining words uttered by 6 speakers (3 male and 3 female)
are used for recognizing. 1,000-state triphone Hidden
Markov Models (HMMs) with 32 Gaussian mixtures per
state trained over a total of 7,000 phonetically balanced
sentences (uttered by 202 male speaker and 91 female
speaker) is used for acoustical modeling. (3,600 of them
are collected in the idling-normal condition and 3,400 of
them are collected while driving the DCV on the streets
near the Nagoya university (city-normal condition).) The
diagram of the in-car regression-based speech recogni-
HMM training (7,000 sentences)

MDM
CTK
F. A.
F. A.
regression
models
100
estimated
feature vectors
feature
transform
95
HMMs
]
[%
t
c
e
rr
o
c
Regression model training (600 words)
90
85
80
75
MDM
CTK
F. A.
F. A.
regression
models
70
CTK-CTK CTK-LRB CTK-LRC LRB-LRB LRC-LRC DST-LRB DST-LRC DST-DST
Regression test (300 words)

MDM
F. A.
estimated
feature vectors
feature
transform
recognition
Figure 3: Averaged word recognition performance using

linear regression on log MFB domain and CMN-MFCC
domain.
and the regression on log MFB are used in the following.
Figure 2: Regression based speech recognition. MDM

and CTK denote the multiple distant microphones and
the close-talking microphone respectively. F. A. represents the feature analysis.
tion is given in Figure 2. The part above the bold solid
line is for training HMMs and the below part is for the
testing.
3.2. Linear regression for in-car speech recognition
We performed linear regression [5] (Equation (1)) on
the 24 log MFB outputs and the 12 CMN-MFCCs for
both training data and test data. The regression is also
performed on the log energy parameter. The estimated
feature vectors used for recognition system consist of
12 CMN-MFCCs + 12 CMN-MFCCs + log energy. In the case of regression on log MFB outputs,
the estimated log MFB vectors are then converted into
CMN-MFCC vectors using DCT, and then perform the
derivatives. In this way, we obtain two regression-based
HMMs, which are used to recognize the test data obtained
in the same way (LRB-LRB and LRC-LRC correspondingly). For comparison, we also cite the CTK
acoustical model trained using the close-talking microphone speech and DST model trained using the speech
from the nearest microphone (microphone 6 in Figure 1).
The recognition performance averaged over the 15 driving conditions is given in Figure 3. We also recognize
the estimated data by using the CTK model and DST
model.
From Fig 3, it is found that good recognition performance can be obtained by using the close-talking microphone for good quality speech signals. The regression
methods outperform the nearest distant microphone. This
shows that the approximation of the speech signals from
the close-talking microphone is reasonable and practical to improve the recognition accuracy. The regression
on log MFB shows the advantage over that on CMNMFCC. The DST-LRB performs best and outperforms
the DST by about 10% averagely. So the DST model
3.3. Nonlinear regression for in-car speech recognition compared to adaptive beamforming
Next we perform the condition-dependent nonlinear regressions to obtain estimated test data. The SVM regression method (Equation (2)) and MLP regression method
(Equation (3)) are considered. The parameters ( , , )
in SVM are specified as (10, 5, 0.5). The learning rate
and the number of iterations are set as 0.001 and 1000
respectively. The regression performance is evaluated by
the signal-to-deviation ratio (SDR), which is defined as
SDR [dB]

(4)
where is the reference feature vector from closetalking microphone for frame . denotes the number
of frames during one utterances. The SDR is averaged
over the number of the utterances. Figure 4 shows the
SDR values obtained using the three regression methods.
The adaptive microphone arrays approach is attractive for speech enhancement and speech recognition (e.g.
[8]). For comparison, we apply the Generalized Sidelobe Canceller (GSC) [2] to our in-car speech recognition. Four linearly spaced microphones (9 to 12 in Figure
1) are used. The architecture of the used GSC is shown
in Figure 5. The three FIR filters are adapted sample-by23
22
] 21
B
d[
R
D
S 20
19
18
LRB
SRB
MRB
DST
Figure 4: Averaged SDR values for the three regression

methods.
input
x1 (n)
x2 (n)
x3 (n)
x4 (n)
3
4
w
w
w
w
1
2
3
output
delay
delay
ybf (n) +
yo(n)
ya(n)
u1 (n)
blocking
matrix
u2 (n)
u3 (n)
FIR 1
FIR 2
FIR 3
100
95
]
[%
t
c
re
r
o
c
90
85
80
75
70
DST-LRB
DST-SRB
DST-MRB
ABF-ABF
DST-DST
Figure 5: Generalized Sidelobe Canceller block diagram

sample using Normalized Least Mean Square (NLMS)
method. The averaged recognition performance by using
nonlinear regression approach and adaptive beamforming approach is shown in Figure 6. The DST-SRB and
DST-MRB denote the recognition of estimated test data
obtained by using SVM method and MLP method. The
ABF-ABF denotes the adaptive beamforming approach
(The number of taps and step-size of adaptation are set as
100 and 0.01). We also cite the DST-LRB and DSTDST from Figure 3 for comparison. It is found that
recognition accuracies are furthermore improved by using nonlinear regression approaches compared to linear
regression. This contributes to the better approximation
of the clean speech (The SDR is further improved by 0.6
dB in Figure 4). The proposed nonlinear approaches outperform the adaptive beamforming averagely by 3% as
shown in Figure 6.
3.4. Discussions
Our experiments show that adaptive beamforming is very
effective when the window near the driver is open and
when the CD player is on. This suggests it could track
the target speech signals effectively from other directions. The proposed nonlinear regression shows advantages when the air-conditioner is on. Especially when
the air-conditioner is on high level, the proposed method
outperforms the nearest distant microphone and adaptive
beamforming by about 40% and more than 10% respectively. For the case when the window near the driver
is open or when the CD player is on, if another microphone is available near the window/CD-player to obtain
more information about the wind/music interference, further improvement can be expected by using the regression
method. In view of the computation cost, the proposed
MLP regression is superior to adaptive beamforming.
4. Summary
In this work, we have proposed nonlinear regression of
the log-spectra for in-car speech recognition by using
multiple distant microphones. The results of our studies
have shown that the proposed method can obtain good
approximation to the speech of a close-talking microphone. The effectiveness is also demonstrated in the improvement of the word recognition accuracies in 15 driv-
View publication stats
Figure 6: Averaged word recognition performance using regression approaches and adaptive beamforming approach.
ing conditions. Other methods for speech enhancement
may be combined with the proposed method to obtain
improved accuracy in recognition of speech in noisy environments.
5. References
[1] Y. Kaneda and J. Ohga, Adaptive microphonearray system for noise reduction, IEEE Trans. on
ASSP, Vol. 34, no. 6, pp. 1391-1400, 1986.
[2] L. J. Griffiths and C. W. Jim, An Alternative Approach to Linearly Constrained Adaptive Beamforming, IEEE Trans. on Antennas and Propagation, vol. AP-30, no. 1, pp. 27-34, Jan. 1982.
[3] I. McCowan, D. Moore, and S. Sridharan. Nearfield adaptive beamformer with application to robust speech recognition, Digital Signal Processing:
A Review, 12(1):87-106, January 2002.
[4] W. Herbordt, H. Buchner, and W. Kellermann, An
acoustic human-machine front-end for multimedia
applications, European Journal on Applied Signal
Processing, Vol. 2003, Num. 1, pp. 1-11, Jan. 2003
[5] T. Shinde, K. Takeda and F. Itakura., Multiple Regression of Log-Spectra for In-Car Speech Recognition, Proc. of ICSLP, pp. 797-800, 2002.
[6] A.J. Smola and P.J. Bartlett and B. Scholkopf, A tutorial on support vector regression, NeuroCOLT2
Technical Report NC2-TR-1998-030, 1998.
[7] S. Haykin, Neural Networks A Comprehensive
Foundation, Prentice Hall, 1999.
[8] Xianxian Zhang and John H. L. Hansen, A
Constrained Switched Adaptive Beamforming for
Speech Enhancement & Recognition in Real Car
Environments, IEEE. Trans. on Speech and Audio
Processing, Vol 11, no. 6, pp. 733-745, Nov. 2003.

Optimizing Regression For In-Car Speech Recognitio

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Optimizing Regression For In-Car Speech Recognitio

Transféré par

Droits d'auteur :

Formats disponibles

See