Académique Documents
Professionnel Documents
Culture Documents
= = =
1
0 0
] [ ) (
~
] [
~
)
~
(
~
N
n
n
n
n
z n x z X z n x z X 1
where
1 ~
z is the first-order all-pass filter:
1
1
1
. 1
~
=
z
z
z
2
where 1 0 < < is treated as frequency warping factor.
The phase response of
1 ~
z
is given by:
)
`
+ =
cos 1
sin
tan 2
~
1
3
This phase function determines a frequency mapping. As
shown in Fig 1, = 0.35, = 0.40 can approximate the
mel-scale and bark-scale [22], [32] at the sampling
frequency of 8 kHz, respectively.
In Mel-LP analysis, the spectral envelope of
)
~
(
~
)
~
(
~
z W z X
is approximated by the following all-pole
model on the linear frequency domain,
+
=
p
k
k
k
e
a
z a
z H
1
~ ~
1
~
)
~
(
~
4
where
k
a
~
is the k-th mel-prediction coefficient and
2
~
e
is
the residual energy (Strube, 1980).
Figure 3: Mel-LP analysis on the linear frequency scale.
I nt ernat i onal Journal of E mergi ng Trends & Technol ogy i n Comput er Sci ence (I JE TTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 2, July August 2012 ISSN 2278-6856
Vol ume 1 , I ssue 2 Jul y-August 2 0 1 2 Page 2 4 3
The model )
~
(
~
z
a
H is estimated on the basis of
minimization of mean square error (MMSE) as shown in
Fig. 3. Since ] [
~
n x is an infinite sequence, the prediction
error signal is also an infinite sequence. Thus, the total
error energy E
~
over an infinite sequence is given by:
2
0 0
] [
~
~
= =
|
|
.
|
\
|
=
n
k
p
k
k
n x a E
5
Where ] [n x
k
is the output signal of a k-th order all-pass
filter
k
z z
) (
~
excited by ] [ ] [
0
n x n x = . As a result of
minimizing E
~
, the mel-prediction coefficients { }
k
a
~
are
obtained by solving for the following normal equations:
( ) ( ) ( ) p m m a k m
k
p
k
,..., 1 , , 0
~
~
,
~
1
= =
=
6
where
] [ ] [ ) , (
~
0
n x n x k m
k
n
m
=
=
7
In the warped frequency domain, Eq.7 can be rewritten
as:
( )
d e e W e X k m
k m j j j
~
) ( 2
~ ~
. | ) (
~
) (
~
|
2
1
,
~
}
= 8
where the frequency weighting function
) (
~
~
j
e W is
defined by:
1
2
~
1
1
)
~
(
~
=
z
z W
9
which is derived from
2
~
| ) (
~
|
~
j
e W
d
d
=
10
Eq.6 indicates that ) , (
~
k m reduces to the
autocorrelation function of the signal whose Fourier
transform is equal to the frequency warped and frequency
weighted spectrum ) (
~
) (
~
~ ~
j j
e W e X .This autocorrelation
function is called as generalized autocorrelation
function. Fig. 2 illustrates the calculation procedure of
generalized autocorrelation function. From Eq.6, it
should be noted that ) , (
~
k m is a function of the
difference ) ( m k . Thus, ) , (
~
k m can be calculated
from the sum of finite terms without any approximation,
] [ ]. [ ] [
~
) , (
~
1
0
n x n x m k r k m
N
n
m k
=
= =
11
Therefore, to solve for
k
a
~
and
e
~
, the generalized
autocorrelation coefficients of the input signal ] [n x is
required instead of autocorrelation coefficients in the
traditional LP analysis[25], [26]. Since the mel-
prediction coefficients { }
k
a
~
are obtained from the
generalized autocorrelation function of the input
signal ] [n x , the proposed system enhances the speech
signal in the generalized autocorrelation domain.
Although the estimated model given by Eq. 4 includes the
frequency weighting ) (
~
~
j
e W , this can be easily removed
by inverse filtering in the generalized autocorrelation
domain using{ }
1
1
)
~
(
~
)
~
(
~
:
} { ] 1 [
~
] 1 [
~
] [
~
] [
~
1 0
+ + + = m r m r m r m r
12
Where
2
1
2 2
0
) 1 )( 1 (
+ =
13
and
2 1 2
1
) 1 (
=
14
As feature parameters for recognition, the Mel-LP
cepstral coefficients can be expressed as:
= H
0
~ ~
)
~
(
~
log
n
n
k a
z c z
15
where } {
k
c
~
are the mel-cepstral coefficients.
The mel-cepstral coefficients can also be calculated
directly from mel-prediction coefficients { }
k
a
~
(J. Markel
and A. Gray, 1976) using the following recursion:
j k
k
j
k k k
c a j k
k
a c
=
~ ~
) (
1
~ ~
1
1
16
It should be noted that the number of cepstral coefficients
need not be the same as the number of prediction
coefficients.
3. MEL-LP PARAMETER COMPENSATION
3.1 Cepstral Mean Normalization
A robust speech recognition system must adapt with its
acoustical environment or channel. To bring this concept
in effect, a number of normalization methods have been
developed in the cepstral domain so far. The simplest but
effective cepstral normalization method is the Cepstral
Mean Normalization (CMN) technique. In CMN the
mean of the cepstral vectors over an utterance is
subtracted from the cepstral coefficients in each frame as
given:
=
=
N
n
m
n c
N
n c n c
0
] [
1
] [ ] [
17
I nt ernat i onal Journal of E mergi ng Trends & Technol ogy i n Comput er Sci ence (I JE TTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 2, July August 2012 ISSN 2278-6856
Vol ume 1 , I ssue 2 Jul y-August 2 0 1 2 Page 2 4 4
where ] [n c and ] [n c
m
are the time-varying cepstral
vectors of the utterance before and after CMN,
respectively, and N is the total number of frames in the
utterance. The average of the cepstrum vectors over the
speech interval represents the channel distortion, which
does not use any knowledge of the [8]. As the channel
distortion is suppressed by CMN, it can be viewed as
parameter filtering operation. Consequently, CMN has
been treated as high-pass and band-pass filters [9]. The
effectiveness of CMN for the combined effect of additive
noise and channel distortion is limited. Acero and Stern,
(1990) [28] have developed more complex cepstral
normalization techniques to compensate the joint effect of
additive noise and channel distortion.
3.2 Blind Equalization
Blind equalization is a technique effective for minimizing
the channel distortion, which is caused by the differences
in the input devices frequency characteristics. It uses
adaptive filtering technique to reduce these effects. It can
be applied both in spectral domain as well as in cepstral
[29], [30]. But in the cepstral domain it is easier to
implement, and it requires less operations than in the
spectral domain. This technique is based on the least
mean square (LMS) algorithm, which minimizes the
mean square error computed as a difference between the
current and reference cepstrum. In this study, the same
algorithm is used as that implemented in Islam et al.
(2007)[3] with same values of different parameters.
4. EVALUATION ON AURORA-2 DATABASE
4.1 Experimental Setup
The proposed system was evaluated on Aurora 2 database
[31], which is a subset of TIDigits database contaminated
by additive noises and channel effects. This database
contains the recordings of male and female American
adults speaking isolated digits and sequences up to 7
digits. In this database, the original 20 kHz data have
been down sampled to 8 kHz with an ideal low-pass filter
extracting the spectrum between 0 and 4 kHz. These data
are considered as clean data. Noises are artificially added
with SNR ranges from 20 to -5 dB at an interval of 5 dB.
To consider realistic the frequency characteristics of
terminals and equipment in the telecommunication area
an additional filtering is applied to the database. Two
standard frequency characteristics G.712 and MIRS are
used which have been defined by the ITU
(1996)[33].Their frequency responses have been shown in
Fig. 5
It should be noted that the whole Aurora-2 database was
not used in this experiment rather a subset of this
database was used as shown in Table 1
Figure 5: Frequency responses of G712 and MIRS filters
Table 1: Definition of training data.
Training
Model
Filter Noise SNR [dB]
Clean G.712
Multi
G.712
Subway, car
,
babble,
exhibition
20, 15, 10,
5, 0, -5 and
clean
The recognition experiments were conducted with a 12th
order Mel-LP analysis. The pre-emphasized speech signal
with a pre-emphasis factor of 0.95 was windowed using
Hamming window of length 20 ms with 10 ms frame
period. The frequency warping factor was set to 0.35. As
front-end, 14 cepstral coefficients and their delta
coefficients including 0th terms were used. Thus, each
feature vector size is 28. The reference recognizer was
based on HTK (Hidden Markov Model Toolkit, Version
3.4) software package. The HMM was trained on clean
condition. The digits are modeled as whole word HMMs
with16 states per word and a mixture of 3 Gaussians per
state using left-to-right models. In addition, two pause
models sil and sp are defined. The sil model consists
of 3 states, which illustrates in Fig. 6. This HMM shall
model the pauses before and after the utterance. A
mixture of 6 Gaussians models each state. The second
pause model sp is used to model pauses between words.
It consists of a single state, which is tied with the middle
state of the sil model. The recognition accuracy (Acc) is
evaluated as follows:
% 100
=
N
I S D N
Acc
18
where N is the total number of words. D, S and I are
deletion, substitution and insertion errors, respectively.
Figure 6: Possible transition in the 3-state pause model
sil.
I nt ernat i onal Journal of E mergi ng Trends & Technol ogy i n Comput er Sci ence (I JE TTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 2, July August 2012 ISSN 2278-6856
Vol ume 1 , I ssue 2 Jul y-August 2 0 1 2 Page 2 4 5
4.2 Recognition Result
The recognition accuracy of ASR without applying
parameter compensation technique for test sets A and C
are tabulated in Table 2 and 3, respectively. It has been
found that the average word accuracy for test sets A and
C are 59.05% and 65.74%, respectively. On the other
hand, CMN has improved the recognition accuracy both
for test sets A and C and are found to be 68.02% and
71.64% as listed in Table 4 and 5, respectively. But in the
case of blind equalization (BEQ), there is a slight
improvement is observed for test set A which is 60.81%
as shown in Table 6, whereas, for test set C, no
improvement is observed and the accuracy is found to be
65.29% as shown in Table 7. The effect of CMN and
BEQ on noise category has also been observed for test sets
A and C .It is clear that for car noise BEQ gives better
accuracy than that of CMN and for other noise categories
CMN outperforms BE on the average. It has also been
observed the effect of CMN and BEQ at low SNR
conditions for test sets A and C. It has been found that for
test set A, at low SNR conditions, that is, for 5 and 0 dB
most of the cases BEQ give better accuracy than that of
CMN except for babble noise. But in the case of test set
C, CMN outperforms BE at any SNR conditions.
Table 2: Word Accuracy of MLPC without parameter
compensation for set A
Noise SNR [dB] Average
(20 to 0
dB)
Clean 20 15 10 5 0 -5
Subway 98.71 96.93 93.43 78.78 49.55 22.81 11.08 68.30
Babble 98.61 89.96 73.76 47.82 21.95 6.80 4.45 48.06
Car 98.54 95.26 83.03 54.25 24.04 12.23 8.77 53.77
Exhibitio
n
98.89 96.39 92.72 76.58 44.65 19.90 11.94 66.05
Average 98.69 94.64 85.74 64.36 35.05 15.44 9.06 59.05
Table 3: Word Accuracy of MLPC without parameter
compensation for set C
Noise SNR [dB] Average
(20 to 0
dB)
Clean 20 15 10 5 0 -5
Subway 99.11 94.75 89.41 77.53 46.58 16.92 8.72 65.04
Street 98.73 94.53 89.09 74.79 49.97 23.76 12.12 66.43
Average 98.92 94.64 89.25 76.16 48.28 20.34 10.42 65.74
Table 4: Word Accuracy of MLPC with CMN for set A
Table 5: Word Accuracy of MLPC with CMN for set C.
Noise SNR [dB] Average
(20 to 0
dB)
Clean 20 15 10 5 0 -5
Subway 97.02 86.77 79.92 73.72 64.63 44.43 21.46 69.89
Street 95.86 89.45 85.01 78.14 67.11 47.19 23.88 73.38
Average 96.44 88.11 82.46
5
75.93 65.87 45.81 22.67 71.64
Table 6: Word Accuracy of MLPC with Blind
equalization for set A
Noise SNR [dB] Average
(20 to 0
dB)
Clean 20 15 10 5 0 -5
Subway 96.99 77.83 71.66 64.32 56.89 45.35 27.79 63.21
Babble 96.43 79.75 68.65 59.16 44.20 22.28 2.96 54.81
Car 96.00 87.65 78.32 70.27 59.92 43.72 20.82 67.98
Exhibitio
n
95.96 77.63 70.90 63.13 46.62 27.95 12.19 57.25
Average 96.35 80.72 72.38 64.22 51.91 34.83 15.94 60.81
Table7: Word Accuracy of MLPC with Blind
equalization for set C
Noise SNR [dB] Average
(20 to 0
dB)
Clean 20 15 10 5 0 -5
Subway 97.08 80.2 74.85 68.01 58.98 43.72 22.87 65.16
Street 95.86 83.89 77.3 67.99 58.04 39.81 19.68 65.41
Average 96.47 82.05 76.08 68 58.51 41.77 21.28 65.29
References
[1] D. C. Bateman, D.K. Bye, M.J. Hunt,Spectral
contrast normalization and other techniques for
speech recognition in noise, ICASSP, 92(1): 241-
244, 1998.
[2] S .V. Vaseghi and B.P. Milner, Noise-adaptive
Hidden Markov models based on Wiener filters,
Euro. Speech, 93(2):1023-1026, 1993.
[3] M. B. Islam., K. Yamamoto and H. Matsumoto,
Wiener filter for Mel-LPC based speech
recognition. IEICE, Trans. Inform. Sys, E90-D (6),
2007.
[4] S .F. Boll, Suppression of acoustic noise in speech
using spectral subtraction IEEE Trans. Acoust.
Speech Signal Process, 27(2): 113-120.
[5] P. Lockwood, and J. Boudy, Experiments with a
nonlinear spectral subtractor (NSS), hidden
Markov models and the projection or robust speech
recognition in cars Speech Commun, vol. 11, no.
2-3, pp. 215-228,2007.
[6] Q. Zhu and A. Alwan, The effect of additive
noise on speech amplitude spectra: A Quantitative
analysis, IEEE Signal Processing Letters,
9(9):275-277, 2002.
[7] B. Atal, Effectiveness of linear prediction
characteristics of the speech wave for automatic
speaker identification and verification. J. Acoust.
Soc. Am., 55(6): 1304-1312, 1974.
Noise SNR [dB] Average
(20 to 0
dB)
Clean 20 15 10 5 0 -5
Subway 99.02 96.41 92.05 78.66 50.23 24.78 16.09 68.43
Babble 98.82 97.37 93.80 82.22 55.32 25.76 13.30 70.89
Car 98.87 96.96 92.72 77.42 42.77 22.55 13.18 66.48
Exhibitio
n
99.07 96.08 91.67 76.70 45.97 20.98 11.60 66.28
Average 98.95 96.71 92.56 78.75 48.57 23.52 13.54 68.02
I nt ernat i onal Journal of E mergi ng Trends & Technol ogy i n Comput er Sci ence (I JE TTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 1, Issue 2, July August 2012 ISSN 2278-6856
Vol ume 1 , I ssue 2 Jul y-August 2 0 1 2 Page 2 4 6
[8] S. Furi, Cepstral analysis technique for automatic
speaker verification IEEE Trans, Acoust, Speech
and Signal Processing, vol. ASSP-29, pp. 254-272,
1981.
[9] C. Mokbel, D. Jouvet, J. Monne and R. De Mori,
Compensation of telephone line effects for robust
speech recognition, Proc. ICSLP, 94, pp. 987-
990, 1984.
[10] M.J.F. Gales and S. J. Young, HMM recognition
in noise using parallel model Combination Proc.
of Euro speech 93, vol. II, pp.837-840, 1993a.
[11] M.J.F. Gales and S. J. Young, Ceptral parameter
compensation for HMM recognition in noise.
Speech Commu, 12(3): 231-239, 1993b.
[12] A.P. Vagra and R.K. Moore, Hidden Markov
model decomposition of speech and noise Proc.
ICASSP., 90(2):845-848, 1990.
[13] S. Davis and P. Mermelstein, Comparison of
parametric representation for monosyllabic word
recognition in continuously spoken sentences.
IEEE Trans. Acoustics Speech Signal Process,
ASSP, 28(4): 357-366, 1993b.
[14] H. Hermansky, Perceptual linear predictive (PLP)
analysis of speech, The Journal of the Acoustical
Society of America, vol. 87, no. 4, pp. 17-29.
[15] N. Virag, Speech enhancement based on masking
properties of the auditory system, ICASSP,
95:796-799.
[16] F. Itakura and S. Satio, Analysis synthesis
telephony based upon the maximum likelihood
method, Proc. of 6
th
International Congress on
Acoustics, Tokyo, C-5-5, C17-20, 1968.
[17] B. Atal, and M.Schroeder, Predictive coding of
speech signals, Proc. of 6
th
International
Congress on Acoustics, Tokyo, pp.21-28, 1968.
[18] J. Makhoul and L. Cosell, LPCW: An LPC
vocoder with linear predictive warping, Proc.
ICASSP., 76: 446-469, 1976.
[19] J. Itahashi and S. Yokoyama, A formant
extraction method utilizing mel scale and equal
loudness contour Speech Transmission
Lab,Quarterly Progress and Status Report
Stockholm (4), pp. 17-29, 1987.
[20] M.G. Rahim and B. H. Juang, Signal bias removal
by maximum likelihood estimation for robust
telephone speech recognition. IEEE Trans, on
Speech and Audio Processing, Vol. 4, No. 1, pp.
19-30, 1996.
[21] A.V. Oppenheim and D. H. Johnson, Discrete
representation of signals, IEEE Proc., vol.60,
no.6, pp.681-691, 1972.
[22] E. Zwicker and E. Terhardt, Analytical
expressions for critical band rate and critical
bandwidth as a function, J. Acoust. Soc. Am., vol.
68, pp. 1523-1525, 1980.
[23] H.W. Strube, Linear prediction on a warped
frequency scale, J. Acoust. Soc. Am., vol. 68, no.
4, pp. 1071-1076, 1980.
[24] Matsumoto, H., T. Nakatoh and Y. Furuhata, An
efficient Mel-LPC analysis method for speech
recognition,Proc. ICSLP 98, pp. 1051-1054, 1998
.
[25] Moreno Raj B., Gouvea E. and Stern R.
MultivariateGaussian-Based Cepstral
Normalization for Robust SpeechRecognition.
ICASSP95.
[26] L. Neumeyer and Weintraub M. Robust Speech
Recognitionin Noise Using Adaptation and
Mapping Techniques.ICASSP95.
[27] J. Markel and A. Gray, 1976.Linear prediction of
speech. Springer-Verlag.
[28] A. Acero and R. Stern, Environmental robustness
in automatic speech recognition,Proc. of ICASSP
90, pp. 849-852, 1990.
[29] L. Mauuary, Blind equalization for robust
telephone based speech recognition,
Proc.EUSPICO 96, pp. 125-128, 1996.
[30] Mauuary, Blind equalization in the cepstral
domain for robust telephone speech Recognition,
Proc. EUSPICO 98, vol. 1, pp. 359-363, 1998.
[31] H.G. Hirsch and D. Pearce,The AURORA
Experimental framework for the performance
evaluation of speech recognition system under
noisy conditions, Proc. ISCA ITRW ASR.,
181:188, 2000.
[32] P.H. Lindsay and D. A. Norman,Human
information processing: An introduction to
psychology, 2nd Ed., pp. 163, Academic Press.
[33] ITU recommendation Transmission performance
characteristics of pulse code modulation channels
[34] J.S Lim and A.V. Oppenheim, Enhancement and
bandwidth compression of noisy speech, Proc. of
the IEEE, vol. 67, no. 2, pp. 1586-1604.