Académique Documents
Professionnel Documents
Culture Documents
Naotoshi Seo
University of Maryland
ENEE632 Speech Processing
Final Project
M 1
1 X Because we cannot know the true value of a (n), we consider
(n) = |S(n, m)|2 (8) an ideal case of a (n) = 0. Here, p (n) becomes
M m=0
(a)
0 0
0.5 0.5
1 2 3 4 5 6 1 2 3 4 5 6
4 4
x 10 x 10
0.5 0.5
(b)
(b)
0
0.5 0.5
0 1 2 3 4 5 6 0 1 2 3 4 5 6
4
x 10 4
x 10
Fig. 1. The first test signal (a) Speech waveform in the silence Fig. 3. The second test signal in the presence of real world
(b) Noisy speech waveform created by adding white noise at street noise (a) Speech waveform in the silence (b) Noisy
an SNR of 0dB speech waveform created by adding street noise at an SNR
of 0dB
Power
(a)
2
0 5
Power
(a)
0 10 20 30 40 50 60 70 80
0
Power of Periodic Part 0 20 40 60 80 100
(b)
2
5
0 Power of Periodic Part
(b)
0 10 20 30 40 50 60 70 80
0
Power of Aperiodic Part 0 20 40 60 80 100
(c)
2
0 5
Power of Aperiodic Part
(c)
0 10 20 30 40 50 60 70 80
8 0
6 log likelihood ratio 0 20 40 60 80 100
(d)
4
2
0 8
6 log likelihood ratio
(d)
0 10 20 30 40 50 60 70 80 4
1.5 2
1 0
VAD Result 0 20 40 60 80 100
(e)
0.5
0 1.5
0.5 1 VAD Result
(e)
0 10 20 30 40 50 60 70 80 0.5
0
0.5
0 20 40 60 80 100
HMM, that is, all possible transitions between states are al- where p(i|xt , ) is the a posteriori probability for acoustic
lowed, to classify speech segments into one of the broad pho- class i, i = 1, . . . , c and satisfies
netic categories corresponding to the HMM states. After the
classification, appropriate features are selected. In the train- c
X
ing phase, reference templates are generated and verification p(i|xt , ) = 1 (22)
thresholds are computed for each phonetic category. In the i=1
verification phase, after the phonetic categorization, a com-
to zero,
Setting derivatives of the Q function with respect to
parison with the reference template for each particular cate-
gory provides a verification score for that category. The fi- the following reestimation formulas are found
nal verification score is a weighted linear combination of the wi N (xt , i , i )
scores from each category [17]. p(i|xt , ) = Pc (23)
The GMM based speaker identification is a most widely k=1 wk N (xt , k , k )
(a)
0 0
1 1
0 5 10 15 0 5 10 15
5
x 10 5
x 10
1.5 1.5
1 1
(b)
0.5
(b)
0.5
0
0
0.5
0 5 10 15 0.5
5
0 5 10 15
x 10 5
x 10
1.5
1.5
1
1
(c)
0.5
(c)
0.5
0
0.5 0
0 5 10 15 0.5
5 0 5 10 15
x 10 5
x 10
constructed an semi-automatic individual voice activity de- [6] X. Huang et al., Spoken language processing, Prentice
tection (IVAD) system by cascading VAD by the PARADE HALL PTR, 2001.
method and speaker identification based on GMMs. How-
ever, the experiment showed that the GMM based speaker [7] S. Furui, Recent advances in speaker recognition, Patter
identification method is not so robust to noise as the PARADE Recognition Letters 18 (1997), 859872.
method. In the future I will investigate noise robust speaker
[8] A. Higggins and D.Vermilyea, King speech corpus, Lin-
identification methods and training-free speaker identification
guistic Data Consortium (1995).
methods to make a completely automatic IVAD system.
[9] Nakatani T. Minami Y. Ishizuka, K. and N. Miyazaki,
6. REFERENCES Speech feature extraction method using subband-
based periodicity and nonperiodicity decomposition, J.
[1] B. S. Atal and L. R. Rabiner, A pattern recogni- Acoust. Soc. Am. 120 (2006), 443452.
tion approach to voiced-unvoiced-silence classification
[10] T. Fujimoto M. Ishizuka, K. Nakatani and N. Miyazaki,
with applications to speech recognition, IEEE Trans.
Noise robust front-end processing with voice activity de-
Acoust., Speech, and Signal Process 24 (1976), 201
tection based on periodic to aperiodic component ratio,
212.
Proc. Interspeech 7 (2007), 230233.
[2] S. Basu, A linked-hmm model for robust voicing and [11] Moreno D. M. Russell M. J. Jackson, P. J. B. and J. Her-
speech detection, Proc. ICASSP 1 (2003), 816819. nando, Covariation and weighting of harmonically de-
[3] A. de Cheveigne, Separation of concurrent harmonic composed streams for asr, Proc. Interspeech (2003),
sounds: Fundamental frequency estimation and a time- 23212324.
domain cancellation model of auditory processing, J. [12] B. H. Juang, The past, present, and future of speech pro-
Acoust. Soc. Am. 93 (1993), 32713290. cessing, IEEE Signal Processing Magazine 15 (1998),
no. 3, 2448.
[4] Z. Enqing, D. Guizhong L. Yatong and Z. Xiaodi, Ap-
plying support vector machines to voice activity detec- [13] Mak B. Junqua, J.-C. and B. Reaves, A robust algorithm
tion, Proceedings of the sixth International Conference for word boundary detection in the presence of noise,
on Signal Processing (2002), 11241127. IEEE Trans. Speech Audio Process. 2 (1994), 406412.
[5] F. Soong et al., A vector quantization approach to [14] Deligne S. Kristjansson, T. and P. Olsen, Voicing fea-
speaker recognition, Proc. IEEE ICASSP (1985), 387 tures for robust speech detection, Proc. Interspeech
390. (2005), 369372.
[15] Swamy N. S. Li, K. and M. O. Ahmad, An improved [28] Brown G.J.-Wan-V. Wrigley, S.N. and S. Renals, Speech
voice activity detection using higher order statistics, and crosstalk detection in multichannel audio, IEEE
IEEE Trans. Speech Audio Process. 13 (2005), 965 Transactions on Speech and Audio Processing 13
974. (2005), no. 1, 8491.
[16] M. Marzinzik and B. Kollmeier, Speech pause detection [29] Krishnamachari-K. L. Yantorno, R. E. and J. M.
for noise spectrum estimation by tracking power enve- Lovekin, The spectral autocorrelation peak valley ratio
lope dynamics, IEEE Trans. Speech Audio Process. 19 (sapvr) . a usable speech measure employed as a co-
(2002), 109118. channel detection system, Proc. IEEE Int. Workshop In-
tell. Signal Process. (2001).
[17] T. Matsui and S. Furui., Comparison of text-independent
speaker recognition methods using vq-distortion and
discrete/continuous hmms, In ICASSP (92), 157160.
[23] Hung J.-W. Shen, J.-L. and L.-S. Lee, Robust entropy-
based endpoint detection for speech recognition in noisy
environments, Proc. ICSLP (1998).