Vous êtes sur la page 1sur 5

ASR for Mixed Speech Using SNMF based

Separation Algorithm
Yash Vardhan Varshney, Prashant Upadhyaya, Musiur Raza Abidi, Zia Ahmad Abbasi, Omar Farooq
Department of Electronics Engineering
Aligarh Muslim University, Aligarh, INDIA
Email: yashvarshneyy@gmail.com, upadhyaya.prashant@gmail.com, abidimr@rediffmail.com, zaabbasi@zhcet.ac.in,
omarfarooq70@gmail.com

Abstract—Combined experiment for separation and recogni- Many researchers have performed work for mixed speech
tion of the mixed speech signal of two speakers using different separation but they have reported the objective measures
features are reported in this work. For separation of speech of separated speech signal like the perceptual evaluation of
signals, sparse NMF is used after wavelet decomposition of
mixed speech signal. Kaldi toolkit with different features is speech quality (PESQ) and short-time objective intelligibility
used for speech recognition having 4-gram language model. For measure (STOi) etc [5]. These objective measures are used to
experiment purpose, phonetically balanced 1000 Hindi sentences assess the quality of speech signal but these measures can not
from AMUAV corpus are used. Simulation is done for different validate the understanding of machine for that speech signal.
target to mixed speech mixing level. Results show that recognition So, subjective features as word error rate (WER) or word
of speech signals after separation using SNMF based algorithm
perform better as compared to the recognition of mixed speech recognition rate may be used as a parameter for capability
when the target to mixed signal ratio is lower than 20dB. It is of the algorithm because it is calculated after recognition of
also observed that ∆ + ∆ − ∆ feature-based speech recognition the number of words correctly recognized by the algorithm.
model perform better as compare of other models. It is obvious that human being can understand separated
Index Terms—Kaldi ASR, Monophones, Speech separation, speech signals better as compare of the mixed signal. This
SNMF, Speech recognition, Triphones, ∆ + ∆ − ∆ features
motivates to apply a source separation technique before the
recognition takes place for machines. A comparison for mixed
I. I NTRODUCTION speech recognition and separated speech recognition using a
particular speech recognition technique may also be used for
In the modern era, a number of automatic speech recogni- making a decision that machines/algorithms also follow the
tion applications are introduced for making human life easier. human behavior for speech recognition.
There are several machines available in society, that work Some researchers have applied source separation and recog-
using the voice/speech command of its user. For such types of nition techniques and experimented on English language sen-
machinery, there are different speech recognition techniques tences [6], [7]. As Hindi is 4th largest spoken language [8],
are available but they are not perfect for all conditions. researchers have worked on Hindi speech recognition as well.
They reduce the performance when target speech signal gets They have used different speech recognition toolboxes as HTK
corrupted using some unwanted background signal (monaural [9], Sphinx-4 [11] and Kaldi [12], [13] etc and compare their
speech signal) [1], [2], [3]. results [14]. However, any combined work on monaural Hindi
So, Monaural speech recognition is one of the problems that speech separation and recognition is not reported till now. This
generate a lot of research interest in the past few decades and work is reporting the results in terms of word error rate (WER)
has reached its maturity level. ASR now has moved out of detected after recognition of mixed Hindi speech signals in
the laboratory and new applications are being explored which mixed and separated signals condition.
may have uncontrolled/mismatch background conditions. If In the first step, the recognition of target speech signal
any device is voice operated than it has to recognize the speech from a mixed speech signal in Kaldi toolbox using different
signal of its user to perform the task. But, When background features has done in terms of WER. Then, separation of the
signal mixed with the target speech signal, then the problem mixed speech signal is done using wavelet decomposition
becomes difficult and the performance of an ASR deteriorate. based sparse non-negative matrix factorization (SNMF) [15].
Researchers have reported some results of automatic speech After getting separated signal, speech recognition again took
recognition for noisy environments [3], [4]. They have con- place in Kaldi toolbox using same features as used in the
sidered different generalized noise as a background signal of previous case. Then speech recognition performances were
targeted one and they have successfully shown a significant compared in both cases. Results show that speech recognition
performance of their algorithms. However, a little work has after separation of speech signals perform better as compare
been done for mixed speech signals recognition problem. of direct recognition from the mixed speech signal.
In this case, a targeted speakers speech signal is contami- Section 2 discusses the technique used for the source
nated/mixed with another speech signal of another speaker. separation and speech recognition using Kaldi toolbox with

978-1-5090-6673-5/17 ©2017 IEEE 204


used features. The models of the experiment and simulation the Sparse NMF is used here which considered the effect of the
conditions are further discussed in section 3. Section 4 consists sparseness of speech signal. The cost function to be minimized
the result and discussion followed by conclusion in section 5. can be given by Equation 2 [15], [17].
II. S PEECH R ECOGNITION S YSTEM
minD(Va k Ra ) = min k Va − W H k2F

The problem can be simply understood as an automatic W,H W,H
speech recognition issue when the input speech signal is X  (2)
+λ H(i, j) ; W, H > 0
mixed with some other unwanted speech signal. The proposed
i,j
framework for the separation as well as recognition is shown
in 1. Wa and Ha can be calculated by the updating rule given
in Equation 3 and 4 which are derived after solving the cost
function Equation 2 using gradient descent algorithm [14].

ViT W̄j
Hi,j ← Hi,j (3)
[[W H]Tj D̄j ] + λ

Σi Hi,j [Vi + ([W H]Ti W̄j )W̄j ]


Fig. 1. Block diagram of target speech recognition from the mixture of two Wj ← Wj (4)
different speaker’s speech signals Σi Hi,j [[W H]i + (ViT W̄j )W̄j ]
where Wj is the j th vector of basis matrix W and Vi is the
For separation of the mixed speech signal, sparse non- th
i row of spectrogram magnitude matrix V .
negative matrix factorization (SNMF) algorithm has been used. The high-frequency components of any speech signal con-
SNMF is a special type of NMF [16] which consider the sparse tain very less information as compare of low-frequency com-
behavior of signal at the time of factorization of spectrogram ponents but they affect the formation of basis vectors equally
magnitude matrix of the speech signal. The related work was [15]. These basis vectors further used to find the weights of
reported by Varshney et. al [15]. According to the theorem, individual speech signals from the mixture of signals. As most
any two-dimensional non-negative matrix can be approximated of the information about the signal is contained in its low-
as a product of its basis vectors and their weights as shown frequency components, basis vectors based on low-frequency
in 1. components are enough to describe the signal properly. So,
before applying SNMF directly to the mixed speech signal,
V = [Wa ][Ha ]; (1) wavelet decomposition is applied on it and the high-frequency
where, V is a two-dimensional non-negative matrix of size components of the signal are assumed as zero. The speech
M × N . Where M represents the frequency divisions and N separation model can be shown in Figure 2.
represents the time divisions of spectrogram matrix. Wa is
M × K matrix of basis vectors and Ha is the weights of the
basis vectors with size K × N . Here K is the number of basis
vectors that should be a minimum of M and N .
Monaural speech signals are single dimensional in nature.
These can be converted into two-dimensional space using its
spectrogram magnitude matrix (V ), which is also nonnegative
in nature. The basis vectors and their weights for V can be Fig. 2. Speech separation model based on wavelet decomposition followed
approximated by minimizing the difference between actual by SNMF
spectrogram magnitude matrix and approximated matrix Ra
i.e. Speech signal extracted using basis vectors of target speaker
is then sent for recognition purpose. Many toolboxes are
V ≈ Wa × Ha
available for the speech recognition purpose. The performance
The basis vectors of a speakers speech signals can be found comparison using Sphinx, HTK, and Kaldi has been reported
out by doing factorization using different types of NMF. The by some researchers [12], [14]. These toolboxes extract fea-
weights of these vectors are used for the formation of a tures from training data and then classify spoken words by
particular speech signal. For separation of two signals, the the speaker. Kaldi is relatively a new toolbox for speech
algorithm needs the basis vectors of the individual speakers. recognition, It is open licensed and provide flexible coding
So, it requires the training speech signals spoken by different with superior performance as compared to other ASR toolbox
speakers. Basis vectors for each speaker are calculated and [12], [14]. So, in this paper, Kaldi is chosen as the speech
those basis vectors used for the separation of target speech recognition toolbox.
signal from the mixed one. The training of speech signals can be carried out using
The approximation of basis vectors should be proper and monophones or triphones model. A monophone is referred
must contain the sparse characteristics of speech signals. So, to the single phoneme. In this case, a word is divided into

205
individual units of phonemes or phones and then features features. Where, third model used 13 MFCC features with
based on single phones are extracted and used for train- their 13 1st order (∆) and 13 2nd order (∆ − ∆) dynamic
ing/testing purpose. Triphone is a set of three phones in the features.
form of predecessor + middle phone + successor (P+M+S). Speech separation was done using MATLAB 14a software
For example a word Research (rI’s@:tS) is trained with triphone on windows 7 operating system and recognition was done
sets like ( +/r/+/I’/), (/r/+/I’/+/s/), (/I’/+/s/+/@:/), (/s/+/@:/+/tS/), using Kaldi toolbox. The machine configuration on which
and (/@:/+/tS/+ ) . speech recognition was conducted has Ubuntu 16.04 LTS (64-
Upadhyaya et. al has reported some results using PLP and bit operating system), Processor Intel Core 2 Duo with 2.20
MFCC features with monophone and triphone training of GHz clock frequency.
Hindi speech signals [18]. In this paper, three models based
on MFCC features are used for speech data training and IV. R ESULTS AND D ISCUSSION
testing. First speech recognition model is based on monophone The experiment is done for mixed speech signals of opposite
decoding using 13 static cepstral coefficients. Second model genders as well as for same genders. Different features were
uses the triphone modeling natural language with same 13 used to recognize the sentences and performance was mea-
static cepstral coefficients. And the third model included the sured in terms of word error rate (WER) that can be defined
∆ and ∆ − ∆ coefficients of MFCC features with triphone as:
modeling. These coefficients are calculated using the temporal
dependencies on Hidden Markov Model (HMM) frames. The (D + S + I)
W ER(%) = ∗ 100% (5)
HMM frames are assumed to be statistically independent to N
each other [19]. where N is the number of words used in the test, D is the
number of deletions, S is the number of substitutions and I is
III. DATABASE AND S IMULATION the number of insertion error.
For training of any system, a large database is required. At high target to mixed signal ratio (TMR) signal is not
Training using large database may lead to better recognition much contaminated, so it may be recognized without any
of speech signal. Aligarh Muslim University Audio Visual separation algorithm. As per the proposed work, in separa-
(AMUAV) database is containing 1000 utterances from 100 tion phase, the information contained in the high-frequency
speakers (57 male and 43 female)[18]. Each speaker utters components get lost. However, the information contained in
10 Hindi sentences in which 2 sentences are common for the high-frequency components is low but it may decrease the
each speaker. These 1000 sentences are phonetically balanced quality of speech signal a bit. It is clearly shown that at high
containing 2007 unique words and having a total of 10664 TMR, mixed signal is better to recognize using all training
words in this database. So, multiple training of each word models of speech recognition.
using different speaker may possible. But, at lower TMR, recognition of mixed-signal becomes
As this experiment was done in two parts. First one was difficult and speech recognition after separation get improved.
speech separation and the second one was the recognition of Ignorance of high-frequency components does not make much
speech. For the first step, the two common sentences were effect on the quality of speech signal, even it improvises the
chosen for the training of individual speaker’s speech signals separation of the mixed signal as discussed earlier. Figure 3
(by finding of basis vector). And the rest of 8 sentences were shows the word error rate during the recognition of mixed
chosen for the testing purpose. All signals are sampled at 16 and separated speech at different mixing level in case of
kHz as it is most suitable for wideband speech signals. The opposite gender mixed signal. Figure 3(a), 3(b) and 3(c) are
spectrogram magnitude matrix of training and testing signals used to show the effect of different feature training models in
was found using 512 point Fast Fourier Transform with 50% recognition. For example, Figure 3(c) (∆+∆−∆ based model)
overlap on the window size of 10 milliseconds. 100 basis shows that WER for mixed signal recognition at 0dB TMR is
vectors for individual speech signal was extracted with setting 82.6% which is improved to 28.4% in case of separated signal
the sparse parameter λ = 0:5. recognition. In a similar way, Figure 4 shows the word error
In the second step, recognition of speech signal was done of rate during the recognition of mixed and separated speech
separated speech signals. There 4-gram language model was at different mixing level in case of the same-gender mixed
used after extracting MFCC features. The statistical language signal. And Figure 4(a), 4(b) and 4(c) are used to show the
model was generated using SRILM toolkit [20]. Utterances effect of different feature training models in recognition. For
from 90 speakers were taken for training and rest were used for example, Figure 4(c) shows that WER in case of 0dB mixture
testing purpose. The feature extraction of training and testing is improved from 85.7% to 50.3% after separation of mixed
data was done using 25 ms windowing with overlapping of signals.
10 ms window. The mixture of two same-gender signals and By the figures, It is observed that triphone based model
opposite gender signals is used making the problem more recognize better as compare of monophone based features
realistic. because triphones are context dependent. Figures 3 and 4 also
During the speech recognition, first two recognition models show that ∆+∆−∆ features based speech recognition system
(monophone decoding and triphone decoding) used 13 MFCC provides improved WER over the simple triphone model.

206
Fig. 3. WER for opposite gender mixed speech and separated speech Fig. 4. WER for same gender mixed speech and separated speech recognition
recognition using (a) monophone, (b) triphone and (c) ∆ + ∆ − ∆ based using (a) monophone, (b) triphone and (c) ∆ + ∆−∆ based model at different
model at different target to mixed speech ratio target to mixed speech ratio

V. C ONCLUSION R EFERENCES
In this paper, recognition of mixed continuous Hindi speech
signal using Kaldi toolkit after separation of individual signal [1] H. Runqiang, Z. Pei, G. Qin, Z. Zhiping, W. Hao and W. Xihong, “CASA
based speech separation for robust speech recognition,”in Proceedings
is reported. Three features were selected for recognition and it of the Ninth International Conference on Spoken Language Processing
was shown that feature outperformed over the other features. (ICSLP), 2006, pp. 2-5.
However, A large triphone training may result more improved [2] Y. V. Varshney, Z. A. Abbasi, M. R. Abidi, O. Farooq and P. Upadhyaya,
“SNMF Based Speech Denoising with Wavelet Decomposed Signal
recognition results as a database contain more triphone as Selection,”in IEEE conference WiSPNET, 2017, pp. 2643-2646.
compare of monophones. [3] Z. Q. Wang and D. Wang, “A Joint Training Framework for Robust Au-
In future, new features and other better separation tech- tomatic Speech Recognition,”IEEE/ACM Transactions on Audio, Speech,
niques can be used for improvement in results. Other speech and Language Processing, vol. 24, no. 4, April 2016, pp. 796-806.
[4] P. Upadhyaya, O. Farooq, M. R. Abidi and P. Varshney, “Compara-
recognition toolkits like iATROS and RWTH ASR can be used tive Study of Visual Feature for Bimodal Hindi Speech Recognition”,
for finding new possibilities. Archives of Acoustics, 40(4), 2015, pp. 609-619.

207
[5] P. Mowlaee, R. Saeidi, M. G. Christensen, Z.-H. Tan, T. Kinnunen,
P. Franti and S. H. Jensen, “A Joint Approach for Single-Channel
Speaker Identification and Speech Separation,”IEEE Transactions on
Audio, Speech, and Language Processing, vol. 20, no. 9, 2012, pp. 2586-
2601.
[6] M. Cooke, J. R. Hershey and S. J. Rennie, “Monaural speech separation
and recognition challenge,” Computer Speech and Language, Volume
24, Issue 1, 2010, Pages 1-15.
[7] M. Khademian and M. M. Homayounpour, “Monaural Multi-Talker
Speech Recognition using Factorial Speech Processing Models,”CoRR,
Volume abs/1610.01367, 2016.
[8] http://www.internationalphoneticalphabet.org.
[9] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G.
Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,
“The HTK Book (for version 3.4),”Cambridge University Engineering
Department, 2009.
[10] A. Lee, T. Kawahara and K. Shikano, “Julius - an open source realtime
large vocabulary recognition engine,”EUROSPEECH, 2001, pp.1691-
1694.
[11] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf
and J. Woelfel, “Sphinx-4: A flexible open source framework for speech
recognition,” Sun Microsystems Inc., Technical Report SML1 TR2004-
0811, 2004.
[12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,
M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz and J. Silovsky “The
Kaldi speech recognition toolkit,”in IEEE workshop on automatic speech
recognition and understanding, no. EPFL-CONF-192584, IEEE Signal
Processing Society, 2011.
[13] Kaldi Home Page (kaldi-asr.org)
[14] C. Gaida, P. Lange, R. Petrick, P. Proba, A. Malatawy and D.
Suendermann-Oeft, “Comparing open-source speech recognition toolk-
its,”Tech. Rep., DHBW Stuttgart 2014.
[15] Y. V. Varshney, Z. A. Abbasi, M. R. abidi, and O. farooq “Frequency
Selection Based Separation of Speech Signals with Reduced Computa-
tional Time Using Sparse NMF,”Archives of Acoustics, Vol. 42, No. 2,
2017, pp. 287-295.
[16] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative
matrix factorization,”Nature, 2nd ed. vol. 401, no. 6755, 1999, pp. 788-
91.
[17] P. O. Hoyer, “Non-negative Matrix Factorization with Sparseness con-
straints,” Journal of machine learning research, vol. 5, 2004, pp. 1457-
1469.
[18] P. Upadhyaya, O. Farooq, M. R. Abidi, Y. V. Varshney “Continuous
Hindi Speech Recognition Model Based on Kaldi ASR Toolkit”in IEEE
conference WiSPNET, 2017, pp. 812-815.
[19] K. Kumar, C. Kim, and R.M. Stern “Delta-spectral cepstral coefficients
for robust speech recognition,”in IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP),2011, pp. 4784-4787.
[20] A. Stolcke, “SRILM-an extensible language modeling toolkit”, in J. H.
L. Hansen and B. Pellom, editors, Proc. ICSLP, vol. 2, Denver, Sep.
2002, pp. 901-904.

208