Vous êtes sur la page 1sur 5

Using Radial Basis Probabilistic Neural Network

for Speech Recognition


Nima Yousefian, Morteza Analoui
Computer Department, Iran University of Science &Technology
Tehran, Iran
ni_yousefiyan@comp.iust.ac.ir, analoui@iust.ac.ir

Abstract— Automatic speech recognition (ASR) has been a well for clean wideband speech [4]. However, in noisy
subject of active research in the last few decades. In this paper environment or for noisy band-limited telephone speech
we study the applicability of a special model of radial basis the performance of GMM degrades considerably.
probabilistic neural networks (RBPNN) as a classifier for Another well known contemporary classification
speech recognition. This type of network is a combination of
Radial Basis Function (RBF) and Probabilistic Neural
technique, Vector Quantization (VQ) is tested with our
Network (PNN) that applies characteristics of both networks dataset and showed absolutely high recognition rate
and finally uses a competitive function for computing final between other classifiers. So we decided to design a new
result. The proposed network has been tested on Persian one probabilistic neural network (PNN) that uses a
digit numbers dataset and produced significantly lower competitive function for its transfer function such as VQ
recognition error rate in comparison with other common networks. Results show that this network overcomes all
pattern classifiers. All of classifiers use Mel-scale Frequency other pattern classifiers in recognition of Persian one digit
Cepstrum Coefficients (MFCC) and a special type of numbers dataset.
Perceptual Linear Predictive (PLP) as their features for PNNs are known to have good generalization properties
classification. Results show that for our proposed network the
and are trained faster than the back propagation ANNs.
MFCC features yield better performance compared to PLP.
The faster training is achieved at the cost of an increased
I. INTRODUCTION complexity and higher computational and memory
requirements [5].
A major problem in speech recognition system is the
decision of the suitable feature set which can accurately II. SYSTEM CONCEPT
describe in an abstract way the original highly redundant
speech signal. In non-metric spectral analysis, A. Dataset
mel-frequency cepstral coefficients (MFCC) are one of A sequence of 10 isolated digits (0, 1, 2, …, 9) Voices
the most popular spectral features in ASR. In parametric from 35 different speakers were recorded in Computer
spectral analysis, the LPC mel-cepstrum based on an all- Department of Iran University of Science and
pole model is widely used because of its simplicity in Technology. All of voices saved as wave files with audio
computation and high efficiency [1]. sample size as 16 bit and audio sample rate 16 KHZ with
Another popular speech feature representation is known mono channel. So there are 350 wave files. We divided
as RASTA-PLP, an acronym for Relative Spectral them into two separate parts, 20 speakers (200 wave files)
Transform - Perceptual Linear Prediction [2]. PLP was for training and 15 remaining speakers (150 wave files)
originally proposed by Hynek Hermansky as a way of for testing. So the ratio of train to test is 4:3.
warping spectra to minimize the differences between B. Features Extraction
speakers while preserving the important speech
information. RASTA is a separate technique that applies a The goal of feature extraction is to represent speech
band-pass filter to the energy in each frequency sub-band signal by a finite number of measures of the signal. This
in order to smooth over short-term noise variations and to is because the entirety of the information in the acoustic
remove any constant offset resulting from static spectral signal is too much to process, and not all of the
coloration in the speech channel for example from a information is relevant for specific tasks. In present ASR
telephone line[3]. RASTA-PLP outperforms PLP for systems, the approach of feature extraction has generally
recognition of channel-distorted speech. been to find a representation that is relatively stable for
Many pattern classifiers have been proposed for speech different examples of the same speech sound, despite
recognition. During the last several years, Gaussian differences in the speaker or various environmental
Mixture Models (GMMs) became very popular in Speech characteristics, while keeping the part that represents the
Recognition systems and have proven to perform very message in the speech signal relatively intact.
In this paper two well known methods are used. For
calculating MFCC features each speech signal is divided p(x(t) | ωi)
into four equal length signals and 12 order coefficients for
each signal are computed by using 16 KHz as sampling
rate and hamming window and 0.5 for high end of highest
filter as a fraction of sampling rate. So 48 feature vectors
for each signal contribute in training and test phase.
A more sophisticated technique is the RASTA-PLP Σ
approach. RASTA basically consists of additional steps p (θ1 | i | ωi) p(θ R | i | ωi)
applied after the critical-band integration of PLP. Based
on an observation that human perception is more sensitive
to relative changes, the static or slow-changing part of the
signal (in each critical-band) is effectively filtered out.
Because most distortions due to unknown channel …
characteristics are convolute, RASTA is best implemented
in a logarithmic domain of the spectrum, which makes the
different parts additive. Same as MFCC for calculating p(x(t) | ωi , θ1 | i) p(x(t) | ωi , θR | i )
RASTA-PLP coefficients, signal is separated into four
signals. Again 16 KHz sampling rate applied for
calculating 12 order coefficients of RASTA-PLP. x(t)
For some classifiers adding first and second order time
derivatives of coefficient results in better performance
than pure PLP [6]. But for calculating derivatives we Figure1: Architecture of GMM
should be aware to take shorter window than normal. So
144 (12*4*3) feature vectors for each signal contribute in Gaussian distributions, which denoted by:
training and test phase. This method is RASTA-Δ2 PLP. R
C. Recognition Methodology p( x(t) | ωi ) = ∑P(θ r | i | ωi ) p( x(t) | ωi ,θr | i ) (1)
r =1
In multi-class mode such as our case, each classifier
tries to identify whether the set of input feature vectors, where θ r | i represents the parameters of the rth mixture
derived from the current signal, belongs to a specific class component, R is the total number of mixture component.
of numbers or not, and to which class exactly. For
samples that can not be realized as a specific class a p( x(t ) | ωi , θ r | i ) = N ( μ r | i , Σ r | i ) (2)
random class is selected.
typically N( μ r | i , Σ r | i ) is a Gaussian distribution that has
III. CLASSIFIERS
Several classifiers are tested for mentioned dataset. The mean μ r | i and covariance Σ r | i .last equation denotes
structures of successful classifiers in recognition are probability density function of the rth component and
described in following subsections. finally P (θ r | i | ω i ) is the prior probability of the rth
A. Gaussian Mixture Models component.
1) Structure: Gaussian mixture models (GMMs) are 2) Effect of parameters in recognition: Here we
one of the semi-parametric techniques for estimating assume number of mixtures R=3. Increasing this
probability density functions (pdf) [7]. The output of a parameter dose not has significant effect on performance.
Gaussian mixture model is the weighted sum of R The method of initializing mean and variance is
component densities, as shown in Fig.1. The training of important. After examining various methods, k-harmonic
GMM can be formulated as maximum likelihood problem initialization method produced best results in comparison
where mean vectors, covariance matrices and prior to k-means and randomly selected data points algorithms.
probabilities are estimated by Expectation-Maximization The number of iterations set to 10 or it will stop when the
(EM) algorithm. Given a set of N independent and increase in log likelihood falls below 0.001.
identically distributed patterns, Xi={ x(t) ; t=1,2, …, N} In our work in the training phase, GMM usually
associated with class ωi , it assumed that the class recognizes all samples with no mistake so recognition rate
likelihood function p(x(t) | ωi) for class ωi is a mixture of is nearly 1, which overcomes all other classifiers in this
case. GMM is one of the classifiers in our work that gets
better result with RASTA-PLP than MFCC.
Figure 2: Architecture of LVQ

for training is 150.Increasing it has no improvement in


B. Learning Vector Quantization
performance of recognition.
1) Structure: LVQ or Learning Vector Quantization C. Other Classifiers
is a prototype-based supervised classification algorithm.
LVQ can be understood as a special case of an artificial Some other neural networks such as RBF, ARTMAP,
neural network, more precisely; it applies a winner-take- and MLP have been tested with our dataset. All of these
all Hebbian learning-based approach. It is a precursor to networks earned accuracy rate below 70 % in the test
Self-organizing maps (SOM) and related to the k Nearest phase. But in RBF network if we modify transfer
Neighbor algorithm (k-NN) [8]. function of layer two with special linear layer and set
layer weights as training data great growth in
An LVQ network has a first competitive layer and a performance occurs. This is one of the points that lead us
second linear layer. The competitive layer learns to to proposed probabilistic neural network in this study.
classify input vectors in much the same way as the K-means is a clustering method. Here is a definition of
competitive layers of Self-Organizing Nets. The linear a special model of supervised k-means, which can
layer transforms the competitive layer's classes into produce accuracy result for training and test phase. If a
target classifications defined by the user. The network is random sample from each of 10 classes been selected and
given by prototypes W= (w (i),..., w(n)). It changes the use them as initialization seed points for k-means we can
weights of the network in order to classify the data train data and get accuracy recognition rate for these
correctly. For each data point the prototype (neuron) that data. If number of selecting random seed points repeats
is closest to it, is determined (called the winner neuron). more, recognition rate may improve. After getting best
The weights of the connections to this neuron are then performance in training, related seed points considered as
adapted, i.e. made closer if it correctly classifies the data centers for calculating a test sample distances to these
point or made less similar if it incorrectly classifies it. centers separately. So for each test signal Euclidean
For a test sample that LVQ can not relate it to any class a distance between its coefficients with all ten centers
random class is considered. An advantage of LVQ is that coefficients is calculated and the center with minimum
it creates prototypes that are easy to interpret for experts distance is considered as the class of test sample.
in the field.
2) Effect of parameters in recognition: There are IV) COMPETITIVE RADIAL BASIS PNN
various types of LVQ. In this work LVQ1 is applied for As described in previous section LVQ and GMM have
classification. The number of neurons in first layer set best recognition rate over all other classifiers. This point
to number of training speakers. It is better to set learning result in designing a probabilistic neural network that
rate near zero, increasing it has unexpected influence in uses Gaussian distribution as its probability kernel
accuracy of recognition. For initializations of the weights function of network and a competitive function as its
LVQ should know the prior class Percentages, as the transfer function like first layer of LVQ. In following
same probability for each 10 classes we set this subsections, network procedure and architecture will be
parameter to 0.1 for all classes. Total number of epochs described in details.
Radial Basis Layer Competitive Layer
Input
IW 1,1
Q*R

P
||dist|| n1 1 n2 L*1
a
LW 2,1 C
R*1 Radbas Q*1 L *1
a 2= y
Q*1 L*Q
1 b1
R Q*1

Figure 3: Architecture of proposed radial basis PNN


R=number of elements in input data
Q=number of input / target Pairs=number of neurons in layer1
L=number of classes of input data=number of neurons in layer 2

A. Procedure produces a vector whose elements indicate how close the


input is to the vectors of the training set. These elements
The radial basis probabilistic neural network (RBPNN)
are multiplied, element by element, by the bias and sent
model is in substance developed from the radial basis
to the Radbas transfer function. Radbas function is
function neural network (RBFNN) and the probabilistic
defined with following equation:
neural network. Therefore, the RBPNN possesses the
−n2
common characteristic of the original two networks, i.e., Radbas (n) = e (3)
the signal is concurrently feed-forwarded from the input
layer to the output layer without any feedback An input vector close to a training vector is represented
connections. On the other hand, the RBPNN, to some by a number close to 1 in the output vector a1. If an input
extent, decreases the two original models demerits. is close to several training vectors of a single class, it is
When an input is presented, the first layer computes represented by several elements of a1 that are close to 1.
distances from the input vector to the training input ai1 =Radbas ( || IWi 1,1 – P|| bi1) (4)
vectors, and produces a vector whose elements indicate
how close the input is to a training input. The second The second-layer weights, LW2,1 are set to the matrix T
layer sums these contributions for each class of inputs to of target vectors. Each vector has a 1 only in the row
produce as its net output a vector of probabilities. associated with that particular class of input, and 0's
Finally, a competitive transfer function on the output of elsewhere. The multiplication Ta1 sums the elements of
the second layer picks the maximum of these a1 due to each of the L input classes. Finally, the second-
probabilities, and produces a 1 for that class and a 0 for layer transfer function, compete, produces a 1
the other classes. corresponding to the largest element of n2, and 0's
elsewhere.
B. Network Architecture
a 2 = compet ( LW 2,1 a1 ) (5)
Illustration of the network architecture is in figure 3. It
is assumed that there are Q input vector/target vector where compet is a transfer function. compet (N) takes
pairs. Each target vector has L elements. One of these one input argument, a matrix of net input vectors and
elements is 1 and others are 0. Thus, each input vector is returns output vectors with 1 where each net input vector
associated with one of L classes. The first-layer input has its maximum value, and 0 elsewhere.
weights, IW1,1 are set to the transpose of the matrix Thus, the network has classified the input vector into a
formed from the Q training pairs, P'. This is exactly what specific one of L classes because that class had the
mentioned in previous section for growing performance maximum probability of being correct. So each input
of RBF. When an input is presented the ||dist|| function certainly has a class as output.
V) EXPERIMENTS and RESULTS Some methods have been proposed for improving
performance of RBPNN. Recursive Orthogonal Least
A. Classifiers Comparison
Squares Algorithms combined with Micro-genetic
As mentioned in previous sections 350 wave files Algorithms (μ-GA) were tested in [9] for optimizing and
containing voices of 35 speakers divided into 2 parts, improvement in results of spirals classification problem
train and test with ratio of 4:3. All of speakers are male. and IRIS Classification problem. Results show that, after
Here 4 classifiers with best performances are selected. optimization, quality of network grows and it is certainly
We name the edited version of k-means that introduced much better than traditional PNN and RBF networks.
in third section edited k-means or Ek-means. The result Such approaches can be considered in future works.
from this classifier is desirable and nearly better than
GMM. ACKNOWLEDGMENT
In the training nearly all of described classifiers The authors would like to thank Dr. A.Akbari for his
recognized training patterns with accuracy above 95%. permission to use wave files as dataset. The authors
Best performance belongs to GMM with accuracy near 1 would also like to thank the reviewers for their comments
in this phase. that greatly improved the manuscript.
In the test phase each classifier has been tested with
each of feature vectors described before. Performance of REFRENCES
different classifiers in the test phase for all of the three [1] Harosha Matsumoto, Masanora Moroto, “Evaluation of MEL-LPC
introduced feature types can be observed in table 1. cepstrum in a large vocabulary continuous speech recognition”,
Obviously RBPNN with MFCC gets best result Acoustics, Speech, and Signal Processing Proceedings IEEE
between all other methods. But it is noticeable that for all International Conference on Volume 1, Issue, 2001, pp. 117-120, 2001.
classifiers adding time derivations of PLP to itself has [2] Edric Gaudard , Guillermo Aradilla ,” Speech Recognition based on
Template Matching and Phone Posterior probabilities” , IDIAP-Com
not improvement effect in recognition rates. Ek-means 07-02, 2007.
like GMM and LVQ produce better outputs with PLP [3] H. Hermansky, ”Perceptual Linear Predictive (PLP) Analysis of
than MFCC. The proposed method is the only one that Speech.”, The Journal of the Acoustical Society of America, Volume
gets better accuracy with MFCC. 87, Issue 4, pp.1738-1752 ,April 1990.
As we described in feature extraction subsection [4] Todor Ganchev, nastasios,Tsopanoglou, Nikos Fakotakis, George
original signal is divided into four parts. This is because Kokkinakis, “Probabilistic neural networks combined with GMMs for
speaker recognition over telephone channels”, 14 International
of average of numbers of phoneme in numbers from 1 to Conference on Digital Signal , Volume II, pp.1081-1084 ,July 2002.
10 in Persian is about four. But some classifiers produce [5] Todor Ganchev, Dimitris K. Tasoulis, Michael N. Vrahatis,
better output by setting this factor to three. So in table1 “Locally Recurrent Probabilistic Neural Network for Text-Independent
overall performance for each classifier is calculated in Speaker Verification”, in Proc. of the EuroSpeech, 2003, vol. 3, pp.
762-766, 2003.
best situation.
[6] Qi Li, Frank K. Soong, Olivier Siohan, “An Auditory System-based
Feature for Robust Speech Recognition”, 7th European Conference on
Table1: Performance Comparison (%)
Speech Communication and Technology, Aalborg-Denmark, pp. 619-
621, September 2001.
Features MFCC RASTA-
PLP
RASTA-
Δ2 PLP [7] K.K. Yiu, M.W. Mak, C.K. Li, “Gaussian mixture models and
probabilistic decision-based neural network for pattern classification: A
comparative Study”, Neural Computing and Applications 8(3): pp. 235-
RBPNN 99.1 95.3 90.1 245, 1999.
[8] Kosaka, T, Omatu, S, “Classification of the Italian Lira using the
LVQ method”, Systems, Man, and Cybernetics, 2000 IEEE
LVQ 96.7 97.3 86.3 International Conference on Volume 4, pp. 2769 - 2774, 2000
[9] Wenbo Zhao, De-Shuang Huang, Lin Guo, “Optimizing Radial
Basis Probabilistic Neural Networks Using Recursive Orthogonal Least
GMM 90.8 91.7 87.2 Squares Algorithms Combined with Micro-Genetic Algorithms”,
Neural Networks, 2003. Proceedings of the International Joint
Conference on Volume 3, pp. 2277 - 2282, July 2003
Ek-means 94.3 96.8 78.3 [10] Leszek Rutkowski, “Adaptive Probabilistic Neural Networks for
Pattern Classification in Time-Varying Environment”, IEEE
TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO.4, pp.
811-827, 2004.
B. Capability of Proposed Method
[11] J.Barry Gomm, Ding Li Yu, “Selecting Radial Basis Function
There is something remarkable about accuracy of this Network Centers with Recursive Orthogonal Least Squares Training”,
version of RBPNN. If all of the three other classifiers IEEE Trans. Neural Networks, vol.lI,N0.2 , March 2000.
compared in table1 combine to make a single classifier [12] F. Gorunescu, “Architecture of probabilistic neural networks:
with majority vote approach, it will gain accuracy about estimating the adjustable smoothing factor,” Research Notes in
Artificial Intelligence and Digital Communications, 104, pp. 56-62,
98.2% which is still about 1% less than proposed 2004.
competitive RBPNN.

Vous aimerez peut-être aussi