Académique Documents
Professionnel Documents
Culture Documents
Abstract— Automatic speech recognition (ASR) has been a well for clean wideband speech [4]. However, in noisy
subject of active research in the last few decades. In this paper environment or for noisy band-limited telephone speech
we study the applicability of a special model of radial basis the performance of GMM degrades considerably.
probabilistic neural networks (RBPNN) as a classifier for Another well known contemporary classification
speech recognition. This type of network is a combination of
Radial Basis Function (RBF) and Probabilistic Neural
technique, Vector Quantization (VQ) is tested with our
Network (PNN) that applies characteristics of both networks dataset and showed absolutely high recognition rate
and finally uses a competitive function for computing final between other classifiers. So we decided to design a new
result. The proposed network has been tested on Persian one probabilistic neural network (PNN) that uses a
digit numbers dataset and produced significantly lower competitive function for its transfer function such as VQ
recognition error rate in comparison with other common networks. Results show that this network overcomes all
pattern classifiers. All of classifiers use Mel-scale Frequency other pattern classifiers in recognition of Persian one digit
Cepstrum Coefficients (MFCC) and a special type of numbers dataset.
Perceptual Linear Predictive (PLP) as their features for PNNs are known to have good generalization properties
classification. Results show that for our proposed network the
and are trained faster than the back propagation ANNs.
MFCC features yield better performance compared to PLP.
The faster training is achieved at the cost of an increased
I. INTRODUCTION complexity and higher computational and memory
requirements [5].
A major problem in speech recognition system is the
decision of the suitable feature set which can accurately II. SYSTEM CONCEPT
describe in an abstract way the original highly redundant
speech signal. In non-metric spectral analysis, A. Dataset
mel-frequency cepstral coefficients (MFCC) are one of A sequence of 10 isolated digits (0, 1, 2, …, 9) Voices
the most popular spectral features in ASR. In parametric from 35 different speakers were recorded in Computer
spectral analysis, the LPC mel-cepstrum based on an all- Department of Iran University of Science and
pole model is widely used because of its simplicity in Technology. All of voices saved as wave files with audio
computation and high efficiency [1]. sample size as 16 bit and audio sample rate 16 KHZ with
Another popular speech feature representation is known mono channel. So there are 350 wave files. We divided
as RASTA-PLP, an acronym for Relative Spectral them into two separate parts, 20 speakers (200 wave files)
Transform - Perceptual Linear Prediction [2]. PLP was for training and 15 remaining speakers (150 wave files)
originally proposed by Hynek Hermansky as a way of for testing. So the ratio of train to test is 4:3.
warping spectra to minimize the differences between B. Features Extraction
speakers while preserving the important speech
information. RASTA is a separate technique that applies a The goal of feature extraction is to represent speech
band-pass filter to the energy in each frequency sub-band signal by a finite number of measures of the signal. This
in order to smooth over short-term noise variations and to is because the entirety of the information in the acoustic
remove any constant offset resulting from static spectral signal is too much to process, and not all of the
coloration in the speech channel for example from a information is relevant for specific tasks. In present ASR
telephone line[3]. RASTA-PLP outperforms PLP for systems, the approach of feature extraction has generally
recognition of channel-distorted speech. been to find a representation that is relatively stable for
Many pattern classifiers have been proposed for speech different examples of the same speech sound, despite
recognition. During the last several years, Gaussian differences in the speaker or various environmental
Mixture Models (GMMs) became very popular in Speech characteristics, while keeping the part that represents the
Recognition systems and have proven to perform very message in the speech signal relatively intact.
In this paper two well known methods are used. For
calculating MFCC features each speech signal is divided p(x(t) | ωi)
into four equal length signals and 12 order coefficients for
each signal are computed by using 16 KHz as sampling
rate and hamming window and 0.5 for high end of highest
filter as a fraction of sampling rate. So 48 feature vectors
for each signal contribute in training and test phase.
A more sophisticated technique is the RASTA-PLP Σ
approach. RASTA basically consists of additional steps p (θ1 | i | ωi) p(θ R | i | ωi)
applied after the critical-band integration of PLP. Based
on an observation that human perception is more sensitive
to relative changes, the static or slow-changing part of the
signal (in each critical-band) is effectively filtered out.
Because most distortions due to unknown channel …
characteristics are convolute, RASTA is best implemented
in a logarithmic domain of the spectrum, which makes the
different parts additive. Same as MFCC for calculating p(x(t) | ωi , θ1 | i) p(x(t) | ωi , θR | i )
RASTA-PLP coefficients, signal is separated into four
signals. Again 16 KHz sampling rate applied for
calculating 12 order coefficients of RASTA-PLP. x(t)
For some classifiers adding first and second order time
derivatives of coefficient results in better performance
than pure PLP [6]. But for calculating derivatives we Figure1: Architecture of GMM
should be aware to take shorter window than normal. So
144 (12*4*3) feature vectors for each signal contribute in Gaussian distributions, which denoted by:
training and test phase. This method is RASTA-Δ2 PLP. R
C. Recognition Methodology p( x(t) | ωi ) = ∑P(θ r | i | ωi ) p( x(t) | ωi ,θr | i ) (1)
r =1
In multi-class mode such as our case, each classifier
tries to identify whether the set of input feature vectors, where θ r | i represents the parameters of the rth mixture
derived from the current signal, belongs to a specific class component, R is the total number of mixture component.
of numbers or not, and to which class exactly. For
samples that can not be realized as a specific class a p( x(t ) | ωi , θ r | i ) = N ( μ r | i , Σ r | i ) (2)
random class is selected.
typically N( μ r | i , Σ r | i ) is a Gaussian distribution that has
III. CLASSIFIERS
Several classifiers are tested for mentioned dataset. The mean μ r | i and covariance Σ r | i .last equation denotes
structures of successful classifiers in recognition are probability density function of the rth component and
described in following subsections. finally P (θ r | i | ω i ) is the prior probability of the rth
A. Gaussian Mixture Models component.
1) Structure: Gaussian mixture models (GMMs) are 2) Effect of parameters in recognition: Here we
one of the semi-parametric techniques for estimating assume number of mixtures R=3. Increasing this
probability density functions (pdf) [7]. The output of a parameter dose not has significant effect on performance.
Gaussian mixture model is the weighted sum of R The method of initializing mean and variance is
component densities, as shown in Fig.1. The training of important. After examining various methods, k-harmonic
GMM can be formulated as maximum likelihood problem initialization method produced best results in comparison
where mean vectors, covariance matrices and prior to k-means and randomly selected data points algorithms.
probabilities are estimated by Expectation-Maximization The number of iterations set to 10 or it will stop when the
(EM) algorithm. Given a set of N independent and increase in log likelihood falls below 0.001.
identically distributed patterns, Xi={ x(t) ; t=1,2, …, N} In our work in the training phase, GMM usually
associated with class ωi , it assumed that the class recognizes all samples with no mistake so recognition rate
likelihood function p(x(t) | ωi) for class ωi is a mixture of is nearly 1, which overcomes all other classifiers in this
case. GMM is one of the classifiers in our work that gets
better result with RASTA-PLP than MFCC.
Figure 2: Architecture of LVQ
P
||dist|| n1 1 n2 L*1
a
LW 2,1 C
R*1 Radbas Q*1 L *1
a 2= y
Q*1 L*Q
1 b1
R Q*1