Académique Documents
Professionnel Documents
Culture Documents
1. INTRODUCTION
Speech word recognition systems commonly carry out some kind of classification
recognition based on speech features which are usually obtained via Fourier Transforms
(FTs), Short Time Fourier Transforms (STFTs), or Linear Predictive Coding techniques.
However, these methods have some disadvantages. These methods accept signal stationarity
within a given time frame and may therefore lack the ability to analyze localized events
correctly. The wavelet transform copes with some of these problems. Other factors
influencing the selection of Wavelet Transforms (WT) over conventional methods include
their ability to determine localized features. Discrete Wavelet Transform method is used for
speech processing.
1
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
2. LITERATURE SURVEY
The desire for automation of simple tasks is not a modern phenomenon, but one
that goes back more than one hundred years in history. By way of example, in 1881
Alexander Graham Bell, his cousin Chichester Bell and Charles Sumner Tainter invented a
recording device that used a rotating cylinder with a wax coating on which up-and-down
grooves could be cut by a stylus, which responded to incoming sound pressure (in much the
same way as a microphone that Bell invented earlier for use with the telephone). Based on
this invention, Bell and Tainter formed the Volta Graphophone Co. in 1888 in order to
manufacture machines for the recording and reproduction of sound in office environments.
The American Graphophone Co., which later became the Columbia Graphophone Co.,
acquired the patent in 1907 and trademarked the term “Dictaphone.” Just about the same
2
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
time, Thomas Edison invented the phonograph using a tinfoil based cylinder, which was
subsequently adapted to wax, and developed the “Ediphone” to compete directly with
Columbia. The purpose of these products was to record dictation of notes and letters for a
secretary (likely in a large pool that offered the service) who would later type them out
(offline), thereby circumventing the need for costly stenographers.
A similar kind of automation took place a century later in the 1990‟s in the area
of “call centers.” A call center is a concentration of agents or associates that handle telephone
calls from customers requesting assistance. Among the tasks of such call centers are routing
the in-coming calls to the proper department, where specific help is provided or where
transactions are carried out. One example of such a service was the AT&T Operator line
which helped a caller place calls, arrange payment methods, and conduct credit card
transactions. The number of agent positions (or stations) in a large call center could reach
several thousand Automatic speech recognition.
3
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
leather, the configuration of which could be altered or controlled with a hand to produce
different speech-like sounds.
During the first half of the 20th century, work by Fletcher [8] and others at Bell
Laboratories documented the relationship between a given speech spectrum (which is the
distribution of power of a speech sound across frequency), and its sound characteristics as
well as its intelligibility, as perceived by a human listener. In the 1930‟s Homer Dudley,
influenced greatly by Fletcher‟s research, developed a speech synthesizer called the VODER
(Voice Operating Demonstrator), which was an electrical equivalent (with mechanical
control) of Wheatstone‟s mechanical speaking machine. Dudley‟s VODER which consisted
of a wrist bar for selecting either a relaxation oscillator output or noise as the driving signal,
and a foot pedal to control the oscillator frequency (the pitch of the synthesized voice). The
driving signal was passed through ten band pass filters whose output levels were controlled
by the operator‟s fingers. These ten band pass filters were used to alter the power distribution
of the source signal across a frequency range, thereby determining the characteristics of the
speech-like sound at the loudspeaker. Thus to synthesize a sentence, the VODER operator
had to learn how to control and “play” the VODER so that the appropriate sounds of the
sentence were produced. The VODER was demonstrated at the World Fair in New York City
in 1939 and was considered an important milestone in the evolution of speaking machines.
Speech pioneers like Harvery Fletcher and Homer Dudley firmly established the
importance of the signal spectrum for reliable identification of the phonetic nature of a speech
sound. Following the convention established by these two outstanding scientists, most
modern systems and algorithms for speech recognition are based on the concept of
measurement of the (time-varying) speech power spectrum (or its variants such as the
cepstrum), in part due to the fact that measurement of the power spectrum from a signal is
relatively easy to accomplish with modern digital signal processing techniques.
Early attempts to design systems for automatic speech recognition were mostly
guided by the theory of acoustic-phonetics, which describes the phonetic elements of speech
(the basic sounds of the language) and tries to explain how they are acoustically realized in a
spoken utterance. These elements include the phonemes and the corresponding place and
manner of articulation used to produce the sound in various phonetic contexts. For example,
4
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
in order to produce a steady vowel sound, the vocal cords need to vibrate (to excite the vocal
tract), and the air that propagates through the vocal tract results in sound with natural modes
of resonance similar to what occurs in an acoustic tube. These natural modes of resonance,
called the formants or formant frequencies, are manifested as major regions of energy
concentration in the speech power spectrum. In 1952, Davis, Biddulph, and Balashek of Bell
Laboratories built a system for isolated digit recognition for a single speaker, using the
formant frequencies measured (or estimated) during vowel regions of each digit. These
trajectories served as the “reference pattern” for determining the identity of an unknown digit
utterance as the best matching digit.
In other early recognition systems of the 1950‟s, Olson and Belar of RCA
Laboratories built a system to recognize 10 syllables of a single talker and at MIT Lincoln
Lab, Forgie and Forgie built a speaker-independent 10-vowel recognizer. In the 1960‟s,
several Japanese laboratories demonstrated their capability of building special purpose
hardware to perform a speech recognition task. Most notable were the vowel recognizer of
Suzuki and Nakata at the Radio Research Lab in Tokyo, the phoneme recognizer of Sakai and
Doshita at Kyoto University, and the digit recognizer of NEC Laboratories [14]. The work of
Sakai and Doshita involved the first use of a speech segmenter for analysis and recognition of
speech in different portions of the input utterance. In contrast, an isolated digit recognizer
implicitly assumed that the unknown utterance contained a complete digit (and no other
speech sounds or words) and thus did not need an explicit “segmenter.” Kyoto University‟s
work could be considered a precursor to a continuous speech recognition system.
5
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
Advancement in technology
6
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
Finally, in the last few years, we have seen the introduction of very large vocabulary systems
with full semantic models, integrated with text-to-speech (TTS) synthesis systems, and multi-
modal inputs (pointing, keyboards, mice, etc.). These systems enable spoken dialog systems
with a range of input and output modalities for ease-of-use and flexibility in handling adverse
environments where speech might not be as suitable as other input-output modalities. During
this period we have seen the emergence of highly natural concatenative speech synthesis
systems, the use of machine learning to improve both speech understanding and speech
dialogs, and the introduction of mixed-initiative dialog systems to enable user control when
necessary.
After nearly five decades of research, speech recognition technologies have finally
entered the marketplace, benefiting the users in a variety of ways. Throughout the course of
development of such systems, knowledge of speech production and perception was used in
establishing the technological foundation for the resulting speech recognizers. Major
advances, however, were brought about in the 1960‟s and 1970‟s via the introduction of
advanced speech representations based on LPC analysis and cepstral analysis methods, and in
the 1980‟s through the introduction of rigorous statistical methods based on hidden Markov
models. All of this came about because of significant research contributions from academia,
private industry and the government. As the technology continues to mature, it is clear that
many new applications will emerge and become part of our way of life – thereby taking full
advantage of machines that are partially able to mimic human speech capabilities.
7
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
8
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
1. Database collection
4. Developing a classifier
The next step is speech signal decomposition. For this we can use different
techniques like LPC, MFCC, STFT, wavelet transform. Over the past 10 years wavelet
transform is mostly used in speech recognition. Speech recognition systems generally carry
9
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
out some kind of classification/recognition based upon speech features which are usually
obtained via time-frequency representations such as Short Time Fourier Transforms (STFTs)
or Linear Predictive Coding (LPC) techniques. In some respects, these methods may not be
suitable for representing speech; they assume signal stationarity within a given time frame
and may therefore lack the ability to analyze localized events accurately. Furthermore, the
LPC approach assumes a particular linear (all-pole) model of speech production which
strictly speaking is not the case.
10
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
Wavelet Transform
The Wavelet transform provides the time-frequency representation. (There are other
transforms which give this information too, such as short time Fourier transform, Wigner
distributions, etc.)
11
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
There are two methodologies for speech decomposition using wavelet. Discrete
Wavelet Transform (DWT) and Wavelet Packet Decomposition (WPD). Out of the two DWT
is used in our project.
The transform of a signal is just another form of representing the signal. It does
not change the information content present in the signal. For many signals, the low-frequency
part contains the most important part. It gives an identity to a signal. Consider the human
voice. If we remove the high-frequency components, the voice sounds different, but we can
still tell what‟s being said. In wavelet analysis, we often speak of approximations and details.
The approximations are the high- scale, low-frequency components of the signal. The details
are the low-scale, high frequency components. The DWT is defined by the following
equation:
Where ψ(t) is a time function with finite energy and fast decay called the mother wavelet.
The DWT analysis can be performed using a fast, pyramidal algorithm related to multi-rate
filter-banks. As a multi-rate filter-bank the DWT can be viewed as a constant Q filter-bank
with octave spacing between the centers of the filters. Each sub-band contains half the
samples of the neighboring higher frequency sub-band. In the pyramidal algorithm the signal
is analyzed at different frequency bands with different resolution by decomposing the signal
into a coarse approximation and detail information. The coarse approximation is then further
decomposed using the same wavelet decomposition step. This is achieved by successive
high-pass and low-pass filtering of the time domain signal and is defined by the following
equations:
12
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
Figure 1: Signal x[n] is passed through lowpass and highpass filters and it is down
sampled by 2
The mother wavelet used is daubichies 4 type wavelet. It contains more number
of filters. Daubichies wavelets are the most popular wavelets. They represent the foundations
of wavelet signal processing and are used in numerous applications. These are also called
Maxflat wavelets as their frequency responses have maximum flatness at frequencies 0 and π
13
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
The mean of the absolute value of the coefficients in each sub-band. These features
The standard deviation of the coefficients in each sub-band. These features provide
Energy of each sub-band of the signal. These features provide information about the
energy of the each sub-band.
Kurtosis of each sub-band of the signal. These features measure whether the data are
peaked or flat relative to a normal distribution.
14
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
Skewness of each sub-band of the signals. These features are the measure of
symmetry or lack of symmetry.
These features are then combined into a hybrid feature and are fed to a classifier.
Features are combined using a matrix. All the features of one sample correspond to a column.
Generally, there are three usual methods in speech recognition: Dynamic Time
Warping (DTW), Hidden Markov Model (HMM) and Artificial Neural Networks (ANNs).
Dynamic time warping (DTW) is a technique that finds the optimal alignment
between two time series if one time series may be warped non-linearly by stretching or
shrinking it along its time axis. This warping between two time series can then be used to find
corresponding regions between the two time series or to determine the similarity between the
two time series.
Hidden Markov Models are finite automates, having a given number of states;
passing from one state to another is made instantaneously at equally spaced time moments.
At every pass from one state to another, the system generates observations, two processes are
taking place: the transparent one, represented by the observations string (feature sequence),
and the hidden one, which cannot be observed, represented by the state string. Main point of
this method is timing sequence and comparing methods.
Nowadays, ANNs are utilized in wide ranges for their parallel distributed
processing, distributed memories, error stability, and pattern learning distinguishing ability.
The Complexity of all these systems increased when their generality rises. The biggest
15
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
restriction of two first methods is their low speed for searching and comparing in models. But
ANNs are faster, because output is resulted from multiplication of adjusted weights in present
input. At present TDNN (Time-Delay Neural Network) is widely used in speech recognition.
Neural Networks
16
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
An Artificial Neuron
The number of neurons in each hidden layer has a direct impact on the
performance of the network during training as well as during operation. Having more
neurons than needed for a problem runs the network into an over fitting problem. Over fitting
problem is a situation whereby the network memorizes the training examples. Networks that
run into over fitting problem perform well on training examples and poorly on unseen
examples. Also having less number of neurons than needed for a problem causes the network
to run into under fitting problem. The under fitting problem happens when the network
architecture does not cope with the complexity of the problem in hand. The under fitting
problem results in an inadequate modeling and therefore poor performance of the network.
17
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
After development the classifier has got 2 steps. Training and testing. In
training phase the features of the samples are fed as input to the ANN. The target is set. Then
the network is trained. The network will adjust its weights such that the target is achieved for
the given input. In this project we have used the function „tansig‟ and „logsig‟. So the output
should be bounded between 0 and 1. The output is given as .9 .1 .1……1 for 1st word. .1 .9
.1…….1 for 2nd word and so on. The position of maximum value corresponds to the output.
The next phase is testing. The samples which are set aside for testing is given
to the classifier and the output is noted. If we don‟t get the desired output ,we reach the
required output by adjusting the number of neurons.
18
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
4. OBSERVATION
ഒ
1
19
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
20 samples for each word was recorded from different people and these samples were
then normalized by dividing their maximum values.Then they were decomposed using
wavelet transform technique upto eight levels since majority of the information about the
signal is present in the low frequency region.
In order to classify the signals an ANN is developed and trained by fixing outputs such
that
Out of 20 samples recorded,16 samples are used to train the ANN and the unused 4 samples are
used for test purpose.
Plots
20
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
21
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
22
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
DWT Tree
The 8 level decomposition tree for a signal using DWT is shown in the
figure,which produces one approximation coefficient and eight detailed
coefficients
23
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
24
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
25
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
Out of the 20 samples recorded for each word, 16 were used for training purpose.
We tested our program‟s accuracy with these 4 unused samples. A total of 20 samples were
tested ( 4 samples each for the 5 words) and the program yielded the right result for all 20
samples. Thus, we obtained 100% accuracy with pre- recorded samples.
Real-time testing:
For real-time testing, we took a sample using microphone and directly executed the
program using this sample. A total of 30 samples were tested, out of which 20 samples gave
the right result. This gives an accuracy of about 66% with real-time samples.
26
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
Change in efficiency by changing the parameters of the ANN were observed and
are plotted below
Plot 1: Accuracy with 2 layer feed forward network,Number of neurons in the first layer=15
27
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
Plot 2: Accuracy with 2 layer feed forward network ,Number of neurons in the first layer=20
28
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
Plot 3: Accuracy with 3 layer feed forward network,Number of neurons in the first
layer,N1=15& number of neurons in the second layer,N2=5
29
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
7. CONCLUSION
Speech recognition is one of the advanced areas. Many research works has been
taking place under this domain to implement new and enhanced approaches. During the
experiment we experienced the effectiveness of Daubechies4 mother wavelet in feature
extraction. In this experiment we have only used a limited number of samples. Increasing the
number of samples may give better feature and a good recognition result for Malayalam word
utterances. The performance of Neural Network with wavelet is appreciable. We have used
software with some limitations, if we increase the number of samples as well as the number
iterations (training), it can produce a good recognition result.
We also observed that, Neural Network is an effective tool which can be embedded
successfully with wavelet. The effectiveness of wavelet based feature extraction with other
classification methods like neuro-fuzzy and genetic algorithm techniques can be used to do
the same task.
From this study we could understand and experience the effectiveness of discrete
wavelet transform in feature extraction. Our recognition results under different kind of noise
and noisy conditions, show that choosing dyadic bandwidths have better performance than
choosing equal bandwidths in sub-band recombination. This result adapts to way which
human ear recognizes speech and shows a useful benefit of dyadic nature of multi-level
wavelet transform for sub-band speech recognition.
30
MAIN PROJECT ‘10 SPEECH RECOGNITION USING WAVELET TRANSFORM
8. REFERENCES
[1] Vimal Krishnan V.R, Athulya Jayakumar, Babu Anto.P, “Speech Recognition of Isolated Malayalam
Words Using Wavelet Features and Artificial Neural Network”, 4th IEEE International Symposium on
Electronic Design, Test & Applications
[2] Lawrance Rabiner, Bing-Hwang Juang, “Fundamentals Speech Recognition”, Eaglewood Cliffs, NJ,
Prentice hall, 1993.
[3] Mallat Stephen, “A Wavelet Tour of Signal Processing”, San Dieago: Academic Press, 1999, ISBN
012466606.
[4] Mallat SA, “Theory for MuItiresolution Signal Decomposition: The Wavelet Representation”, IEEE
Transactions on Pattern Analysis Machine Intelligence. Vol. 31, pp 674-693, 1989.
[5] K.P. Soman, K.I. Ramachandran, “Insight into Wavelets from Theory to Practice”, Second Edition, PHI,
2005.
[6] Kadambe S., Srinivasan P. “Application of Adaptive Wavelets for Speech “, Optical Engineering 33(7),
pp. 2204-2211, July 1994.
[7] Stuart Russel, Peter Norvig, “Artificial Intelligence, A Modern Approach”, New Delhi: Prentice Hall of
India, 2005.
[8] S.N. Srinivasan, S. Sumathi, S.N. Deepa, “Introduction to Neural Networks using Matlab 6.0,” New Delhi,
Tata McGraw Hill, 2006.
[9] James A Freeman, David M Skapura, “Neural Networks Algorithm”. Application and Programming
Techniques, Pearson Education, 2006.
31