Speech Signal Processing Paper

Speech Signal Processing-
Voice Controlled Automation System

G. Aishwarya K. Malathi
ECE III year ECE III year
Panimalar Institute of technology Panimalar Institute of technology
Chennai, India. Chennai, India.
aishu.gopi91@yahoo.com Kmalathi234@gmail.com
Phone no:04425540018 Mobile no:9500974695
Abstract- In this era of technology, rapid advancements are The system consists of microphone through which input
being made in the field of automation and signal processing. in the form of speech signal is applied. The data acquisition
The developments made in digital signal processing are being system of the speech processor acquires the output from the
applied in the field of automation, communication systems and microphone and then it detects the exact word spoken. The
biomedical engineering. This paper discusses the speech command signal from the speech processor is generated
recognition and its application in control mechanism. Speech
recognition can be used to automate many tasks that usually
accordingly which is then send to the microcontroller via
require hands-on human interaction, such as recognising simple wireless transceiver and the microcontroller takes necessary
spoken commands to perform something like turning on lights action according to the command signal. Suppose we want to
or shutting a door or driving a motor. After speech recognition, drive an electric motor and we want that the motor rotates
a particular code related to spoken word is transferred through anticlockwise then by speaking the word anticlockwise we
wireless communication system to 8051 microcontroller and it can rotate the motor in the same direction. Similarly by
works accordingly. Many techniques have been used to compare speaking the word clockwise, we can rotate the motor
the patterns of speech and recent technological advancements accordingly. If such a system is installed in a motor car, then
have made recognition of more complex speech patterns by using several commands like left, right, start, stop,
available. Despite these breakthroughs, however, current efforts forward, backward, etc; we can drive a car without using
are still far away from 100% recognition of natural human
even out hands.
speech. Therefore the project is considered involving processing
of a speech signal in any form as a challenging and rewarding B. Dependency on the speaker
one. The system is speaker independent i.e. it is able to
recognize the words spoken by anyone and it is not bound to
I. INTRODUCTION the particular speaker. This makes the flexibility in its usage
as everyone is able to operate the system quite efficiently.
The term speech recognition can be defined as a
technique of determining what is being spoken by a C. Software based Speech processor
particular speaker. It is one of the common place applications The speech processor is software based and the algorithms
of the speech processing technology. The project taken under are constructed on the MATLAB 7. The input applied via
consideration involves speech recognition and its application microphone is recorded in the PC as ‘.wav’ file. Since the
in control mechanism. It establishes a speech recognition ‘.wav’ is decipherable to MATLAB so it can be processed by
system which controls a system at a remote area via MATLAB very easily and necessary signals can be
transceiver, for example controlling of a wireless car using generated.
voice. Similarly, in control and instrumentation industry
devices can be controlled from a control centre that is
responsible for emanated commands.
Recognizing natural speech is a challenging task. Human

speech is parameterized over many variables such as
amplitude, pitch and phonetic emphasis that vary from
speaker to speaker. The problem becomes easier, however,
when we look at certain subsets of human speech.
II. PROJECT OVERVIEW AND FEATURES
A .project plan
amount of information that the pattern-matching algorithm
must contend with. A set of these speech parameters for one
interval of time is known as a speech frame.
The pre-processed speech is buffered for the recognition

algorithm in preprocessed signal storage. Stored reference
patterns can be matched against the user’s speech sample
once it has been pre-processed by the DSP module.
Figure 3. Basic components of the Speech Recognition System
III. SPEECH PROCESSING This information is stored as a set of the speech templates
or as generative speech models. The algorithm must compute
A. Basic attributes to human hearing a measure of goodness-of-fit between the preprocessed signal
from the user’s speech and all the stored templates or speech
In human hearing, pitch is one of the most important models. A selection process chooses the templates or model
attributes of audio signals. According to the Place Theory of with the best match.
hearing, the inner ear acts like a frequency analyzer, and the
stimulus reaching our ear is decomposed into many C. Pattern Matching of Speech Signals
sinusoidal components, each of which excites different places
along the basilar membrane, where hair cells with distinct The comparison of two speech signals is nothing but
characteristic frequencies are linked with neurons. When a basically their pattern matching. The speech signal can be
particular frequency of vibration hits the ear, the hair cells represented as the set of numbers representing certain
that are tuned to that frequency will fire nerve impulses. features of the speech that is to be described. For further
The second most important attribute related to human processing it is useful to construct a vector out of these
hearing is the loudness and the intensity of the sound signals. numbers by assigning each measured value to one component
According to the psycho acoustical experiments made in the of the vector. These feature vectors can be represented in a
1930’s by Fletcher and Munson, the perceived loudness is not two- dimensional vector space.
simply a function of intensity, but also of the sound’s
frequencies in this range. It can be shown in the form of D. Techniques of Pattern Comparison
graph as shown in Figure2.
The above curves show that ear being not equally There are several techniques for comparing the pattern of
sensitive to all frequencies, particularly in the low and high two speech signals. Dynamic time warping is an algorithm
frequency ranges. The curves are lowest in the range from 1 that is used for measuring the similarity between two
to 5khz, with a dip at 4khz, indicating that the ear is most sequences which may vary in time or speed. Similarly
sensitive to the frequencies in this range. The lowest curve another form of pattern matching that is normally used with
represents the threshold of hearing while the highest curve the speaker verification systems utilizes multiple templates
represents the threshold of pain. represent frames of speech, is referred to as vector
quantization.
B. Components of Speech Recognition System The technique of frequency warping on which this paper
is based is the most accurate among the three techniques. The
The speech capturing device consists of a microphone and method of frequency warping for speech recognition is based
an A/D converter which digitally encodes the raw speech on the fact that the voice signal acts as a stimulus to human
waveform. The DSP module performs the endpoint (word hearing decomposes into many sinusoidal components with
boundary) detection to separate speech from non-speech and each component having a distinct frequency. The
convert the raw waveform into a frequency domain components are divided into critical bandwidths of hearing
representation. It also performs further windowing, scaling, with each other and having a particular center frequency.
filtering and data compression. The goal is to enhance and
retain only those components of the spectral representation
that are useful for recognition purposes, thereby reducing the IV .Frequency Warping as a Speech Comparison Technique
A. Feature Extraction The frequency scale vector is the divided according to the
Bark band pass filters. Then each set of frequencies is
Before speech pattern comparison, the features of the triangularly weighted, and the base-10 log power of the
spoken words will be extracted. For this, a function is spectrum is calculated over each filter interval. The finally
constructed using Data Acquisition Toolbox. During voice individual power values are concatenated together to form a
recording, some noise or no spoken signal is also stored single 24 element feature vector.
before and after the actual voice sample. So it is removed for
the real analysis. As discussed above the critical bandwidth
of human hearing is not constant, so computer cannot
analyze the whole speech signal instantaneously but the
algorithm is set in such a way that it will select the particular
number of samples from the entire set. The feature interval of Figure 4.Block Diagram for Frequency Domain Feature Extractor
voice signal which the computer processor will select at a
time is 32ms. Succeeding feature intervals will be taken by
the combination of half of the interval of previous break
apart and the new samples for the rest of intervals.
Commonly used measuring intervals are from 20 to 40ms. In
the frequency domain, shorter intervals give good time
resolution but poorer frequency resolution, and longer
intervals give you poorer time resolution but better frequency
resolution. 32 ms was selected as compromise because it has
feature-length short enough process the signals quickly. For
the particular feature interval two time domain and 24
frequency domain features will be extracted. The feature
extraction is done with all speech templates to be stored for
reference and real time command signal. Noting that for
each command, at least 5 examples are stored and their mean
is taken as the reference template for the particular
command. Figure 5.Bark Scale Filter Bank
1) Time domain features: Two time domain features which

were extracted are mean power and mean zero crossing of
the feature chunk. The mean power for the signal g(t) over B. Final Template Making
an interval N is simply given by:
Before the recognition engine can recognize new voice
inputs, it must first be able to recognize the original date
provided for its training. This process is called “template
creation” and needs to be performed only once for a
particular training set. For comparing two speech signals
Mean zero crossing is an average number of times a signal spoken by same person, the formula used is given as:
crosses the zero – axis over an interval.
2) Frequency Domain Features: For frequency-domain

features, it is better to weight the samples with a Hamming
window. Hamming window is sued for amplitude weighting
of the time signal used with gated continuous signals to give The error template is formed by including the minimum
them a slow onset and cut-off in order to reduce the error values resulting by the comparison of each value of
generation of side lobes in their frequency spectrum. Then every two example. For calculating inter speaker differences,
absolute Fast Fourier Transform of the weighted sample is the procedure is same except the formula used which is
taken to from frequency scale vector. known as Euclidean distance formula (1) and is given as:
The Bark scale goes up to a maximum value of index of

24, so a filter bank based on the Bark scale utilizes 24 band
pass filters which are centered on the Bark centre frequencies
and whose bandwidths are equal to Bark scale critical The values coming from both types of calculation
bandwidths. Because these band pass filters generally combine to form the final template for each word.
overlap, a triangular weighting scheme is applied to each
filter in order to give the center frequency the greatest C. Real Time Pattern Comparison
weight. The plot of the filter bank that we have used in this
The voice signal given as command signal in real time is
project is shown in figure.7.
passed through the same feature extraction process and
template making process and compared with the stored final accepts the serial data, processes it and provides the output
template. The minimum difference with the stored template on one of its ports accordingly. The baud rate of the
of the particular word will result in generation of the code microcontroller is set according to the baud rate of the serial
indicating that the particular word has been spoken. data send by computer.
V. WIRELESS COMMUNICATION VIA SERIAL PORT
By definition, serial data is transmitted one bit at a time.

The order in which the bits are transmitted is given below: Figure 6. Serial Data Execution
• The start bit is transmitted with a value of 0, that is, One of the 89C51’s many powerful features is its
logic 0. integrated UART, also known as a serial port. The fact that
the 8051 has an integrated serial port means that one may be
• The data bits are transmitted. The first data bit able to read from and write values top the serial port very
corresponds to the least significant bit (LSB), while easily. Here it is needed to configure the serial port’s
the last data bit corresponds to the most significant operation mode and baud rate. For serial port configuration,
bit (MSB). what has to be done is to write the data from the serial port to
Serial Buffer (SBUF), a Special Function Register (SFR)
• The parity bit (if defined) is transmitted. dedicated to the serial port. The interrupt service routine of
the 89C51 will automatically let the controller know about
• One or two stop bits are transmitted, each with a the reception of the serial data so that it can control the
value of 1, which is logic 1. system according to the command send in the form of the
speech signal. For the configuring the baud rate compatible
The number of bits transferred per second is given by the to the serial port, the timer registers of the microcontrollers
baud rate. The transferred bits include the start bit, the data are set according to the particular baud rate of the serial port
bits, the parity bit (if defined), and the stop bit . of the computer.
The communication via RS-232 line is asynchronous. VII. CASE STUDY-VOICE CONTROLLED INTELLIGENT
This means that the transmitted byte must be identified by WHEELCHAIR
start and stop bits. The start bit indicates when the data byte
has been transferred. The process of identifying bytes with In this presented case study, we could set voice
the serial data format follows these steps: commands to control wheelchair. The voice commands
consist of nine reaction commands and five verification
• When a serial port pin is idle, then it is in an “on” commands. The reaction commands consist of five basic
state. reaction commands and four short moving reaction
commands which move short distance. The details of voice
• When data is about to be transmitted, the serial port commands and their respective reactionary shown in table
pin switches to an “off” state due to the start bit. given below.
• The serial port pin switches back to an “on” state
due to the stop bit(s). This indicates the end of the
byte.
The data bits transferred through a serial port might

represent device commands, sensor readings, error messages,
and so on. The data can be transferred as either binary data or
ASCII data. ASCII equivalent will be transferred if the data
is in the form of alphabetic characters.
The data bits coming from the RS-232 serial port is in the
form RS-232 level. So it is converted to TTL level by MAX-
232 IC since the transceiver and the microcontroller is able to
recognize the TTL level. The data is send to the
microcontroller via wireless transceiver. Table 1. Voice command and reaction
VI. CONTROLLING THROUGH MIOCROCONTROLLER Two commands “migi” and “hidari” are the 90 degrees
rotational movement in a fixed place without any forward or
ATMEL 89C51 microcontroller is used to control the backward movement. When these commands are input
system according to the command provided in the form of during running, the system turns 90 degrees and move
the speech signal. The serial output coming from the receiver forward after turn. Thus, five basic reaction commands
is sent to 89C51 (8051) microcontroller. The microcontroller achieve seven reactions (modes). In our system, though the
short moving reactions, such as forward running about 30cm
and right rotating about 30 degree, are decided in experience,
these reactions is possible to change by user’s needs.
After voice command is input, recognition process is

carried out, and recognition result is shown in the display of
the laptop. In this time, to avoid wrong reaction by the
Misrecognition, the user must input a verification command
to check the result. When acceptance word (“ok” or “yes”) is
input, the system considers the recognition result as correct,
and it reacts to the given movement. Oppositely, when
rejection word (“torikeshi” or “no” or “cancel”) is input, or
no voice for a while, the system considers the recognition
result as incorrect.
A. System Configuration
Figure 8. System overview
Our system is based on commercial electronic
Wheel chair Nissin Medical Industries co. NEO-P1. Our
System consists of a headset microphone and a laptop. B. Control Algorithm
The overview and the construction of our system is
shown below respectively. To control the wheelchair by voice command, a user
inputs reaction command and verification command to
prevent wrong reaction by the misrecognition. In our system,
the control algorithm is shown below whereas the flowchart
is shown in figure 11.
 User inputs a reaction command.
 Input reaction: voice data is recognized and its result

word is displayed.
 When result word is a stop command, go to step 7.
 User inputs a verification command. When the

Figure 7. Voice controlled intelligent wheelchair verification command is not input within three
The laptop is main computer of this system, and seconds, the system considers that the recognition
recognition process is executed here. The control signal is result has failed and goes to step 1.
sent to PIC from the laptop, and PIC generates the motor
control signal to drive the wheelchair.  Input verification voice data is recognized.
We use a grammar-based recognition parser named  When result word is a rejection command, go to step
“Julian”. This is open-source software and is developed 1.
by Kyoto University, Nara Institute of Science and
Technology, and so on. When a speech input is given, Julian  Control system sends the target control signal to the
searches for the most likely word sequence under constraint wheelchair, and the wheelchair reacts, go to step 1.
of the given grammar. Furthermore, in our system, the
forward running speed is set at 1.8 km/h, and backward speed
is set at 1.4 km/h.
Figure 9.Control Algorithm
VIII.REFERENCES
[1] B.Plannerer, ”An Introduction to Speech Recognition”,

March 2005.
[2] Richard D. Peacocke and Daryl H.Graf, ”An

Introduction to Speech Recognition”, Computer archive,
volume 23, August 1990,pp.26-33.
[3] G. Pires and U. Nunes. A wheelchair steered

through voice commands and assisted by a reactive
Fuzzy-logic controller. Journal of Intelligent and
Robotic Systems, 34(3):
[4] T. Kawahara and A. Lee. Open-source speech

recognition software julius. Journal of Japanese Society
for Artificial Intelligence, 20(1):41–49, 2005.
(in Japanese).
[5] K. H. Kim, H. K. Kim, J. S. Kim, W. Son, and S. Y.

Lee. A biosignal-based human interface controlling
a power-wheelchair for people with motor disabilities.
ETRI Journal, 28(1):111–114, 2006.

Speech Signal Processing Paper

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Speech Signal Processing Paper

Transféré par

Droits d'auteur :

Formats disponibles

Speech Signal Processing-

Voice Controlled Automation System

Recognizing natural speech is a challenging task. Human

II. PROJECT OVERVIEW AND FEATURES

The pre-processed speech is buffered for the recognition

Figure 3. Basic components of the Speech Recognition System

1) Time domain features: Two time domain features which

2) Frequency Domain Features: For frequency-domain

The Bark scale goes up to a maximum value of index of

V. WIRELESS COMMUNICATION VIA SERIAL PORT

By definition, serial data is transmitted one bit at a time.

The data bits transferred through a serial port might

After voice command is input, recognition process is

 User inputs a reaction command.

 Input reaction: voice data is recognized and its result

 When result word is a stop command, go to step 7.

 User inputs a verification command. When the

[1] B.Plannerer, ”An Introduction to Speech Recognition”,

[2] Richard D. Peacocke and Daryl H.Graf, ”An

[3] G. Pires and U. Nunes. A wheelchair steered

[4] T. Kawahara and A. Lee. Open-source speech

[5] K. H. Kim, H. K. Kim, J. S. Kim, W. Son, and S. Y.

Vous aimerez peut-être aussi