Speech Recognition Using DSP

SPEECH RECOGNITION USING DSP
K.S. SAGALE*
19.kaustubh@gmail.com
N.B.RAJPUT*
narendrasingrajput@gmail.com
G.J.PATIL*
gauravpatil159@yahoo.co.in
*S.S.V.P.S.B.S.D.C.O.E. DHULE, North Maharashra University(M.S)-424005
SPEECH RECOGNITION USING DSP

ABSTRACT:
card numbers,
implemented
PIN
using
codes,
This paper deals with the process of etc. This paper
MATLAB. This
automatically recognizing who is speakingis
technique
is
on the basis of individual informationtext
used
in
included
application
in
speech
waves.
based
on
Speaker independent
recognition methods can be divided intospeaker
areas such as
text-independent
control access
and
text-dependent recognition
methods. In a text-independent system, system
and
to services like
speaker models capture characteristics ofmakes use of
voice
somebody's
banking
speech,
which
show
up mel frequency
dialing,
irrespective of what one is saying. In a text-cepstrum
telephone,
dependent system, on the other hand, thecoefficients to
database
recognition of the speaker's identity is process
access
the
by
based on his or her speaking one or more input
signal
services, voice
specific phrases, like passwords,
vector
mail,
and
quantization
approach
identify
speaker.
security
control
for
to
confidential
the
information
The
above task is
areas,
and
remote access
to computers.
his/her voice input with the ones from a

set of known speakers. We will discuss
each module in detail in later sections.
identifi
cation
and
verific
ation.
Speak
er
identifi
cation
is the
P
rproces
i
of
ns
cdeter
i
pmining
lwhich
e
sregiste
red
o
fspeak
er
S
pprovid
e
es a
a
kgiven
e
uttera
r
nce.
Rec
ogni
Speak
tion
er
c
e
s
s
o
f
a
c
c
e
p
t
i
n
g
verific
Speaker
recognition
ation,
on the
can
be
other
classified into
hand,
o
r
r
e
therepres
jecting
identity
claiment
of a speaker.each
Figure
1speak
shows
theer.
basic
Featur
structures
ofe
speaker
matchi
identification
ng
and
involve
verification
systems.
actual
At the highest
level,
all
speaker
proced
ure to
identif
recognition
systems
the
unkno
contain
two
main modules
(refer to Figure
1):
the
feature
wn
speak
er
by
compa
extraction and
ring
feature
extract
matching.
ed
Feature
feature
extraction
the
is
process
that extracts a
small
of
amount
data
from
the voice signal

that can later
be
used
to
s from
(
a
)
S
p
e
a
k
e
r
i
d
e
n
t
i
f
i
c
a
t
i
o
n
Figure 1. Basic
structu
res of
speak
er
(b) Speaker
verification recogn
ition
that
the
system
can build or train a

reference
model
for that speaker. In

case of speaker
All
speaker
verification
recognition
systems,
in
systems have to
addition,
serve
speaker-specific
two
distinguish
threshold is also
phases. The first
computed from the
one is referred to
training
the
During the testing
enrollment
or
phase ( Figure 1),
phase
the input speech is
sessions
training
samples.
while the second
matched
one is referred to
stored
reference
as the operation
model
and
sessions or testing
recognition
phase.
decision is made.
In
training
each
the
phase,
registered
speaker
has
to
provide samples of
their
with
speech
so
Speech Feature
Extraction
Introduction:
The
purpose
of
this module is to
When
convert
the
over a sufficiently
speech waveform
short period of time
to some type of
(between
parametric
100
representation (at
characteristics are
fairly
considerably
examined
and
msec),
its
stationary.
lower information
However, over long
rate)
further
periods of time (on
and
the order of 1/5
processing. This is
seconds or more)
often referred as
the
the
characteristic
for
analysis
signal-
processing
front
signal
change to reflect
end.
the different speech
The speech signal is
sounds
timed
spoken. Therefore,
varying signal (it is
short-time spectral
called
analysis is the most
slowly
stationary).
quasiAn
common
being
way
example of speech
characterize
signal is shown in
speech signal.
Figure 2.
to
the
MFCC's
are
based
on
known
variation
of
the
human
ear's
Figure 2. An
example of
speech signal
the
critical
bandwidths
with
frequency,
filters
spaced linearly at
low
Mel-frequency
cepstrum
coefficients
frequencies
and
logarithmically at
processor:
high
frequencies have
typically
been
to
at a sampling rate
the
above 10000 Hz.
used
capture
recorded
phonetically
This
important
frequency
characteristics of
chosen to minimize
speech. This is
the
expressed in the
aliasing
mel-frequency
analog-to-digital
scale, which is a
conversion. These
linear
sampled
frequency
spacing
below
can
sampling
was
effects
in
of
the
signals
capture
all
1000 Hz and a
frequencies up to 5
logarithmic
kHz, which cover
spacing
above
most
energy
of
that
are
1000 Hz.
sounds
A block diagram of
generated
the structure of an
humans. As been
MFCC processor is
discussed
given in Figure 3.
previously,
The speech input is
main
purpose
by
the
of
the
MFCC
processor
is
to
mimic the behavior

of the human ears.
In addition, rather
than
the
speech
waveforms
themselves,
MFFC's are shown
to
be
susceptible
mentioned
variations.
Frame
Blocking :
less
to
In this step the

continuous speech
signal is blocked
into frames of N
samples,
with
adjacent
frames
being
separated
by M (M < N). The

first frame consists
of
the
first
samples.
N
The
second
frame
begins M samples
after
frame,
the
first
and
overlaps it by N M
samples.
Similarly, the third

frame begins 2M
samples after the
Figure 3. Block
diagram
of the
MFCC
processor
first frame (or M

samples after the
second frame) and
overlaps it by N 2M samples. This
process continues
until all the speech
is accounted for
within one or more
frames.
Typical
values for N and

M are N = 256
(which
is
taper the signal to
equivalent to ~ 30
zero
msec
beginning and end
windowing
at
the
and facilitate the
of each frame. If we
fast radix-2 FFT)
define the window
and M = 100.
as
Windowing:
The next step in
the processing is to
window
each
individual frame so
as to minimize the
signal discontinuities
at the beginning and
, where N is the
number
of
samples in each
frame, then the
result
of
windowing is the
signal
end of each frame.

The concept here is
to
minimize
the
spectral distortion by
using the window to
Typically
the
Hamming window
is
used,
which
has the form:

converts
frame
each
of
samples from the

time domain into
the
frequency
domain. The FFT

is a fast algorithm
Fast Fourier
Transform (FFT)
The
next
processing step is
the Fast Fourier
Transform, which
to implement the
Discrete
Fourier
Transform (DFT)
which is defined
on the set of N
samples {xn}, as
follow:
The result after

this step is often
referred
to
spectrum
as
or
periodogram.
Note that we
use j here to
denote the
Melfrequency
Wrapping
imaginary unit, i.e.
As
mentioned
above,
In
general Xn's are

complex numbers.
The
resulting
sequence {Xn} is
interpreted
follows:
the
as
zero
frequency
corresponds to n =
0, positive
frequencies
psychophysical
studies have shown
that
human
perception
frequency
of
of
the
contents
sounds
for
speech signals does

not follow a linear
scale. Thus for each
tone with an actual
frequency,
f,
correspond to
measured in Hz, a
values
subjective pitch is
, while negative
frequencies
measured
the sampling
frequency.
The
mel-
frequency scale is a
linear
. Here, Fs
denotes
scale called the 'mel'

scale.
correspond to
on
frequency
spacing below 1000

Hz and a logarithmic
spacing above 1000
Hz. As a reference
point, the pitch of a
1 kHz tone, 40 dB
response, and the
above
the
spacing as well
hearing
as the bandwidth
threshold, is defined
is determined by
as 1000 mels.
perceptual
constant
mel
frequency
interval.
The
modified
One approach to
simulating
the
thus consists of
the output power
subjective
spectrum
spectrum of S( )
is
to
use a filter bank,

spaced uniformly
on the mel scale .
That filter bank
has a triangular
of
these
when S( ) is the
input.
number
chosen
speech
spectrum provides
Cepstrum
In this final step,
spectrum
mel
coefficients, K, is
the
log
of
as 20.
frequency
the
The
spectrum
typically
bandpass
filters
mel
is
good
representation
of
the local spectral
converted back to
properties
time. The result is
signal for the given
called
frame
the
mel
of
the
analysis.
frequency
Because the mel
cepstrum
spectrum
coefficients
coefficients (and so
(MFCC).
The
real numbers, we
cepstral
representation
their logarithm) are
of
can convert them to
the
time
domain
30msec
with
using the Discrete
overlap, a set of
Cosine
mel-frequency
Transform
(DCT). Therefore if
cepstrum
we
coefficients
denote those
mel
power
is
computed.
These
spectrum
are
of
coefficients that are
cosine transform of
the result of
the logarithm of the
the last step are
short-term
expressed
calculate the
MFCC's,
mel-
as
Note that the first

component is
excluded,
from the DCT

since
it
represents
the
mean value of the

input signal which
carried
little
specific
information.
By applying the
procedure
described
above,
for
speech
each
frame
of
power
spectrum
, we can
speaker
result
around
on
frequency
This
scale.
set
coefficients
are
generically
of
called patterns and
is
in
our
case
called an acoustic
sequences
vector.
acoustic
Therefore
each
input
utterance
is
are
of
vectors
that are extracted

from
an
input
transformed into a
speech using the
sequence
techniques
of
acoustic vectors. In
described
in
the next section we
previous
section.
will see how those
The classes here
acoustic
can
be
vectors
used
represent
to
and
recognize the voice

characteristic of the
the
refer to individual
speakers. Since the
classification
procedure
in
our
case is applied on
extracted features,
speaker.
it
can
referred
Feature
Matching
also
be
to
as
feature matching.
The state-of-the-art
Introduction
The problem of
in feature matching
speaker recognition
techniques used in
belongs to pattern
speaker recognition
recognition.
include
Dynamic
objective of pattern
Time
Warping
recognition
to
(DTW),
Hidden
classify objects of
Markov
Modeling
interest into one of
(HMM), and Vector
Quantization
number
categories
classes.
The
is
of
or
The
objects of interest
(VQ).
In this paper the VQ

approach
will
be
used, due to ease of
from a large vector
implementation and
space
high accuracy. VQ is
number of regions in
that
process
mapping
of
to
finite
vectors
space.
Each
speaker-specific
region is called a
VQ
cluster and can
generated for each
be
represented
known speaker by
by
its
clustering
center
called
codeword.
The
collection
of
codebook
is
his/her
training
acoustic
vectors. The result
all
codewords
codewords
is
(centroids)
called
shown in Figure 4
are
codebook.
by black circles and
Figure 4 shows a
black triangles for
conceptual diagram
speaker 1 and 2,
to
respectively.
illustrate
this
recognition
process.
figure,
distance
The
from
In
the
vector
to
only
two
closest
codeword
the
speakers and two
of a codebook is
dimensions of the
called
acoustic space are
distortion.
shown. The circles
recognition phase,
refer
an input utterance
to
acoustic
the
vectors
of
an
voice
while the triangles
quantized"
are
each
the
In
the
unknown
from the speaker 1
from
VQ-
is
"vectorusing
trained
speaker 2. In the
codebook and the
training phase, a
total VQ distortion
is computed. The
speaker
corresponding
to
the VQ codebook
with smallest total
distortion
is
identified.
Figure
4.
Conceptual
diagram
illustrating vector
quantization
codebook
formation.
One speaker can
be discriminated
from another
based of the
location of
centroids.
Clustering the
Training Vectors
After
the
enrolment session,
the
vectors
acoustic
extracted
from input speech
algorithm, namely
of
LBG
speaker
algorithm
provide a set of
[Linde, Buzo and
training
Gray,
vectors.
1980],
for
As
described
clustering a set of
above,
the
next
L training vectors
important step is
into a set of M
to build a speaker-
codebook vectors.
specific
The algorithm is
VQ
codebook for this
formally
speaker
implemented
using
those
training
the
following
vectors. There is a
recursive
well-know
procedure:
from
1.
Design a 1-
vector
codebook;
this is the centroid

of the entire set of
training
vectors
(hence, no iteration
2.
Double the
size
of
codebook
splitting
the
by
each
current codebook
yn according to
the rule
to
the
current size of the

codebook, and
is
splitting
parameter
choose
(we
=0.01).
Nearest-
3.
is required here).
by
Neighbor Search:
for each training
vector,
find
the
codeword in the
current codebook
that is closest (in
terms of similarity
measurement),
and assign that
vector
to
the
corresponding
where
varies
cell
(associated
with the closest
search for a 2-
codeword).
vector codebook,
4.
Centroid
Update:
update
the codeword in
each cell using the
centroid
of
the
training
vectors
assigned to that
cell.
Iteration 1:
5.
repeat
steps
and 4 until the

average distance
falls
below
preset threshold
6.
Iteration 2:
repeat steps 2, 3
and
until
codebook size of
M is designed.
Intuitively, the LBG
algorithm designs
an
M-vector
codebook
stages.
It
in
starts
first by designing a
1-vector
codebook,
uses
then
splitting
technique on the
codewords
initialize
to
the
and continues the
as
splitting
whether
the
until the desired
procedure
has
M-vector
converged.
process
codebook
to
determine
is
obtained.
Figure 5 shows, in
a flow diagram, the
detailed
steps
of
the LBG algorithm.

"Cluster vectors" is
the
nearest-
neighbor
search
procedure
which
assigns
each
training vector to a
cluster
with
associated
the
closest
codeword.
"Find
centroids"
is
the
centroid
update
Figure 5. Flow
procedure.
"Compute
diagra
(distortion)"
m of
sums
the distances of all
the
training vectors in
LBG
the
algorith
nearest-
neighbor search so
m
an efficient speaker
Conclusion:
Even though
much care is taken it
is difficult to obtain
recognition
system
since this task has

been challenged by
the
highly
variant
input
speech
signals.
The
References
[1]
principle source of
this variance is the
speaker
himself.
Speech signals in
training and testing
sessions
can
be
greatly different due
L.R.
Rabiner
and
B.H.
Juang,
Fundamentals
of
Speech
Recognition,
Prentice-Hall,
Englewood Cliffs,
N.J., 1993.
to many facts such

as
people
voice
change
with time,
health
conditions
(e.g.
the
speaker
cold),
has
speaking rates, etc.

There are also other
factors,
beyond
speaker
variability,
that
present
challenge
a
to
speaker recognition
technology. Because
of
all
these
difficulties
this
technology is still an
active
research.
area
of
[2] L.R Rabiner

and
R.W.
Schafer,
Digital
Processing
Speech
of
Signals,
Prentice-Hall,
Englewood Cliffs,
N.J., 1978.

Speech Recognition Using DSP

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Speech Recognition Using DSP

Transféré par

Droits d'auteur :

Formats disponibles

SPEECH RECOGNITION USING DSP

*S.S.V.P.S.B.S.D.C.O.E. DHULE, North Maharashra University(M.S)-424005

SPEECH RECOGNITION USING DSP

This paper deals with the process of etc. This paper

automatically recognizing who is speakingis

on the basis of individual informationtext

recognition methods can be divided intospeaker

methods. In a text-independent system, system

speaker models capture characteristics ofmakes use of

irrespective of what one is saying. In a text-cepstrum

dependent system, on the other hand, thecoefficients to

recognition of the speaker's identity is process

based on his or her speaking one or more input

specific phrases, like passwords,

his/her voice input with the ones from a

the voice signal

can build or train a

for that speaker. In

phases. The first

computed from the

During the testing

phase ( Figure 1),

the input speech is

while the second

short period of time

However, over long

periods of time (on

the order of 1/5

the different speech

The speech signal is

varying signal (it is

analysis is the most

above 10000 Hz.

kHz, which cover

The speech input is

mimic the behavior

In this step the

by M (M < N). The

Similarly, the third

first frame (or M

values for N and

taper the signal to

beginning and end

and facilitate the

fast radix-2 FFT)

define the window

end of each frame.

has the form:

samples from the

domain. The FFT

The result after

imaginary unit, i.e.

general Xn's are

speech signals does

scale called the 'mel'

spacing below 1000

response, and the

use a filter bank,

the local spectral

time. The result is

signal for the given

Because the mel

their logarithm) are