Vous êtes sur la page 1sur 19

SPEECH RECOGNITION USING DSP

K.S. SAGALE*
19.kaustubh@gmail.com

N.B.RAJPUT*
narendrasingrajput@gmail.com

G.J.PATIL*
gauravpatil159@yahoo.co.in

*S.S.V.P.S.B.S.D.C.O.E. DHULE, North Maharashra University(M.S)-424005

SPEECH RECOGNITION USING DSP


ABSTRACT:

card numbers,

implemented

PIN

using

codes,

This paper deals with the process of etc. This paper

MATLAB. This

automatically recognizing who is speakingis

technique

is

on the basis of individual informationtext

used

in

included

application

in

speech

waves.

based

on

Speaker independent

recognition methods can be divided intospeaker

areas such as

text-independent

control access

and

text-dependent recognition

methods. In a text-independent system, system

and

to services like

speaker models capture characteristics ofmakes use of

voice

somebody's

banking

speech,

which

show

up mel frequency

dialing,

irrespective of what one is saying. In a text-cepstrum

telephone,

dependent system, on the other hand, thecoefficients to

database

recognition of the speaker's identity is process

access

the

by

based on his or her speaking one or more input

signal

services, voice

specific phrases, like passwords,

vector

mail,

and

quantization
approach
identify
speaker.

security

control

for

to

confidential

the

information

The

above task is

areas,

and

remote access

to computers.

his/her voice input with the ones from a


set of known speakers. We will discuss
each module in detail in later sections.
identifi

cation

and
verific

ation.

Speak

er
identifi

cation

is the

P
rproces
i
of
ns
cdeter
i
pmining
lwhich
e
sregiste
red

o
fspeak
er

S
pprovid
e
es a
a
kgiven
e
uttera
r
nce.

Rec
ogni
Speak
tion
er

c
e
s
s

o
f
a
c
c
e
p
t
i
n
g

verific
Speaker
recognition

ation,

on the
can
be
other
classified into
hand,

o
r
r
e

therepres

jecting
identity

claiment

of a speaker.each
Figure

1speak

shows

theer.

basic

Featur

structures

ofe

speaker

matchi

identification

ng

and

involve

verification

systems.

actual

At the highest
level,

all

speaker

proced
ure to
identif

recognition

systems

the

unkno

contain

two

main modules
(refer to Figure
1):

the

feature

wn
speak
er

by

compa

extraction and
ring
feature

extract

matching.

ed

Feature

feature

extraction
the

is

process

that extracts a
small
of

amount

data

from

the voice signal


that can later
be

used

to

s from

(
a
)
S
p
e
a
k
e
r
i
d
e
n
t
i
f
i
c
a
t
i
o
n

Figure 1. Basic

structu

res of

speak

er

(b) Speaker
verification recogn

ition

that

the

system

can build or train a


reference

model

for that speaker. In


case of speaker
All

speaker

verification

recognition

systems,

in

systems have to

addition,

serve

speaker-specific

two

distinguish

threshold is also

phases. The first

computed from the

one is referred to

training

the

During the testing

enrollment
or

phase ( Figure 1),

phase

the input speech is

sessions
training

samples.

while the second

matched

one is referred to

stored

reference

as the operation

model

and

sessions or testing

recognition

phase.

decision is made.

In

training
each

the

phase,
registered

speaker

has

to

provide samples of
their

with

speech

so

Speech Feature
Extraction
Introduction:
The

purpose

of

this module is to

When

convert

the

over a sufficiently

speech waveform

short period of time

to some type of

(between

parametric

100

representation (at

characteristics are

fairly

considerably

examined

and

msec),

its

stationary.

lower information

However, over long

rate)

further

periods of time (on

and

the order of 1/5

processing. This is

seconds or more)

often referred as

the

the

characteristic

for

analysis

signal-

processing

front

signal

change to reflect

end.

the different speech

The speech signal is

sounds

timed

spoken. Therefore,

varying signal (it is

short-time spectral

called

analysis is the most

slowly

stationary).

quasiAn

common

being

way

example of speech

characterize

signal is shown in

speech signal.

Figure 2.

to
the

MFCC's

are

based

on

known

variation

of

the

human

ear's
Figure 2. An
example of
speech signal

the

critical

bandwidths

with

frequency,

filters

spaced linearly at
low

Mel-frequency
cepstrum
coefficients

frequencies

and
logarithmically at

processor:

high

frequencies have

typically

been

to

at a sampling rate

the

above 10000 Hz.

used

capture

recorded

phonetically

This

important

frequency

characteristics of

chosen to minimize

speech. This is

the

expressed in the

aliasing

mel-frequency

analog-to-digital

scale, which is a

conversion. These

linear

sampled

frequency

spacing

below

can

sampling
was

effects
in

of
the

signals

capture

all

1000 Hz and a

frequencies up to 5

logarithmic

kHz, which cover

spacing

above

most

energy

of

that

are

1000 Hz.

sounds

A block diagram of

generated

the structure of an

humans. As been

MFCC processor is

discussed

given in Figure 3.

previously,

The speech input is

main

purpose

by

the
of

the

MFCC

processor

is

to

mimic the behavior


of the human ears.
In addition, rather
than

the

speech

waveforms
themselves,
MFFC's are shown
to

be

susceptible
mentioned
variations.

Frame
Blocking :

less
to

In this step the


continuous speech
signal is blocked
into frames of N
samples,

with

adjacent

frames

being

separated

by M (M < N). The


first frame consists
of

the

first

samples.

N
The

second

frame

begins M samples
after
frame,

the

first
and

overlaps it by N M

samples.

Similarly, the third


frame begins 2M
samples after the

Figure 3. Block
diagram
of the
MFCC
processor

first frame (or M


samples after the
second frame) and
overlaps it by N 2M samples. This
process continues
until all the speech
is accounted for
within one or more
frames.

Typical

values for N and


M are N = 256

(which

is

taper the signal to

equivalent to ~ 30

zero

msec

beginning and end

windowing

at

the

and facilitate the

of each frame. If we

fast radix-2 FFT)

define the window

and M = 100.

as

Windowing:
The next step in
the processing is to
window

each

individual frame so
as to minimize the
signal discontinuities
at the beginning and

, where N is the
number

of

samples in each
frame, then the
result

of

windowing is the
signal

end of each frame.


The concept here is
to

minimize

the

spectral distortion by
using the window to

Typically

the

Hamming window
is

used,

which

has the form:


converts
frame

each
of

samples from the


time domain into
the

frequency

domain. The FFT


is a fast algorithm
Fast Fourier
Transform (FFT)
The

next

processing step is
the Fast Fourier
Transform, which

to implement the
Discrete

Fourier

Transform (DFT)
which is defined
on the set of N
samples {xn}, as

follow:

The result after


this step is often
referred

to

spectrum

as
or

periodogram.
Note that we
use j here to
denote the

Melfrequency
Wrapping

imaginary unit, i.e.

As

mentioned

above,

In

general Xn's are


complex numbers.
The

resulting

sequence {Xn} is
interpreted
follows:

the

as
zero

frequency
corresponds to n =
0, positive

frequencies

psychophysical
studies have shown
that

human

perception
frequency
of

of

the

contents

sounds

for

speech signals does


not follow a linear
scale. Thus for each
tone with an actual
frequency,

f,

correspond to

measured in Hz, a

values

subjective pitch is

, while negative
frequencies

measured

the sampling
frequency.

The

mel-

frequency scale is a
linear

. Here, Fs
denotes

scale called the 'mel'


scale.

correspond to

on

frequency

spacing below 1000


Hz and a logarithmic
spacing above 1000
Hz. As a reference
point, the pitch of a

1 kHz tone, 40 dB

response, and the

above

the

spacing as well

hearing

as the bandwidth

threshold, is defined

is determined by

as 1000 mels.

perceptual

constant

mel

frequency
interval.

The

modified
One approach to
simulating

the

thus consists of
the output power

subjective
spectrum

spectrum of S( )

is

to

use a filter bank,


spaced uniformly
on the mel scale .
That filter bank
has a triangular

of

these

when S( ) is the
input.
number

chosen

speech

spectrum provides

Cepstrum
In this final step,

spectrum

mel

coefficients, K, is

the

log

of

as 20.

frequency

the

The

spectrum
typically

bandpass

filters

mel
is

good

representation

of

the local spectral

converted back to

properties

time. The result is

signal for the given

called

frame

the

mel

of

the

analysis.

frequency

Because the mel

cepstrum

spectrum

coefficients

coefficients (and so

(MFCC).

The

real numbers, we

cepstral
representation

their logarithm) are

of

can convert them to

the

time

domain

30msec

with

using the Discrete

overlap, a set of

Cosine

mel-frequency

Transform

(DCT). Therefore if

cepstrum

we

coefficients

denote those

mel

power

is

computed.

These

spectrum

are

of

coefficients that are

cosine transform of

the result of

the logarithm of the

the last step are

short-term

expressed

calculate the
MFCC's,

mel-

as

Note that the first


component is
excluded,

from the DCT


since

it

represents

the

mean value of the


input signal which
carried

little
specific

information.
By applying the
procedure
described

above,

for

speech

each

frame

of

power

spectrum

, we can

speaker

result

around

on

frequency
This

scale.

set

coefficients

are

generically

of

called patterns and

is

in

our

case

called an acoustic

sequences

vector.

acoustic

Therefore

each

input

utterance

is

are
of

vectors

that are extracted


from

an

input

transformed into a

speech using the

sequence

techniques

of

acoustic vectors. In

described

in

the next section we

previous

section.

will see how those

The classes here

acoustic
can

be

vectors
used

represent

to
and

recognize the voice


characteristic of the

the

refer to individual
speakers. Since the
classification
procedure

in

our

case is applied on
extracted features,

speaker.

it

can

referred

Feature
Matching

also

be

to

as

feature matching.
The state-of-the-art

Introduction
The problem of

in feature matching

speaker recognition

techniques used in

belongs to pattern

speaker recognition

recognition.

include

Dynamic

objective of pattern

Time

Warping

recognition

to

(DTW),

Hidden

classify objects of

Markov

Modeling

interest into one of

(HMM), and Vector

Quantization

number

categories
classes.

The

is

of
or
The

objects of interest

(VQ).

In this paper the VQ


approach

will

be

used, due to ease of

from a large vector

implementation and

space

high accuracy. VQ is

number of regions in

that

process

mapping

of

to

finite

vectors

space.

Each

speaker-specific

region is called a

VQ

cluster and can

generated for each

be

represented

known speaker by

by

its

clustering

center

called

codeword.

The

collection

of

codebook

is

his/her

training

acoustic

vectors. The result

all

codewords

codewords

is

(centroids)

called

shown in Figure 4

are

codebook.

by black circles and

Figure 4 shows a

black triangles for

conceptual diagram

speaker 1 and 2,

to

respectively.

illustrate

this

recognition
process.
figure,

distance

The

from

In

the

vector

to

only

two

closest

codeword

the

speakers and two

of a codebook is

dimensions of the

called

acoustic space are

distortion.

shown. The circles

recognition phase,

refer

an input utterance

to

acoustic

the
vectors

of

an

voice

while the triangles

quantized"

are

each

the

In

the

unknown

from the speaker 1

from

VQ-

is

"vectorusing
trained

speaker 2. In the

codebook and the

training phase, a

total VQ distortion

is computed. The
speaker
corresponding

to

the VQ codebook
with smallest total
distortion

is

identified.

Figure

4.

Conceptual
diagram
illustrating vector
quantization
codebook
formation.
One speaker can
be discriminated
from another
based of the
location of
centroids.
Clustering the
Training Vectors
After

the

enrolment session,
the
vectors

acoustic
extracted

from input speech

algorithm, namely

of

LBG

speaker

algorithm

provide a set of

[Linde, Buzo and

training

Gray,

vectors.

1980],

for

As

described

clustering a set of

above,

the

next

L training vectors

important step is

into a set of M

to build a speaker-

codebook vectors.

specific

The algorithm is

VQ

codebook for this

formally

speaker

implemented

using

those

training

the

following

vectors. There is a

recursive

well-know

procedure:

from
1.

Design a 1-

vector

codebook;

this is the centroid


of the entire set of
training

vectors

(hence, no iteration

2.

Double the

size

of

codebook
splitting

the
by
each

current codebook
yn according to
the rule

to

the

current size of the


codebook, and
is

splitting

parameter
choose

(we

=0.01).

Nearest-

3.

is required here).

by

Neighbor Search:
for each training
vector,

find

the

codeword in the
current codebook
that is closest (in
terms of similarity
measurement),
and assign that
vector

to

the

corresponding
where

varies

cell

(associated

with the closest

search for a 2-

codeword).

vector codebook,

4.

Centroid

Update:

update

the codeword in
each cell using the
centroid

of

the

training

vectors

assigned to that
cell.

Iteration 1:

5.
repeat

steps

and 4 until the


average distance
falls

below

preset threshold
6.

Iteration 2:

repeat steps 2, 3
and

until

codebook size of
M is designed.
Intuitively, the LBG
algorithm designs
an

M-vector

codebook
stages.

It

in
starts

first by designing a
1-vector
codebook,
uses

then
splitting

technique on the
codewords
initialize

to
the

and continues the

as

splitting

whether

the

until the desired

procedure

has

M-vector

converged.

process

codebook

to

determine

is

obtained.
Figure 5 shows, in
a flow diagram, the
detailed

steps

of

the LBG algorithm.


"Cluster vectors" is
the

nearest-

neighbor

search

procedure

which

assigns

each

training vector to a
cluster
with

associated
the

closest

codeword.

"Find

centroids"

is

the

centroid

update

Figure 5. Flow

procedure.
"Compute

diagra

(distortion)"

m of

sums

the distances of all

the

training vectors in

LBG

the

algorith

nearest-

neighbor search so

m
an efficient speaker

Conclusion:
Even though
much care is taken it
is difficult to obtain

recognition

system

since this task has


been challenged by
the

highly

variant

input

speech

signals.

The

References
[1]

principle source of
this variance is the
speaker

himself.

Speech signals in
training and testing
sessions

can

be

greatly different due

L.R.
Rabiner
and
B.H.
Juang,

Fundamentals
of
Speech
Recognition,
Prentice-Hall,
Englewood Cliffs,
N.J., 1993.

to many facts such


as

people

voice

change

with time,

health

conditions

(e.g.

the

speaker

cold),

has

speaking rates, etc.


There are also other
factors,

beyond

speaker

variability,

that

present

challenge

a
to

speaker recognition
technology. Because
of

all

these

difficulties

this

technology is still an
active
research.

area

of

[2] L.R Rabiner


and

R.W.

Schafer,

Digital

Processing
Speech

of

Signals,

Prentice-Hall,
Englewood Cliffs,
N.J., 1978.

Vous aimerez peut-être aussi