Académique Documents
Professionnel Documents
Culture Documents
K.S. SAGALE*
19.kaustubh@gmail.com
N.B.RAJPUT*
narendrasingrajput@gmail.com
G.J.PATIL*
gauravpatil159@yahoo.co.in
card numbers,
implemented
PIN
using
codes,
MATLAB. This
technique
is
used
in
included
application
in
speech
waves.
based
on
Speaker independent
areas such as
text-independent
control access
and
text-dependent recognition
and
to services like
voice
somebody's
banking
speech,
which
show
up mel frequency
dialing,
telephone,
database
access
the
by
signal
services, voice
vector
mail,
and
quantization
approach
identify
speaker.
security
control
for
to
confidential
the
information
The
above task is
areas,
and
remote access
to computers.
cation
and
verific
ation.
Speak
er
identifi
cation
is the
P
rproces
i
of
ns
cdeter
i
pmining
lwhich
e
sregiste
red
o
fspeak
er
S
pprovid
e
es a
a
kgiven
e
uttera
r
nce.
Rec
ogni
Speak
tion
er
c
e
s
s
o
f
a
c
c
e
p
t
i
n
g
verific
Speaker
recognition
ation,
on the
can
be
other
classified into
hand,
o
r
r
e
therepres
jecting
identity
claiment
of a speaker.each
Figure
1speak
shows
theer.
basic
Featur
structures
ofe
speaker
matchi
identification
ng
and
involve
verification
systems.
actual
At the highest
level,
all
speaker
proced
ure to
identif
recognition
systems
the
unkno
contain
two
main modules
(refer to Figure
1):
the
feature
wn
speak
er
by
compa
extraction and
ring
feature
extract
matching.
ed
Feature
feature
extraction
the
is
process
that extracts a
small
of
amount
data
from
used
to
s from
(
a
)
S
p
e
a
k
e
r
i
d
e
n
t
i
f
i
c
a
t
i
o
n
Figure 1. Basic
structu
res of
speak
er
(b) Speaker
verification recogn
ition
that
the
system
model
speaker
verification
recognition
systems,
in
systems have to
addition,
serve
speaker-specific
two
distinguish
threshold is also
one is referred to
training
the
enrollment
or
phase
sessions
training
samples.
matched
one is referred to
stored
reference
as the operation
model
and
sessions or testing
recognition
phase.
decision is made.
In
training
each
the
phase,
registered
speaker
has
to
provide samples of
their
with
speech
so
Speech Feature
Extraction
Introduction:
The
purpose
of
this module is to
When
convert
the
over a sufficiently
speech waveform
to some type of
(between
parametric
100
representation (at
characteristics are
fairly
considerably
examined
and
msec),
its
stationary.
lower information
rate)
further
and
processing. This is
seconds or more)
often referred as
the
the
characteristic
for
analysis
signal-
processing
front
signal
change to reflect
end.
sounds
timed
spoken. Therefore,
short-time spectral
called
slowly
stationary).
quasiAn
common
being
way
example of speech
characterize
signal is shown in
speech signal.
Figure 2.
to
the
MFCC's
are
based
on
known
variation
of
the
human
ear's
Figure 2. An
example of
speech signal
the
critical
bandwidths
with
frequency,
filters
spaced linearly at
low
Mel-frequency
cepstrum
coefficients
frequencies
and
logarithmically at
processor:
high
frequencies have
typically
been
to
at a sampling rate
the
used
capture
recorded
phonetically
This
important
frequency
characteristics of
chosen to minimize
speech. This is
the
expressed in the
aliasing
mel-frequency
analog-to-digital
scale, which is a
conversion. These
linear
sampled
frequency
spacing
below
can
sampling
was
effects
in
of
the
signals
capture
all
1000 Hz and a
frequencies up to 5
logarithmic
spacing
above
most
energy
of
that
are
1000 Hz.
sounds
A block diagram of
generated
the structure of an
humans. As been
MFCC processor is
discussed
given in Figure 3.
previously,
main
purpose
by
the
of
the
MFCC
processor
is
to
the
speech
waveforms
themselves,
MFFC's are shown
to
be
susceptible
mentioned
variations.
Frame
Blocking :
less
to
with
adjacent
frames
being
separated
the
first
samples.
N
The
second
frame
begins M samples
after
frame,
the
first
and
overlaps it by N M
samples.
Figure 3. Block
diagram
of the
MFCC
processor
Typical
(which
is
equivalent to ~ 30
zero
msec
windowing
at
the
of each frame. If we
and M = 100.
as
Windowing:
The next step in
the processing is to
window
each
individual frame so
as to minimize the
signal discontinuities
at the beginning and
, where N is the
number
of
samples in each
frame, then the
result
of
windowing is the
signal
minimize
the
spectral distortion by
using the window to
Typically
the
Hamming window
is
used,
which
each
of
frequency
next
processing step is
the Fast Fourier
Transform, which
to implement the
Discrete
Fourier
Transform (DFT)
which is defined
on the set of N
samples {xn}, as
follow:
to
spectrum
as
or
periodogram.
Note that we
use j here to
denote the
Melfrequency
Wrapping
As
mentioned
above,
In
resulting
sequence {Xn} is
interpreted
follows:
the
as
zero
frequency
corresponds to n =
0, positive
frequencies
psychophysical
studies have shown
that
human
perception
frequency
of
of
the
contents
sounds
for
f,
correspond to
measured in Hz, a
values
subjective pitch is
, while negative
frequencies
measured
the sampling
frequency.
The
mel-
frequency scale is a
linear
. Here, Fs
denotes
correspond to
on
frequency
1 kHz tone, 40 dB
above
the
spacing as well
hearing
as the bandwidth
threshold, is defined
is determined by
as 1000 mels.
perceptual
constant
mel
frequency
interval.
The
modified
One approach to
simulating
the
thus consists of
the output power
subjective
spectrum
spectrum of S( )
is
to
of
these
when S( ) is the
input.
number
chosen
speech
spectrum provides
Cepstrum
In this final step,
spectrum
mel
coefficients, K, is
the
log
of
as 20.
frequency
the
The
spectrum
typically
bandpass
filters
mel
is
good
representation
of
converted back to
properties
called
frame
the
mel
of
the
analysis.
frequency
cepstrum
spectrum
coefficients
coefficients (and so
(MFCC).
The
real numbers, we
cepstral
representation
of
the
time
domain
30msec
with
overlap, a set of
Cosine
mel-frequency
Transform
(DCT). Therefore if
cepstrum
we
coefficients
denote those
mel
power
is
computed.
These
spectrum
are
of
cosine transform of
the result of
short-term
expressed
calculate the
MFCC's,
mel-
as
it
represents
the
little
specific
information.
By applying the
procedure
described
above,
for
speech
each
frame
of
power
spectrum
, we can
speaker
result
around
on
frequency
This
scale.
set
coefficients
are
generically
of
is
in
our
case
called an acoustic
sequences
vector.
acoustic
Therefore
each
input
utterance
is
are
of
vectors
an
input
transformed into a
sequence
techniques
of
acoustic vectors. In
described
in
previous
section.
acoustic
can
be
vectors
used
represent
to
and
the
refer to individual
speakers. Since the
classification
procedure
in
our
case is applied on
extracted features,
speaker.
it
can
referred
Feature
Matching
also
be
to
as
feature matching.
The state-of-the-art
Introduction
The problem of
in feature matching
speaker recognition
techniques used in
belongs to pattern
speaker recognition
recognition.
include
Dynamic
objective of pattern
Time
Warping
recognition
to
(DTW),
Hidden
classify objects of
Markov
Modeling
Quantization
number
categories
classes.
The
is
of
or
The
objects of interest
(VQ).
will
be
implementation and
space
high accuracy. VQ is
number of regions in
that
process
mapping
of
to
finite
vectors
space.
Each
speaker-specific
region is called a
VQ
be
represented
known speaker by
by
its
clustering
center
called
codeword.
The
collection
of
codebook
is
his/her
training
acoustic
all
codewords
codewords
is
(centroids)
called
shown in Figure 4
are
codebook.
Figure 4 shows a
conceptual diagram
speaker 1 and 2,
to
respectively.
illustrate
this
recognition
process.
figure,
distance
The
from
In
the
vector
to
only
two
closest
codeword
the
of a codebook is
dimensions of the
called
distortion.
recognition phase,
refer
an input utterance
to
acoustic
the
vectors
of
an
voice
quantized"
are
each
the
In
the
unknown
from
VQ-
is
"vectorusing
trained
speaker 2. In the
training phase, a
total VQ distortion
is computed. The
speaker
corresponding
to
the VQ codebook
with smallest total
distortion
is
identified.
Figure
4.
Conceptual
diagram
illustrating vector
quantization
codebook
formation.
One speaker can
be discriminated
from another
based of the
location of
centroids.
Clustering the
Training Vectors
After
the
enrolment session,
the
vectors
acoustic
extracted
algorithm, namely
of
LBG
speaker
algorithm
provide a set of
training
Gray,
vectors.
1980],
for
As
described
clustering a set of
above,
the
next
L training vectors
important step is
into a set of M
to build a speaker-
codebook vectors.
specific
The algorithm is
VQ
formally
speaker
implemented
using
those
training
the
following
vectors. There is a
recursive
well-know
procedure:
from
1.
Design a 1-
vector
codebook;
vectors
(hence, no iteration
2.
Double the
size
of
codebook
splitting
the
by
each
current codebook
yn according to
the rule
to
the
splitting
parameter
choose
(we
=0.01).
Nearest-
3.
is required here).
by
Neighbor Search:
for each training
vector,
find
the
codeword in the
current codebook
that is closest (in
terms of similarity
measurement),
and assign that
vector
to
the
corresponding
where
varies
cell
(associated
search for a 2-
codeword).
vector codebook,
4.
Centroid
Update:
update
the codeword in
each cell using the
centroid
of
the
training
vectors
assigned to that
cell.
Iteration 1:
5.
repeat
steps
below
preset threshold
6.
Iteration 2:
repeat steps 2, 3
and
until
codebook size of
M is designed.
Intuitively, the LBG
algorithm designs
an
M-vector
codebook
stages.
It
in
starts
first by designing a
1-vector
codebook,
uses
then
splitting
technique on the
codewords
initialize
to
the
as
splitting
whether
the
procedure
has
M-vector
converged.
process
codebook
to
determine
is
obtained.
Figure 5 shows, in
a flow diagram, the
detailed
steps
of
nearest-
neighbor
search
procedure
which
assigns
each
training vector to a
cluster
with
associated
the
closest
codeword.
"Find
centroids"
is
the
centroid
update
Figure 5. Flow
procedure.
"Compute
diagra
(distortion)"
m of
sums
the
training vectors in
LBG
the
algorith
nearest-
neighbor search so
m
an efficient speaker
Conclusion:
Even though
much care is taken it
is difficult to obtain
recognition
system
highly
variant
input
speech
signals.
The
References
[1]
principle source of
this variance is the
speaker
himself.
Speech signals in
training and testing
sessions
can
be
L.R.
Rabiner
and
B.H.
Juang,
Fundamentals
of
Speech
Recognition,
Prentice-Hall,
Englewood Cliffs,
N.J., 1993.
people
voice
change
with time,
health
conditions
(e.g.
the
speaker
cold),
has
beyond
speaker
variability,
that
present
challenge
a
to
speaker recognition
technology. Because
of
all
these
difficulties
this
technology is still an
active
research.
area
of
R.W.
Schafer,
Digital
Processing
Speech
of
Signals,
Prentice-Hall,
Englewood Cliffs,
N.J., 1978.