Report

Process of Registration of person for Speaker Identification and Feature Extraction
1. INTRODUCTION
Speech and hearing, man’s most used means of communication, have been the
objects of intense study for more than 150 years-from the time of von Kempelen’s
speaking machine to the present day. With the advent of telephone and the explosive
growth of its dissemination and use, the engineering in the design of evermore band
width-efficient and higher –quality transmission systems has been the objective and
the providence of both engineers and scientists for more than 70 years. This work and
investigations have been largely driven by these real-world applications which now
have broadened to include not only speech synthesises but also atomic speech
recognition systems, speaker verification systems, speech enhancement systems
,efficient speech coding systems and speech and voice modification systems. The
objectives of the engineers have been to design and build real workable and
economically affordable systems that can be used over the broad range of existing a
newly installed communication channel.
1
2. LITERATURE SURVEY
2.1 PRODUCTION AND CLASSIFICATION OF SPEECH SOUNDS:
A simplified view of speech production is given in figure --,where the speech

organs are divided into three main groups: the Lungs, Larynx and the Vocal tract.
The lung acts as a power supply and provide airflow to the larynx. The larynx
modulates the airflow and provides either a periodic puff-like oral noisy airflow
source to the vocal tract. The vocal tract consists of oral, nasal, and pharynx cavities,
giving the modulated airflow its “color” by spectrally shaping the source. The
variation of air pressure at the lips results in travelling sound wave that the listener
perceives as speech. In general there are three general categories of source of speech
sounds: periodic, noisy, and impulsive, though combinations are often present.
Combining these sources with different vocal tract configurations gives more refined
speech sound classes, referred to as phonemes. The study of which is called
phonemics.
Figure:2.1 Simple view of speech production system.
2
2.2 ANATOMY AND PHYSIOLOGY OF SPEECH ORGANS
During speaking, we take short spurts of air and release them steadily. we over
ride our rhythmic breathing by making the duration of exhaling roughly equal to the
length of sentence or phase, where the lung air pressure is maintained at
approximately a constant level.
Figure 2.2 Speech Organs
2.3 ABOUT VOICE
2.3.1 Parts of the Voice
1. Larynx
The larynx is the voice box. The vocal folds are part of the larynx. The vocal
folds vibrate to create the sound of the voice.
2. Pharynx
The pharynx is the throat. It goes up from the larynx and divides into the
laryngopharynx (just above the larynx), oropharynx (going into the mouth)
and nasopharynx (going into the nose).
3. Trachea
The trachea is your windpipe. It's the tube that connects your lungs to your
throat. The larynx sits on the top of the trachea.
3
Figure 2.3 Parts of the voice
4. Oesophagus
The oesophagus is your food pipe. It's just behind the larynx and trachea. Your
pharynx carries both air and food/water. The air goes through the larynx and
trachea, and food and water go into your oesophagus.
5. Spinal column
The spinal column is behind the oesophagus.
6. Diaphragm
The diaphragm is underneath the lungs, inside the rib cage. It's shaped like a
dome. The diaphragm is your main muscle for controlling respiration
(breathing).
4
Figure 2.4 Internal structure of human body
Figure 2.5 Anatomy of the Larynx:
Vocal Folds (Vocal Cords) are the remarkable structures that provide a valve for the
airway and also vibrate to produce the voice. The vocal folds are multilayered
structures, consisting of a muscle covered by a mucosal covering.
2.4 VOCAL FOLD VIBRATION AND PITCH
This chart shows how fast vocal folds are vibrating (in Hertz) when you're
singing certain pitches. Notice that every time you go up an octave, you double the
5
frequency of vibration.
Figure 2.6 Vocal Fold Vibration and Pitch
The faster the vocal folds vibrate, the higher the pitch. Extremely slow vocal
fold vibration is about 60 vibrations per second and produces a low pitch. Extremely
fast vocal fold vibration approaches 2000 vibrations per second and produces a very
high pitch. Only the highest sopranos can attain those extremely high pitches. In
general, men's vocal folds can vibrate from 90 - 500 Hz, and they average about 115
Hz in conversation. Women's vocal folds can vibrate from 150 -1000 Hz, and they
average about 200 Hz in conversation.
2.5 MODELS FOR SPEECH PRODUCTION
A schematic longitudinal cross sectional drawing of the human vocal tract

mechanism is given in the figure--.This diagram highlights the essential physical
features of human anatomy that enter into the final stages of speech production
process. It shows the vocal tract as a tube of non uniformal cross sectional area that is
bounded at one end by vocal cords and at the other by mouth opening .For creating
the nasal sounds like /M/,/N/ or /NG/, a side branch tube called the nasal tract is
connected to the main acoustic branch by the trap door action of the velum. This
branch path radiates the sound at the nostrils. The shape of the vocal tract changes
with time due to motion of the lips, jaw, tongue and velum.
6
Figure2.7: Hidden knowledge sources in speech waveform
The sounds of speech are generated in several ways. Voice sounds (vowels,
liquids, glides, nasals) are produced when the vocal tract is excited by the pulses of air
pressure resulting from the quasi-periodic opening and closing of glottal
orifice(opening between the vocal chords).The vowels /UH/,/IY/, and /EY/, and the
liquid constant /W/.Unvoiced sounds are produced by creating a constriction
somewhere in the vocal tract tube and forcing air through that constriction, thereby
creating turbulent air flow, which acts as a random noise excitation of the vocal tract
tube. Examples are the unvoiced fricative sounds such as /SH/ and /S/.A third sound
production mechanism is when the vocal tract is partially closed off causing turbulent
flow due to constriction, at the same time allowing quasi-periodic flow due to vocal
chord vibrations. Sounds produced in this manner include the voiced fricatives
/V/,/DH/,/Z/, and /ZH/.Finally, plosive sounds such as /P/,/T/, and /K/ and affricates
such as /CH/ are formed by momentarily closing off air flow, allowing pressure to
build up behind the closure, and then abruptly releasing the pressure. All these
excitation sources create a wide-band excitation signal to vocal tract tube, which acts
7
as an acoustic transmission line with certain vocal tract shape-dependent resonances

that tend to emphasize frequencies of the excitation relative to others.
Figure 2.8 Schematic model of the vocal tract system
2.6 HEARING AND AUDITORY PERCEPTION
In this topic the properties of human sound perception that can be employed to
create digital representations of the speech signal that are perceptually robust has been
discussed.
2.6.1 The Human Ear
Figure 3.7 shows the schematic view of the human ear showing the three
distinct sound processing sections, namely: the outer ear consisting of the pinna,
which gathers sound and conducts it through the external canal to the middle ear; the
middle ear beginning at the tympanic membrane, or ear drum, and including three
small bones .the malleus(also called the hammer),the incus (also called the anvil) and
the stapes (also called the stirrup),which perform a transduction from acoustic waves
to mechanical pressure waves; and finally, the ear, which consists of the cochlea and
the set of neural connections to the auditory nerve, which conducts the neural signals
to the brain.
8
Figure 2.9 Schematic view of the human ear
Figure 3.9 depicts a block diagram abstraction of the auditory system. The
acoustic wave is transmitted from the outer ear to the inner ear where the ear drum
and bone structures convert the sound wave to mechanical vibrations which ultimately
are transferred to the basilar membrane inside the cochlea. The basilar membrane
vibrates in a frequency-selective manner along its extent and there by performs a
rough spectral analysis of the sound. Distributed along the basilar membrane are a set
of inner hair cells that serve to convert motion along the baseline membrane to neural
activity.
This produces an auditory nerve representation in both time and frequency. The
processing at higher levels in the rain, shown in figure—as a sequence of central
processing with multiple representations followed by some type of pattern
recognition, is not well understood and we can only postulate the mechanisms used by
the human brain to perceive sound or speech. Even so, a wealth of knowledge about
how sounds are perceived has been discovered by careful experiments that use tones
and noise signals to stimulate the auditory system of human observers in very specific
and controlled ways. These experiments have yielded much valuable knowledge
about the sensitivity of human auditory system to acoustic properties such as intensity
and frequency.
9
Figure :2.10 Schematic model of auditory mechanism
2.6.2 Perception and Loudness
A key factor in the perception of speech and other sounds is loudness.

Loudness is a perceptual quality that is related to the physical property of sound
pressure level. Loudness is quantified by relating the actual sound pressure level of a
pure tone(in dB relative to standard reference level) to the perceived loudness of the
same tone(in a unit called phons) over the range of human hearing(20Hz-20KHz).This
relationship is as shown in the figure--.these loudness curves show that the perception
of loudness is frequency-dependent. Specifically, the dotted curve at the bottom of the
figure labelled “threshold of audibility” shows the sound pressure level that is
required for a sound of given frequency to be just audible(by a person with normal
hearing).It can been seen that low frequencies must be significantly more intense than
frequencies in the mid-range in order that they be perceived at all.
The solid curves are equal-loudness-level contours measured by comparing

sounds at various frequencies with a pure tone of frequency 1000Hz and no sound
pressure level. For example the point of the frequency 100Hz on the curve labelled
50(phons) is obtained by adjusting the power of the 100 Hz tone until it sounds as
10
loud as a 1000Hz tone having a sound pressure level of 50dB. Careful measurements
of this kind show that a 100Hz tone must have a sound pressure level of about 60dB
in order to be perceived to be equal in loudness to the 1000Hz tone of sound pressure
level 50dB.
Figure 2.11: Graph representing Loudness level for human hearing
By convention, both 50dB 1000Hz tone and 60dB 100Hz tone are said to have
a loudness level of 50 phons. The equal-loudness-level curves show that the auditory
system is most sensitive for frequencies ranging from about 100Hz up to 6KHzwith
the greatest sensitivity at around 3-4KHz.This is almost precisely the range of
frequencies occupied by most of the sounds of speech.
3. STATE OF ART
At present, there are many various applications based on the speech

technology. The speech technology can be divided into three main parts: the speech
recognition, the speaker recognition, and other recognition. However, emphasis is laid
on the speech recognition more than on the other types. The most widely used
techniques on the speech technology are the hidden Markov models, or the neural
networks experiencing their renaissance.
3.1 SPEECH RECOGNITION

11
In the speech recognition, there is a machine trying to answer the question

“what was said by the speaker?”Thus, the content of the speech is recognised. Result
of this process should be a transcription of the text said by the speaker,
accomplishment of a speech command, or another action. The speech recognition can
be roughly divided into the following groups:
1. Continuous speech recognition is recognition of the phones, words or

sentences continuously spoken by the speaker.
2. Recognition of the isolated speech units, i.e. the units are said separately and
the task is to find them in a signal and to recognise them.
3. Recognition and understanding of the meaning, i.e. the meaning of the speech
should be recognised.
Figure 3.1 Classification of speech processing
There are many ways of usage of the speech recognition technology. Before
all, phone banking, device controlling, or just simple type writing can be mentioned.
3.2 SPEAKER RECOGNITION
12
Speaker recognition is a system that can recognize a person based on his/her

voice. This is achieved by implementing complex signal processing algorithms that
run on a digital computer. This application is analogous to the finger print recognition
system that is based on certain characteristics of a person.
There are several occasions when we want to identify a person from a given
group of people even if the person is not present for physical examination. According
to the dependency of text, there are two kinds of input voice that is Text Dependent
and Text Independent.
3.2.1 Text-Dependent
The content of speech is known to the system. It is based on the assumption

that the speaker is cooperative. This is the one that easier to implement, and with
higher accuracy.
3.2.2 Text-Independent
In this condition, the content of the speech is unknown. In other words, there
is no prior knowledge in the system. This is the one harder to implement, but with
higher flexibility.
3.2.3 Approaches to the Speaker Recognition
There are two different approaches to the speaker recognition. The first one is
based on the verification and the second one is based on the speaker identification.
When identifying a speaker, there are two possible answers- ‘yes’ or ‘no’.
Main difference between the two approaches is that in case of the speaker
verification the user has to tell the system, who he/she claims to be. Then the system
compares the stored pattern of the claimed user with the new pattern of the unknown
user and answers ‘yes’ or ‘no’.
13
Hence, a one to one comparison is performed. In case of the speaker

identification, the system itself determines the user. Thus, a one-to-many comparison
is performed. If the users voice were not found in the data base, the answer would be
‘no’, otherwise the answer would be ‘yes’.
All speaker recognition systems have to serve two distinguished phases. The
first one is referred to the enrolment or training phase, while the second one is
referred to the operational or testing phase. Speaker recognition is a difficult task.
Automatic speaker recognition works based on the premise that a person’s speech
exhibits characteristics that are unique to the speaker.
Usually, the process of the speaker recognition consists of the following steps:
1. Speech Processing
2. Feature extraction
3. Pattern matching
4. Decision making
3.3 SPEECH PROCESSING
The first phases of the speech signal processing are the recording, the
digitizing, and the pre-processing.
3.3.1 Recording and Digitizing
Recording of a signal is the first stage of the whole speaker recognition

process. Recording and digitizing is very important and can influence the speaker
recognition very much. The analog speech signal is recorded using a microphone.
Quality of the microphone can influence final quality of the recognition.
14
Speech signals are usually represented as functions of continuous variable t,

which denotes time.The analog time signal can be defined as a function varying
continuously in time.The processed signals are sampled with a sampling period
Ts.Then we can define a sample of a discrete-time signal as
S (n) =Sa (n Ts)
The signal S (n) is called the digital signal. Usually the sampling frequency of
the speech signals lies in the range 8000<Fs<22050.Due to limitations of the human
organs for the speech production and the auditory system,the typical speech
communication is limited to a band width of 8 KHz, which results in the sampling
frequency of 16 KHz, which ordinarily satisfy our requirements.
3.3.2 Framing
Framing is the next important step in the signal processing process. The
recorded discrete signal s(n) has always a finite length, but is usually not processed
whole. The signal is framed – cut into pieces. Length of one frame is N<<Ntotal
samples.
Given total length of the signal N(total) total count of all frames is:
J=int (Ntotal/N)
Where the function int(x) returns the integer part of x. The length N of the frames in
the real application is usually based on the physiological characteristic of a vocal
tract.
The vocal tract is not able to change its shape faster than fifty times per
second, which gives us a period of 20 milliseconds. When the sampling frequency is
Fs=16000Hz, then we get the length of the frame N=FsT=320 samples.
15
Figure 3.2 Framing of speech signal
Usually, an overlapping of the individual frames is defined. The over lapping

is used to increase precision of the recognition process.Teh length of the frame is
increased ordinarily to the power of two, i.e. in our case it would be 512 samples. The
power of two is chosen because of the Fast Fourier Transform (FFT), which is
performed fastest in case the input length of the signal equals the power of two. Then
the over lapping is chosen so that it is O=512-320=192 samples long.
3.3.3 Windowing
Before further processing, the individual frames are windowed. The frame
itself is implicitly windowed by a rectangular window. However, spectral
characteristic of the rectangular window is unsuitable. Windowed signal is defined as
Sw (n) =S (n).W (n)
Where Sw (n) is the windowed signal, s(n) is the original N samples long, and
w(n) is the window itself. The window w(n) should be of the same length as the
original signal s(n).There are many types of windows used for this purpose.
The Rectangular window is defined as
16
The most frequently used window type is a Hamming window defined as
Figure 3.3 Example of windowing
Where N is the length of the window. In figure—you can see influence of a

rectangular and hamming window on the original signal. In this case, the length of the
original signal is different from the length of the window, i.e. the window is shifted
relatively to the beginning of the signal.
3.3.4 Cepstrum
The speech signal s(n) can be represented as a “quickly varying” source signal
e(n) convolved with the “slowly varying” impulse response h(n) of the vocal tract
17
represented as a linear filter . Only the output (speech signal) is accessible and it is
often desirable to eliminate the source part. The source and the filter parameters are
convolved with each other in time domain. The cepstrum is a representation of the
signal where these two components are resolved into two additive parts. It is
computed by taking the inverse DFT of the logarithm of the magnitude spectrum of
the frame . This is represented in the following equation:
Cepstrum(frame) = IDFT(log( j DFT(frame) j ))

The conversion from time domain to frequency domain changes the
convolution to the multiplication. On that use of logarithm replaces all the
multiplication steps by the addition. The inverse DFT is applied on it, which operate
individually on quickly varying and slowly varying parts of speech signal.
3.3.5 Hidden Markov Model
It is described that each state of the Markov models is corresponding to

an observable (physical) event. This model is too restrictive to be applicable to much
problemsof interest. So the concept of Markov models is extended to include the case
where the observation is a probabilistic function of the state. The resulting model
(which is called a HiddenMarkov Model) is a doubly embedded stochastic process
with an underlying stochastic process that is not observable (it is hidden), but can only
be observed through another set of stochastic Processes that produce the sequence of
observations.
3.3.6 STFT
The short-time Fourier Transform (STFT), or alternatively short-term Fourier

transform, is a Fourier related transform used to determine the sinusoidal frequency
and phase content of local sections of a signal as it changes over time.
Short Time Fourier Transform STFT
Short-time Fourier transform (STFT), is a signal processing method used for

analyzing non-stationary signals, whose statistic characteristics vary with time. In
essence, STFT extracts several frames of the signal to be analyzed with a window that
moves with time. If the time window is sufficiently narrow, each frame extracted can
18
be viewed as stationary so that Fourier transform can be used. With the window
moving along the time axis, the relation between the variance of frequency and time
can be identified.
Figure 3.4 Graph representing STFT
Continuous-time STFT
Simply, in the continuous-time case, the function to be transformed is

multiplied by a window function which is nonzero for only a short period of time.
The Fourier transform (a one-dimensional function) of the resulting signal is taken as
the window is slid along the time axis, resulting in a two-dimensional representation
of the signal. Mathematically, this is written as:
Where w(t) is the window function, commonly a Hann

window or Gaussian "hill" centered around zero, and x(t) is the signal to be
transformed. X(τ, ω) is essentially the Fourier Transform of x(t)w(t-τ), a complex
function representing the phase and magnitude of the signal over time and frequency.
Often phase unwrapping is employed along either or both the time axis, τ, and
frequency axis, ω, to suppress any jump discontinuity of the phase result of the STFT.
19
The time index τ is normally considered to be "slow" time and usually not expressed
in as high resolution as time t.
Discrete-time STFT
In the discrete time case, the data to be transformed could be broken up into
chunks or frames (which usually overlap each other, to reduce artifacts at the
boundary). Each chunk is Fourier transformed, and the complex result is added to a
matrixThis can be expressed as:
likewise, with signal x[n] and window w[n]. In this case, m is discrete and ω is
continuous, but in most typical applications the STFT is performed on a computer
using the Fast Fourier Transform, so both variables are discrete and quantized. Again,
the discrete-time index m is normally considered to be "slow" time and usually not
expressed in as high resolution as time n.
The magnitude squared of the STFT yields the spectrogram of the

function:
Consider a signal consisting of 2 frequencies, one frequency f1 existing over

an interval T and the second a frequency f2 existing ovar another interval T.
The Fourier Transform gives 2 sinc functions existing over all time.
One approach which can give information on the time resolution of the spectrum is
the short time fourier transform (STFT). Here a moving window is applied to the
signal and the fourier transform is applied to the signal within the window as the
window is moved.
x(t) is the signal itself, w(t) is the window function, and * is the complex conjugate.
As you can see from the equation, the STFT of the signal is nothing but the FT of the
signal multiplied by a window function.
20
The problem with STFT is the fact whose roots go back to what is known as
the Heisenberg Uncertainty Principle . This principle originally applied to the
momentum and location of moving particles, can be applied to time-frequency
information of a signal. The problem with the STFT has something to do with
the width of the window function that is used. To be technically correct, this width of
the window function is known as the support of the window. If the window function
is narrow, then it is known as compactly supported.In STFT is window is of finite
length, and we no longer have perfect frequency resolution and if we make the length
of the window in the STFT infinite, just like as it is in the FT, to get perfect frequency
resolution we lose all the time information.
The narrower we make the window, the better the time resolution, and better
the assumption of stationarity, but poorer the frequency resolution:
Narrow window ===> good time resolution, poor frequency resolution.

Wide window ===> good frequency resolution, poor time resolution.
The window function we use here is simply a Gaussian function in the form:
w(t)=exp(-a*(t^2)/2);
where a determines the length of the window, and t is the time.
The following figure shows four window functions of varying regions of

support, determined by the value of a.
21
Figure 3.5 Various Window Functions
3.3.7 Pre-emphasis
Pre-emphasis is processing of the input signal by a low order digital FIR filter.
This filter is usually the first order FIR filter defined as
Sp(n)=S(n)-λs(n-1)
Where λ is a pre-emphasis coefficient lying usually in an interval of(0.9-1),s(n) is

the original signal, and sp(n) is a pre-emphasized signal. The pre-emphasis is used to
flatten spectrally the input signal in favour of vocal tract parameters. It makes the
signal less susceptible to later finite precision effects.
22
Figure 3.6 Influence of the pre-emphasis
23
4. FEATURE EXTRACTION
Feature extraction is a crucial phase of the speaker verification process. Well-

chosen feature set can result in quality recognition as well as wrongly chosen feature
set can result in a poor recognition. Many features described in the following are
originally defined for an infinite continuous signal. This is not useful for the real
applications .Most of the real applications work with a finite discrete signal.
4.1 ENERGY
Energy of a signal expresses strength of the signal. Its value is usable to voice
activity detection, because the energy of the voice is higher than the energy of the
noise. Definition of the energy of a finite discrete signal is
Where Ntotal is length of the whole finite discrete signal s(n).Energy calculated this
way is energy of the whole signal so that it is called Long-term Energy(LTE).Given
the signal divided into Jframes N samples long we can define energy of one frame as
Where s(j,n) is the j-th frame of the original signal s(n) and j€{1,2,…
J}.Energy of one frame is called Short-Time Energy(STE).
24
Figure 4.1 Shape of short time energy of frames of Czech word Emanuel
Figure 5.1 shows a shape of the short-time energy of all the frames as they go
one after another in a czech word ‘Emanuel’.you can distinguish obviously the back
ground noise and the consonants from the vowels .Energy of the back ground noise
depends on the quality of the microphone used to record the signal.The lower the
signal-to-noise ratio of the microphone, the better can we distinguish the background
from the vowels and even from the consonants.
4.2 ZERO-CROSSING RATE(ZCR)
Zero-Crossing Rate(ZCR) expresses how many times crosses the signal

zero.Value of the Zero-Crossing Rate suits best to distinguish noise from the vowels
or the consonants.The zero-crossing rate of the noise is usually much higher than the
zero-crossing rate of the vowels.However, the zero-ceossing rate of some consonants
is high as well,which makes the recognition of consonants more difficult.
The zero-crossing rate is formulated as
Where N is the length of the j-th frame of a signal s(n),j€{1,2,…,J}.The

function sign(s(n)) is defined as
25
Figure 4.2 shows an example of a shape of the short time zero-crossing ate of
all frames as they go one after another in a german word ‘wiesbaden’.You can clearly
see high values of the zero-crossing rate of the noise and lower values of some
phones.
Figure 4.2: Shape of the ZCR of the frames as in a German word ‘Wiesbaden’
4.3 AUTOCORRELATION
Autocorrelation is a mathematical tool used frequently in the signal processing

for analysing functions or series of values (time domain signals).it is the cross-
correlation of a signal, such as determining the presence of a periodic signal, which
has buried under noise, or identifying the fundamental frequency of a signal, which
does not actually contain that frequency component, but implies it with many
harmonic frequencies.
The discrete autocorrelation R(k) of the discrete finite signal s(n) at the lag k is
formally defines as
26
Where µ is the mean value of the discrete finite signal s(n), and N is the length
of the signal. In the speech processing the first 12-32 autocorrelation coefficients are
used. Usually, the order of the autocorrelation is approximately M=Fs+4, where Fs is
the sampling frequency given in KHz.
4.4 LINEAR PREDICTIVE CODING (LPC)
Linear Predictive Coding (LPC) is one of the most powerful speech analysis
techniques, and one of the most useful methods for encoding good quality speech at a
low bit rate. It provides extremely accurate estimates of speech parameters, and is
relatively efficient for computation. This document describes the basic ideas behind
linear prediction, and discusses some of the issues involved in its use.
4.4.1Basic Principles
LPC starts with the assumption that the speech signal is produced by a buzzer
at the end of a tube. The glottis (the space between the vocal cords) produces the
buzz, which is characterized by its intensity (loudness) and frequency (pitch). The
vocal tract (the throat and mouth) forms the tube, which is characterized by its
resonances, which are called formants. For more information about speech
production, see the Speech Production OLT.
LPC analyzes the speech signal by estimating the formants, removing their
effects from the speech signal, and estimating the intensity and frequency of the
remaining buzz. The process of removing the formants is called inverse filtering, and
the remaining signal is called the residue.
The numbers which describe the formants and the residue can be stored or
transmitted somewhere else. LPC synthesizes the speech signal by reversing the
process: use the residue to create a source signal, use the formants to create a filter
(which represents the tube), and run the source through the filter, resulting in speech.
27
Because speech signals vary with time, this process is done on short chunks of
the speech signal, which are called frames. Usually 30 to 50 frames per second give
intelligible speech with good compression.
The coefficients of the Linear Predictive Coding (LPC) play an important role
in the speech signal processing .An approximate Spred(n) of a speech signals(n) can
be calculated by a linear combination of the LP coefficients and M previous samples
of the original signal s(n) (auto regressive model).This approximation is called linear
prediction of the original signal. The linear prediction is defined as
Where a(m), m=1,2….,M, are the LPC coefficients, M is number of the LPC
coefficients called order of the linear prediction, s(n) is the original signal.
4.5 DISCRETE FOURIER TRANSFORM (DFT)
The Fourier transform is very useful one, because it uses complex

exponentials as its basis functions. A digital system T is a system that, given an input
signal s(n),generates an output signal.
Y(n)=T{s(n)}
Linear digital systems, so called linear shift-invariant (LSI) systems, are

described as
where * denotes the convolution operator. The convolution operator is

commutative, associative, and distributive. The LSI systems are characterized by the
impulse response h(n) of the system.
Given a Fourier transform F and its inverse, we can define cepstrum

coefficients.The cepstrum coefficients are given as the inverse Fourier transform of
the log-energy of the Fourier transform of a signal s(n):
28
C(v)=F-1{log[F{s(w)}]2}
The cepstrum coefficients can be used for some speaker recognition specific
purposes. It is useful for pitch estimation, since at the position of the pitch and at the
positions of its harmonics there are peaks. Actually, the coefficients of the cepstrum
can be used to separate the vocal tract impulse response and the generating signal.
The vocal tract response could be used for the speaker recognition, since they are
speaker dependent. However, there is a question asking how many of the cepstrum
coefficients should be used for this purpose. These coefficients were tested together
with the other ones.
4.6 DISCRETE COSINE TRANSFORM (DCT)
The Discrete Cosine Transform (DCT) is widely used for the speech
processing. It is so often because of its energy compaction, which results in its
coefficients being more concentrated at low indices than the coefficients of the DFT.
This allows us to approximate a signal using fewer conditions. There are several
definitions of DCT. Coefficients C(k) of one of them are defined as
Where s(n) is a real signal .Its inverse is given by
The DCT-II can be derived from the DFT when assuming s(n) is a real
periodic signal with a period of 2N and with an even symmetry s(n)=s(2N-1-n).The
29
DCT is used for computing of the Mel-Frequency Cepstrum Coefficients or the

speaker Dependent Cepstrum coefficients.
4.7 MEL-FREQUENCY CEPSTRUM COEFFICIENTS
Some psychoacoustic experiments were under taken to derive scales

attempting to model the natural response of the human perceptual system, since the
cochlea of the inner acts as a spectrum analyser. Fletcher’s work pointed to the
existence of critical bands in the cochlear response. One class of critical band scales is
called Bark frequency scale. I t ranges from 1 to 24 Barks, corresponding to 24
critical bands of hearing. Another scale like the bark frequency scale is the Mel-
frequency scale, which is linear below and above 1 KHz. The Mel-scale is based on
experiments with simple sinusoidal tones. It can be approximated by
B(f)=1125ln(1+f/700)
The Mel-Frequency cepstrum Coefficients (MFCC) are defined as the real

cepstrum of a windowed short-term signal derived from the FFT of that signal. It
differs from the real cepstrum in the nonlinear frequency scale used for approximation
of the behaviour of the auditory system.
Given the DFT of an input signal defined, we define a filter bank with I filters.
The ith filter, i=1, 2, K, I, is defined as
These filters compute the average spectrum around each central frequency
with increasing bandwidths.
30
C(i)= ln ,i=1,2,….,I
Where S(k) is the magnitude of the FFT of the signal s(n),which is N samples
long. The Mel-Frequency Cepstrum is the discrete cosine transform of the I filter
outputs.
C(j)=,i=0-I-1-C,i..co,s-, πn,i -,1-2I....,j=0,1,…,I.
The count I of the filters varies from 24 to 40 in different applications.

However, for the speech recognition only the first 13 coefficients are used.
A short-term frame of 25 ms multiplied by the hamming window is typically

used to calculate the MFCC and the delta coefficients. The frame overlapping is
usually 10ms.
Figure 4.3 Example of a filter bank used for computation of MFCC
4.8 AVERAGE LONG-TERM LPC SPECTRUM
Long-term statistics of many various features are used often to recognize

speech. However, not only the area of the speech recognition is a domain of the long-
term statistics. The average long-term LPC spectrum is applicable to the speaker
verification as well.
31
5. SOFTWARE PLATFORM
The software platform used by us was Lab VIEW (Laboratory Virtual

Instrument Engineering Workbench).
5.1 Lab VIEW
Lab VIEW is a programming environment in which you create programs

using a graphical notation(connecting functional nodes via wires through which data
flows); in this regard, it differs from the traditional programming languages like C,C+
+,or Java, in which you program with text. , Lab VIEW is much more than a
programming language. It is an interactive program development and execution
system designed for people, like scientists and engineers, who need to program as part
of their jobs. The Lab VIEW development environment works on computers running
Windows, Mac OS X, or Linux. Lab VIEW can create programs that run on those
platforms, as well as Microsoft Pocket PC, Microsoft Windows CE, Palm OS, and a
variety of embedded platforms, including Field Programmable Gate
Arrays(FPGAs),Digital Signal Processors(DSPs), and microprocessors.
Using very powerful programming language that many Lab VIEW users
affectively call “G”,LabVIEW can increase your productivity by orders of
magnitude .Programs that take weeks or months to write using conventional
32
programming languages can be completed in hours using LabVIEW because it is

specifically designed to take measurements, analyse data, and present results to the
user.
Lab VIEW offers more flexibility than standard laboratory instruments

because it is soft-ware based. You, not the instrument manufacturer, define instrument
functionally. Your computer, plug-in hard ware, and LabVIEW comprise a
completely configurable virtual instrument you need, when you need it, at a fraction
of the cost of traditional instruments. When your needs change, you can modify your
virtual instrument in moments.
5.2 Lab VIEW WORK
Lab VIEW uses terminology, icons and ideas familiar to scientists and
engineers. It relies on graphical symbols rather than textual language to define a
program’s actions. Its execution is based on the principle of data flow, in which
function execute only after receiving the necessary data. Because of these features,
you can learn LabVIEW even if have little or no programming experience.
A Lab VIEW program consists of one or more virtual instruments

(VIs).Virtual instruments are called such because their appearance and operation often
imitate actual physical instruments. However, behind the scenes, they are analogous
to main programs, functions, and subroutines from popular programming languages
like C or Basic. Hereafter, LabVIEW program is referred as a ‘VI’, its appearance or
function relates to an actual instrument or not.
A VI has two main parts: a front panel, a block diagram.
1. The front panel is the interactive user interface of a VI, so named because it
simulates the front panel of a physical instrument. The front panel can contain
knobs, push buttons, graphs, and many other controls and indicators. You can
input data using a mouse and key board , and then view results produced by
your program on the screen.
33
2. The block diagram is the VI’s source code, constructed in Lab View’s
graphical programming language, G. The block diagram is the actual
executable program .The components of a block diagram is the actual
executable program. The components of a block diagram are lower-level VI’s,
built-in functions, constants, and program execution control structures. You
draw wires to connect the appropriate objects together to define the flow of
data between them. Front panel objects together to define the flow of data
between them. Front panel objectives have corresponding terminals on the
block diagram so data can pass from the user to the program and back to the
user.
6. EXTRACTING THE FEATURES
6.1 MEL-CEPSTRUM OF SPEECH SIGNAL
Some psychoacoustics were undertaken to derive scales attempting to model the

natural response of the human perceptual system, since the cochlea of the inner ear
acts as a spectrum analyzer .Fletcher’s work pointed to the existence of critical bands
in the cochlear response. One class of critical band scales is called bands of hearing.
Another scale like the bark frequency scale is the Mel-frequency scale, which is linear
below 1 kHz and logarithmic above, with equal number of samples taken below and
above 1 kHz. The Mel-scale is based on experiments with simple sinusoidal tones.
The following figure shows the block diagram to calculate the MFCC.
34
Speech Figure 6.1: Block diagram to calculate MFCC
signal Fram ing

H
The speech waveform is first windowed with the analysis window w[n] and W
the discrete STFT X(n,w) is computed as
x(m)*w(n-m) exp(-jwm)
MFC
C
DiscreteC osine
Transfisothen
Where w=2πk/N with N as the DFT length. The magnitude rm weighted
by a series of filter frequency responses whose centre and edge frequencies and
bandwidths match with those of auditory critical band filters. This has been called as a
Mel-Scale filter bank.
Let us now move on without discussion on Mel-Cepstrum. By weighing the

magnitude of STFT with filter bank and each frame is multiplied with entire filter
bank. The next step is to compute the energy in STFT weighted by each Mel-scale
filter response. The frequency response of 1st filter is denoted by V1(w). The
resulting energies are given as:
35
Where L1, U1 denote the upper and the lower cut off frequencies of 1st filter and
where
A1=
This normalizes the filters according to their varying bandwidths so to give

equal energies for flat input spectrum.
The real Cepstrum associated with Emel is known as Mel-Cepstrum and is

computed as:
Where m is the number of cepstrel coefficients and R=no. of filters.
36
Extraction of MFCC using Lab VIEW
Vector Quantization constructs a set of representative samples of target

speakers enrolment utterances by clustering the feature vectors. Although a variety of
clustering techniques exists, the most commonly used isk-means clustering (36). This
approach partitions N future vector into K disjoint subsets Sj to minimize an overall
distance such as
j)
37
Where µj=(1/Nj)_Xi €Sj Xi is the centroid of the Nj samples in the jth cluster.
The algorithm proceeds in two steps:
1. Compute the centroid of each cluster using an initial assignment of the feature
vectors to the clusters.
2. Reassign Xi to that cluster whose centroid is closest to it. These steps are
iterated until successive steps do not reassign samples.
Once the VQ models are established for a tear get speaker, scoring consists of
Evaluating D for feature vectors in the test utterance.
6.2 ESTIMATING THE FORMANTS
The basic problem of the LPC system is to determine the formants from the
speech signal. The basic solution is a difference equation, which expresses each
sample of the signal as a linear combination of previous samples. Such an equation is
called a linear predictor, which is why this is called Linear Predictive Coding.
The coefficients of the difference equation (the prediction coefficients)

characterize the formants, so the LPC system needs to estimate these coefficients. The
estimate is done by minimizing the mean-square error between the predicted signal
and the actual signal.
Formant information of speech has been used as an effective information for

speech analysis, synthesis and recognition systems. A well-known and highly accurate
technique for extracting formant information is to solve a high order equation having
LPC (Linear Prediction Coding) coefficients as constants using a Newton-Lapson
method.
You can detect formant tracks and pitch contour by using several methods.
The most popular method is the Linear Prediction Coding (LPC) method.
38
This method[5] applies an all-pole model to simulate the vocal tract. The
following Figure shows the flow chart of formant detection with the LPC method.
Figure 6.2: Formant Detection with the LPC Method
Applying the window w(n) breaks the source signal s(n) into signal
blocks x(n). Each signal block x(n) estimates the coefficients of an all-pole vocal tract
model by using the LPC method. After calculating the discrete Fourier transform
(DFT) on the coefficients A(z), the peak detection of 1/A(k) produces the formants
39
Formant Detection with the LPC Method by Using LabVIEW
6.3 ESTIMATING THE PITCH
Most musical sounds as well as voiced speech sounds have a periodic

structure when viewed over short time intervals, and such sounds are perceived by the
auditory system as having a high quality known as Pitch. Like loudness, Pitch is a
40
subjective attribute of sound that is related to fundamental frequency of the sound,

which is a physical attribute of the acoustic waveform.
The following figure shows the flow chart of pitch detection with the LPC
method. This method uses inverse filtering to separate the excitation signal from the
vocal tract and uses the real Cepstrum signal to detect the pitch.
Figure 6.4: Pitch Detection with the LPC Method
The source signal s(n) first goes through a low pass filter (LPF), and then
breaks into signal blocks x(n) by applying a window w(n). Each signal
block x(n) estimates the coefficients of an all-pole vocal tract model by using the LPC
method. These coefficients inversely filter x(n). The resulting residual
signal e(n) passes through a system which calculates the real Cepstrum. Finally, the
peaks of the real Cepstrum calculate the pitch.
The Advanced Signal Processing Toolkit includes the Modeling and

Prediction VIs that you can use to obtain the LPC coefficients or AR model
coefficients, as shown in Figure 6 and Figure 7. The Lab VIEW Signal Processing
VIs include the Scaled Time Domain Window VI and the FFT VI.
We can use these VIs to apply a window to the source signal and calculate the
DFT of the signal block, respectively.
41
The Advanced Signal Processing Toolkit also includes the Correlation and
Spectral Analysis VIs that you can use to calculate the real Cepstrum of a signal, as
shown in the Figure.
Here we use the WA Multiscale Peak Detection VI to detect the resonance

peaks in the Cepstrum.
Pitch Detection with the LPC Method by Using Lab VIEW
42
7. RESULT
Graph: 7.1 Input speech signal
(Sample no. Vs. Amplitude)
43
Graph: 7.2 Single frame of the input speech
Graph: 7.3 Windowing of single frame
44
Graph: 7.4 FFT of Single frame
(Sample no. Vs. Magnitude)
Graph: 7.5 Mel- Spaced Filter Bank
(Frequency vs. Gain)
45
Graph: 7.6 after applying Mel-spaced filter bank
(Coefficient no. Vs. Amplitude)
46
Graph: 7.7 Mel Frequency Cepstral coefficients
(Coefficient no. Vs. Amplitude)
8. CONCLUSION
This work describes feature extraction as a part of speaker recognition

systems. Mainly this work aims at feature extraction and research in this field.
The feature extraction using LPC method is implemented on Lab VIEW 9.0
platform after removing the silence from the voice print.
47
9. FUTURE SCOPE
48
49

Report

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Report

Transféré par

Droits d'auteur :

Formats disponibles

Process of Registration of person for Speaker Identification and Feature Extraction

2.1 PRODUCTION AND CLASSIFICATION OF SPEECH SOUNDS:

A simplified view of speech production is given in figure --,where the speech

Figure:2.1 Simple view of speech production system.

2.2 ANATOMY AND PHYSIOLOGY OF SPEECH ORGANS

Figure 2.2 Speech Organs

2.3 ABOUT VOICE

2.3.1 Parts of the Voice

Figure 2.3 Parts of the voice

Figure 2.4 Internal structure of human body

Figure 2.5 Anatomy of the Larynx:

2.4 VOCAL FOLD VIBRATION AND PITCH

Figure 2.6 Vocal Fold Vibration and Pitch

2.5 MODELS FOR SPEECH PRODUCTION

A schematic longitudinal cross sectional drawing of the human vocal tract

Figure2.7: Hidden knowledge sources in speech waveform

as an acoustic transmission line with certain vocal tract shape-dependent resonances

Figure 2.8 Schematic model of the vocal tract system

2.6 HEARING AND AUDITORY PERCEPTION

2.6.1 The Human Ear

Figure 2.9 Schematic view of the human ear

Figure :2.10 Schematic model of auditory mechanism

2.6.2 Perception and Loudness

A key factor in the perception of speech and other sounds is loudness.

The solid curves are equal-loudness-level contours measured by comparing

Figure 2.11: Graph representing Loudness level for human hearing

At present, there are many various applications based on the speech

3.1 SPEECH RECOGNITION

In the speech recognition, there is a machine trying to answer the question

1. Continuous speech recognition is recognition of the phones, words or

Figure 3.1 Classification of speech processing

3.2 SPEAKER RECOGNITION

Speaker recognition is a system that can recognize a person based on his/her

The content of speech is known to the system. It is based on the assumption

3.2.3 Approaches to the Speaker Recognition

Hence, a one to one comparison is performed. In case of the speaker

3.3 SPEECH PROCESSING

3.3.1 Recording and Digitizing

Recording of a signal is the first stage of the whole speaker recognition

Speech signals are usually represented as functions of continuous variable t,

S (n) =Sa (n Ts)

Figure 3.2 Framing of speech signal

Usually, an overlapping of the individual frames is defined. The over lapping

Sw (n) =S (n).W (n)

The Rectangular window is defined as

The most frequently used window type is a Hamming window defined as

Figure 3.3 Example of windowing

Where N is the length of the window. In figure—you can see influence of a

Cepstrum(frame) = IDFT(log( j DFT(frame) j ))

3.3.5 Hidden Markov Model

It is described that each state of the Markov models is corresponding to

The short-time Fourier Transform (STFT), or alternatively short-term Fourier

Short Time Fourier Transform STFT

Short-time Fourier transform (STFT), is a signal processing method used for

Figure 3.4 Graph representing STFT

Simply, in the continuous-time case, the function to be transformed is

Where w(t) is the window function, commonly a Hann

The magnitude squared of the STFT yields the spectrogram of the

Consider a signal consisting of 2 frequencies, one frequency f1 existing over

Narrow window ===> good time resolution, poor frequency resolution.

where a determines the length of the window, and t is the time.

The following figure shows four window functions of varying regions of

Figure 3.5 Various Window Functions