Académique Documents
Professionnel Documents
Culture Documents
1. INTRODUCTION
Speech and hearing, man’s most used means of communication, have been the
objects of intense study for more than 150 years-from the time of von Kempelen’s
speaking machine to the present day. With the advent of telephone and the explosive
growth of its dissemination and use, the engineering in the design of evermore band
width-efficient and higher –quality transmission systems has been the objective and
the providence of both engineers and scientists for more than 70 years. This work and
investigations have been largely driven by these real-world applications which now
have broadened to include not only speech synthesises but also atomic speech
recognition systems, speaker verification systems, speech enhancement systems
,efficient speech coding systems and speech and voice modification systems. The
objectives of the engineers have been to design and build real workable and
economically affordable systems that can be used over the broad range of existing a
newly installed communication channel.
1
Process of Registration of person for Speaker Identification and Feature Extraction
2. LITERATURE SURVEY
2
Process of Registration of person for Speaker Identification and Feature Extraction
During speaking, we take short spurts of air and release them steadily. we over
ride our rhythmic breathing by making the duration of exhaling roughly equal to the
length of sentence or phase, where the lung air pressure is maintained at
approximately a constant level.
1. Larynx
The larynx is the voice box. The vocal folds are part of the larynx. The vocal
folds vibrate to create the sound of the voice.
2. Pharynx
The pharynx is the throat. It goes up from the larynx and divides into the
laryngopharynx (just above the larynx), oropharynx (going into the mouth)
and nasopharynx (going into the nose).
3. Trachea
The trachea is your windpipe. It's the tube that connects your lungs to your
throat. The larynx sits on the top of the trachea.
3
Process of Registration of person for Speaker Identification and Feature Extraction
4. Oesophagus
The oesophagus is your food pipe. It's just behind the larynx and trachea. Your
pharynx carries both air and food/water. The air goes through the larynx and
trachea, and food and water go into your oesophagus.
5. Spinal column
The spinal column is behind the oesophagus.
6. Diaphragm
The diaphragm is underneath the lungs, inside the rib cage. It's shaped like a
dome. The diaphragm is your main muscle for controlling respiration
(breathing).
4
Process of Registration of person for Speaker Identification and Feature Extraction
Vocal Folds (Vocal Cords) are the remarkable structures that provide a valve for the
airway and also vibrate to produce the voice. The vocal folds are multilayered
structures, consisting of a muscle covered by a mucosal covering.
This chart shows how fast vocal folds are vibrating (in Hertz) when you're
singing certain pitches. Notice that every time you go up an octave, you double the
5
Process of Registration of person for Speaker Identification and Feature Extraction
frequency of vibration.
The faster the vocal folds vibrate, the higher the pitch. Extremely slow vocal
fold vibration is about 60 vibrations per second and produces a low pitch. Extremely
fast vocal fold vibration approaches 2000 vibrations per second and produces a very
high pitch. Only the highest sopranos can attain those extremely high pitches. In
general, men's vocal folds can vibrate from 90 - 500 Hz, and they average about 115
Hz in conversation. Women's vocal folds can vibrate from 150 -1000 Hz, and they
average about 200 Hz in conversation.
6
Process of Registration of person for Speaker Identification and Feature Extraction
The sounds of speech are generated in several ways. Voice sounds (vowels,
liquids, glides, nasals) are produced when the vocal tract is excited by the pulses of air
pressure resulting from the quasi-periodic opening and closing of glottal
orifice(opening between the vocal chords).The vowels /UH/,/IY/, and /EY/, and the
liquid constant /W/.Unvoiced sounds are produced by creating a constriction
somewhere in the vocal tract tube and forcing air through that constriction, thereby
creating turbulent air flow, which acts as a random noise excitation of the vocal tract
tube. Examples are the unvoiced fricative sounds such as /SH/ and /S/.A third sound
production mechanism is when the vocal tract is partially closed off causing turbulent
flow due to constriction, at the same time allowing quasi-periodic flow due to vocal
chord vibrations. Sounds produced in this manner include the voiced fricatives
/V/,/DH/,/Z/, and /ZH/.Finally, plosive sounds such as /P/,/T/, and /K/ and affricates
such as /CH/ are formed by momentarily closing off air flow, allowing pressure to
build up behind the closure, and then abruptly releasing the pressure. All these
excitation sources create a wide-band excitation signal to vocal tract tube, which acts
7
Process of Registration of person for Speaker Identification and Feature Extraction
In this topic the properties of human sound perception that can be employed to
create digital representations of the speech signal that are perceptually robust has been
discussed.
Figure 3.7 shows the schematic view of the human ear showing the three
distinct sound processing sections, namely: the outer ear consisting of the pinna,
which gathers sound and conducts it through the external canal to the middle ear; the
middle ear beginning at the tympanic membrane, or ear drum, and including three
small bones .the malleus(also called the hammer),the incus (also called the anvil) and
the stapes (also called the stirrup),which perform a transduction from acoustic waves
to mechanical pressure waves; and finally, the ear, which consists of the cochlea and
the set of neural connections to the auditory nerve, which conducts the neural signals
to the brain.
8
Process of Registration of person for Speaker Identification and Feature Extraction
Figure 3.9 depicts a block diagram abstraction of the auditory system. The
acoustic wave is transmitted from the outer ear to the inner ear where the ear drum
and bone structures convert the sound wave to mechanical vibrations which ultimately
are transferred to the basilar membrane inside the cochlea. The basilar membrane
vibrates in a frequency-selective manner along its extent and there by performs a
rough spectral analysis of the sound. Distributed along the basilar membrane are a set
of inner hair cells that serve to convert motion along the baseline membrane to neural
activity.
This produces an auditory nerve representation in both time and frequency. The
processing at higher levels in the rain, shown in figure—as a sequence of central
processing with multiple representations followed by some type of pattern
recognition, is not well understood and we can only postulate the mechanisms used by
the human brain to perceive sound or speech. Even so, a wealth of knowledge about
how sounds are perceived has been discovered by careful experiments that use tones
and noise signals to stimulate the auditory system of human observers in very specific
and controlled ways. These experiments have yielded much valuable knowledge
about the sensitivity of human auditory system to acoustic properties such as intensity
and frequency.
9
Process of Registration of person for Speaker Identification and Feature Extraction
loud as a 1000Hz tone having a sound pressure level of 50dB. Careful measurements
of this kind show that a 100Hz tone must have a sound pressure level of about 60dB
in order to be perceived to be equal in loudness to the 1000Hz tone of sound pressure
level 50dB.
By convention, both 50dB 1000Hz tone and 60dB 100Hz tone are said to have
a loudness level of 50 phons. The equal-loudness-level curves show that the auditory
system is most sensitive for frequencies ranging from about 100Hz up to 6KHzwith
the greatest sensitivity at around 3-4KHz.This is almost precisely the range of
frequencies occupied by most of the sounds of speech.
3. STATE OF ART
2. Recognition of the isolated speech units, i.e. the units are said separately and
the task is to find them in a signal and to recognise them.
3. Recognition and understanding of the meaning, i.e. the meaning of the speech
should be recognised.
There are many ways of usage of the speech recognition technology. Before
all, phone banking, device controlling, or just simple type writing can be mentioned.
12
Process of Registration of person for Speaker Identification and Feature Extraction
There are several occasions when we want to identify a person from a given
group of people even if the person is not present for physical examination. According
to the dependency of text, there are two kinds of input voice that is Text Dependent
and Text Independent.
3.2.1 Text-Dependent
3.2.2 Text-Independent
In this condition, the content of the speech is unknown. In other words, there
is no prior knowledge in the system. This is the one harder to implement, but with
higher flexibility.
There are two different approaches to the speaker recognition. The first one is
based on the verification and the second one is based on the speaker identification.
When identifying a speaker, there are two possible answers- ‘yes’ or ‘no’.
Main difference between the two approaches is that in case of the speaker
verification the user has to tell the system, who he/she claims to be. Then the system
compares the stored pattern of the claimed user with the new pattern of the unknown
user and answers ‘yes’ or ‘no’.
13
Process of Registration of person for Speaker Identification and Feature Extraction
All speaker recognition systems have to serve two distinguished phases. The
first one is referred to the enrolment or training phase, while the second one is
referred to the operational or testing phase. Speaker recognition is a difficult task.
Automatic speaker recognition works based on the premise that a person’s speech
exhibits characteristics that are unique to the speaker.
Usually, the process of the speaker recognition consists of the following steps:
1. Speech Processing
2. Feature extraction
3. Pattern matching
4. Decision making
The first phases of the speech signal processing are the recording, the
digitizing, and the pre-processing.
14
Process of Registration of person for Speaker Identification and Feature Extraction
The signal S (n) is called the digital signal. Usually the sampling frequency of
the speech signals lies in the range 8000<Fs<22050.Due to limitations of the human
organs for the speech production and the auditory system,the typical speech
communication is limited to a band width of 8 KHz, which results in the sampling
frequency of 16 KHz, which ordinarily satisfy our requirements.
3.3.2 Framing
Framing is the next important step in the signal processing process. The
recorded discrete signal s(n) has always a finite length, but is usually not processed
whole. The signal is framed – cut into pieces. Length of one frame is N<<Ntotal
samples.
Given total length of the signal N(total) total count of all frames is:
J=int (Ntotal/N)
Where the function int(x) returns the integer part of x. The length N of the frames in
the real application is usually based on the physiological characteristic of a vocal
tract.
The vocal tract is not able to change its shape faster than fifty times per
second, which gives us a period of 20 milliseconds. When the sampling frequency is
Fs=16000Hz, then we get the length of the frame N=FsT=320 samples.
15
Process of Registration of person for Speaker Identification and Feature Extraction
3.3.3 Windowing
Before further processing, the individual frames are windowed. The frame
itself is implicitly windowed by a rectangular window. However, spectral
characteristic of the rectangular window is unsuitable. Windowed signal is defined as
Where Sw (n) is the windowed signal, s(n) is the original N samples long, and
w(n) is the window itself. The window w(n) should be of the same length as the
original signal s(n).There are many types of windows used for this purpose.
16
Process of Registration of person for Speaker Identification and Feature Extraction
3.3.4 Cepstrum
The speech signal s(n) can be represented as a “quickly varying” source signal
e(n) convolved with the “slowly varying” impulse response h(n) of the vocal tract
17
Process of Registration of person for Speaker Identification and Feature Extraction
represented as a linear filter . Only the output (speech signal) is accessible and it is
often desirable to eliminate the source part. The source and the filter parameters are
convolved with each other in time domain. The cepstrum is a representation of the
signal where these two components are resolved into two additive parts. It is
computed by taking the inverse DFT of the logarithm of the magnitude spectrum of
the frame . This is represented in the following equation:
3.3.6 STFT
18
Process of Registration of person for Speaker Identification and Feature Extraction
be viewed as stationary so that Fourier transform can be used. With the window
moving along the time axis, the relation between the variance of frequency and time
can be identified.
Continuous-time STFT
19
Process of Registration of person for Speaker Identification and Feature Extraction
The time index τ is normally considered to be "slow" time and usually not expressed
in as high resolution as time t.
Discrete-time STFT
In the discrete time case, the data to be transformed could be broken up into
chunks or frames (which usually overlap each other, to reduce artifacts at the
boundary). Each chunk is Fourier transformed, and the complex result is added to a
matrixThis can be expressed as:
likewise, with signal x[n] and window w[n]. In this case, m is discrete and ω is
continuous, but in most typical applications the STFT is performed on a computer
using the Fast Fourier Transform, so both variables are discrete and quantized. Again,
the discrete-time index m is normally considered to be "slow" time and usually not
expressed in as high resolution as time n.
The Fourier Transform gives 2 sinc functions existing over all time.
One approach which can give information on the time resolution of the spectrum is
the short time fourier transform (STFT). Here a moving window is applied to the
signal and the fourier transform is applied to the signal within the window as the
window is moved.
x(t) is the signal itself, w(t) is the window function, and * is the complex conjugate.
As you can see from the equation, the STFT of the signal is nothing but the FT of the
signal multiplied by a window function.
20
Process of Registration of person for Speaker Identification and Feature Extraction
The problem with STFT is the fact whose roots go back to what is known as
the Heisenberg Uncertainty Principle . This principle originally applied to the
momentum and location of moving particles, can be applied to time-frequency
information of a signal. The problem with the STFT has something to do with
the width of the window function that is used. To be technically correct, this width of
the window function is known as the support of the window. If the window function
is narrow, then it is known as compactly supported.In STFT is window is of finite
length, and we no longer have perfect frequency resolution and if we make the length
of the window in the STFT infinite, just like as it is in the FT, to get perfect frequency
resolution we lose all the time information.
The narrower we make the window, the better the time resolution, and better
the assumption of stationarity, but poorer the frequency resolution:
The window function we use here is simply a Gaussian function in the form:
w(t)=exp(-a*(t^2)/2);
21
Process of Registration of person for Speaker Identification and Feature Extraction
3.3.7 Pre-emphasis
Pre-emphasis is processing of the input signal by a low order digital FIR filter.
Sp(n)=S(n)-λs(n-1)
22
Process of Registration of person for Speaker Identification and Feature Extraction
23
Process of Registration of person for Speaker Identification and Feature Extraction
4. FEATURE EXTRACTION
4.1 ENERGY
Energy of a signal expresses strength of the signal. Its value is usable to voice
activity detection, because the energy of the voice is higher than the energy of the
noise. Definition of the energy of a finite discrete signal is
Where Ntotal is length of the whole finite discrete signal s(n).Energy calculated this
way is energy of the whole signal so that it is called Long-term Energy(LTE).Given
the signal divided into Jframes N samples long we can define energy of one frame as
Where s(j,n) is the j-th frame of the original signal s(n) and j€{1,2,…
J}.Energy of one frame is called Short-Time Energy(STE).
24
Process of Registration of person for Speaker Identification and Feature Extraction
Figure 4.1 Shape of short time energy of frames of Czech word Emanuel
Figure 5.1 shows a shape of the short-time energy of all the frames as they go
one after another in a czech word ‘Emanuel’.you can distinguish obviously the back
ground noise and the consonants from the vowels .Energy of the back ground noise
depends on the quality of the microphone used to record the signal.The lower the
signal-to-noise ratio of the microphone, the better can we distinguish the background
from the vowels and even from the consonants.
Figure 4.2 shows an example of a shape of the short time zero-crossing ate of
all frames as they go one after another in a german word ‘wiesbaden’.You can clearly
see high values of the zero-crossing rate of the noise and lower values of some
phones.
Figure 4.2: Shape of the ZCR of the frames as in a German word ‘Wiesbaden’
4.3 AUTOCORRELATION
The discrete autocorrelation R(k) of the discrete finite signal s(n) at the lag k is
formally defines as
26
Process of Registration of person for Speaker Identification and Feature Extraction
Where µ is the mean value of the discrete finite signal s(n), and N is the length
of the signal. In the speech processing the first 12-32 autocorrelation coefficients are
used. Usually, the order of the autocorrelation is approximately M=Fs+4, where Fs is
the sampling frequency given in KHz.
Linear Predictive Coding (LPC) is one of the most powerful speech analysis
techniques, and one of the most useful methods for encoding good quality speech at a
low bit rate. It provides extremely accurate estimates of speech parameters, and is
relatively efficient for computation. This document describes the basic ideas behind
linear prediction, and discusses some of the issues involved in its use.
4.4.1Basic Principles
LPC starts with the assumption that the speech signal is produced by a buzzer
at the end of a tube. The glottis (the space between the vocal cords) produces the
buzz, which is characterized by its intensity (loudness) and frequency (pitch). The
vocal tract (the throat and mouth) forms the tube, which is characterized by its
resonances, which are called formants. For more information about speech
production, see the Speech Production OLT.
LPC analyzes the speech signal by estimating the formants, removing their
effects from the speech signal, and estimating the intensity and frequency of the
remaining buzz. The process of removing the formants is called inverse filtering, and
the remaining signal is called the residue.
The numbers which describe the formants and the residue can be stored or
transmitted somewhere else. LPC synthesizes the speech signal by reversing the
process: use the residue to create a source signal, use the formants to create a filter
(which represents the tube), and run the source through the filter, resulting in speech.
27
Process of Registration of person for Speaker Identification and Feature Extraction
Because speech signals vary with time, this process is done on short chunks of
the speech signal, which are called frames. Usually 30 to 50 frames per second give
intelligible speech with good compression.
The coefficients of the Linear Predictive Coding (LPC) play an important role
in the speech signal processing .An approximate Spred(n) of a speech signals(n) can
be calculated by a linear combination of the LP coefficients and M previous samples
of the original signal s(n) (auto regressive model).This approximation is called linear
prediction of the original signal. The linear prediction is defined as
Where a(m), m=1,2….,M, are the LPC coefficients, M is number of the LPC
coefficients called order of the linear prediction, s(n) is the original signal.
Y(n)=T{s(n)}
28
Process of Registration of person for Speaker Identification and Feature Extraction
C(v)=F-1{log[F{s(w)}]2}
The cepstrum coefficients can be used for some speaker recognition specific
purposes. It is useful for pitch estimation, since at the position of the pitch and at the
positions of its harmonics there are peaks. Actually, the coefficients of the cepstrum
can be used to separate the vocal tract impulse response and the generating signal.
The vocal tract response could be used for the speaker recognition, since they are
speaker dependent. However, there is a question asking how many of the cepstrum
coefficients should be used for this purpose. These coefficients were tested together
with the other ones.
The Discrete Cosine Transform (DCT) is widely used for the speech
processing. It is so often because of its energy compaction, which results in its
coefficients being more concentrated at low indices than the coefficients of the DFT.
This allows us to approximate a signal using fewer conditions. There are several
definitions of DCT. Coefficients C(k) of one of them are defined as
The DCT-II can be derived from the DFT when assuming s(n) is a real
periodic signal with a period of 2N and with an even symmetry s(n)=s(2N-1-n).The
29
Process of Registration of person for Speaker Identification and Feature Extraction
B(f)=1125ln(1+f/700)
Given the DFT of an input signal defined, we define a filter bank with I filters.
The ith filter, i=1, 2, K, I, is defined as
These filters compute the average spectrum around each central frequency
with increasing bandwidths.
30
Process of Registration of person for Speaker Identification and Feature Extraction
C(i)= ln ,i=1,2,….,I
Where S(k) is the magnitude of the FFT of the signal s(n),which is N samples
long. The Mel-Frequency Cepstrum is the discrete cosine transform of the I filter
outputs.
5. SOFTWARE PLATFORM
Using very powerful programming language that many Lab VIEW users
affectively call “G”,LabVIEW can increase your productivity by orders of
magnitude .Programs that take weeks or months to write using conventional
32
Process of Registration of person for Speaker Identification and Feature Extraction
Lab VIEW uses terminology, icons and ideas familiar to scientists and
engineers. It relies on graphical symbols rather than textual language to define a
program’s actions. Its execution is based on the principle of data flow, in which
function execute only after receiving the necessary data. Because of these features,
you can learn LabVIEW even if have little or no programming experience.
1. The front panel is the interactive user interface of a VI, so named because it
simulates the front panel of a physical instrument. The front panel can contain
knobs, push buttons, graphs, and many other controls and indicators. You can
input data using a mouse and key board , and then view results produced by
your program on the screen.
33
Process of Registration of person for Speaker Identification and Feature Extraction
2. The block diagram is the VI’s source code, constructed in Lab View’s
graphical programming language, G. The block diagram is the actual
executable program .The components of a block diagram is the actual
executable program. The components of a block diagram are lower-level VI’s,
built-in functions, constants, and program execution control structures. You
draw wires to connect the appropriate objects together to define the flow of
data between them. Front panel objects together to define the flow of data
between them. Front panel objectives have corresponding terminals on the
block diagram so data can pass from the user to the program and back to the
user.
34
Process of Registration of person for Speaker Identification and Feature Extraction
x(m)*w(n-m) exp(-jwm)
MFC
C
DiscreteC osine
Transfisothen
Where w=2πk/N with N as the DFT length. The magnitude rm weighted
by a series of filter frequency responses whose centre and edge frequencies and
bandwidths match with those of auditory critical band filters. This has been called as a
Mel-Scale filter bank.
35
Process of Registration of person for Speaker Identification and Feature Extraction
Where L1, U1 denote the upper and the lower cut off frequencies of 1st filter and
where
A1=
36
Process of Registration of person for Speaker Identification and Feature Extraction
j)
37
Process of Registration of person for Speaker Identification and Feature Extraction
Where µj=(1/Nj)_Xi €Sj Xi is the centroid of the Nj samples in the jth cluster.
1. Compute the centroid of each cluster using an initial assignment of the feature
vectors to the clusters.
2. Reassign Xi to that cluster whose centroid is closest to it. These steps are
iterated until successive steps do not reassign samples.
Once the VQ models are established for a tear get speaker, scoring consists of
Evaluating D for feature vectors in the test utterance.
The basic problem of the LPC system is to determine the formants from the
speech signal. The basic solution is a difference equation, which expresses each
sample of the signal as a linear combination of previous samples. Such an equation is
called a linear predictor, which is why this is called Linear Predictive Coding.
You can detect formant tracks and pitch contour by using several methods.
The most popular method is the Linear Prediction Coding (LPC) method.
38
Process of Registration of person for Speaker Identification and Feature Extraction
This method[5] applies an all-pole model to simulate the vocal tract. The
following Figure shows the flow chart of formant detection with the LPC method.
Applying the window w(n) breaks the source signal s(n) into signal
blocks x(n). Each signal block x(n) estimates the coefficients of an all-pole vocal tract
model by using the LPC method. After calculating the discrete Fourier transform
(DFT) on the coefficients A(z), the peak detection of 1/A(k) produces the formants
39
Process of Registration of person for Speaker Identification and Feature Extraction
40
Process of Registration of person for Speaker Identification and Feature Extraction
The following figure shows the flow chart of pitch detection with the LPC
method. This method uses inverse filtering to separate the excitation signal from the
vocal tract and uses the real Cepstrum signal to detect the pitch.
The source signal s(n) first goes through a low pass filter (LPF), and then
breaks into signal blocks x(n) by applying a window w(n). Each signal
block x(n) estimates the coefficients of an all-pole vocal tract model by using the LPC
method. These coefficients inversely filter x(n). The resulting residual
signal e(n) passes through a system which calculates the real Cepstrum. Finally, the
peaks of the real Cepstrum calculate the pitch.
We can use these VIs to apply a window to the source signal and calculate the
DFT of the signal block, respectively.
41
Process of Registration of person for Speaker Identification and Feature Extraction
The Advanced Signal Processing Toolkit also includes the Correlation and
Spectral Analysis VIs that you can use to calculate the real Cepstrum of a signal, as
shown in the Figure.
42
Process of Registration of person for Speaker Identification and Feature Extraction
7. RESULT
43
Process of Registration of person for Speaker Identification and Feature Extraction
44
Process of Registration of person for Speaker Identification and Feature Extraction
45
Process of Registration of person for Speaker Identification and Feature Extraction
46
Process of Registration of person for Speaker Identification and Feature Extraction
8. CONCLUSION
47
Process of Registration of person for Speaker Identification and Feature Extraction
9. FUTURE SCOPE
48
Process of Registration of person for Speaker Identification and Feature Extraction
49