Speech Recognisation Proposal

Introduction
In daily life, there is a need for controlled access to certain information/places for
security. Typically such secure identification system requires a person to use a card
system (something that the user has) or pin system (something that the user knows)
in to gain access to the system. However, the two methods mentioned above have
some shortcomings as the access control used can be stolen, lost, misused .
This project describes how Speaker Recognition model using MFCC and VQ has
been planned, built up and tested for male and female voice recognition .
The desire for a more secure identification system (where by the physical human
self is the key to access the system) which leads to the research in the of biometric
recognition systems .There are two main properties of biometric features.
Behavioral characteristics such as voice ,signature are the result of body part
movements. In the case of voice it merely shows the physical properties of the
voice production organs. The articulatory process and the subsequent speech
produced are never exactly same even when the same person utters the same
sentence. Physiological characteristics refer to the actual physical properties of a
person such as fingerprint, iris and hand geometry measurement.
Speaker recognition is the automatic process which identify the unknown speaker
based on input speech signal. Due to the speech recognition ,speaker recognition is
also plays an important role in signal processing. Speaker recognition system is
categorized into category Speaker identification and Speaker Verification .In
Speaker identification, identify the unknown speaker from the given sets of
speaker by using best matching technique. In Speaker Verification identity of
unknown speaker is compared with set of speakers whose identity is to be claimed
and according to that accept and reject the speaker .
Continued
This project is the perfect solution for providing efficient security. Since no two
people in the world have the same voice, hence it can be very easily used for
providing unique identity to a user through the characteristics in his or her voice.
Speaker recognition is the process of automatically recognizing who is speaking on
the basis of individual information included in speech waves. This technique
makes it possible to use the speaker's voice to verify their identity and control
access to services such as banking by telephone, telephone shopping, database
access services, information services, voice mail, security control for confidential
information areas, and remote access to computers.
Some of the possible applications of biometric systems include user-interface

customization and access control such as airport check in, building access control,
telephone banking or remote credit card purchases. Speech technology offers many
possibilities for personal identification that is natural and non-intrusive. Besides
that, speech technology offers the capability to verify the identity of a person
remotely over long distance by using a normal telephone.
A conversation between people contains a lot of information besides just the

communication of ideas. Speech also conveys information such as gender,
emotion, attitude, health situation and identity of a speaker. The topic of this thesis
deals with speaker recognition that refers to the task of recognizing people by their
voices.
Feasibility Study
Speech Recognition is the process of automatically recognizing a certain word
spoken by a particular speaker based on individual information included in speech
waves. This technique makes it possible to use the speaker's voice to verify his/her
identity and provide controlled access to services like voice based biometrics,
database access services, voice based dialing, voice mail and remote access to
computers.
Signal processing front end for extracting the feature set is an important stage in
any speech recognition system. The optimum feature set is still not yet decided
though the vast efforts of researchers. There are many types of features, which are
derived differently and have good impact on the recognition rate. This project
presents one of the techniques to extract the feature set from a speech signal, which
can be used in speech recognition systems.
The key is to convert the speech waveform to some type of parametric

representation (at a considerably lower information rate) for further analysis and
processing. This is often referred as the signal-processing front end. A wide range
of possibilities exist for parametrically representing the speech signal for the
speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequency
Cepstrum Coefficients (MFCC), and others. MFCC is perhaps the best known and
most popular, and these will be used in this project. MFCCs are based on the
known variation of the human ear's critical bandwidths with frequency filters
spaced linearly at low frequencies and logarithmically at high frequencies have
been used to capture the phonetically important characteristics of speech. However,
another key characteristic of speech is quasi-stationarity, i.e. it is short time
stationary which is studied and analyzed using short time, frequency domain
analysis.
To achieve this, we have first made a comparative study of the MFCC approach.
The voice based biometric system is based on isolated or single word recognition.
A particular speaker utters the password once in the training session so as to train
and store the features of the access word. Later in the testing session the user utters
the password again in order to achieve recognition if there is a match. The feature
vectors unique to that speaker are obtained in the training phase and this is made
use of later on to grant authentication to the same speaker who once again utters
the same word in the testing phase. At this stage an intruder can also test the
system to test the inherent security feature by uttering the same word .
Literature Survey
The captivation with employing voice for the many purposes in daily life has
driven engineers and scientist to conduct massive amount of research and
development in this field. The idea of an Automatic speaker recognition (ASR)
which aim is to build a machine that can identify a person by recognizing voice
characteristics or features that are unique to each person. The performance of
modern recognition systems has improved significantly due to the various
improvements of the algorithm and techniques involved in this field. As of this
moment, ASR is still a great interest to researchers and engineers worldwide and
the efficiency level of ASR is still improving. This is to highlight some of the
important techniques, algorithm and research that are relevant to this report.
1) Feature Extraction
This module is used to convert the speech signal into set of feature vectors i.e.
reduce the input speech signal dimensionally. There are different methods used for
Feature Extraction such as MFCC, PLP,, LPC. In this project due to high accuracy
I MFCC. They are a representation of the short-term power spectrum of a sound,
based on the linear cosine transform of the log power spectrum on a nonlinear Mel
scale of frequency .Block diagram of MFCC is shown in figure below.
Continued
The Mel-frequency Cepstrum Coefficient (MFCC) technique is often used to

create the impression of the sound files. The MFCC are depend on the known
variation of the human ear's critical bandwidth frequencies with filters spaced
linearly at low frequencies and logarithmically at high frequencies used to capture
the important characteristics of speech. The Mel-frequency scale is linear
frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz.
From Fig.1 shows the steps involved in MFCC.As shown in the figure 1
continuous speech signals are coming from the microphone and they are processed
over short period of time It is divided into to frames and overlapped with the
previous one for the clear transition. In second step we used hamming window for
overlapping frame which is used to reduce the distortion caused by the
overlapping. Next to windowing, FFT convert speech signal from time domain to
frequency domain .In Mel Frequency wrapping, each frame signals are passed
through Mel-Scale band pass filter to mimic the human ear. In the final stage, again
signals converted into time domain using DCT. Instead of using inverse FFT,
Discrete Cosine Transform is used as it is more appropriate .
2) Feature Match
Once the impression of speech signal is created i.e. feature vector is created it will
be stored in a database as a speaker. When an unknown speaker speech file is
loaded into the matlab ,its finger print also will be created and its vector will be
compared against vectors which are present in the database already by using the
Euclidian distance technique, and suitable speaker will be identified.. This process
is called Feature matching. Various methods are used to match the extracted
features of voice to the stored voice such as Dynamic Time Warping (DTW),
Vector Quantization (VQ), Gaussian Mixture Modeling (GMM) etc. In this project
we use Vector Quantization.
Vector Quantization
A speaker recognition system must able to compute probability distributions of the
estimated feature vectors. Due to impossibility of storing each and every feature
vector it is necessary to quantized these feature vectors into the small template
vector i.e. vector quantiziation. VQ is a process that takes large sets of feature
vectors and create small set of feature vectors that represent the centroids of the
distribution. These feature vectors are clustered to form a codebook for each
speaker. In the recognition phase, the data from the unknown speaker is compared
to the codebook of each speaker and estimate the difference .By using this
difference recognition decision is to be made. The various algorithm used for
codebook generation are such as: K-means algorithm, LGB algorithm, SOM an
PNN algorithm.
Conceptual diagram illustrating vector quantization codebook formation.
Proposed Methodology
In this project we will be building and testing of an automatic speaker recognition
system. There are many approaches to make this program, some highly famous are
Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC),
and others. MFCC is perhaps the best approach and used in building the program.
The most convenient platform for this is the Matlab environment since many of the
tasks that we need to use were already implemented in Matlab, i.e. dct (discrete
cosine transform), fft (fast Fourier transform) etc. Using the MFCC approach the
program converts the voice of user to a set of coefficients, meaning that the voice
of the user will be transformed to a sequence of acoustic vectors. On the base of
these sequences the program guesses the speaker by pattern recognition,
scientifically speaking the approach of Vector Quantization (VQ) is used. VQ is a
process of mapping vectors from a large vector space to a finite number of regions
in that space. Codewords are formed from the cluster of vectors and a collection of
codewords is known as codebook.
The program works in two stages both very similar:

1. Makes a database of the Users sound
2. And identifies the user.
First of all the program forms a database with users and converts the voice into a
codebook for each voice, then when the user speaks for the stage 2 it converts that
voice into a codebook as well and matches it with the codebooks in the database.
This is how our program judges the users.
Principles of Speaker Recognition

Speaker recognition can be classified into two parts:
Identification
Speaker identification is the process of determining which registered speaker
provides a given utterance at that time.
Verification
Speaker verification is the process of accepting or rejecting the identity claim of a
speaker.
Figure 1 and 2 shows the basic structures of speaker identification and verification
systems. The system that we will describe is classified as text-independent speaker
identification system since its task is to identify the person who speaks regardless
of what is saying. All speaker recognition system contain two main modules (refer
to Figure 1 and 2):
Feature extraction
Feature extraction is the process that extracts a small amount of data from the
voice signal that can later be used to represent each speaker.
Feature matching
Feature matching involves the actual procedure to identify the unknown speaker by
comparing extracted features from his/her voice input with the ones from a set of
known speakers.
Continued
All speaker recognition systems have to serve two distinguished phases.
Enrolment or training phase

In the training phase, each registered speaker has to provide samples of their
speech so that the system can build or train a reference model for that speaker.
Operational or testing phase.

In the testing phase, the input speech is matched with stored IDs and a recognition
decision is made.
REFRENCES
[1] L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood
Cliffs, N.J., 1993.
[2] L.R Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood
Cliffs, N.J., 1978.
[3] S.B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech,
Signal Processing, Vol. ASSP-28, No. 4, August 1980.
[4] Y. Linde, A. Buzo & R. Gray, An algorithm for vector quantizer design, IEEE Transactions on
Communications, Vol. 28, pp.84-95, 1980.
[5] S. Furui, Speaker independent isolated word recognition using dynamic features of speech
spectrum, IEEE Transactions on Acoustic, Speech, Signal Processing, Vol. ASSP-34, No. 1, pp. 52-
59, February 1986.
[6] S. Furui, An overview of speaker recognition technology, ESCA Workshop on Automatic
Speaker Recognition, Identification and Verification, pp. 1-9, 1994.
[7] F.K. Song, A.E. Rosenberg and B.H. Juang, A vector quantisation approach to speaker
recognition, AT&T Technical Journal, Vol. 66-2, pp. 14-26, March 1987.
[8] Revathi, R. Ganapathy and Y. Venkataramani, Text Independent Speaker Recognition and
Speaker Independent Speech Recognition Using Iterative Clustering Approach, IJCSIT, Vol 1, No 2,
November 2009 [3] Douglas A. Reynolds and Richard C. Rose, Robust Text-Independent Speaker
Identification using Gaussian Mixture Speaker Models, IEEE Transactions on Speech and Audio
transactions and Audio Processing, VOL. 3, NO. 1, JANUARY 1995

Speech Recognisation Proposal

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Speech Recognisation Proposal

Transféré par

Droits d'auteur :

Formats disponibles

Introduction

Some of the possible applications of biometric systems include user-interface

A conversation between people contains a lot of information besides just the

The key is to convert the speech waveform to some type of parametric

The Mel-frequency Cepstrum Coefficient (MFCC) technique is often used to

Conceptual diagram illustrating vector quantization codebook formation.

The program works in two stages both very similar:

Principles of Speaker Recognition

Enrolment or training phase

Operational or testing phase.

Vous aimerez peut-être aussi