Vous êtes sur la page 1sur 29

Designing of MatLab

Based Automatic Speaker


Recognition Systems
Jamel Price
Ali Eydgahi
Department of Engineering and Aviation Science
University of Maryland Eastern Shore

Outline


Introduction



Project Description
Motivation

Speech Recognition Process





Digital Signal Processing


Feature Extraction
Mel Frequency Cepstrum Coefficient
Feature Matching
Signal Models

Hidden Markov Modeling

Dynamic Time Warping





Proposed MATLAB based Speaker Recognition System





Sample Output
Automatic Speech Recognition Data Files

Future Work


Strategic Approach

Introduction

Project Description


This project represents one of the many design and development activities offered to
undergraduate students in the area of Science, Technology, Engineering and
Mathematics.
This presentation describes the design of an automatic speaker recognition system
using the Matlab software environment, which was a part of a NASA Langley
Research collaboration through the Chesapeake Information Based Aeronautic
Consortium (CIBAC).

Motivation


Advances in aerospace technology have brought tremendous resources within reach


of todays pilots. This research responds to the urgent need to drastically and
effectively reduce cockpit interface complexity and pilot workload.
The proposed software attempts to deliver effective and commercially attractive voice
interface solutions that allow pilots to interact with their cockpit environment in a safe
and more efficient manner.

Speech Recognition
Process

Digital Signal Processing




Digital signal processing (DSP) is the


processing of signals by digital means. Through
a processing techniques a discrete or digital
output analogous to the analog signal is
produced. Signals are processed in order to
improve signal quality or to extract information.
A digital signal processor can be replicated by
means of software using a data-manipulation
and development environment such as
Matlab.

Speaker recognition is the process of automatically recognizing who is speaker based


on unique characteristics contained in speech waves. This makes is possible to use a
speakers voice to verify their identity and control access to private services.
All speaker recognition systems at the highest level contain two modules, feature
extraction and feature matching.


Feature Extraction The process of extracting unique information from speech files that can
later be used to identify the speaker.
Feature Matching The process of actually identifying the speaker which involves comparing
the unknown voice data with a database of know speakers stored in the systems database.

Block Diagram of Speaker Recognition System

Feature Extraction Process




A voiceprint represents the most basic, yet unique, features of the speech command
in the frequency domain.
A voiceprint is merely a matrix of numbers in which each number represents the
energy or average power that is heard in a particular frequency band during a specific
interval.
During the feature extraction stage a database of voiceprints is created in order to
be used as a reference in the feature matching stage.

Techniques used to parametrically represent a voice command for speech


recognition tasks include, but is not limited to Mel-frequency cepstrum coefficients
(MFCC).
The MFCC technique was used in this project. The MFCC technique is based on the
known variation of the human ears critical bandwidth frequencies with filters spaced
linearly at low frequencies, below 100 Hz, and logarithmically at high frequencies,
above 100 Hz.

MFCC Process for Feature Extraction

Feature Matching Process




To verify a voice command, the MFCC process is completed on the unknown


utterance spoken input into the system.
This newly attained voiceprint is compared against those reference voiceprints
created and stored in the database during the feature extraction stage.
Using a pattern recognition technique, similarities and differences between the
unknown signal model and reference voiceprints will be determined.

Signal Models


Real world processes generally produce observable outputs which can be


characterized like signals. Signals can be discrete in nature or continuous in
nature (speech). There are two types of signal models:


Deterministic Models (Dynamic Time Warping)




Generally exploit some known specific properties of the signal to determine values of
new signal parameters

Statistical Models (Hidden Markov Modeling)




One tries to characterize only the statistical properties of the signal via a stochastic
process. A probability of the likelihood of unknown signal is computed using a given
model.

Dynamic Time Warping




A common task with continuous data is comparing one series with


another. In the case where the time series have the same
component shapes but do not match we must warp the time axis of
one or both series to achieve better alignment.
A warping path defines a mapping between the series in question.
There are exponentially many warping paths, but we are interested
in the path which minimizes the warping cost.
The voiceprint with the lowest warping cost represents the alignment
between the input voice command and reference voiceprint. This
voiceprint is the match.

Hidden Markov Modeling




Given the model parameters created for the reference speakers, the
probability of hidden states that could have generated a particular unknown
output sequence is computed using the Viterbi algorithm.
The Viterbi algorithm makes one key assumption.
The most likely hidden sequence up to a certain point t must depend only on
the observed event at point t, and the most likely sequence at point t 1.

In order to implement a speech recognition system using HMM the following


steps must be taken:

For each reference voice command, a Markov model must be built
using parameters that optimize the observations of the word.

A calculation of model likelihoods for all possible reference models
against the unknown model must be completed using the Viterbi
algorithm followed by the selection of the reference with the highest
model likelihood value.

Proposed MATLAB
based Speaker
Recognition System

Training


Typing main in the MatLab command


window executes the program.
The main menu gives the user the option of
training the system, testing the system or
clearing the systems database.
Selecting Train gives the user the option of
creating an entirely new reference database
or adding a speaker to the existing
database.
During the training phase, analog voice data
is converted to discrete numerical data using
MFCC code. These voiceprints are then
automatically stored in an excel database
according to an unique identifier.

Reference Database in Excel

Testing


During the testing phase, a voice print is


created on the unknown voice data and
stored as a temporary excel file.


If Dynamic Time Warping is selected


as the pattern matching algorithm, this
unknown voiceprint is compared
against each voiceprint stored in the
reference database and the identifier of
the voiceprint with the smallest warping
path is returned.
If Hidden Markov Modeling is selected,
a HMM is created for the unknown
voiceprint as well as each voiceprint in
the database. The unknown model is
then compared against the latter and
the identifier of the voiceprint with the
highest likelihood value is returned.

Dynamic Time Warping Command Window Output

Hidden Markov Modeling Command Window Output

Graphical Output

Automatic Speaker Recognition Data Files Created During


Recognition Process


acoustic_data.xls workbook
which contains voiceprints of each
speaker; analogous to a database.

*.wav input voice waveform of each


speaker recorded @ 16Bits with a
sample rate of 12KHz. Although these
files are pictured here, they are
normally deleted immediately after
feature extraction.

*.mat a proprietary MATLAB file


format which stores data in binary
form. Hidden Markov Models are
stored with the .mat extension

test_data.xls, unknown.wav,
unknown.mat (not pictured)
contain voiceprint, waveform and
Hidden Markov model of unknown
speaker respectively. These files are
deleted immediately following testing
stage.

Future Work

Using an additional program referred to as the MATLAB compiler, we will convert the
application into self contained C-code.
Implement the Speech Recognition program into Texas Instruments DSP hardware
(TMS320).

Development of Speech Recognition System




An automatic speech recognition system is termed a speaker-independent system.


Unlike a speaker recognition system which is speaker dependent, a speakerindependent system is designed to work for any speaker of a specific language such
as English.

Tasks will be executed upon system


recognition of speaker-independent
voice commands stored in the systems
reference database.

Reference Database in Matlab

Vous aimerez peut-être aussi