ANN Presentation

Artificial Neural Network for Speech Recognition
Overview :
Ø
ØPresenting an Artificial Neural Network to
recognize and classify speech.
Ø
ØSpoken digits.
Ø
Ø“one”,”two”,”three”, etc…
Ø
ØChoosing a speech representation scheme.
Ø
ØTraining Perceptron.
Ø
ØResults.
Representing Speech :
vProblem
ØRecording samples never produce identical waveforms.

ØLength.
ØAmplitude.
ØBackground noise.
ØSample rate.
ØHowever, perceptual information relative to speech
remains consistent.
vSolution
Ø
ØExtract speech-related information.
ØSee: Spectrogram.
Representing Speech :
“one” “one”
Waveform
Spectrogram
Spectrogram :
ØShows change in
amplitude spectra
over time.
ØThree dimensions
ØX Axis: Time
ØY Axis: Frequency
ØZ axis: Color
intensity represents
magnitude.
Mel Frequency Cepstrum Coefficients :
ØSpectrogram provides a good visual representation of

speech but still varies significantly between samples.
ØA cepstral analysis is a popular method for feature

extraction in speech recognition applications, and can be
accomplished using Mel Frequency Cepstrum Coefficient
analysis (MFCC).
Mel Frequency Cepstrum Coefficients :
ØInverse Fourier transform of the log of the Fourier

transform of a signal using the Mel Scale filterbank.
Ømfcc function returns vectors of 13 dimensions.

Network Architecture :
vInput layer
Ø26 Cepstral Coefficients
vHidden Layer
Ø100 fully-connected hidden-
layer units
ØWeight range between -1 +1
ØInitially random
ØRemain constant
vOutput
Ø1 output unit for each target
ØLimited to values between 0
and +1
Sample Training Stimuli
(Spectrograms) :
“one”
“two”
“three”
Training the network :
vSpoken digits were recorded
ØSeven samples of each digit.

Ø“One” through “eight” recorded.
ØTotal of 56 different recordings with varying lengths and
environmental conditions.
ØBackground noise was removed from each Sample.

ØCalculate MFCC using Malcolm Slaney’s Auditory

Toolbox.
Øc=mfcc(s,fs,fix((3*fs)/(length(s)-256))).
ØLimits frame rate such that mfcc always produces a matrix
of two vectors corresponding to the coefficients of the two
halves of the sample.
ØConvert 13x2 matrix to 26 dimensional column vector.

Øc=c(:).
vSupervised learning
ØChoose intended target and create a target vector.
Ø56 dimensional target vector.
ØIf training the network to recognize spoken “one”, target
has a value of +1 for each of the known “one” stimuli and 0
for everything else.
vTrain a multilayer perceptron with feature vectors

(simplified)
ØSelect stimuli at random.
ØCalculate response to stimuli.
ØCalculate error.
ØUpdate weights.
ØRepeat.
Ø
ØIn a finite amount of time, the perceptron will successfully
learn to distinguish between stimuli of an intended target
and not.
Ø sigmoid(x)=1/(1+e-x)
ØCalculate response to
Stimuli
ØCalculate hidden layer.

Øh=sigmoid(W*s+bias).
ØCalculate response.
Øo=sigmoid(v*h+bias).
ØSigmoid transfer function.
ØMaps values between 0
and +1.
vCalculate error
ØFor a given stimuli, error is the difference between
target and response
Øt-o
Øt will be either 0 or 1
Øo will be between 0 and +1
v
vUpdate weights
Øv=vprevious+γ(t-o)hT
Øv is weight vector between hidden-layer units and
output
Øγ (gamma) is learning rate
Results :
Target = “one”
ØLearning rate: +1
ØBias: -1
Ø100 hidden-layer
units
Ø3000 iterations
Ø316 seconds to
learn target
Results :
vResponse to unseen stimuli
ØStimuli produced by same voice used to train network with

noise removed.
ØNetwork was tested against eight unseen stimuli
corresponding to eight spoken digits.
ØReturned 1 (full activation) for “one” and zero for all other
stimuli.
ØResults were consistent across targets.
Øi.e. when trained to recognize “two”, “three”, etc…
Øsigmoid(v*sigmoid(w*t1+bias)+bias) == 1.
Results :
vResponse to noisy sample

v
ØNetwork returned a low, but response > 0 to a sample
without noise removed.
vResponse to foreign speaker.
ØNetwork responded with mixed results when presented

samples from speakers different from training stimuli.
vIn all cases, error rate decreased and accuracy improved

with more learning iterations.
v
References :
ØJurafsky, Daniel and Martin, James H. (2000) Speech and Language
Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition (1st ed.). Prentice Hall.
Ø
ØGolden, Richard M. (1996) Mathematical Methods for Neural Network
analysis and Design (1st ed.). MIT Press.
Ø
ØAnderson, James A. (1995) An Introduction to Neural Networks
(1st ed.). MIT Press.
ØHosom, John-Paul, Cole, Ron, Fanty, Mark, Schalkwyk, Joham, Yan,

Yonghong, Wei, Wei (1999, February 2). Training Neural Networks for
Speech Recognition Center for Spoken Language Understanding, Oregon
Graduate Institute of Science and Technology, http://
speech.bme.ogi.edu/tutordemos/nnet_training/tutorial.html.
ØSlaney, Malcolm Auditory Toolbox Interval Research Corporation

ANN Presentation

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ANN Presentation

Transféré par

Droits d'auteur :

Formats disponibles

Artificial Neural Network for Speech Recognition

ØRecording samples never produce identical waveforms.

ØSpectrogram provides a good visual representation of

ØA cepstral analysis is a popular method for feature

ØInverse Fourier transform of the log of the Fourier

Ømfcc function returns vectors of 13 dimensions.

vSpoken digits were recorded

ØSeven samples of each digit.

ØBackground noise was removed from each Sample.

ØCalculate MFCC using Malcolm Slaney’s Auditory

ØConvert 13x2 matrix to 26 dimensional column vector.

vTrain a multilayer perceptron with feature vectors

ØCalculate hidden layer.

vResponse to unseen stimuli

ØStimuli produced by same voice used to train network with

vResponse to noisy sample

vResponse to foreign speaker.

ØNetwork responded with mixed results when presented

vIn all cases, error rate decreased and accuracy improved

ØHosom, John-Paul, Cole, Ron, Fanty, Mark, Schalkwyk, Joham, Yan,

ØSlaney, Malcolm Auditory Toolbox Interval Research Corporation

Vous aimerez peut-être aussi