Académique Documents
Professionnel Documents
Culture Documents
Spectral plot of a speech signal and two particular time slices showing
the energy across the various frequencies
The Task Pictorially
F2
F3
Example Formant Centers
Here you can see that we
understand the general Formant Formant f2
location of where formants Vowel
f1 (Hz) (Hz)
will form
u 320 800
In terms of their frequency
o 500 1000
Unfortunately, there are a
700 1150
number of factors they
influence the formats a 1000 1400
including 500 1500
age, gender y 320 1650
excitement level and volume 700 1800
what sounds are on both sides e 500 2300
of the phoneme i 320 2500
the formants for the i in nine
will appear differently than the
formants for the i in time!
Vowel Formant Frequencies
Consonant Place of Articulation
Co-articulation
As stated on the previous slide, the impact of one
articulatory sound into another causes variation in the
speech spectrum
The result is that it isnt a simple mapping of frequency
phonetic unit
but instead a very complex, poorly understood situation
In speech recognition, the problem can be simplified by
insisting on discrete speech
Pauses between every pair of words
Within a word, we might model how the entire word should
appear (that is, our recognition units are words, not phonemes,
letters or syllables) or we might try to model how co-
articulation impacts the speech spectrum
But in continuous speech, the problem is magnified
because different combinations of words will create vastly
different spectra
Phonetic Dependence
Below are two wave forms created by uttering the same
vowel sound, ee as in three (on the left) and tea (on the
right)
notice how dissimilar the ee portion is
the one on the right is longer and the one on the left has higher
frequency formants
Isolated Speech Recognition
Early speech recognition concentrated on isolated
speech because continuous speech recognition
was just not possible
Even today, accurate continuous speech recognition is
extremely challenging
There are many advantages to isolated speech
recognition but the primary advantages are
The speech signal is segmented so that it is easy to
determine the starting point and stopping point of each
word
Co-articulation effects are minimized
A distinct gap
(silence) appears
between words
Early Speech Recognition
Bell labs 1952 implemented a system for isolated digit
recognition (of a single speaker) by comparing the
formants of the speech signal to expected frequencies
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Training data are used
Data are clustered using the training data
centroids are located for each cluster
Each cluster then is described as a vector
The centroid + vector make up each codebook
see the next slide
Continued
0123456789
01234 56789
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
012345 6789
0 1 2 3 4 5 6 7 8 9
The HMM Speech Process
Discrete
speech
of the 10
digits
Continuous Speech with HMM
Many simplifications made for discrete speech do not
work for continuous speech
HMMs will have to model smaller units, possibly phonemes
To reduce the search space, use a beam search
To ease the word-to-word transitions, use bigrams or trigrams
The process is similar to the
previous slide where all
phoneme HMMs are searched
using Viterbi, but here,
transition probabilities are
included along with more
codebooks to handle the
phoneme-to-phoneme
transitions and word-to-word
A 7-stage phoneme model as used in
transitions
the Sphinx system (in this case, a /d/)
Continuous Speech
Search within a word
compared to searching
across words
CMU Sphinx
The greatest breakthrough in SR happened with CMUs
Sphinx
This was a phd dissertation
It extended on the HMM work cited previously
Improvements:
3 codebooks, 256 features per codebook
enhanced with predictive codebook creation
HMM models extended to 5 states and then 7 states for greater
variability
Amount of training reduced
Better trigram grammar
models including predictive
capabilities
Later Sphinx models used a
lexical tree structure to
reduce the search time
Multiple Codebooks Extended
As time has gone on in the speech community, different
features have been identified that can be of additional
use
New speech signal features could be used simply by adding
more codebooks
During the development of Sphinx, they were able to
experiment with new features to see what improved
performance by simply adding codebooks
Delta coefficients were introduced, as an example
these are like previous LPC coefficients except that they keep
track of changes in coefficients over time
this could, among other things, lessen the impact of
coarticulation
The final version of Sphinx used 51 features in 4 codebooks
of 256 entries each
Senones
Another Sphinx innovation is the senone
Phonemes are often found to be the wrong level of
representation for speech primitives
The allophone is a combination of the phoneme with its
preceding and succeeding phonemes
this is a triphone which includes coarticulatory data
There are over 100,000 allphones in English too many
to represent efficiently
The senone was developed as a response to this
It is an HMM that models the triphone by clustering
triphones into groups, reducing the number needed to
around 7000
Since they are HMMs, they are trainable
They also permit pronunciation optimization for
individual speakers through training
Neural Network Approach
We can also solve SR with a neural network
One approach is to create a NN for each word in our
lexicon
For small vocabulary isolated word system, this may work
No need to worry about finding the separation between
words or the effect that a word ending might have on the
next word beginning
Notice that NN
have fixed sized
inputs
An input here
will be the
processed
speech signal in
the form of LPCs
Vowel Recognition
We could build a system that uses signal processing to derive the F1 and F2
formant frequencies for an input and then use the above network to
determine the vowel sound
Multiple networks
for the various
levels of the SR
problem
Segmentation
module responsible
for dividing the
continuous signal
into segments Word module combines
possible phonetic
Unit level generates units to words
phonetic units
Neural Networks Continued
There are a number of difficult challenges to solve
SR by NN
fixed sized input
the recurrent NN is much like a multistate phonetic model in an
HMM
cannot train like an HMM
the HMM is fine-tuned to the user by a having the user speak a
number of training sentences
but the NN, once trained, is forever fixed, so how can we market
a trained NN and adjust it to other speakers?
no means of representing syntax
the NN cannot use higher level knowledge such as an ATN
grammar, rules, or bigrams or trigrams
how do we represent co-articulatory knowledge?
unless our training sentences include all combinations of
phonemes, the NN cannot learn this
HMM/NN Hybrid
The strength of the NN is in its low level recognition ability
The strength of the HMM is in its matching ability of the LPC
values to a codebook and selecting the right phoneme
Why not combine them? A neural network is trained
and used to determine the
classification of the
frames rather than a
matching codebook thus,
the system can learn to
match better the acoustic
information to a phonetic
classification