Automatic Speech Recognition

Introduction to Automatic Speech Recognition
Outline
Define the problem What is speech? Feature Selection Models
Early methods Modern statistical models
Current State of ASR Future Work
The ASR Problem

There is no single ASR problem The problem depends on many factors
Microphone: Close-mic, throat-mic, microphone array, audio-visual Sources: band-limited, background noise, reverberation Speaker: speaker dependent, speaker independent Language: open/closed vocabulary, vocabulary size, read/spontaneous speech Output: Transcription, speaker id, keywords
Performance Evaluation
Accuracy
Percentage of tokens correctly recognized Inverse of accuracy Phones Words* Sentences Semantics?
Error Rate
Token Type

What is Speech?
Analog signal produced by humans You can think about the speech signal being decomposed into the source and filter The source is the vocal folds in voiced speech The filter is the vocal tract and articulators
Speech Production
Speech Production
Speech Production
Speech Visualization
Feature Selection
As in any data-driven task, the data must be represented in some format Cepstral features have been found to perform well They represent the frequency of the frequencies Mel-frequency cepstral coefficients (MFCC) are the most common variety
Where do we stand?
Defined the multiple problems associated with ASR Described how speech is produced Illustrated how speech can be represented in an ASR system Now that we have the data, how do we recognize the speech?
Radio Rex
First known attempt at speech recognition A toy from 1922 Worked by analyzing the signal strength at 500Hz
Actual speech recognition systems
Originally thought to be a relatively simple task requiring a few years of concerted effort 1969, Wither speech recognition is published A DARPA project ran from 1971-1976 in response to the statements in the Pierce article We can examine a few general systems
Template-Based ASR

Originally only worked for isolated words Performs best when training and testing conditions are best For each word we want to recognize, we store a template or example based on actual data Each test utterance is checked against the templates to find the best match Uses the Dynamic Time Warping (DTW) algorithm
Dynamic Time Warping
Create a similarity matrix for the two utterances Use dynamic programming to find the lowest cost path
Hearsay-II
One of the systems developed during the DARPA program A blackboard-based system utilizing symbolic problem solvers Each problem solver was called a knowledge group A complex scheduler was used to decide when each KG should be called
Hearsay-II
DARPA Results
The Hearsay-II system performed much better than the two other similar competing systems However, only one system met the performance goals of the project

The Harpy system was also a CMU built system In many ways it was a predecessor to the modern statistical systems
Modern Statistical ASR
Modern Statistical ASR
Acoustic Model
For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes Two methods are commonly used

Multilayer perceptron (MLP) gives the likelihood of a class given the data Gaussian Mixture Model (GMM) gives the likelihood of the data given a class
Gaussian Distribution
Pronunciation Model
While the pronunciation model can be very complex, it is typically just a dictionary The dictionary contains the valid pronunciations for each word Examples:

Cat: k ae t Dog: d ao g Fox: f aa x s
Language Model
Now we need some way of representing the likelihood of any given word sequence Many methods exist, but ngrams are the most common Ngrams models are trained by simply counting the occurrences of words in a training set
Ngrams
A unigram is the probability of any word in isolation A bigram is the probability of a given word given the previous word Higher order ngrams continue in a similar fashion A backoff probability is used for any unseen data
How do we put it together?
We now have models to represent the three parts of our equation We need a framework to join these models together The standard framework used is the Hidden Markov Model (HMM)
Markov Model
A state model using the markov property
The markov property states that the future depends only on the present state
Models the likelihood of transitions between states in a model Given the model, we can determine the likelihood of any sequence of states
Hidden Markov Model
Similar to a markov model except the states are hidden We now have observations tied to the individual states We no longer know the exact state sequence given the data Allows for the modeling of an underlying unobservable process
HMMs for ASR

First we build an HMM for each phone Next we combine the phone models based on the pronunciation model to create word level models Finally, the word level models are combined based on the language model We now have a giant network with potentially thousands or even millions of states
Decoding
Decoding happens in the same way as the previous example For each time frame we need to maintain two pieces of information

The likelihood of being at any state The previous state for every state
State of the Art
What works well

Constrained vocabulary systems Systems adapted to a given speaker Systems in anechoic environments without background noise Systems expecting read speech Large unconstrained vocabulary Noisy environments Conversational speech
What doesn't work

Future Work
Better representations of audio based on humans Better representation of acoustic elements based on articulatory phonology Segmental models that do not rely on the simple frame-based approach
Resources
Hidden Markov Model Toolkit (HTK)
http://htk.eng.cam.ac.uk/ http://spandh.dcs.shef.ac.uk/projects/chime/PCC /datasets.html http://www.stanford.edu/class/cs229/ http://www.youtube.com/watch?v=UzxYlbK2c7E
CHIME ( a freely available dataset)
Machine Learning Lectures

Automatic Speech Recognition

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Automatic Speech Recognition

Transféré par

Droits d'auteur :

Formats disponibles

Introduction to Automatic Speech Recognition

Early methods Modern statistical models

Current State of ASR Future Work

The ASR Problem

Actual speech recognition systems

Dynamic Time Warping

Modern Statistical ASR

Modern Statistical ASR

Cat: k ae t Dog: d ao g Fox: f aa x s

How do we put it together?

A state model using the markov property

Hidden Markov Model

HMMs for ASR

State of the Art

What works well

What doesn't work

Hidden Markov Model Toolkit (HTK)

http://htk.eng.cam.ac.uk/ http://spandh.dcs.shef.ac.uk/projects/chime/PCC /datasets.html http://www.stanford.edu/class/cs229/ http://www.youtube.com/watch?v=UzxYlbK2c7E

CHIME ( a freely available dataset)

Machine Learning Lectures

Vous aimerez peut-être aussi