Académique Documents
Professionnel Documents
Culture Documents
Overview
Julia Hirschberg
CS 4706
(special thanks to Roberto Pierraccini)
DIALOG
SEMANTICS
SPOKEN
LANGUAGE
UNDERSTANDING
SPEECH
RECOGNITION
SPEECH
SYNTHESIS
DIALOG
MANAGEMENT
SYNTAX
LEXICON
MORPHOLOG
Y
PHONETICS
INNER EAR
ACOUSTIC
NERVE
VOCAL-TRACT
ARTICULATORS
James Flanagan
Bell Laboratories
(user:Roberto(attribute:telephone-num
(attribute:telephone-numvalue:7360474))
value:7360474))
(user:Roberto
VP
NP
NP
MY
IS
NUMBER
m I n & m &r i
b
THREE
SEVEN
SEVEN
ZERO
NINE
s e v & nth
rE n
I n zE o
r
TWO
FOUR
s ev &
n
Limited vocabulary
Multiple Interpretations
(user:Roberto(attribute:telephone-num
(attribute:telephone-numvalue:7360474))
value:7360474))
Speaker Dependency
(user:Roberto
Word variations
VP
NP
Word confusability
NP
MY
IS
NUMBER
THREE
SEVEN
ZERO
NINE
Context-dependency
SEVEN
TWO Coarticulation FOUR
Noise/reverberation
m I n & m &r i
b
s e v & nth
rE n
I n z E o Intra-speaker
t s e v &variability
f O
r
n
J. R. Pierce
Executive Director,
Bell Laboratories
TEMPLATE (WORD 7)
Isolated Words
Speaker Dependent
Connected Words
Speaker Independent
Sub-Word Units
UNKNOWN WORD
10
Fred Jelinek
W arg max P ( A | W ) P (W )
W
Acoustic HMMs
a11
S1
a22
a12
S2
Word Tri-grams
a33
a23
S3
P( wt | wt 1 , wt 2 )
Jim Baker
11
12
1996
1997
1998
1999
2000
2001
2002
2003
2004
HOSTING
MIT
SPEECHWORKS
SPOKEN
STANDARDS
DIALOG
INDUSTRY
NUANCE
SRI
APPLICATION
DEVELOPERS
TOOLS
PLATFORM
INTEGRATORS
STANDARDS
VENDORS
13
NUANCE Today
14
LVCSR Today
Large Vocabulary Continuous Speech
Recognition
~20,000-64,000 words
Speaker independent (vs. speaker-dependent)
Continuous speech (vs isolated-word)
09/23/16
15
Speech and Language Processing Jurafsky and Martin
15
09/23/16
Task
Vocabulary
Error (%)
Digits
11
0.5
5K
20K
Broadcast news
64,000+
10
Conversational Telephone
64,000+
20
16
Speech and Language Processing Jurafsky and Martin
16
Vocab
ASR
Hum SR
Continuous digits
11
.5
.009
5K
0.9
5K
1.1
SWBD 2004
65K
20
Conclusions:
Machines about 5 times worse than humans
Gap increases with noisy speech
These numbers are rough
09/23/16
17
Speech and Language Processing Jurafsky and Martin
17
Paradigm:
Supervised Machine Learning + Search
The Noisy Channel Model
18
19
20
W argmax P(W | O)
W L
P(O |W )P(W )
W argmax
P(O)
W L
21
22
23
24
26
Embedded training:
Re-estimate probabilities using initial phone HMMs
+ orthographically transcribed corpus +
pronunciation lexicon to create whole sentence
HMMs for each sentence in training corpus
Iteratively retrain transition and observation
probabilities by running the training data through
the model until convergence
27
29
30
Grammars
Finite state grammar or Context Free Grammar
(CFG) or semantic grammar
Search/Decoding
Find the best hypothesis P(O|W) P(W) given
For O
Calculate most likely state sequence in HMM given transition
and observation probabilities
Trace back thru state sequence to assign words to states
N best vs. 1-best vs. lattice output
Limiting search
Lattice minimization and determinization
Pruning: beam search
32
Evaluating Success
Transcription
Goal: Low WER (Subst+Ins+Del)/N * 100
This is a test
Thesis test. (1subst+2del)/4*100=75% WER
That was the dentist calling. (4 subst+1ins)/4words *
100=125% WER
Understanding
Goal: High concept accuracy
How many domain concepts were correctly
recognized?
I want to go from Boston to Baltimore on September 29
33
Domain concepts
source city
target city
travel date
Values
Boston
Baltimore
September 29
Summary
ASR today
Combines many probabilistic phenomena: varying
acoustic features of phones, likely pronunciations
of words, likely sequences of words
Relies upon many approximate techniques to
translate a signal
Finite State Transducers
ASR future
Can we include more language phenomena in the
model?
35
Next Class
Building an ASR system: the HTK Toolkit
36