Académique Documents
Professionnel Documents
Culture Documents
Fundamentals of Speech
Recognition Lecture 1
Introduction/Overview of
Automatic Speech
Recognition
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
Fundamentals of Speech
Recognition-Overview of ASR
Fundamentals of Speech
Recognition-Overview of ASR
Speech Recognition-2001
(Stanley Kubrick View in 1968)
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
Fundamentals of Speech
Recognition-Overview of ASR
Text-to-Speech
Synthesis
TTS
ASR
Automatic Speech
Recognition
Data
Whats next?
Words spoken
Dialog
Management
(Actions) and
Spoken
Language
Generation
12/28/2009 (Words)
DM &
SLG
SLU
Spoken Language
Understanding
Meaning
Billing credit
Fundamentals of Speech
Recognition-Overview of ASR
Fundamentals of Speech
Recognition-Overview of ASR
A-to-D
Converter
speech
Feature
Analysis
Pattern
Matching
symbols
recognition
speaker recognition
speaker verification
word spotting
automatic indexing of speech recordings
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
Speech
Production
Mechanisms
s(n)
Acoustic
Processor
Speaker Model
Linguistic
Decoder
^
W
Speech Recognizer
W = arg max P (W | X )
W
P ( X | W ) P (W )
= arg max
P( X )
W
= arg max PA ( X | W ) PL (W )
W
Step 3
12/28/2009
Step 1
Step 2
Fundamentals of Speech
Recognition-Overview of ASR
10
Fundamentals of Speech
Recognition-Overview of ASR
11
Fundamentals of Speech
Recognition-Overview of ASR
12
PL (W ) = PL (w1,w 2 ,...,w k )
k
= PL (w n | w n 1,w n 2 ,...,w n N )
n =1
Fundamentals of Speech
Recognition-Overview of ASR
13
Fundamentals of Speech
Recognition-Overview of ASR
14
W = arg max P ( W | X)
s(n), W
P(W) P( X | W)
= arg max
P ( X)
W
= arg max P ( W ) P ( X | W )
Speech
Analysis
Xn
Decoder
= w w ...w
W
1 2
M
is the corresponding word sequence, P(X|W) is the
acoustic model and P(W) is the language model
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
15
^W
TTS
DM
Speech
Recognition
Process
ASR
SLU
Acoustic
Acoustic
Model
Model
(HMM)
(HMM)
Input
Speech
Feature
Feature
Analysis
Analysis
(Spectral
(Spectral
Analysis)
Analysis)
Pattern
Pattern
Classification
Classification
(Decoding,
(Decoding,
Search)
Search)
Language
Language
Model
Model
(N-gram)
(N-gram)
12/28/2009
Utterance
Utterance
Verification
Verification
(Confidence
(Confidence
Scores)
Scores)
Hello World
(0.9) (0.8)
Word
Word
Lexicon
Lexicon
Fundamentals of Speech
Recognition-Overview of ASR
16
Evaluate performance
speech testing data set
Fundamentals of Speech
Recognition-Overview of ASR
17
Feature Extraction
Goal: Extract robust features (information) from
the speech that are relevant for ASR.
Feature
Feature
Extraction
Extraction
Acoustic
Acoustic
Model
Model
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
Fundamentals of Speech
Recognition-Overview of ASR
18
Robustness
Robustness
Robustness
Problem:
Rejection
Rejection
Unlimited
Unlimited
Vocabulary
Vocabulary
Methods:
Perception Approach:
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
19
Training
Signal
Enhancement
Testing
12/28/2009
Signal
Features
Normalization
Features
Fundamentals of Speech
Recognition-Overview of ASR
Model
Adaptation
Model
20
Acoustic Model
Goal: Map acoustic features into distinct
Acoustic
Acoustic
Model
Model
Feature
Feature
Extraction
Extraction
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
Fundamentals of Speech
Recognition-Overview of ASR
21
z1
z1
i
1
z1
z2
z1
z3
ih1
12/28/2009
ih2
ih3
z1
z1
z1
z2
Fundamentals of Speech
Recognition-Overview of ASR
z1
z3
22
a22
a12
b1(Ot)
a33
a23
a44
a34
a55=1
a45
a13
a24
a35
b2(Ot)
b3(Ot)
b4(Ot)
b5(Ot)
Fundamentals of Speech
Recognition-Overview of ASR
23
Word Lexicon
Acoustic
Acoustic
Model
Model
Goal:
Feature
Feature
Extraction
Extraction
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
Multiple Pronunciation:
Challenges:
Fundamentals of Speech
Recognition-Overview of ASR
24
Language Model
Goal:
Acoustic
Acoustic
Model
Model
Feature
Feature
Extraction
Extraction
Handcrafted:
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
Statistical:
Challenges:
0.6
Fundamentals of Speech
Recognition-Overview of ASR
25
Pattern Classification
Goal:
Method:
Acoustic
Acoustic
Model
Model
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
Challenges:
Fundamentals of Speech
Recognition-Overview of ASR
26
HMM states
HMM units
Phones
Words
Robustness
Robustness
Rejection
Rejection
Unlimited
Unlimited
Vocabulary
Vocabulary
Sentences
Fundamentals of Speech
Recognition-Overview of ASR
27
Word:Phrase
WFST
WFST
Phone:Word
WFST
WFST
HMM:Phone
WFST
WFST
State:HMM
WFST
WFST
12/28/2009
Combination
Combination Optimization
Optimization
Fundamentals of Speech
Recognition-Overview of ASR
Search
Network
28
dx:/.8
ax:data/1
d: /1
ae:/.6
12/28/2009
Data
t:/.2
Fundamentals of Speech
Recognition-Overview of ASR
29
Algorithmic Speed
-up for
Speed-up
Speech Recognition
AT&T (Algorithmic)
Relative Speed
30
AT&T
25
20
Community
15
10
5
0
1994
1995
1996
1997
1998
1999
2000
2001
2002
Year
12/28/2009
North
NorthAmerican
AmericanBusiness
Business
vocabulary:
40,000
words
vocabulary:
40,000
words
Fundamentals
of Speech
branching
Recognition-Overview
of85
ASR
branchingfactor:
factor:
85
30
Utterance Verification
Acoustic
Acoustic
Model
Model
Goal:
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Method:
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
credit please
credit fees
(0.9) (0.3)
Challenges:
Fundamentals of Speech
Recognition-Overview of ASR
31
Robustness
Robustness
Rejection
Rejection
Rejection
Problem:
Extraneous acoustic events, noise,
background speech and out-of-domain speech
deteriorate system performance.
Unlimited
Unlimited
Vocabulary
Vocabulary
Measure of Confidence:
Associating word strings with a verification cost that
provide an effective measure of confidence
(Utterance Verification).
Effect:
Improvement in the performance of the recognizer,
understanding system and dialogue manager.
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
32
TTS
DM
Input
Speech
ASR
SLU
Feature
Feature
Extraction
Extraction
State
-of-the-Art
State-of-the-Art
Performance?
Acoustic
Acoustic
Model
Model
Pattern
Pattern
Classification
Classification
(Decoding,
(Decoding,
Search)
Search)
Language
Language
Model
Model
12/28/2009
Utterance
Utterance
Verification
Verification
Recognized
Sentence
Word
Word
Lexicon
Lexicon
Fundamentals of Speech
Recognition-Overview of ASR
33
TYPE
Connected Digit
Strings--TI Database
Connected Digit
Strings--Mall
Recordings
Connected Digits
Strings--HMIHY
RM (Resource
Management)
ATIS(Airline Travel
Information System)
NAB (North American
Business)
Broadcast News
Spontaneous
Switchboard
Call Home
12/28/2009
Spontaneous
Conversational
VOCABULARY
WORD
SIZE
ERROR RATE
11 (zero-nine,
0.3%
oh)
11 (zero-nine,
2.0%
oh)
5.0%
Read Speech
11 (zero-nine,
oh)
1000
Spontaneous
2500
2.5%
Read Text
64,000
6.6%
News Show
210,000
13-17%
Conversational
Telephone
Conversational
Telephone
45,000
25-29%
28,000
40%
Fundamentals of Speech
Recognition-Overview of ASR
Factor of
17
increase
in digit
error rate
2.0%
34
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
35
12/28/2009
vocabulary:
40,000
branching
vocabulary:
40,000words
words
Fundamentals
of
Speech branching
factor:
Recognition-Overview
factor:85
85of ASR
36
Broadcast News
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
37
Dictation Machine
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
38
Word Accuracy
80
70
60
50
40
30
1996
1997
1998
1999
2000
2001
2002
Year
12/28/2009
Switchboard/Call
Switchboard/CallHome
Home
Vocabulary:
40,000
words
Vocabulary:
40,000
words
Fundamentals
of Speech
Perplexity:
Recognition-Overview
of ASR
Perplexity:85
85
39
Vocabulary Size
Growth in Effective
Recognition Vocabulary Size
10000000
1000000
100000
10000
1000
100
10
1
1960
1970
1980
1990
2000
2010
Year
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
40
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
41
100
10
x100
1
x10
x1
0.1
0.001
0.01
0.1
Machines
Outperform
Humans
1
10
12/28/2009
Digits
RM-LM
NAB-mic
WSJ
RM-null
NAB-omni
SWBD
WSJ-22dB
Fundamentals of Speech
Recognition-Overview of ASR
42
Voice
-Enabling Services:
Voice-Enabling
Technology Components
Voice reply to customer
Customer voice request
What number did you
want to call?
Text-to-Speech
Synthesis
ASR
TTS
Automatic Speech
Recognition
Whats next?
Words spoken
Dialog
Management
and Spoken
(Actions)
Language
Generation
12/28/2009 (Words)
DM &
SLG
SLU
Spoken Language
Understanding
Meaning
Billing credit
Fundamentals of Speech
Recognition-Overview of ASR
43
Goal:: Interpret the meaning of key words and phrases in the recognized
speech string, and map them to actions that the speech understanding
system should take
accurate understanding can often be achieved without correctly recognizing
every word
SLU makes it possible to offer services where the customer can speak naturally
without learning a specific set of terms
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
44
Voice
-Enabling Services:
Voice-Enabling
Technology Components
Voice reply to customer
Customer voice request
What number did you
want to call?
Text-to-Speech
Synthesis
ASR
TTS
Automatic Speech
Recognition
Whats next?
Words spoken
Dialog
Management
(Actions) and
Spoken
Language
Generation
12/28/2009 (Words)
DM &
SLG
SLU
Spoken Language
Understanding
Meaning
Billing credit
Fundamentals of Speech
Recognition-Overview of ASR
45
Goal:: Combine the meaning of the current input with the interaction history to
decide what the next step in the interaction should be
DM makes viable complex services that require multiple exchanges between the system and
the customer
dialog systems can handle user-initiated topic switching (within the domain of the
application)
spoken text string to guide the dialog forward towards a clear and well understood
goal or system interaction
goal, e.g., booking an airline reservation, renting a car, purchasing a stock, obtaining
help with a service
time, progress towards a goal); how do you attain goals (get answers). Is there an
art of dialogues. How does the User Interface play into the art/science of
dialoguessometimes it is better/easier/faster/more efficient to point, use a mouse,
type than speak multimodal interactions with machines
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
46
sm
HMIHY
10 seconds
Sparkle Tone
Thank you for calling AT&T
Sparkle Tone
AT&T, How may I help you?
30 seconds
Network Menu
13 seconds
26 seconds
58 seconds
Main Menu
38 seconds
LD Sub-Menu
20 seconds
8 seconds
47
HMIHY
Account
Balance
12/28/2009
Calling
Plans
Local
Unrecognized
Number
...
Fundamentals of Speech
Recognition-Overview of ASR
48
Irate Customer
Rate Plan
Account Balance
Local Service
Unrecognized Number
Threshold Billing
Billing Credit
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
Customer Satisfaction
decreased repeat
calls (37%)
decreased OUTPIC
rate (18%)
decreased CCA (Call
Control Agent) time
per call (10%)
decreased customer
complaints (78%)
49
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
50
Text-to-Speech
Synthesis
TTS
ASR
Automatic Speech
Recognition
Whats next?
Words spoken
Dialog
Management
(Actions) and
Spoken
Language
Generation
12/28/2009 (Words)
DM &
SLG
SLU
Spoken Language
Understanding
Meaning
Billing credit
Fundamentals of Speech
Recognition-Overview of ASR
51
Speech Synthesis
text
12/28/2009
Linguistic
Rules
DSP
Computer
Fundamentals of Speech
Recognition-Overview of ASR
D-to-A
Converter
speech
52
Speech Synthesis
Synthesis of Speech for effective humanmachine communications
reading email messages over a telephone
telematics feedback in automobiles
talking agents for completion of transactions
call center help desks and customer care
handheld devices such as foreign language
phrasebooks, dictionaries, crossword puzzle
helpers
announcement machines that provide things
like stock quotes, airlines schedules, updates
of arrivals and departures of flights
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
53
Natural
Speech
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
54
Gettysburg Address:
Fundamentals of Speech
Recognition-Overview of ASR
55
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
56
Au Clair de la Lune
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
57
Information Kiosk
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
58
MATCH: Multimodal
Access To City Help
12/28/2009
Fundamentals of Speech
59
Are thereRecognition-Overview
any cheap Italian
places in this neighborhood?
of ASR
MIPad Demo--Microsoft
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
60
Voice-Enabled Services
Desktop applications -- dictation, command and control of
desktop, control of document properties (fonts, styles, bullets, )
Agent technology simple tasks like stock quotes, traffic
reports, weather; access to communications, e.g., voice dialing,
voice access to directories (800 services); access to messaging
(text and voice messages); access to calendars and appointments
Voice Portals convert any web page to a voice-enabled site
where any question that can be answered on-line can be
answered via a voice query; protocols like VXML, SALT, SMIL,
SOAP and others are key
E-Contact services Call Centers, Customer Care (HMIHY) and
Help Desks where calls are triaged and answered appropriately
using natural language voice dialogues
Telematics command and control of automotive features
(comfort systems, radio, windows, sunroof)
Small devices control of cellphones, PDAs from voice
commands
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
61
Isolated
Words
Filter-bank
analysis;
Timenormalization;
Dynamic
programming
1962
12/28/2009
Large
Vocabulary;
Syntax,
Semantics,
Medium
Large
Vocabulary,
Vocabulary,
Template-based Statistical-based
Isolated Words;
Connected Digits;
Continuous
Speech
Pattern
recognition; LPC
analysis;
Clustering
algorithms; Level
building;
1967
1972
Connected
Words;
Continuous
Speech
Hidden Markov
models;
Stochastic
Language
modeling;
Continuous
Speech; Speech
Understanding
Spoken dialog;
Multiple
modalities
Stochastic language
understanding;
Finite-state
machines;
Statistical learning;
1977
1982
1987
Fundamentals of Speech
Recognition-Overview of ASR
Year
Very Large
Vocabulary;
Semantics,
Multimodal
Dialog, TTS
1992
Concatenative
synthesis; Machine
learning; Mixedinitiative dialog;
1997
2002
62
Dialog
Systems
2002
12/28/2009
Very Large
Vocabulary,
Limited Tasks,
Arbitrary
Environment
Unlimited
Vocabulary,
Unlimited Tasks,
Many
Languages
Robust
Systems
Multilingual
Systems;
Multimodal
Speech Enabled
Devices
2005
2008
Fundamentals of Speech
Recognition-Overview of ASR
Year
2011
63
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
64
Speaker trained
Speaker independent
Amount of training material
Quiet office
Home
Noisy surroundings (factory floor,
cellular environments, speakerphones)
12/28/2009
Feedback to users
Instructions
Requests for repeats
Rejections
Human factors
Recognition task
Speaker characteristics
Speaking environment
Isolated words/phrases
Connected word sequences
Continuous speech (essentially
unconstrained)
Recognition mode
System complexity
Computation/hardware
Real-time response capability
Fundamentals of Speech
Recognition-Overview of ASR
65
Overview of Speech
Recognition Processes
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
66
OverviewofSpeech
RecognitionProcesses
Template
VQHMM
FS
Speech
Transcriptions
logEn Zn
Temporal,
Spectral,
Cepstral,LPC
Features
d(X,Y)DTW
Word/
Sound
Models,
Templates
Dictionary
Syntax
Recognized
Input
TemplatesModels
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
67
Fundamentals of Speech
Recognition-Overview of ASR
68
Training phase is required; the more training data, the better the
patterns (templates, models)
Patterns are sensitive to the speaking environment, transmission
environment, transducer (microphone), etc. (This problem is known
as the speech robustness problem).
No speech specific knowledge is required or exploited, except in the
feature extraction stage
Computational load is (more or less) linearly proportional to the
number of patterns being recognized (at least for simple recognition
problems, e.g., isolated word tasks)
Pattern recognition techniques are applicable to a range of speech
units, including phrases, words, and sub-word units (phonemes,
syllables, dyads, etc.)
Extensions possible to large vocabulary, fluent speech recognition
using word lexicons (dictionaries) and language models (grammars
or syntax)
Extensions possible to natural language understanding systems
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
69
Fundamentals of Speech
Recognition-Overview of ASR
70
12/28/2009
71
template methods
clustering methods
HMM methodsViterbi, Forward-Backward
vector quantization (VQ) methods
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
72
dynamic programming
level building
one pass method
12/28/2009
73
phoneme models
context dependent models
discrete, mixed, continuous density models
N-gram language models
Natural language understanding
insertions, deletions, substitutions
other factors
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
74
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
75
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
76
Fundamentals of Speech
Recognition-Overview of ASR
77