Vous êtes sur la page 1sur 77

ECE 259B

Fundamentals of Speech
Recognition Lecture 1
Introduction/Overview of
Automatic Speech
Recognition
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

Why Digital Processing of


Speech?
digital processing of speech signals (DPSS)
enjoys an extensive theoretical and
experimental base developed over the past 75
years
much research has been done since 1965 on
the use of digital signal processing in speech
communication problems
highly advanced implementation technology
(VLSI) exists that is well matched to the
computational demands of DPSS
there are abundant applications that are in
widespread use commercially
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

The Speech Stack


Speech Applications coding, synthesis,
recognition, understanding, verification,
language translation, speed-up/slow-down

Speech Algorithms speech-silence,


voiced-unvoiced, pitch, formants
Speech Representations temporal,
spectral, homomorphic, LPC
Fundamentals acoustics, linguistics,
pragmatics, speech perception
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

Speech Recognition-2001
(Stanley Kubrick View in 1968)

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

Apple Navigator -- 1988

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

The Speech Advantage


Reduce costs
reduce labor expenses while still providing
customers an easy-to-use and natural way to
access information and services
New revenue opportunities
24x7 high-quality customer care automation
access to information without a keyboard or
touch-tones
Customer retention
provide personal services for customer
preferences
improve customer experience
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

The Speech Circle


Voice reply to customer
Customer voice request
What number did you
want to call?

Text-to-Speech
Synthesis

TTS

ASR

Automatic Speech
Recognition

Data
Whats next?

Words spoken

Determine correct number

I dialed a wrong number

Dialog
Management
(Actions) and
Spoken
Language
Generation
12/28/2009 (Words)

DM &
SLG

SLU
Spoken Language
Understanding
Meaning
Billing credit

Fundamentals of Speech
Recognition-Overview of ASR

Automatic Speech Recognition


Goal: Accurately and efficiently convert a
speech signal into a text message
independent of the device, speaker or the
environment.
Applications: Automation of complex
operator-based tasks, e.g., customer care,
dictation, form filling applications,
provisioning of new services, customer
help lines, e-commerce, etc.
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

Pattern Matching Problems


speech

A-to-D
Converter

speech

Feature
Analysis

Pattern
Matching

symbols

recognition

speaker recognition
speaker verification
word spotting
automatic indexing of speech recordings
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

Basic ASR Formulation (Bayes Method)


Speakers
Intention

Speech
Production
Mechanisms

s(n)

Acoustic
Processor

Speaker Model

Linguistic
Decoder

^
W

Speech Recognizer

W = arg max P (W | X )
W

P ( X | W ) P (W )
= arg max
P( X )
W
= arg max PA ( X | W ) PL (W )
W

Step 3
12/28/2009

Step 1

Step 2

Fundamentals of Speech
Recognition-Overview of ASR

10

Steps in Speech Recognition


Step 1- Acoustic Modeling:
Modeling assign probabilities
to acoustic realizations of a sequence of
words. Compute PA(X|W) using statistical
models (Hidden Markov Models) of acoustic
signals and words
Step 2- Language Modeling:
Modeling assign probabilities
to sequences of words in the language. Train
PL(W) from generic text or from transcriptions
of task-specific dialogues.
Step 3- Hypothesis Search:
Search find the word
sequence with the maximum a posteriori
probability. Search through all possible word
sequences to determine arg max over W.
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

11

Step 1-The Acoustic Model


i we build acoustic models by learning statistics of the acoustic
features, X , from a training set where we compute the variability
of the acoustic features during the production of the sounds
represented by the models
i it is impractical to create a separate acoustic model, PA ( X | W ), for
every possible word in the language--it requires too much
training data for words in every possible context
i instead we build acoustic-phonetic models for the ~50 phonemes
in the English language and construct the model for a word by
concantenating (stringing together sequentially) the models for
the constituent phones in the word
i similarly we build sentences (sequences of words) by concatenating
word models
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

12

Step 2-The Language Model


the language model describes the probability of a sequence of
words that form a valid sentence in the language
a simple statistical method works well based on a Markovian
assumption, namely that the probability of a word in a sentence
is conditioned on only the previous N-words, namely an N-gram
language model

PL (W ) = PL (w1,w 2 ,...,w k )
k

= PL (w n | w n 1,w n 2 ,...,w n N )
n =1

i where PL (w n | w n 1,w n 2 ,...,w n N ) is estimated by simply


counting up the relative frequencies f of N -tuples in a
large corpus of text
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

13

Step 3-The Search Problem


the search problem is one of searching the space of
all valid sound sequences, conditioned on the word
grammar, the language syntax, and the task
constraints, to find the word sequence with the
maximum likelihood
the size of the search space can be astronomically
large and take inordinate amounts of computing
power to solve by heuristic methods
the use of methods from the field of Finite State
Automata Theory provide Finite State Networks
(FSN) that reduce the computational burden by
orders of magnitude, thereby enabling exact solutions
in computationally feasible times, for large speech
recognition problems
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

14

Basic ASR Formulation


The basic equation of Bayes rule-based speech
recognition is

W = arg max P ( W | X)
s(n), W

P(W) P( X | W)
= arg max
P ( X)
W
= arg max P ( W ) P ( X | W )

Speech
Analysis

Xn

Decoder

where X=X1,X2,,XN is the acoustic observation (feature


vector) sequence.

= w w ...w
W
1 2
M
is the corresponding word sequence, P(X|W) is the
acoustic model and P(W) is the language model
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

15

^W

TTS

DM

Speech
Recognition
Process

ASR

SLU
Acoustic
Acoustic
Model
Model
(HMM)
(HMM)

Input
Speech

Feature
Feature
Analysis
Analysis
(Spectral
(Spectral
Analysis)
Analysis)

Pattern
Pattern
Classification
Classification
(Decoding,
(Decoding,
Search)
Search)

Language
Language
Model
Model
(N-gram)
(N-gram)
12/28/2009

Utterance
Utterance
Verification
Verification
(Confidence
(Confidence
Scores)
Scores)

Hello World
(0.9) (0.8)

Word
Word
Lexicon
Lexicon

Fundamentals of Speech
Recognition-Overview of ASR

16

Speech Recognition Processes


Choose task => sounds, word vocabulary,
task syntax (grammar), task semantics
text training data set => word lexicon, word
grammar (language model), task grammar
speech training data set => acoustic models

Evaluate performance
speech testing data set

Training algorithm => build models from


training set of text and speech
Testing algorithm => evaluate
performance from testing set of speech
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

17

Feature Extraction
Goal: Extract robust features (information) from
the speech that are relevant for ASR.

Feature
Feature
Extraction
Extraction

Method: Spectral analysis through either a

bank-of-filters or through LPC followed


by non-linearity and normalization (cepstrum).

Acoustic
Acoustic
Model
Model

Pattern
Pattern
Classification
Classification

Language
Language
Model
Model

Utterance
Utterance
Verification
Verification

Word
Word
Lexicon
Lexicon

Result: Signal compression where for each window of speech

samples where 30 or so cepstral features are extracted (64,000 b/s ->


5,200 b/s).

Challenges: Robustness to environment (office, airport, car),

devices (speakerphones, cellphones), speakers (acents, dialect, style,


speaking defects), noise and echo. Feature set for recognition
cepstral features or those from a high dimensionality space.
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

18

Robustness
Robustness

Robustness
Problem:

Rejection
Rejection
Unlimited
Unlimited
Vocabulary
Vocabulary

A mismatch in the speech signal between the


training phase and testing phase can result in performance degradation.

Methods:

Traditional techniques for improving system robustness


are based on signal enhancement, feature normalization or/and
model adaptation.

Perception Approach:

Extract fundamental acoustic information in narrow bands


of speech. Robust integration of features across time and frequency.

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

19

Methods for Robust Speech


Recognition

Training

Signal

Enhancement

Testing

12/28/2009

Signal

Features

Normalization

Features

Fundamentals of Speech
Recognition-Overview of ASR

Model

Adaptation

Model

20

Acoustic Model
Goal: Map acoustic features into distinct

Acoustic
Acoustic
Model
Model

Feature
Feature
Extraction
Extraction

phonetic labels (e.g., /s/, /aa/).

Pattern
Pattern
Classification
Classification

Language
Language
Model
Model

Utterance
Utterance
Verification
Verification

Word
Word
Lexicon
Lexicon

Hidden Markov Model (HMM): Statistical method for

characterizing the spectral properties of speech by a parametric random


process. A collection of HMMs is associated with a phone. HMMs are
also assigned for modeling extraneous events.

Advantages: Powerful statistical method for dealing with a wide

range of data and reliably recognizing speech.

Challenges: Understanding the role of classification models (ML

Training) versus discriminative models (MMI training). What comes after


the HMMare there data driven models that work better for some or all
vocabularies.
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

21

HMM for speech


Phone model : z (/Z/)

z1
z1

i
1

z1
z2

z1
z3

Word model: is (/IH/ /Z/)i

ih1

12/28/2009

ih2

ih3

z1
z1

z1
z2

Fundamentals of Speech
Recognition-Overview of ASR

z1
z3

22

Isolated Word HMM


a11
1

a22
a12

b1(Ot)

a33
a23

a44
a34

a55=1
a45

a13

a24

a35

b2(Ot)

b3(Ot)

b4(Ot)

b5(Ot)

Left-right HMM highly constrained state sequences


12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

23

Word Lexicon

Acoustic
Acoustic
Model
Model

Goal:

Map legal phone sequences into words


according to phonotactic rules. For example,
David

Feature
Feature
Extraction
Extraction

/d/ /ey/ /v/ /ih/ /d/

Pattern
Pattern
Classification
Classification

Language
Language
Model
Model

Utterance
Utterance
Verification
Verification

Word
Word
Lexicon
Lexicon

Multiple Pronunciation:

Several words may have multiple pronunciations. For example


Data
Data

/d/ /ae/ /t/ /ax/


/d/ /ey/ /t/ /ax/

Challenges:

How do you generate a word lexicon automatically; how do you add


new variant dialects and word pronunciations.
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

24

Language Model
Goal:

Mapping words into


phrases and sentences based on
task syntax.

Acoustic
Acoustic
Model
Model

Feature
Feature
Extraction
Extraction

Handcrafted:

Deterministic grammars that are


knowledge-based. For example,
Yes on my credit (card) please

Pattern
Pattern
Classification
Classification

Language
Language
Model
Model

Utterance
Utterance
Verification
Verification

Word
Word
Lexicon
Lexicon

Statistical:

Compute estimate of word probabilities (N-gram model). For example


Yes on my credit card please
0.4

Challenges:

0.6

How do you build a language model rapidly for a new task.


12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

25

Pattern Classification
Goal:

Combine information (probabilities) from the acoustic


model, language model and word lexicon to generate
an optimal word sequence (highest
Feature
Feature
Extraction
probability).
Extraction

Method:

Decoder searches through all possible recognition


choices using a Viterbi decoding algorithm.

Acoustic
Acoustic
Model
Model

Pattern
Pattern
Classification
Classification

Language
Language
Model
Model

Utterance
Utterance
Verification
Verification

Word
Word
Lexicon
Lexicon

Challenges:

How do we build efficient structures (FSMs) for decoding and searching


large vocabulary, complex language models tasks;
features x HMM units x phones x words x sentences can lead to
search networks with 10 22 states
FSM methods can compile the network to 10 8 states14 orders
of magnitude more efficient
What is the theoretical limit of efficiency that can be achieved
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

26

Unlimited Vocabulary ASR


The basic problem in ASR is to find the sequence
of words that explain the input signal. This implies
the following mapping:
Features

HMM states

HMM units

Phones

Words

Robustness
Robustness
Rejection
Rejection
Unlimited
Unlimited
Vocabulary
Vocabulary

Sentences

For the WSJ 64,000 vocabulary, this results in a network


of 1022 bytes!
State-of-the-art methods including fast match, multi-pass
decoding and A* stack provide tremendous speed-up at
a cost of increased complexity and less portability.
Advances in weighted finite state transducers have
enabled us to represent this network in a unified
8
mathematical framework with only 10 bytes!
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

27

Weighted Finite State


Transducers (WFST)
Unified Mathematical framework to ASR
Efficiency in time and space

Word:Phrase

WFST
WFST

Phone:Word

WFST
WFST

HMM:Phone

WFST
WFST

State:HMM

WFST
WFST

12/28/2009

Combination
Combination Optimization
Optimization

Fundamentals of Speech
Recognition-Overview of ASR

Search
Network

28

Weighted Finite State


Transducer
Word Pronunciation Transducer
ey:/.4

dx:/.8
ax:data/1

d: /1
ae:/.6

12/28/2009

Data

t:/.2

Fundamentals of Speech
Recognition-Overview of ASR

29

Algorithmic Speed
-up for
Speed-up
Speech Recognition
AT&T (Algorithmic)

Moore's Law (hardware)

Relative Speed

30

AT&T

25
20

Community

15
10
5
0
1994

1995

1996

1997

1998

1999

2000

2001

2002

Year

12/28/2009

North
NorthAmerican
AmericanBusiness
Business
vocabulary:
40,000
words
vocabulary:
40,000
words
Fundamentals
of Speech
branching
Recognition-Overview
of85
ASR
branchingfactor:
factor:
85

30

Utterance Verification
Acoustic
Acoustic
Model
Model

Goal:

Identify possible recognition errors


Feature
Feature
Extraction
and out-of-vocabulary events. Potentially
Extraction
improves the performance of ASR, SLU and DM.

Pattern
Pattern
Classification
Classification

Language
Language
Model
Model

Method:

Utterance
Utterance
Verification
Verification

Word
Word
Lexicon
Lexicon

A confidence score based on a hypothesis test is associated with


each recognized word. For example:
Label:
Recognized:
Confidence:

credit please
credit fees
(0.9) (0.3)

Challenges:

Rejection of extraneous acoustic events (noise, background speech,


door slams) without rejection of valid user input speech.
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

31

Robustness
Robustness

Rejection

Rejection
Rejection

Problem:
Extraneous acoustic events, noise,
background speech and out-of-domain speech
deteriorate system performance.

Unlimited
Unlimited
Vocabulary
Vocabulary

Measure of Confidence:
Associating word strings with a verification cost that
provide an effective measure of confidence

(Utterance Verification).

Effect:
Improvement in the performance of the recognizer,
understanding system and dialogue manager.
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

32

TTS

DM

Input
Speech

ASR

SLU

Feature
Feature
Extraction
Extraction

State
-of-the-Art
State-of-the-Art
Performance?
Acoustic
Acoustic
Model
Model

Pattern
Pattern
Classification
Classification
(Decoding,
(Decoding,
Search)
Search)

Language
Language
Model
Model

12/28/2009

Utterance
Utterance
Verification
Verification

Recognized
Sentence

Word
Word
Lexicon
Lexicon

Fundamentals of Speech
Recognition-Overview of ASR

33

Word Error Rates


CORPUS

TYPE

Connected Digit
Strings--TI Database
Connected Digit
Strings--Mall
Recordings
Connected Digits
Strings--HMIHY
RM (Resource
Management)
ATIS(Airline Travel
Information System)
NAB (North American
Business)
Broadcast News

Spontaneous

Switchboard
Call Home
12/28/2009

Spontaneous

Conversational

VOCABULARY
WORD
SIZE
ERROR RATE
11 (zero-nine,
0.3%
oh)
11 (zero-nine,
2.0%
oh)
5.0%

Read Speech

11 (zero-nine,
oh)
1000

Spontaneous

2500

2.5%

Read Text

64,000

6.6%

News Show

210,000

13-17%

Conversational
Telephone
Conversational
Telephone

45,000

25-29%

28,000

40%

Fundamentals of Speech
Recognition-Overview of ASR

Factor of
17
increase
in digit
error rate

2.0%

34

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

35

North American Business

12/28/2009

vocabulary:
40,000
branching
vocabulary:
40,000words
words
Fundamentals
of
Speech branching
factor:
Recognition-Overview
factor:85
85of ASR

36

Broadcast News

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

37

Dictation Machine

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

38

Algorithmic Accuracy for


Speech Recognition

Word Accuracy

80
70
60
50
40
30
1996

1997

1998

1999

2000

2001

2002

Year

12/28/2009

Switchboard/Call
Switchboard/CallHome
Home
Vocabulary:
40,000
words
Vocabulary:
40,000
words
Fundamentals
of Speech
Perplexity:
Recognition-Overview
of ASR
Perplexity:85
85

39

Vocabulary Size

Growth in Effective
Recognition Vocabulary Size
10000000
1000000
100000
10000
1000
100
10
1

1960

1970

1980

1990

2000

2010

Year

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

40

Human Speech Recognition vs ASR

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

41

Human Speech Recognition vs ASR

M ACHINE ERROR (%)

100

10

x100
1

x10
x1

0.1
0.001

0.01

0.1

Machines
Outperform
Humans
1

10

HUMAN ERROR (%)

12/28/2009

Digits

RM-LM

NAB-mic

WSJ

RM-null

NAB-omni

SWBD

WSJ-22dB

Fundamentals of Speech
Recognition-Overview of ASR

42

Voice
-Enabling Services:
Voice-Enabling
Technology Components
Voice reply to customer
Customer voice request
What number did you
want to call?

Text-to-Speech
Synthesis

ASR

TTS

Automatic Speech
Recognition

Whats next?

Words spoken

Determine correct number

I dialed a wrong number

Dialog
Management
and Spoken
(Actions)
Language
Generation
12/28/2009 (Words)

DM &
SLG

SLU
Spoken Language
Understanding
Meaning
Billing credit

Fundamentals of Speech
Recognition-Overview of ASR

43

Spoken Language Understanding (SLU)

Goal:: Interpret the meaning of key words and phrases in the recognized
speech string, and map them to actions that the speech understanding
system should take
accurate understanding can often be achieved without correctly recognizing

every word

SLU makes it possible to offer services where the customer can speak naturally
without learning a specific set of terms

Methodology:: Exploit task grammar (syntax) and task semantics to


restrict the range of meaning associated with the recognized word string;
exploit salient words and phrases to map high information word
sequences to appropriate meaning
Performance Evaluation:: Accuracy of speech understanding system on
various tasks and in various operating environments
Applications:: Automation of complex operator-based tasks, e.g.,
customer care, catalog ordering, form filling systems, provisioning of new
services, customer help lines, etc.
Challenges: What goes beyond simple classifications systems but below
full Natural Language voice dialogue systems

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

44

Voice
-Enabling Services:
Voice-Enabling
Technology Components
Voice reply to customer
Customer voice request
What number did you
want to call?

Text-to-Speech
Synthesis

ASR

TTS

Automatic Speech
Recognition

Whats next?

Words spoken

Determine correct number

I dialed a wrong number

Dialog
Management
(Actions) and
Spoken
Language
Generation
12/28/2009 (Words)

DM &
SLG

SLU
Spoken Language
Understanding
Meaning
Billing credit

Fundamentals of Speech
Recognition-Overview of ASR

45

Dialog Management (DM)

Goal:: Combine the meaning of the current input with the interaction history to
decide what the next step in the interaction should be

DM makes viable complex services that require multiple exchanges between the system and
the customer
dialog systems can handle user-initiated topic switching (within the domain of the
application)

Methodology:: Exploit models of dialog to determine the most appropriate

spoken text string to guide the dialog forward towards a clear and well understood
goal or system interaction

Performance Evaluation:: Speed and accuracy of attaining a well defined task

Applications:: Customer care (HMIHY), travel planning, conference registration,

goal, e.g., booking an airline reservation, renting a car, purchasing a stock, obtaining
help with a service

scheduling, voice access to unified messaging

Challenges: Is there a science of dialogueshow do you keep it efficient (turns,

time, progress towards a goal); how do you attain goals (get answers). Is there an
art of dialogues. How does the User Interface play into the art/science of
dialoguessometimes it is better/easier/faster/more efficient to point, use a mouse,
type than speak multimodal interactions with machines

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

46

Customer Care IVR and HMIHY


Customer Care IVR

sm

HMIHY

10 seconds

Sparkle Tone
Thank you for calling AT&T

Sparkle Tone
AT&T, How may I help you?

30 seconds

Network Menu

Account Verification Routine

13 seconds

LEC Misdirect Announcement

26 seconds

Account Verification Routine

58 seconds

Main Menu

38 seconds

LD Sub-Menu

20 seconds
8 seconds

Reverse Directory Routine

Total Time to Get to Reverse


Total Time to Get to Reverse
12/28/2009
Fundamentals of Speech
Directory Lookup: 28 seconds!!!
Directory Lookup: 2:55 minutes!!!
Recognition-Overview of ASR

47

HMIHY sm How Does It Work


z
z
z

Prompt is AT&T. How may I help you?


User responds with totally unconstrained fluent speech
System recognizes the words and determines the meaning
of users speech, then routes the call
Dialog technology enables task completion

HMIHY

Account
Balance
12/28/2009

Calling
Plans

Local

Unrecognized
Number

...

Fundamentals of Speech
Recognition-Overview of ASR

48

HMIHY Example Dialogs


sm

Irate Customer
Rate Plan
Account Balance
Local Service
Unrecognized Number
Threshold Billing
Billing Credit

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

Customer Satisfaction
decreased repeat
calls (37%)
decreased OUTPIC
rate (18%)
decreased CCA (Call
Control Agent) time
per call (10%)
decreased customer
complaints (78%)

49

Customer Care Scenario

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

50

TTS Closest to the Customers Ear


Voice reply to customer
Customer voice request
What number did you
want to call?

Text-to-Speech
Synthesis

TTS

ASR

Automatic Speech
Recognition

Whats next?

Words spoken

Determine correct number

I dialed a wrong number

Dialog
Management
(Actions) and
Spoken
Language
Generation
12/28/2009 (Words)

DM &
SLG

SLU
Spoken Language
Understanding
Meaning
Billing credit

Fundamentals of Speech
Recognition-Overview of ASR

51

Speech Synthesis

text

12/28/2009

Linguistic
Rules

DSP
Computer

Fundamentals of Speech
Recognition-Overview of ASR

D-to-A
Converter

speech

52

Speech Synthesis
Synthesis of Speech for effective humanmachine communications
reading email messages over a telephone
telematics feedback in automobiles
talking agents for completion of transactions
call center help desks and customer care
handheld devices such as foreign language
phrasebooks, dictionaries, crossword puzzle
helpers
announcement machines that provide things
like stock quotes, airlines schedules, updates
of arrivals and departures of flights
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

53

Giving Machines High Quality Voices and


Faces
U.S. English Female:
U.S. English Male:
Spanish Female::

Natural
Speech
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

54

Speech Synthesis Examples


Soliloquy from Hamlet:

Gettysburg Address:

Third Grade Story:


12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

55

Speech Recognition Demos

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

56

Au Clair de la Lune

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

57

Information Kiosk

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

58

Multimodal Language Processing


Unified Multimodal Experience Access to
information through voice interface, gesture or both.

Multimodal Finite State Combination of speech,


gesture and meaning using finite state technology

MATCH: Multimodal
Access To City Help
12/28/2009

Fundamentals of Speech

59

Are thereRecognition-Overview
any cheap Italian
places in this neighborhood?
of ASR

MIPad Demo--Microsoft

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

60

Voice-Enabled Services
Desktop applications -- dictation, command and control of
desktop, control of document properties (fonts, styles, bullets, )
Agent technology simple tasks like stock quotes, traffic
reports, weather; access to communications, e.g., voice dialing,
voice access to directories (800 services); access to messaging
(text and voice messages); access to calendars and appointments
Voice Portals convert any web page to a voice-enabled site
where any question that can be answered on-line can be
answered via a voice query; protocols like VXML, SALT, SMIL,
SOAP and others are key
E-Contact services Call Centers, Customer Care (HMIHY) and
Help Desks where calls are triaged and answered appropriately
using natural language voice dialogues
Telematics command and control of automotive features
(comfort systems, radio, windows, sunroof)
Small devices control of cellphones, PDAs from voice
commands
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

61

Milestones in Speech and


Multimodal Technology Research
Small
Vocabulary,
Acoustic
Phoneticsbased

Isolated
Words

Filter-bank
analysis;
Timenormalization;
Dynamic
programming

1962
12/28/2009

Large
Vocabulary;
Syntax,
Semantics,

Medium
Large
Vocabulary,
Vocabulary,
Template-based Statistical-based

Isolated Words;
Connected Digits;
Continuous
Speech
Pattern
recognition; LPC
analysis;
Clustering
algorithms; Level
building;
1967

1972

Connected
Words;
Continuous
Speech
Hidden Markov
models;
Stochastic
Language
modeling;

Continuous
Speech; Speech
Understanding

Spoken dialog;
Multiple
modalities

Stochastic language
understanding;
Finite-state
machines;
Statistical learning;

1977
1982
1987
Fundamentals of Speech
Recognition-Overview of ASR

Year

Very Large
Vocabulary;
Semantics,
Multimodal
Dialog, TTS

1992

Concatenative
synthesis; Machine
learning; Mixedinitiative dialog;

1997

2002
62

Future of Speech Recognition


Technologies
Very Large
Vocabulary,
Limited Tasks,
Controlled
Environment

Dialog
Systems

2002
12/28/2009

Very Large
Vocabulary,
Limited Tasks,
Arbitrary
Environment

Unlimited
Vocabulary,
Unlimited Tasks,
Many
Languages

Robust
Systems

Multilingual
Systems;
Multimodal
Speech Enabled
Devices

2005
2008
Fundamentals of Speech
Recognition-Overview of ASR

Year

2011
63

Issues in Speech Recognition

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

64

Issues in Speech Recognition

Speaker trained
Speaker independent
Amount of training material
Quiet office
Home
Noisy surroundings (factory floor,
cellular environments, speakerphones)

High quality microphone, close


talking/noise cancelling microphone
Telephone (carbon button/electret)
Switched telephone network
IP network (VoIP)
Cellular network

12/28/2009

Feedback to users
Instructions
Requests for repeats
Rejections

Tolerance for recognition errors

Syntax constrained (language model)


Viable semantics

Human factors

Highly motivated, cooperative


Casual

Recognition task

Small (2-50 words)


Medium (50-250 word)
Large (250-2,000,000 words)

Speaker characteristics

Transducer and transmission system

Vocabulary size and complexity


(perplexity)

Speaking environment

Isolated words/phrases
Connected word sequences
Continuous speech (essentially
unconstrained)

Recognition mode

Input speech format

Fail soft systems


Human intervention on errors/confusion
Correction mechanisms built in

System complexity

Computation/hardware
Real-time response capability
Fundamentals of Speech
Recognition-Overview of ASR

65

Overview of Speech
Recognition Processes

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

66

OverviewofSpeech
RecognitionProcesses
Template
VQHMM

FS

Speech
Transcriptions

logEn Zn

Temporal,
Spectral,
Cepstral,LPC
Features

d(X,Y)DTW

Word/
Sound
Models,
Templates

Dictionary
Syntax

Recognized
Input
TemplatesModels
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

67

Statistical Pattern Recognition


The basic speech recognition task may be defined as follows:
a sequence of measurements (speech analysis frames)
on the (endpoint detected) speech signal of an utterance
defines a pattern for that utterance
this pattern is to be classified as belonging to one of
several possible categories (classes) (for word/phrase
recognition) or to a sequence of possible categories (for
continuous speech recognition)
the rules for this classification are formulated on the
basis of a labeled set of training patterns or models
The type of measurement (temporal, spectral, cepstral, LPC
features) and the classification rules (pattern alignment and
distance, model alignment and probability) are the main
factors that distinguish one method of speech recognition from
another
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

68

Issues in Pattern Recognition

Training phase is required; the more training data, the better the
patterns (templates, models)
Patterns are sensitive to the speaking environment, transmission
environment, transducer (microphone), etc. (This problem is known
as the speech robustness problem).
No speech specific knowledge is required or exploited, except in the
feature extraction stage
Computational load is (more or less) linearly proportional to the
number of patterns being recognized (at least for simple recognition
problems, e.g., isolated word tasks)
Pattern recognition techniques are applicable to a range of speech
units, including phrases, words, and sub-word units (phonemes,
syllables, dyads, etc.)
Extensions possible to large vocabulary, fluent speech recognition
using word lexicons (dictionaries) and language models (grammars
or syntax)
Extensions possible to natural language understanding systems

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

69

Speech Recognition Processes


1. Fundamentals (Lectures 2-6)
speech production (acoustic-phonetics, linguistics)
speech perception (auditory (ear) models, neural
models)
pattern recognition (statistical, template-based)
neural networks (classification methods)

2. Speech/Endpoint Detection (Lecture 7)


algorithms
speech features
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

70

Speech Recognition Processes


3. Speech Analysis/Feature Extraction (Lectures 7-9)

temporal parameters (log energy, zero crossings,


autocorrelation)
spectral parameters (STFT, OLA, FBS, spectrograms)
cepstral parameters (cepstrum, -cepstrum, 2cepstrum)
LPC parameters (reflection coefs, area coefs, LSP)

4. Distance/Distortion Measures (Lecture 10)

12/28/2009

temporal (quadratic, weighted)


spectral (log spectral distance)
cepstral (cepstral distance)
LPC (Itakura distance)
Fundamentals of Speech
Recognition-Overview of ASR

71

Speech Recognition Processes


5. Time Alignment Algorithms (Lectures 11-12)
linear alignments
dynamic time warping (DTW, dynamic programming)
HMM alignments (Viterbi alignment)

6. Model Building/Training (Lectures 13-14)

template methods
clustering methods
HMM methodsViterbi, Forward-Backward
vector quantization (VQ) methods

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

72

Speech Recognition Processes


7. Connected word modeling (Lecture 15)

dynamic programming
level building
one pass method

8. Testing/Evaluation Methods (Lecture 16)

12/28/2009

word/sound error rates


dictionary of words
task syntax
task semantics
task perplexity
Fundamentals of Speech
Recognition-Overview of ASR

73

Speech Recognition Processes


9. Large Vocabulary Recognition (Lectures 17-18)

phoneme models
context dependent models
discrete, mixed, continuous density models
N-gram language models
Natural language understanding
insertions, deletions, substitutions
other factors

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

74

Putting It All Together

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

75

Speech Recognition Course


Topics

12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

76

What We Will Be Learning

speech production modelacoustics, articulatory concepts, speech production models


speech perception modelear models, auditory signal processing, equivalent acoustic
processing models
signal processing approaches to speech recognitionacoustic-phonetic methods,
pattern recognition methods, statistical methods, neural network methods
fundamentals of pattern recognition
signal processing methodsbank-of-filters model, short-time Fourier transforms, LPC
methods, cepstral methods, perceptual linear prediction, mel cepstrum, vector
quantization
pattern recognition issuesspeech detection, distortion measures, time alignment and
normalization, dynamic time warping
speech system design issuessource coding, template training, discriminative methods
robustness issuesspectral subtraction, cepstral mean subtraction, model adaptation
Hidden Markov Model (HMM) fundamentalsdesign issues
connected word modelsdynamic programming, level building, one pass method
grammar networksfinite state machine (FSM) basics
large vocabulary speech recognitiontraining, language models, perplexity, acoustic
models for context dependent sub-word units
task-oriented designsnatural language understanding, mixed initiative systems, dialog
management
text-to-speech synthesisbased on unit selection methods
12/28/2009

Fundamentals of Speech
Recognition-Overview of ASR

77

Vous aimerez peut-être aussi