Académique Documents
Professionnel Documents
Culture Documents
=
i
t
i
t
t
Tx
i X
n i X i T
n R
2
) (
) ( ) (
) (
i
t
i
t
i
t t
X X
i X i X
n i X i X
n R
t t
2
1
2
1
) ( ) (
) ( ) (
) (
1
Voiced
Unvoiced
Template - Frame Cross - Frame
6.345 Automatic Speech Recognition
Paralinguistic Information Processing 6
Dynamic Programming Search
Optimal solution taking into account F0 and F0 constraints
Search space quantized such that F/F is constant
L
o
g
a
r
i
t
h
m
i
c
F
r
e
q
u
e
n
c
y
n correlatio frame - template : R , n correlatio frame - cross :
0
0 )} ( { max
TX
1
0
1
XX
TX
TX X X t j
t
R
) (t c) (i R
) (t c) (i R j i R (j) score
(i) score
t t t
=
> +
=
p(x
n
|S
i
,class(x
n
)) = score for speaker i
Score frames with class models of hypothesized phones
6.345 Automatic Speech Recognition
Paralinguistic Information Processing 22
Speaker Adapted Scoring
Train speaker-dependent (SD) models for each speaker
Rescore hypothesis with SD models
Compute total speaker adapted score by interpolating SD
score with SI score
Get best hypothesis from recognizer using speaker-
independent (SI) models
SD models
for
speaker i
) , | ( p
SD i n n
S u x
SD Model
SUMMIT
(SI models)
Test
utterance
) | ( p
SI n n
u x
SI Model
fifty-five
Features
1
x
2
x
3
x
4
x
5
x
6
x
7
x
8
x
9
x
f ay v f ih f t tcl iy f ay v f ih f t tcl iy
Phones
) , | ( p
) , | ( p ) 1 ( ) , | ( p
) , | ( p
SI
SI SD
SA
i n n
i n n n i n n n
i n n
S u x
S u x S u x
S u x
+
=
K u c
u c
n
n
n
+
=
) (
) (
US
NW
US
NW
DL
279
317
1507
1825
4431
6:00 A.M.
6:50 A.M.
7:00 A.M.
7:40 A.M.
7:40 A.M.
7:37 A.M.
8:27 A.M.
8:40 A.M.
9:19 A.M.
9:45 A.M.
Airline Flight Departs Arrives
DIALOGUE
MANAGER
DIALOGUE
MANAGER
6.345 Automatic Speech Recognition
Paralinguistic Information Processing 36
Multi-modal Interfaces
Timing information is a useful way to relate inputs
Does this mean
yes, one, or
something else?
Inputs need to be understood in the proper context
Where is she looking or
pointing at while saying
this and there?
Move this one
over there
Are there any
over here?
What does he mean by any,
and what is he pointing at?
6.345 Automatic Speech Recognition
Paralinguistic Information Processing 37
Multi-modal Fusion: Initial Progress
All multi-modal inputs are synchronized
Speech recognizer generates absolute times for words
Mouse and gesture movements generate {x,y,t} triples
Speech understanding constrains gesture interpretation
Initial work identifies an object or a location from gesture inputs
Speech constrains what, when, and how items are resolved
Object resolution also depends on information from application
Speech: Move this one over here
Pointing: (object) (location)
time
6.345 Automatic Speech Recognition
Paralinguistic Information Processing 38
Multi-modal Demonstration
Manipulating planets in a
solar-system application
Continuous tracking of
mouse or pointing gesture
Created w. SpeechBuilder
utility with small changes
(Cyphers, Glass, Toledano &
Wang)
Standalone version runs with
mouse/pen input
Can be combined with
gestures from determined
from vision (Darrell &
Demirdjien)
6.345 Automatic Speech Recognition
Paralinguistic Information Processing 39
OUTPUT
GENERATION
OUTPUT
GENERATION
DIALOGUE
MANAGEMENT
DIALOGUE
MANAGEMENT
CONTEXT
RESOLUTION
CONTEXT
RESOLUTION
LANGUAGE
UNDERSTANDING
LANGUAGE
UNDERSTANDING
SPEECH
RECOGNITION
SPEECH
RECOGNITION
MULTI-MODAL
SERVER
MULTI-MODAL
SERVER
SPEECH
UNDERSTANDING
SPEECH
UNDERSTANDING
Recent Activities: Multi-modal Server
General issues:
Common meaning representation
Semantic and temporal compatibility
Meaning fusion mechanism
Handling uncertainty
GESTURE
UNDERSTANDING
GESTURE
UNDERSTANDING
PEN-BASED
UNDERSTANDING
PEN-BASED
UNDERSTANDING
6.345 Automatic Speech Recognition
Paralinguistic Information Processing 40
Summary
Speech carries paralinguistic content:
Prosody, intonation, stress, emphasis, etc.
Emotion, mood, attitude,etc.
Speaker specific characteristics
Multi-modal interfaces can improve upon speech-only systems
Improved person identification using facial features
Improved speech recognition using lip-reading
Natural, flexible, efficient, and robust human-computer interaction
6.345 Automatic Speech Recognition
Paralinguistic Information Processing 41
References
C. Benoit, The intrinsic bimodality of speech communication and the synthesis of talking
faces, Journal on Communications, September 1992.
M. Covell and T. Darrell, Dynamic occluding contours: A new external-energy term for
snakes, CVPR, 1999.
F. Dellaert, T. Polzin, and A. Waibel, Recognizing emotion in speech, ICSLP, 1998.
B. Heisele, P. Ho, and T. Poggio, Face recognition with support vector machines: Global
versus component-based approach, ICCV, 2001.
T. Huang, L. Chen, and H. Tao, Bimodal emotion recognition by man and machine, ATR
Workshop on Virtual Communication Environments, April 1998.
C. Neti, et al, Audio-visual speech recognition, Tech Report CLSP/Johns Hopkins
University, 2000.
S. Oviatt and P. Cohen, Multimodal interfaces that process what comes naturally, Comm.
of the ACM, March 2000.
A. Park and T. Hazen, ASR dependent techniques for speaker identification, ICSLP, 2002.
D. Reynolds, Speaker identification and verification using Gaussian mixture speaker
models, Speech Communication, August 1995.
P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple
features, CVPR, 2001.
C. Wang, Prosodic modeling for improved speech recognition and understanding, PhD
thesis, MIT, 2001.
E. Weinstein, et al, Handheld face identification technology in a pervasive computing
environment, Pervasive 2002.