Vous êtes sur la page 1sur 16

Primacy of Multimodal

Speech Perception
Lawrence D. Rosenblum

(presented by Dominik efk)
Thesis
Multimodal speech is the primary
mode of speech perception

=> Primitives not tied to any single
modality (the relevant information for
phonetic resolution is modality-
neutral)

Evidence drawn from:
1) The ubiquity and automaticity of
multimodal speech
2) Extremely early speech integration
3) The neurophysiological primacy of
multimodal speech
4) Modality-neutral speech information
1.) The Ubiquity and Automaticity of
Multimodal Speech
Ubiquity visual speech perception is utilized both by
listeners with hearing impairments and those with good
hearing
Automaticity observed already with 5-month-old
infants, reported in observers with various native
language backgrounds
McGurks effects show how the visual component of
perception may influence and even override the
acoustic one
=>support for the gestural theories of speech
primitives (x auditory-based)
2.) Multimodal Speech is Integrated at
the Earliest Observable Stage
Possible stages of integration:
(1) the informational input
(2) before feature extraction
(3) after feature extraction
(4) after segment or word recognition
Research suggests that the integration
happens before the segments are
phonetically categorized (3), possibly
even before feature extraction (2)
3.) Neurophysiological Primacy of
Multimodal Speech
Neural imaging research
Changes in visual speech information can change
auditory cortex activity during audio-visual
integration
fMRI research
A silent lipreading task can induce primary auditory
cortex activity similar to that induced by auditory
speech
Modality seems relatively unimportant to the speech
perception mechanism even at the level of auditory
cortex
Controversy => more research is necessary
4.) Modality-Neutral Speech
Information
Speech information (an alternative account)
Composed of higher-order, time-varying patterns of
energy (light, sound) whose more abstract nature
allows for a common form in multiple energy arrays
The role of sensory organs
Sampling the higher-order structure in the energy
range to which they are sensitive
Cross-modal integration does not occur
within the perceiver but is a property of the
information itself.
4.1. Specific examples of modality-
neutral speech information
Higher-order, modality-neutral information
Frequency of oscillation
Time-course
Amplitude
The details of the structured energy are
different across sensory modalities yet the
higher-order information is modality-
independent
Informational commonalities across audio and
visual speech
4.2. Informational similitude in
auditory and visual speech: Time-
varying dimensions
The salient general properties of speech
perception are similar across modalities
The time-varying dynamic dimensions of
visual speech are critically informative
Isolated time-varying aspects of the signals can
provide useful speech information (sine-wave
speech)
Isolated time-varying visual information for
articulation is also perceptually salient (a point-
light technique)



4.2. Time-varying dimensions
Time-varying speech information is more
salient than static speech information
Auditory speech: portions of the signal that are least
changing (e.g. vowel nuclei) are less informative that
portions which are more dynamic and influenced by
coarticulation
Visual speech: analogous findings
The coarticulated/dynamic portions of visible
vowels are more informative than the extreme
canonical articulatory positions.
4.3. Informational similitude in
auditory and visual speech: Indexical
dimensions of speech
Speaker-specific information can facilitate
speech perception
Auditory speech: Phonetic properties of the speech
signal can be used for both speech and speaker
recognition
Visual speech: analogous findings
Subjects in a test were able to recognize point-
light speakers from idiolectic information
(coarticulatory assimilation, rhythmicity)
available in the visible gestures

4.3. Indexical dimensions of speech
Possibly, speaker-informing idiolectic
information is also modality-neutral
Cross-modal speaker matching should be
possible (i.e. matching heard voices to seen
speakers)
Kamachi at al. (2003): better than chance levels
Rosenblum and Nichols (2000): used point-light
technique -> subjects performed at levels
significantly above chance
Smith and Rosenblum (2003): matching sinewave
sentences to point-light sentences -> preliminary
results: better than chance levels
5.) Visible Speech and the Evolution
of Spoken Language: Speculations and
Predictions
MacNeilages frame/content theory
The frame of spoken language is constructed from
components of ingestive mastication
The oscillatory nature of mandibular movements
during mastication provided the evolutionary support
for the cyclical nature of syllabic speech
Ingestive gestures have been invested with
communicative potential
Communicative relevance transferred from the
visuofacial gestures to vocal oscillations
5.) Speculations
Corballiss theory
Much of the evolution of language occurred in a
visual medium
The first true language was likely gestural in nature,
not spoken (gestures comprised both manual and
facial articulations)
Language evolved from being primarily manual, to
facial and manual, and then ultimately to facial and
vocal.
Visuofacial gestures a critical link from manual to
audible gestures
Causes of the evolution of language: usefulness in the
dark, over great distances, ability to free the hands
5.) Predictions
Traceable phylogeny of multimodal speech
Potentially, primates should show evidence for
audiovisual integration of speech
Influence of visual speech on the
phonological inventories of modern
languages
Possible proof: /m/ vs. /n/ - linguistic distinctions
that are harder to hear are easier to see and vice
versa
Still, it remains an open question
6.) Conclusions
The primary mode of speech perception is
multimodal
Ubiquitous, automatic, supported by behavioural
and neurophysiological findings
The appropriate conceptualization of speech
information is in terms of modality-neutral
information
Evidence: close correspondences between auditory
and visual speech signals, informational similitude
across modalities

Vous aimerez peut-être aussi