Vous êtes sur la page 1sur 40

1.

0 Introduction to Speech Processing

Speech Production
Speech Perception
Speech Analysis
Speech Synthesis
Speech Recognition
Speech Coding
Human Factors
Hearing
Texts and Web Pages
Principles of Computer Speech
I. H. Witten Academic Press 1982

Speech Processing
Ed. Chris Rowden McGraw-Hill 1992

Fundamentals of Speech Synthesis and Speech Recognition


Ed. Eric Keller Wiley & Sons 1994

Speech and Signal Processing


Ben Gold & Nelson Morgan Wiley 2000

Speech Communication: Human and Machine 2nd Edition


D. O’Shaughnessy IEEE Press 2000

Speech Synthesis and Recognition 2nd Edition


John Homes & Wendy Holmes Taylor & Francis 2001

Web addresses:
http://www.speech.usyd.edu.au/comp.speech COMP.SPEECH
http://svr-www.eng.cam.ac.uk/comp.speech COMP.SPEECH
http://www.cse.ogi.edu/CSLU/ CSLU
http://www.cstr.ed.ac.uk/projects/festival.html FESTIVAL
http://tcts.fpms.ac.be/synthesis/mbrola.html MBROLA
http://www.speech.psychol.ucl.ac.uk/ UCL
http://www.ISIP.MsState.Edu/resources/ MISSISSIPPI U
SPEECH
What is it?

Linguistics
Acoustics

Physiology
The Speech Chain (Denes & Pinson)

Speaker Listener

Linguistic Physiological Acoustic Physiological Linguistic


level level level level level
Linguistics
Units of language. What are they?

Words? Syllables? Sounds?

What are the individual sounds in language?

Phonemes. How are they defined?


Consider the words pig, dig and jig

p, d and j distinguish the three words from each


other.
We can compare all the words in a language and
determine those sounds that differentiate one word
from another. These sounds constitute the phonemes
of a language.

kop versus cap. What distinguishes the two


words?
The o and the a. k and c have the same sound here.
They both belong to the same phoneme /k/
cede versus seed /s/
How many phonemes in English?
Hazard a guess?

About 40 equally split between vowels and consonants

We’ll classify them later according to the way in which


they are pronounced.
Physiology
This relates to how the sounds are produced through neural
and muscular activity.

We set air coming up from the lungs in motion using our


vocal cords and then we can channel this air through the
vocal tract using out tongue, lips, etc.

We can classify the different sounds we make according to how


we set the air in motion and how we channel the airstream
through the vocal tract.
Acoustics
This describes the generation and transmission of the sounds.
How air is set in motion.
We generate sound waves. What do they look like?
Sound Files
Formats
.wav (PCs) little endian
.au (Unix) big endian
.aiff (Macs) big endian

Sampling rates 8kHz - 44.1kHz


Typically 16kHz. This means that frequencies up to 8kHz can be
accurately represented (Nyquist Theorem).

Data
Typically 8 or 16 bit integers but variety of data supported
including floating point depending on format.

16-bit signed integers is common for .wav files


2.0 SPEECH PRODUCTION
This is about the sounds we use when we speak and the relationships
between them. In linguistic terms speech production is concerned
with

•phonetics – the actual speech sounds of human languages;


how they are made by moving various organs in the

vocal tract (articulatory phonetics), perceived by the


human ear and their physical properties (acoustic
phonetics).
•phonology – how the sounds are organised into meaningful
groupings (sentences). This is concerned with speech
at a more abstract level and is more concerned with
the thought processes involved in speech.
2.1 Anatomy of Speech Organs
The source of most speech occurs in the
larynx. It contains two folds of tissue
called the vocal folds or vocal cords
which can open and shut like a pair of
fans. The gap between the vocal cords is
called the glottis and as air is forced
through the glottis the vocal cords will
start to vibrate and modulate the air flow.
This process is known as phonation. The
frequency of vibration determines the
pitch of the voice and for a male is
typically in the range 50-200Hz whereas
for a female voice the range can be up to
500Hz.

http://www.phon.ox.ac.uk/~jcoleman/phonation.htm
Amplitude

50
Opening
Time (ms)
Closure
phase
Closing
phase

Period = 12.5ms
Fundamental frequency = 1/.0125 = 80Hz

Glottal Pulse
Rosenberg JASM 49, 1971
Intensity
Spectrum of glottal pulse

Frequency (Hz)

Harmonics of spectrum spaced at 80 Hz, corresponding to


pitch period of 12.5ms.
The vocal tract consists of the air passages from the vocal cords
to the lips, including the nasal cavity. The vocal tract behaves like
a resonance chamber, amplifying and attenuating certain
frequencies. The shape and size of the vocal tract can be modified
by moving the lips, tongue, velum, etc. which results in
continually changing resonant frequencies. Consequently a
spectrum of frequencies is produced which contains peaks at
certain frequencies. The nasal cavity can be either included or
excluded by opening or closing the soft palate or velum. Sounds
are classified as nasal if they are produced by the passage of air
through the nasal cavity.

http://www.exploratorium.edu/exhibits/vocal_vowels/vocal_vowels.html
http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
Intensity
Spectrum of glottal pulse
filtered by the vocal tract

Frequency (Hz)

Harmonics of spectrum spaced at 80 Hz, corresponding to


pitch period of 12.5ms.
The shape and size of the vocal tract can be changed
by means of the articulatory organs. These are

The lips
The teeth
The alveolar ridge
The palate
The velum
The tongue
The jaws

http://www.exploratorium.edu/exhibits/vocal_vowels/vocal_vowels.html
http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
2.2 Sounds
The individual sounds of speech are known as phones but these
sounds differ slightly according to their context.

For example the words key and caw both begin with a k sound but the
tongue position is not quite the same in each case.
In fact a native speaker probably doesn't notice the difference. In this
situation k and c are said to belong to the same class of sounds and we
distinguish this class by the phoneme /k/. A phoneme, therefore, is an
abstract linguistic unit and is defined as the smallest meaningful
contrastive unit in the language.
The group of phones which are placed together to make the phoneme
are called allophones.
What distinguishes sounds?

What distinguishes sounds is the type of airstream.


e.g. ingressive or egressive.

As the airstream comes up from the lungs we strategically


place one or more of the organs of speech so that

•we block the airstream


•we channel the airstream into areas where it wouldn't
normally pass

•we set the airstream into unexpected patterns of vibration


Classifying Sounds - Consonants and vowels

The classification of sounds is determined by the manner and


point of articulation

A consonant is a speech sound produced as a result of a partial or


complete obstruction to the airstream somewhere in the vocal
tract.
Voicing does not count as obstruction.

Vowels are sounds produced with no obstruction to the airstream


as it passes through the vocal tract. Resonance is decisive in vowel
production and all vowel sounds are, therefore, voiced.
Plosives(stops)
Consonant produced as a result of a complete obstruction
somewhere in the vocal tract. Made up of three successive phases.

1. Two organs come together to form a complete closure -


obstruction or closure phase

2. After phase 1 lungs are still contracting so there's a build


up of pressure - compression phase

3. Closure is released and air rushes out - release phase.

Voiceless - p, t, k pit tip cot


Voiced - b, d, g bit dip got
Fricatives
Consonants produced as a result of near obstruction somewhere in
the vocal tract. If the passage is sufficiently narrow and the pressure
behind the constriction is high enough then the airflow becomes fast
enough to generate turbulence at the end of the constriction.

Voiceless f (as in far) th (as in thin) s (as in sit)


sh (as in shut) h (as in hood)

Voiced v (as in van) dh (as in this),


z (as in zip) zh (as in pleasure)
Affricates
Special type of plosives whose release stage is slow and lax enough
to result in a perceptible fricative sound at the end of the
articulation.

Voiceless ch (as in church)


Nasals
The velum is lowered during nasal sounds so that there is airflow
out of the nostrils. The only nasal sounds in English are consonants,
during which the oral tract is completely closed at the lips(m), or
with the tongue against the palate(n, ng). These sounds are all
voiced.

Approximants
These form a class of sounds which lie on the border between
consonants and vowels. There's a marginal degree of constriction.
They can be sub-divided into liquids (l, r) and glides (w, y).

l is also sometimes called a lateral. All these sounds are voiced.


Consonants Bilabial Labio Dental Alveolar Palato- Palatal Velar Glottal
dental Alveolar
Plosives vl p t k
vd b d g
Fricatives vl f th s sh h
vd v dh z zh
Affricates vl ch
vd j
Nasals m n ng

Laterals l

Approximants w r y

http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html
Vowels

Vowels are sounds produced with no obstruction to the airstream


as it passes through the vocal tract. Resonance is decisive in vowel
production and all vowel sounds are, therefore, voiced.

There are three main organs of speech involved in changing the


size of the air chamber. These are
the lips - rounding, spreading
the lower jaw - lowered, raised
the tongue - raised, flattened, brought forward, etc.
The tongue is the most important and the tongue position is taken
as the primary criterion in classifying vowels. The vowel zone can
be classified in terms of how high or low the tongue can be, and
how far forward or back it can be. Also, the lips can be either
rounded, neutral or spread.
Vowels/Diphthongs

Front Middle Back


ee as in feet er as in heard ar as in car
i as in did u as in cut o as in cod
e as in red a as in about or as in board
aa as in mat oo as in wood uu as in rude

Dipthongs
ou as in pound ia as in deer
ie as in tie ur as in poor
oa as in boat ei as in their
oi as in toy ai as in make
front central back

ee uu high

mid

o
low

front
central
high
back

mid-high

mid-low

low
Vowel /uu/

as in rude

High, back, rounded


Vowel /ar/

as in car

front, low, unrounded


Vowel /ee/

as in feet

High, front,spread
/ee/ /ar/ /uu/
Wrong /r/ /o/ /ng/
Moving /m/ /uu/ /v/ /i/ /ng/
Southampton /s/ /ou/ /th/ /aa/ /m/ /p/ /t/ /a/ /n/
Sixteen /s/ /i/ /k/ /s/ /t/ /ee/ /n/

Vous aimerez peut-être aussi