Vous êtes sur la page 1sur 10

452 PROCEEDINGS OF THE IEEE, VOL. 64, NO.

4, APRIL 1976

A Model of Articulatory Dynamics and Control


CECIL H. COKER, SENIOR MEMBER, IEEE

Abstmct-A model of human uticulrtion b described whose spatial


and dynamic chpIcteristics d d y match those of natunl speech. The
model indudes a controller that embodies enough dculatory “motor
skin” to produce, from discrete phonetic strinp, property timed se-
quences of aticuLtory movements. Twther with programs for dic-
tionary senching and rules for duration md other phonetic variables,
the model can produce reuolubly acceptable synthetic speech from
ordinary English text.

I. INTRODUCTION
HE VALUE of any model rests in its ability to produce
from simple inputs, large amounts of correct detail in its
outputs. Here is described a model of articulation that,
from a phonetic input, can produce vocal-tract area functions
and excitation controls to derive sound outputs that are at
t
PS
least passablyintelligible,andwhosespectrograms are not
Fig. 1. A spatial model of the articulatory system.
casually distinguishablefrom those of natural speech.
The system includes: 1) a physical model of the vocal sys-
tem, with spatial constraints veryclose to those of natural regulate the rate of change of area as any articulator begins to
articulation; 2) a representation of the motional constraints of close off the tract, the cross section or constriction parameter
the articulators which, when moving from one stated shape to C produces a saturation of the form
another, interpolates realistic intermediate shapes; 3) a similar
model of the movements of the excitation system, including A‘ = ( A + d m)/2.
subglottal pressure, vocal cord angle and tension; and 4) a con-
troller for this mechanismwhich produces from input pho- Its most critical action is to distinguish between shapes of the
netic strings, sequences of articulatory commands which cause tongue tip in Is/, /d/, and /l/, and to make specification of
this dynamicsystem to execute properly timed articulatory these phonemes less sensitiveto context. Since saturation area
motions. is an even function of C, / s i , and /l/ may be assigned target
values with opposite signs, representing tongue-tip curvature
II. THE STATICARTICULATORY MODEL less than- and greater than palate curvature, respectively. Con-
sequently, smooth transitions between Is/ and /l/ insert
The level of detail sought in the spatial model of the articu- “homorganic stops” It/ or /d/ automatically;
lators was to match rather closely the shapes of the human A ninth variable represents position of the velum, but in this
vocal tract, but not to resolve individual muscles. A diagram implementation affects only an acousticdomain simulation of
of the model is shown in Fig. 1. ‘Three coordinates are used to nasality. Another variable represents presumably an upper-
specify location of a large central portion of the tongue and pharyngeal constriction of some sort, as occurs in the “bunched-
to regulate jaw movement. Making use of the fact that the tongue” /. (er) ofAmericanEnglish. This too is presently
tongue body moves within the mouth as a rather constant- handled as an acousticdomain correction.
shapedmass, this portion of the tongue is represented as a The model has three variables to control the manner of ex-
fixed circular section movable in a plane.’ Two of the three citation of the vocal tract. One of these represents action of
tongue-body coordinates define vowels, and have an influence the arytenoid cartelages in opening and closing the vocal cords.
on jaw angle. A third variable produces the more rapid tongue- Two others are analogous to vocal cord tension and subglottal
body movement for /g/ and similar consonants, but does not pressure.
affect jaw angle. Fiveother parameters are significantfor other Several internal variables governshape of the pharyngeal sec-
consonants. Two coordinates specify closure and rounding of tion, area at the teeth, etc., but are not independently con-
the lips; two others regulate raising and curling back of the trollable. For completeness, the model should have another
tongue tip. variable for independent control of the upper lip. Phonemes
Another variable serves t o control the general-purpose cross- /f/ and /v/ must currently be synthesized as bilabials. Simi-
section transformation. The model itself accounts for vocal- larly, close matching of the articulatory shapes for /I/and /3/
tract width except where the tract is severely constricted. To (sh and zh) would require an additional degree of freedom.

Manuscript received August 20,1975;revised October 31,1975. 111. DYNAMICSOF THE ARTICULATORY
SYSTEM
The author is with the Acoustics Research Department, Ben Labora- A. What We Know and Do Not Know About Articulatory
tories, Murray Hill, NJ 07974.
*,Theidea of a “circle within a circle” is Fujimura’s [ 11. For closer Dynamics
fits to high front and back phonemes, the palate of the present model
b flatter, with the front and back comers rounded almost to the radius The tongue body is a volume of muscle tissue whose shape
of thetongue mass. changes in response to varying internal forces. The density of
COKER: ARTICULATORY DYNAMICS AND CONTROL 453

muscle tissue is very close to that of water; but the viscosity An approximate model can be constructed andcompared
at low frequencies must be much higher than that of a nearly with the.real data. In successive iterations, the model can be
lossless fluid. Muscles are fibrous; their stretch and shear manually converged to fit the real vocal system.
characteristics are directiondependent.It is not at all clear
whether the multiply interwoven fibers of the tongue that are C . Transformations to Independent Variables
called by single muscle names, do in fact contract in unison; Consider a linear system whose confjguration, described by
or whether they can be excited differently in different layers, a set of position coordinates y or more simply the vector Y ,
or at different points along their lengths. are related to itsinputs X by the transfer function H
Thus for a close representation of the tongue, one could en-
vision a set of seconddegree partial differential equations in yi = h i p i (2)
three dimensions and time, with messy-todescribe shapes, in- Y = HX.
ternal characteristics, excitations and boundary conditions.
But of course, the available data is inadequate to characterize We will assume that H is a matrix of convolutions, and that
such equations. X and Y are vectors whose elements are time waveforms or dis-
Nevertheless, even though we know little about the equa- crete time series. The procedure also applies, however, where
tions of motion of the vocal system, we know quite a lot X and Y are continuous functions over a given domain, and H
about their solutions! is an integral or differential equation. The x’s can represent
First of all, we can predict the form of the solutions. Whether forces. At the same time, if we assume the system to have re-
the system is represented by partial differential equations or storing forces proportional to displacement, we might view the
by “lumpedconstant” equivalents, if linearity can be assumed, x’s, with simple rescaling, as positions (target configurations).
then solutions will be additive combinations of a number of We wish to make a change of variables; to represent the sys-
“modes ofresponse,’’ each mode having its characteristic tem in terms of a new set of coordinates R and inputs C whose
spatial and temporal “signature.” elements are related to the old variablesX and Y by the trans-
From available data, we can deduce quite a lot about those formation P
modes. We know the number of them-no more than about R =Y
PC=- ’=PYP
XR-=’PXC . (4)
ten have dynamic activity in most English speech. The shapes
of the vocal tract during mostspeech can be approximated Equation (3) can be rewritten
closely by a model with seven to ten degrees of freedom. The R = PY = P-’HPP-’X = P-‘HPC = DC. (5)
model spans the vector space of nontrivial solutions.
Within reason, we can identify those modes used in speech. The transfer function for the new coordinates becomes
The lips, tongue tip, velum, glottis and to an extent the tongue D = P-’HP
body are physically isolated parts with comparatively little or
no directly connecting tissue or bone. They should have inde- and the transformed system equations
pendent responses-roughly one or two modes for each inde-
pendent articulator. R = DC. (6)
There is also data available to give a rather clear idea of the The crux of the principal-axis problem is this: For a reason-
temporal characteristics of the modes. The spectra of speech ably general class of physical systems (see Section 111-E), there
sounds are strongly dominated by vocal-tract constrictions. exists a transformation P which will reduce the system re-
And various consonants of the language use different ones of sponse equations to a diagonal matrix. Consequently, there
the physically independent articulators to produce their exists a coordinate system in which each individual term of the
characteristic vocal-tract closures. Formants in consonantal responses R is dependent upon exactly one term of the inputs
and vowel-vowel transitions are thus distorted but reasonably C:
clearimagesof the temporal responses of articulators. The
temporal behavior of lip rounding can be seen in the spec- ri = diici. (7)
trograms of /w/; bilabial closure in /b/; tongue-tip raising
The large system of simultaneous equations can thus be re-
in /d/; tongue-body movements in vowel-vowel and /g/
transitions. placed by a set of smaller independent equations. In a model,
a large multivariate system can be realized as a group of inde-

istics of the modes. It remains only to construct a system Y i = Pij‘j.


which exhibits these characteristics simultaneously, and to
refine our estimates of these characteristics. D . Modeling Articulatory Movements
The procedure we will follow is the reverse of the classical The classical principal-axis problem is to determine D and P
principal-axis transformation or eigenvalue problem [ 31, [ 4 ] . from H. But in this case, H is the unknown. Only the indi-
The model will be constructed in such a way that individual vidual components of H have direct physical counterparts, and
terms of the solution will appear explicitly as internal vari- even they are unknown. But P and D have direct associations
ables; their linear summations to form the actual responses with measurableresponses. The elements of D describe the
will appear as explicit combinations of the internal variables. temporal characteristics of modes. The rows of P describe the
454 PROCEEDINGS OF THE IEEE, APRIL 1976

Y is to contain only positions), some simple proportionalities


are implied in H and its inverse. Most likely, this follows from
the uniformity of density, viscosity and stretch properties of
muscle tissue.

F. Estimating Matrices D and P


An identity matrix wouldserve as a starting point for P .
However,since P must diagonalize the system for all condi-
tions, including the “steady state,” offdiagonal terms of P
may be estimated by crosscorrelating the positions of extrem-
ity variables with tonguebody variables in an ensemble of
static X-ray data for vowels [ 101 (of course, taking care to ex-
clude rounded vowels from data used for lip variables). In this
waywe can establish the effects of the tongue-body-jaw as-
dol ( T I
sembly on the lip- and tongue-tip variables and on some inter-
nal variablessuch a pharynx opening. The only significant
Fig. 2. (a) A real system. (b) A workable approach to modeling (a).
interactions left to be determined are those between lip closure
and rounding.
relative amounts of each mode in a particular physical coor- Fortunately, the two lip variables have very different speeds
dinate j . In our particular problem, it is quite easy to find of response, and so it is relatively easy to identify the response
speech sequences that excite predominantly one or a very few of each in spectrograms, and thus to converge rapidly in “anal-
modes of the articulatory system ata time. From dataon ysis by synthesis.” Fig. 3, for example, shows a spectrogram
such utterances, we can pindown both D and P. Our ap- of the utterance /bo/ (beau) with superimposed exponentials
proach, then, is to define the model in terms of D,and work for the slower poles of lip closure and rounding. There is a
backwards, through (4)-(8), to the physical coordinates Y. component of lip closure due to rounding, but not the reverse.
This procedure shouldbe the same, independent of the The solution for lip variables, however,is not unique. One can
original definition of they ’s-whether for example they repre- trade between the coefficient of closure due to rounding and
sent the parameters of this or any other sufficiently flexible the target value for the rounding command variable. Appar-
“lumpedconstant” model [ 5 ] -[ 71, the points in a quantized ently, this trade exists in real speech as well. There are varia-
set of partial differential equations, or even continuous func- tions between individuals, between dialects and between lan-
tions with linear differential or integral operators for H . For guages as to the actual amount of lip extension in “rounded’)
different choices of Y,transformation P changes appropriately phonemes.
to relate Y to R . But except for trivial matters of ordering and The temporal responses of other modes-the terms of matrix
varying numbers of ignorable higher order terms, matrix D and D can also be identified in spectrograms. The proper corre-
variables C and R will be the same in each case. They are in- spondences are found by observing which articulatory gestures
trinsic properties of the physical system. dominate the speech spectrum in various configurations. If
In this case, we choose Y to be the variables to our spatial the vocal tract is not strongly constricted, we know there is
model. Those variables were intentionally selected to be good close correspondence between the articulatory “vowel tri-
guesses for the r’s. We expect P to be reasonably close to the angle” and the FI-F~plane. The first formant identifies with
identity matrix. Ultimately, we can incorporate matrix P into forward motion of the tongue mass; the second formant, with
the model, and use variables R as its inputs. From the outset, downward motion. When there is a constriction, however, the
we will incorporate P into the definitions of phonemes, and cavity in front of the constriction dominates one format; the
store instead of X the independent command variablesC. one behind it, another. The lower of these will be F2;the
higher, FJ. The F1 resonance will be that of a Helmholtz
E.. Underlying Assumptions cavity with the yielding wall acting to increase the effective
Since muscles pull, but cannot push, there is reason to ques- constriction area. The square of F1 will be proportional to
tion the presumption of linearity. In Houde’s X-ray data [81, the product of cavity volume and constriction area plus a con-
linearity seems to hold. Houde’s [ 8, Fig. 3.3.81, for example, stant. Using this relationship, identifiiations can be made for
shows a collection of tongue-body responses that have essen- F1 with tongue-tip raising, lip closure, rounding and tongue-
tially the same waveshape, independent of the starting point, body closure of /dl, lbl, Iwl, and 191.
the direction and distance of travel. Fig. 4 shows a spectrogram of the utterance /IgU/ in which
The presumption that H is diagonalizable may or may not the tongue-body vertical movements canbe identified with
be a restriction, depending on the classof operations per- F1. Horizontal movements are reflected very directly in the
mitted in transformation P . If the elements of P were them- front-cavity resonance, as seen first in F3 of /I/ and then in
selves allowed to be transfer functions, then there would be F2 of /U/; and also in the backcavity resonance, as seen first
no restriction. But it turns out in this particular case that a in F2 of /I/ and in F3 of /U/. This particular spectrogram was
real-number P produces close fits to the available data. made from a synthesized utterance; the superimposed plot is
Diagonalizability by a real matrix normally implies that H that of the actual synthesis parameter.
(and therefore its inverse) is, in some sense, twoelement kind:
G. Tongue-Body Variables: Time Varying Coefficients or
H = Hlf1 + H2f2 (9) Separate Parameters
where H1 and Hz are matrices of real numbers and f1 and f2 Houde’s data show the tongue body to have two speeds of
are scalar response functions [ 9 ] . Since a more general spring- motion: a slow one (-150 ms) for transitions not involving
massdamping system wouldrequire three such terms (if vector tongue-body consonants, and a faster one (-80 ms) for the
COKER: ARTICULATORY DYNAMICS AND CONTROL 455

coordinates, but as interpolation factors. When the variable is


low, the corresponding coordinate follows the (actually
moving) “rest position” value; when thecontrol variable is
near unity, the corresponding coordinate assumes its (actually
SIMPLE fixed) value appropriate for the consonant. As this “switching
CLOSE-OPEN
TRANSIENT \
~
function” is processed into smoothed transitions by its charac-
- < teristic filter, the corresponding articulatory coordinate makes
smooth interpolations from its movable “rest position” to its
fixed “active position.” The resultingsystem is not strictly
nonlinear, but linear with variable coefficients.
IV. THE ARTICULATORY
CONTROLLER;
Fig. 3. Spectrograms of the utterance /bo/ as in anarcher’s “bow,” SEQUENCING OF ARTICULATORY
COMMANDS
showing the separate responsesof two modes of lip movement.
The shape and dynamics of the articulatory system are of
course physiological properties that should bebasically lan-
3 guage independent. The sequencing of commands to this sys-
tem, however, must belearned, and certainly can vary between
N
5 languages.
2 2 There is every reason to expect the command structure of
t articulation to be complex. A human spends as much time
u
2 perfecting this motor skiU as he does learning to stand erect
w
and walk, learning to play a musical instrument, or to be a
ow= 1
K reasonably accomplished gymnast.
CL

A . Choice o f Command Variables


A significant property of articulation is that, in moving from
Fig. 4. A spectrogram of thesynthesizedutterance /IgU/ (vowels as
in “big” and “good”). Plots of the tongue-body horizontal coordinate one phoneme shape to another, all articulators do not change
are superimposed on the F,-F, transition associated with the front at the same time. In consonant-consonant boundaries, for ex-
cavity (high to low) and back cavity (lowt o high). ample, the essential feature of one segment is held until the
essential feature of the successivesegmenthas also been
vertical motion in tongue-body consonants. Rather than pre- achieved. At any particular time the vocal system is usually
sume the system to be time variable, the present implementa- dealing with three or more phonemes: producing the current
tion uses a separate variable for tongue-body consonants. This phoneme, recovering from one or more previous ones, and pre-
has several advantages. For one thing, the jaw angle is strongly paring for oneor more upcoming ones.
correlated with tongue heightinvowels, but not strongly Articulatory-overlap can look quite different in different
coupled in consonants. Foranother,the relativetiming of descriptions of the vocal system. Consider two hypothetical
articulatory commands appears to be different in tongue-body representations, the u’s and u’s, that are each linear combina-
consonants and in vowels.Using a separate variableallows tions of the other:
these problems to be handled in a regular way, similar to the
way of handling other consonantal gestures. U =Q V V = Q-’U. (10)
Obviously, the two representations would be quite equivalent
H . Interpolation Variables; A Departure from Linearity for the description of target positions.
The presumption of a linearsystem with constant coef- But suppose that the change from one target to the next is
ficients poses only one problem. In the “rest” position, the accomplished by sequentially switching each ui at time t i . In
lower lip is fixed relative to the movable jaw; and the tongue terms of the u’s this simple sequence is much more complex.
tip is fixed relative to the movable tongue body. But during If each u consists of a single step, then each component of V
their specific consonants, these variables and the consonantal would be the summation of many such steps, each occurring
tongue-body variable seek values that are fixed in relation to at a different time t i . Ingeneral, a simple control sequence
the fixed structure of the head. Presumably, this couldbe will look simple only in its proper coordinates.
handled in an open-loop compensation procedure, by calcu- Of course, the coordinates for linguistic control must be
lating the anticipated position of the movable large structure learned; they are not intrinsic properties of the articulators.
and setting the incremantal consonant variable to make up the Conceivably, there could be one coordinate system in which
difference. articulatory dynamics would appear simple, and quite another
A more likely explanation depends on a nonlinear compress- in which control sequencing is simple. Fortunately, and per-
ibility or perhaps a nonlinear feedback. As the tongue tip, for haps predictably, this is not the case. Linguistic control ap-
example, approaches its spatially fixed target, contact with the pears to be organized around the modes of articulatory re-
fixed parts of the head increases, f i t at the teeth and pro- sponse. The reason is probably nothing more than the physical
gressively against more and more of the palate. Thus, at the separation of articulators. But there would be tendencies for
time of closure, the tongue tip does indeed have a fixed languages to align themselves around modes, even without
reference frame. physically separate articulators. A mode-oriented control
As a working expedient, the present system uses an ad hoc strategy is simplest t o learn-in this domain, cause and effect
feed-forward compensation downstream of the dynamics. are most directly associated.
Variables controlling consonantal vocal-tract constrictions (by But regardless of reasons, the specific articulatory gestures
the lips, tongue tip, and tongue body) are used, not as actual of consonants happen also to be extrema of mode variables:
456 APRIL IEEE. PROCEEDINGS OF THE 1976

lip rounding, lip closure, raised tongue tip, retroflexed tongue I fi COMMANDS
tip, tongue-body-velar closure, nasalization, glottal closure, and A RESPONSES

opening, etc. In terms of (2)-( lo), the proper overlapping of


articulatory gesturescanbe obtained by simplydelaying or \ \ 1 - WITH PRIORITIES
advancing the steps from one target value to another of the
individual coordinates q.
I
I I
I
B . Timing o f Articulatory Commands I
I
I
I I
The model of articulatory control is based on an open-loop
feed-forward or “rehearsal” philosophy, as opposed to a feed-
back principle (cf. Hinke [ 61 ). In the model, transition times
are calculated directly as operating variables.Eachspeech
segment (phoneme or subphoneme) is assigned, for each artic-
ulatory variable, two priorities: one for transitions toward the
segment and one for those leaving it. The intended duration
of a segment is computed open-loop from phonological rules.
TI
TIME - T2 T3

Fig. 5 . Plots of tongue-tip and lip-control variables in the /db/ sequence


Thus there is an official time for the beginning and end of a
of “goodbye.”
segment. But the actual time to step each articulatory variable
is computed to lead or lag this official time, according to the
difference in priorities. Thus priority has the dimension of convention is chosen to set absolute levels so that the highest
time (or better, time normalized by speed of the variable, but number of phonemes will have zero priority for any given
this is a trivial point).. articulator. This leads to negative priorities fora few pho-
Fig. 5 illustrates the effects of priorities on the tongue tip nemes that defer certain decisions to their neighbors. Negative
and lips in the sequence /db/ of “goodbye.” The lighter lines priorities for tongue-body variables of /h/ allow that phoneme
show command variables (step functions) and responses to take on the “coloration” of adjacent vowels.Similarly,
(smooth curves) that would occur with neutral priorities; negative priorities for control of glottal opening in voiced
heavylinesshow those variables with correct priorities. The fricatives /v/, /a/, /z/, andallow these phonemes to become
proper time for the /db/ boundary is T . Without priorities, pure fricatives in regions adjacent to other voiceless sounds
the commands for tongue tip and lips would leadthat time by and silence.
amounts DB and D w appropriate to compensate for delays of
each characteristic response filter. Butsincevocal-tractclo- C. Priorities, Targets, and Distinctive Features
sure corresponds to an extreme of each articulator’s move- Linguists have long regarded speech sounds as being describ-
ment, the release of /dl would occur tens of milliseconds be- able by a complex of articulatory, acoustic, or conceptual clas-
fore the correct time; the closure of /b/ would occur equally sifications called distinctive features [ 1 1I , [ 121 . The priorities
late. In the correct case,since /dl haspositive priority for and parameter target values of this model obviously havemany
tongue tip, and /b/ does not, a positive priority difference will close parallels with widely used features. Priorities and target
result in a positive timing correction PT. The actual time for values for lip rounding and bilabial closure have direct counter-‘
the tongue-tip command step between /dl and /b/ becomes parts in the features + rounded and + labia2. The tongue-tip
raising variable and its priorities correspond to the feature +
TB = T - DB + P B (1 1)
alveolar.
thus causing the releaseof /dl to occur at the proper time. The glottal width variable corresponds also to the feature
The priorities of /d/ and /b/ are reversed for the lip variable, distinction voicedlvoiceless, although there is no parallel in
and so the signs of the priority difference and consequent tim- priorities for glottal control. The tongue-tip front-back param-
ing correction are reversed; eter in the back position matches the feature + retroflex, but
the opposite extreme of that variable is equally critical for the
Tw = T - D w + (-Pw). (12) tongue-tipdental phonemes /-e/ and /a/ (as in “thin” and
The dominant effect of priorities, then, is to specify how far “then”). These extrema are also assigned high priority, as
in the transition from one target value toward the next each compared tothat of It/, /dl, etc. The tongue-body conso-
articulator is to progress by the time of the intended phoneme nantal variable andits priorities identify with the feature + velar
boundary. For constriction-oriented phonemes and articula- when used for thevelar consonants. In the model, that variable
tors, this point is very near one extreme of articulatory move- is given nonzero targets and priorities also for the front-palatal
ment. If the articulatory target is chosen to overshoot closure consonants /J/ and /j/ (Sh and Y). The tongue-body variables
by 10 percent, for example, the closure-producing transient used to specify vowels align properly with the feature distinc-
should be 90 percent complete, and the release transient should tions frontlback and highllow, but there appears no reason to
be 10 percent complete, at phoneme boundaries. For pho- favor either extreme with a priority.
nemes where lip rounding, lip closure, tongue-tip raising, retro-
flexing(and anti-retroflexing), and consonantal tongue-body D . Context-Dependent Priorities
closure are significant, this leads to timing adjustments of -40 There is one conspicuous situation in which the sequencing
percent of the rise time of each specific articulator-for both of articulatory commands appears to vary according to pho-
anticipation and holdover. nemic context. That condition involves the relative timing of
Since the model operates on differences of priorities, the horizontal and vertical tongue-body motions to and from the
absolute values of priorities for any given articulator are in- consonant /gl. Fig. 6 is a schematic representation of the
sensitive to anadditive constant acrossallphonemes. The motions between /g/ and the vowels / i / , / a / , and /u/ from
COKER: ARTICULATORY DYNAMICS A N D CONTROL 457

and vertical tongue-body movements acts to move the con-


striction forward during /g/ for most contexts, and to mini-
mizebackwardmovements in the most difficult contexts.
Apparently, the perceptual process anticipates that this phe-
nomenon will occur. If the circular motion is not properly
modeled in synthesis, the phoneme /g/ is easily confused with
/d/ in many contexts.
The circular motion could be handled in the model by giving
various vowels strong asymmetries in priorities for the conso-
nantal tongue-body variable.Thiswouldlead,however, to
serious irregularities in the duration of closure. The /g/ of
/ugi/, for example would occur 50 ms before the nominal time
for closure; that of /igu/ would occur 50 ms late.
The present model accounts for the circular motion by giv-
ing /g/ a context-dependent priority for horizontal tongue-
body movement. If /g/ is preceededby afront vowel, its
initial priority is increased by 50 ms; if by a back vowel, it
is decreased by that amount. If the consonant is followed by
Fig. 6. A schematic representation of tongue-body motion for the pho- afront vowel, the final priority is increased; if by a back
neme /g/ between the vowels /i/, /a/, and /u/. The circular motion is
possiblyacompensation to prevent buildup of pressure withinthe vowel, it is decreased. This procedure produces exceptionally
mouth that could stop vocal cord oscillation. good matches to X-ray data. Fig. 7 shows the performance of
the model superimposed on Houde’s data for a number of
utterances of /g/ with vowels /i/, /a/, and /u/.

E . “Intentional”A1lophonicVariation
The timing adjustments previouslydescribed are functions
of phonemic context only. There are other adjustments which
are governedbyhigherlevel factors such as stressand the
presence or absence of word boundaries (in the model, a con-
tinuously adjustable degreeof boundary). The most notable
of these variants is the timing of glottal closure in voiceless
stops.
In Umeda’s paper [ 13, fig. 121, [ 141 shows the devoicing
time of /t/ measured from real speech in a number of phono-
logical conditions. A It/that is both word-andstressed-
syllable-initial has a devoicing time of 60-70 ms; word-medial
nonstressed It/, an averageof 40 ms, depending on context;
fiial It/,0 to 20 ms.
The physiological mechanism for regulating devoicing time
operates through the timing of a step command from “open-
Fig. 7. Cornpotison of tongue-bodymotion of themodel (smooth, glottis” to “closed-glottis.” Fig. 8 shows measured data from
lighter traces) with X-ray data onhuman articulation (data taken
from Houde [ 81). realspeech for this effect. In word-and strewinitial stops,
the glottis reaches its widest opening at the time of release of
the stop, and then takes 60 to 70 ms to close back to the point
Houde’s data. There is a general tendency, during vocal-tract where significant vocal-cordvibrations occur. But in the word-
closure of /g/, for the constriction to move from the back of final stop, the glottis never opens. Word-medial stops exhibit
the palate toward the front-or at least, to never move toward various intermediate degrees of glottal opening and delay of
the back. closure, depending on phonological conditions.
After closure, the buildup of pressure behind the constriction The initial-final contrasts of voiceless stops are obvious and
would tend to move the tongue mass in the proper direction. well known. Less widelyrecognized and more subtle but
But the actual effect shows up in initial movements from the similar contrasts occur in all consonants. The initial-final con-
vowel to /g/, well before closure occurs, and in the departure trasts of fricatives involvemainly intensity. Our data show
from /g/ well after release. -The initial movement of / g / is wider glottal opening for initial and stressed fricativesand, for
horizontal; the final movement of /gi/ is vertical. The initial higher stress levels, heightened subglottal pressure. The effect
movement of lug/ is vertical, the final movement of / g u / is of subglottal pressure apparently reaches to adjacent vowels,
horizontal. where the portions nearest the stressed consonant can appear
Presumably this implies an active adjustment in sequencing, to be louder by two to four dB than the remainder of the
to prevent devoicing of the consonant. When the vocal tract vowel. Presumably, fricative intensity is also affected by the
is closed off to produce /g/,the enclosed cavity is smaller, and shape and degreeof constriction.
more hard-walled than with other consonants. In order to In voiced consonants, initial allophones have subjectively
prevent a pressure buildup that would quickly cause /g/ to lower intensity during the actual consonant, followed by a
devoice, the cavitygraduallyenlarges,making room for air rapid increase,going intothe vowel.As with other conso-
flowing through the glottis. Staggered timing of horizontal nants, a portion of the vowel adjacent to the stressed conso-
458 PROCEEDINGS OF THE IEEE, APRIL 1976

from 0 to 9. This control is used variously to modify both the


target values of variables and the timing priorities of variables,
as appropriate for different phonemes. Internally, the control
digit is translated into a fractional coefficient between - 1 and
1. In order to have this coefficient produce different effects
for different phonemes, each phoneme can be provided a set
of allophonic incremental factors, with pointers indicating
whichvariables they are to modify. The products of these
increments and the allophonic coefficient are added to the
designated target variable or priority (approaching or leaving),
to produce the actual values used bythe processor.
Fig. 9 illustrates the effect of allophonic control of the
aritenoid variable of It/. The two traces are simulations of
vocalcord area for the two sentences as the natural speech
data ofFig. 8. Control data for the synthesis was produced
by rule from printed text, and then modified slightly (changes
in phoneme duration and smallchanges in the allophonic
control variable of It/) tomore closely parallel these particular
natural utterances.
G . Generating Sound Output
The model currently has two methods for generating sound
output. In one, the model is used to derivearea functions
which, together with the model’s variables for subglottal pres-
sure, vocalcord tension and neutral glottal area, areused
directly to control the Flanagan-Ishizakavocal-cord-vocal-
tract model [ 161. This procedure, executing in a CSP-30 com-
Fig. 8. Spectrograms aad optical measurement8 of vocal-cord opening puter generates one second of speech in about 6 minutes of
for the utterances “pay tie” and “great eye.” In the stressed /t/ of computation.
“tie” the glottis opens wideto allow air flow for a burst of turbulence. The model’s usual method of generating sound involves an
In “great eye” the vocal tract is kept closed to prevent such a burst.
intermediate computation of formant frequencies and ampli-
tudes of noise and voicing. These in turn are used to control a
nant can be louder than it would be followinga final or medial commercial digital resonance synthesizer [ 171. Formant cal-
consonant. The mechanism for lowering the apparent inten- culations for this method begin with Websters equation for
sity of the consonantal mummur, despite possibly heightened pressure P in a hard-walled lossless vocal tract, with steady-
subglottal pressure, appears to be different for differxt types state sinosoidal excitation:
of consonants. In initial voiced stops, there appears to be a
glottal readjustment that results in increased fundamental and l a a
--a(x)-P+- P=O.
w2
lower harmonics, but sharplydecreased fourth,fiith and a ( x ) ax ax cz
higher harmonics [ 151 (see Umeda’s [ 13, Fig. 111).
Quantizing vocal-tract position X yields
In nasals, the intensity change is probably the result of a
continuing change in velaropening throughout the conso-
nant. Strongly initial nasals will not nasalize the preceding
vowels,and thus will begin their consonantal murmurs with
characteristics qiite like their placecognate voiced stops, with Ai-1
intensity growing as the nasal path opens. A less extreme op- Pi+, = (1 - 0’AX’/C2 )Pi + -(Pi - Pi-1 ). (1 5 )
Ai
posite effect occurs with fiial nasals.
The intensity effect in /w/ and /j/ is probably achieved If this vocal tract is excited at the resonance frequency
mainly through regulation of the degree of constriction. The of the Nth formant, pressure w i peak at the glottis; there
l
consonant /r/ achieves the initial-final contrast through lip will be exactly N zero crossings of P along the length of the
rounding. Initial /r/ is a strongly rounded relative of /w/; final tract-one of them occurring precisely at the lips.
/r/ is a nonrounded relative of the vowel /PI. The consonant Formant frequencies are calculated by an iterative search
/l/ quite possibly regulates the degreeof constriction, but it procedure. A guess is made forthe frequency, and (17)
also makes strong shifts in the target positions and timing pri- iterated from the glottis to the lips. If the lipsare reached
orities of the tongue body. Initial /I/ is relatively palatal, and before the correct number of zerocrossings occur, then the
quite deferent to its vowel neighbors. Final /1/ has strong guess frequency is too low; if the correct number of zero-
initial priorities for the tongue body and a target position for crossings occur before iteration reaches the lips, thenthe
the tongue bodyalmost as far back as the vowel /a/. To a guess is too h i g h . A binary search can be used to find the
lesser extent,the consonant In/ has tongue-body variations correct frequency.
that are the reverse of /l/. Having found the hard-walledresonances, the effects of
yielding wallsare included by applying the transformation
F. Modeling “Intentional” Allophones
Initial-final contrasts are handled in the model by a one-
F, = d m (16)
dimensional attribute-in the phonetic input, a control digit where FO is the appromate F 1 frequency forthe totally
COKER: ARTICULATORY DYNAMICS AND CONTROL 459

The speech code exploits these features by setting more than


one articulator into motion at once. Using a form of “pulse
stretching,” it assures that articulatory commands lastlong
enough for each articulator to reach its critical position, for
the proper amount of time, and in the proper sequence. By
this “orchestration” of overlapping commands, discrete speech
segmentsaresuccessfully transmitted through the sluggish
multivariate vocal mechanism at peak rates exceeding 20 seg-
ments per second-more than twice the Nyquist limit for the
faster variables; four or five timesit for the slower ones!
The linguistic control of this system is organized around the
independent characteristic modes of response of the articula-
Fig. 9. Performance of the model for the same words as the natural
speech data of Fig. 8. tors. Apparently, languagesselect for consonants an assort-
ment of modes that can be discriminated perceptually, either
by their spatial or temporal characteristics. Consonants /b/,
/d/ and /g/, for example, arediscriminatedby spatial (and
hence spectral) characteristics; /b/-/w/ and /d/-/j/ aredis-
criminated by temporal characteristics. This alignment of
linguistic distinctions probably simplifies languageacquisition,
in much the same way that it simplifies problems of modeling
the articulatory process.
This paper has presented a reasonably complete model of
speech production, both of the articulatory system which so
WE WERE AWAY fN SEPTEMBER
strongly constrains the speech process, and of the articulatory
controller which so effectively circumvents those constraints.
The model is incorporated into an overall system for speech
synthesis, either from phonetic controls orfrom ordinary
English text [ 201, [ 2 11. From very simple inputs, the system
can produce speech whose spectrograms are not casually dis-
tinguishablefrom those of natural speech. And the system
generalizes well to a wide variety of utterances other than its
original “training data.”
Fig. 10. Two spectrograms of the utterance “We were away in Septem-
ber.” One is real speech; the other is speech synthesis by rule.
ACKNOWLEDGMENT
closedyielding wall vocal tract-about 200 Hz (cf. Sondhi I wish to thank my colleagues J. L. Flanagan, M.M.Sondhi
[ 181). Formants are constrained to a minimum separation for many valuable discussions, and 0. Fujimura for collabora-
of 175 Hz. Then bandwidths are determined by table lookup tion on the original static model. I thank R. Houde for gra-
and interpolation from Dunn’s data on natural speech [ 191 . ciouslyprovidingdigital tapes of hisX-raymeasurements. I
The state of excitation for formant synthesis is determined am grateful to K. Browman and S. Webber who did much of
by modeling the low-frequency behavior of the interoral pres- the hard work of this project, often from vague and changing
sure. From the transglottal pressure, the neutral-glottal-area specifications. I am especially grateful to N. Umeda for
variable and prior state of oscillation, the current amplitude years of collaboration during the evolution of this model.
of vocalcord oscillation and on-off ratio of vocalcord clo-
sure are estimated, following data obtained from the Flanagan- REFERENCES
Ishizaka model [ 151. Similarly the amplitudes ofnoiseand
aspiration are inferred from the constriction area and glottal C. H. Cokerand 0. Fujimura, “A model for Specification of
Vocal Tract Area Function,” I.Acousf. SOC. Amer, vol. 40,1271
area, the transconstmction and transglottal pressures. (A), 1966.
Calculation of control data for formant synthesis, operating J. Mathews and R. C. Walker, Mathematical Methods of Physics.
New York: Benjamin, 1965.
in a fixed-point Honeywell 5 16, requires about 13-s computa- R. Courant andD.Hilbert, Methods of Mathematical Physics.
tion for each second of speech. Fig. 10 is an example of the New York: Wiley, 1965.
output of this system. One spectrogram is of natural speech; H. Goldstein, clclspical Mechanics. Reading, MA: Addison-
Wesley, 1950.
the other is of speech produced by the model, with formant S. E.G. Ohman, “Numerical models of coarticulation,”J. Acousf.
synthesis. Soc. Amer., vol. 41, pp. 310-320,1966.
W. L. Henke, “Preliminaries to speechsynthesis based upon an
articulatory model” in Proc. IEEE Conf. Speech Commun. Pro-
V. CONCLUSION cessing, pp. 170-182, 1967.
Spoken language is literally designed around the articulatory J. S. Perkell, “A physiologically-oriented model of tongue activity
in speech production,”
Doctoraldissertation, Massachussetts
process-to exploit its capabilities; and to compensate for its Inst. Technol., Cambridge, 1974.
weaknesses. The dominant weakness is slowness of articula- R. A. Houde, “A study of tongue body motion during selected
speech sounds,” doctoraldissertation,Univ.Michigan, Ann Arbor,
tory movements. The exploitable features are the large num- 1967.
ber of independently controllable articulatory movements, and E. A. Guillemin, Theory of Linear Physical Systems. New York:
the fact that most of these movements have nonlinear effects Wiley, pp. 159-216,1963.
C. G . M. Fant, Acowtic Theory of Speech Production, s-Graven-
on the output sound. Certain critical extremes of movement hage, The Netherlands: Mouton and Co.,1960.
have pronounced effects; opposite extremes have little effect. N. Chomsky and M. Halle, The Sound Pottern of English. New
460 PROCEEDINGS OF THE IEEE, VOL. 64, NO. 4 , APRIL 1976

York: Harper & Row, 1968. Syst. Tech. J., vol. 54, pp. 485-506, Mar. 1975.
[ l 2 ] G . Fant, “Distinctive features and phonetic dimensions,” in [I71 L. R. Rabiner, L.B. Jackson, R. W. Schafer, and C. H. Coker,
Applications of LinguLprics (selected Papers of the 2nd Int.Congr. “A hardware realization of a digital formant synthesizer,” IEEE
Appl. Linguistics, Cambridge, England, 1969). Cambridge, Th~ns.Cornrnun. Technol., vol. COM-19, pp. 1016-1020, Nov.
England: Cambridge Univ. Press, 1971. 1971.
[13] N. Umeda, “Linguistic rules fortext-toapeech synthesis,” this [ l a ] M. M. Sondhi, “Model for wave propagation in a lossy vocal
issue, pp. 443-451. tract,”J.Acawt. SOC.Amer.,vol. 55, pp. 1070-1075,1974.
[ 141 N. Umeda and C. H. Coker, “Allophonic variation in American [ 191 H. K. Dunn, “Methods of measuring vowel formant bandwidtha,”
English,” J. Phonetics, vol. 2-5, pp. 1 4 , 1 9 7 4 . J. Acoust. SOC.Arne?., vol. 33, pp. 1737-1746,1961.
[ 151 C. H. Coker and N. Umeda, The importance of spectral detail in [20] C. H. Coker, Speech synthesis with a parametric articulatory
initial-final contrasts of voiced stop,’’ J. Phonetics, vol. 3.1, model,” presented at a Talk at Kyoto Speech Symposium, 1968.
pp. 63-68,1975. [ 21 ] C. H. Coker, N. Umeda, and C. P. Browman, “Automatic synthesis
[ 161 J. L. Flanagan, K. Ishizaka, and K. Shipley, “Synthesis of speech from ordinary English text,” L%EE Tmnr.Audio ElectrooLouat.,
from a dynamic model of the vocal cords and vocal t r a c t s , ” Bcfl vol. AU-21, pp. 293-298, Feb. 1973.

Automatic Recognitionof Speakers


&rom Their Voices
BISHNU S. ATAL

Abrtract-This paper presents a survey of automatic speaker recogpi- vices. Although vocoder-generated synthetic speech was quite
tion techniques The paper indudes a discussion of the speaker- intelligible, it was often deficient with respect to the recog-
dependent properties of the speech signal, methods for selecting an nizability of the speakers. This particular problem led to a
efficient set of speech measurements, results of experimental studies
inustnting the performance of vlrious methods of sperlrer recognition, search forthe factors whichconvey thespeakerdependent
and a comprriaon of the performmce of automatic methods with that information in speech. The principal aim of these studies was
of human listeners. Both textdependent as well a8 text-independent not just to detennine the accuracy with which listeners could
speaker-recognition techniques are diacwsed. identify speakers but to answer the fundamental question:
how do listeners differentiate among speakers? Unfortunately,
I. INTRODUCTION it is not easy to find a satisfactory answer to this question.

M OST OF US ARE aware of the fact that voices of


different individuals do not sound alike. This impor-
tant property of speech-of being speakerdependent-
is what enables us to recognize a friend over a telephone. The
The question about perceptual basesofspeaker recognition
and their acoustical correlates remains largely unresolved.
The availability of digital computers for processing of speech
signals probably provided the greatest impetus to research on
ability of recognizing a person solely from his voice is known automatic speaker recognition. The motivation for such re-
as speaker recognition. Speaker recognition by human listeners search stemmed from both a curiosity to see if human per-
is a common experience and has been known for a long time formance could be duplicated by machines and a promise of
[ 11, [ 21. More recently, with the availability of digital com- providing newand indeed revolutionary services in many
puters, speech scientists have wondered if automatic and ob- diversefields. Efficient banking and business transactions,
jective methods can be devisedto recognize a speaker uniquely controlled accessof a facility or information to selected in-
from his voice [3 I -[ 101. In many speech applications, it is dividuals, and a new tool for the lawenforcementagencies are
often difficult to duplicate human performance by machines. among the manypossible applications of automatic speaker
In the case of speaker recognition, fortunately, this is not true. recognition.
Not only successful speaker recognition by machines is pos- This paper presents a survey of the progress achieved towards
sible, presently available experimental evidencesuggests that automatic speaker recognition.’ We willreview in the re-
the performance of machines in many instances exceeds that mainder of this introductory section some of the fundamental
of human listeners [ 11J , aspects of the problem. Why is speaker recognition possible?
The earlywork on speaker recognition was almostcom- What are the sources of interspeaker variability and how are
pletely limited to human listening. A considerable part of this
research was motivated by the desire to produce natural sound-
ingspeech from vocodersand similar speechprocessingde- ‘The term speaker recognition as used in this paper refers to any
decision-making process that uses some featuns of the speech signal to
detennine if a particular person is the speaker of a given utterance
Manuscript received August 15, 1975;revised October 8, 1975. which will include tasks such as identification, verification, &
cri
min a -
The author is with Bell Laboratories, Murray Hill, NJ 07974. tion, and authenticationof speakers.

Vous aimerez peut-être aussi