Académique Documents
Professionnel Documents
Culture Documents
4, APRIL 1976
I. INTRODUCTION
HE VALUE of any model rests in its ability to produce
from simple inputs, large amounts of correct detail in its
outputs. Here is described a model of articulation that,
from a phonetic input, can produce vocal-tract area functions
and excitation controls to derive sound outputs that are at
t
PS
least passablyintelligible,andwhosespectrograms are not
Fig. 1. A spatial model of the articulatory system.
casually distinguishablefrom those of natural speech.
The system includes: 1) a physical model of the vocal sys-
tem, with spatial constraints veryclose to those of natural regulate the rate of change of area as any articulator begins to
articulation; 2) a representation of the motional constraints of close off the tract, the cross section or constriction parameter
the articulators which, when moving from one stated shape to C produces a saturation of the form
another, interpolates realistic intermediate shapes; 3) a similar
model of the movements of the excitation system, including A‘ = ( A + d m)/2.
subglottal pressure, vocal cord angle and tension; and 4) a con-
troller for this mechanismwhich produces from input pho- Its most critical action is to distinguish between shapes of the
netic strings, sequences of articulatory commands which cause tongue tip in Is/, /d/, and /l/, and to make specification of
this dynamicsystem to execute properly timed articulatory these phonemes less sensitiveto context. Since saturation area
motions. is an even function of C, / s i , and /l/ may be assigned target
values with opposite signs, representing tongue-tip curvature
II. THE STATICARTICULATORY MODEL less than- and greater than palate curvature, respectively. Con-
sequently, smooth transitions between Is/ and /l/ insert
The level of detail sought in the spatial model of the articu- “homorganic stops” It/ or /d/ automatically;
lators was to match rather closely the shapes of the human A ninth variable represents position of the velum, but in this
vocal tract, but not to resolve individual muscles. A diagram implementation affects only an acousticdomain simulation of
of the model is shown in Fig. 1. ‘Three coordinates are used to nasality. Another variable represents presumably an upper-
specify location of a large central portion of the tongue and pharyngeal constriction of some sort, as occurs in the “bunched-
to regulate jaw movement. Making use of the fact that the tongue” /. (er) ofAmericanEnglish. This too is presently
tongue body moves within the mouth as a rather constant- handled as an acousticdomain correction.
shapedmass, this portion of the tongue is represented as a The model has three variables to control the manner of ex-
fixed circular section movable in a plane.’ Two of the three citation of the vocal tract. One of these represents action of
tongue-body coordinates define vowels, and have an influence the arytenoid cartelages in opening and closing the vocal cords.
on jaw angle. A third variable produces the more rapid tongue- Two others are analogous to vocal cord tension and subglottal
body movement for /g/ and similar consonants, but does not pressure.
affect jaw angle. Fiveother parameters are significantfor other Several internal variables governshape of the pharyngeal sec-
consonants. Two coordinates specify closure and rounding of tion, area at the teeth, etc., but are not independently con-
the lips; two others regulate raising and curling back of the trollable. For completeness, the model should have another
tongue tip. variable for independent control of the upper lip. Phonemes
Another variable serves t o control the general-purpose cross- /f/ and /v/ must currently be synthesized as bilabials. Simi-
section transformation. The model itself accounts for vocal- larly, close matching of the articulatory shapes for /I/and /3/
tract width except where the tract is severely constricted. To (sh and zh) would require an additional degree of freedom.
Manuscript received August 20,1975;revised October 31,1975. 111. DYNAMICSOF THE ARTICULATORY
SYSTEM
The author is with the Acoustics Research Department, Ben Labora- A. What We Know and Do Not Know About Articulatory
tories, Murray Hill, NJ 07974.
*,Theidea of a “circle within a circle” is Fujimura’s [ 11. For closer Dynamics
fits to high front and back phonemes, the palate of the present model
b flatter, with the front and back comers rounded almost to the radius The tongue body is a volume of muscle tissue whose shape
of thetongue mass. changes in response to varying internal forces. The density of
COKER: ARTICULATORY DYNAMICS AND CONTROL 453
muscle tissue is very close to that of water; but the viscosity An approximate model can be constructed andcompared
at low frequencies must be much higher than that of a nearly with the.real data. In successive iterations, the model can be
lossless fluid. Muscles are fibrous; their stretch and shear manually converged to fit the real vocal system.
characteristics are directiondependent.It is not at all clear
whether the multiply interwoven fibers of the tongue that are C . Transformations to Independent Variables
called by single muscle names, do in fact contract in unison; Consider a linear system whose confjguration, described by
or whether they can be excited differently in different layers, a set of position coordinates y or more simply the vector Y ,
or at different points along their lengths. are related to itsinputs X by the transfer function H
Thus for a close representation of the tongue, one could en-
vision a set of seconddegree partial differential equations in yi = h i p i (2)
three dimensions and time, with messy-todescribe shapes, in- Y = HX.
ternal characteristics, excitations and boundary conditions.
But of course, the available data is inadequate to characterize We will assume that H is a matrix of convolutions, and that
such equations. X and Y are vectors whose elements are time waveforms or dis-
Nevertheless, even though we know little about the equa- crete time series. The procedure also applies, however, where
tions of motion of the vocal system, we know quite a lot X and Y are continuous functions over a given domain, and H
about their solutions! is an integral or differential equation. The x’s can represent
First of all, we can predict the form of the solutions. Whether forces. At the same time, if we assume the system to have re-
the system is represented by partial differential equations or storing forces proportional to displacement, we might view the
by “lumpedconstant” equivalents, if linearity can be assumed, x’s, with simple rescaling, as positions (target configurations).
then solutions will be additive combinations of a number of We wish to make a change of variables; to represent the sys-
“modes ofresponse,’’ each mode having its characteristic tem in terms of a new set of coordinates R and inputs C whose
spatial and temporal “signature.” elements are related to the old variablesX and Y by the trans-
From available data, we can deduce quite a lot about those formation P
modes. We know the number of them-no more than about R =Y
PC=- ’=PYP
XR-=’PXC . (4)
ten have dynamic activity in most English speech. The shapes
of the vocal tract during mostspeech can be approximated Equation (3) can be rewritten
closely by a model with seven to ten degrees of freedom. The R = PY = P-’HPP-’X = P-‘HPC = DC. (5)
model spans the vector space of nontrivial solutions.
Within reason, we can identify those modes used in speech. The transfer function for the new coordinates becomes
The lips, tongue tip, velum, glottis and to an extent the tongue D = P-’HP
body are physically isolated parts with comparatively little or
no directly connecting tissue or bone. They should have inde- and the transformed system equations
pendent responses-roughly one or two modes for each inde-
pendent articulator. R = DC. (6)
There is also data available to give a rather clear idea of the The crux of the principal-axis problem is this: For a reason-
temporal characteristics of the modes. The spectra of speech ably general class of physical systems (see Section 111-E), there
sounds are strongly dominated by vocal-tract constrictions. exists a transformation P which will reduce the system re-
And various consonants of the language use different ones of sponse equations to a diagonal matrix. Consequently, there
the physically independent articulators to produce their exists a coordinate system in which each individual term of the
characteristic vocal-tract closures. Formants in consonantal responses R is dependent upon exactly one term of the inputs
and vowel-vowel transitions are thus distorted but reasonably C:
clearimagesof the temporal responses of articulators. The
temporal behavior of lip rounding can be seen in the spec- ri = diici. (7)
trograms of /w/; bilabial closure in /b/; tongue-tip raising
The large system of simultaneous equations can thus be re-
in /d/; tongue-body movements in vowel-vowel and /g/
transitions. placed by a set of smaller independent equations. In a model,
a large multivariate system can be realized as a group of inde-
lip rounding, lip closure, raised tongue tip, retroflexed tongue I fi COMMANDS
tip, tongue-body-velar closure, nasalization, glottal closure, and A RESPONSES
E . “Intentional”A1lophonicVariation
The timing adjustments previouslydescribed are functions
of phonemic context only. There are other adjustments which
are governedbyhigherlevel factors such as stressand the
presence or absence of word boundaries (in the model, a con-
tinuously adjustable degreeof boundary). The most notable
of these variants is the timing of glottal closure in voiceless
stops.
In Umeda’s paper [ 13, fig. 121, [ 141 shows the devoicing
time of /t/ measured from real speech in a number of phono-
logical conditions. A It/that is both word-andstressed-
syllable-initial has a devoicing time of 60-70 ms; word-medial
nonstressed It/, an averageof 40 ms, depending on context;
fiial It/,0 to 20 ms.
The physiological mechanism for regulating devoicing time
operates through the timing of a step command from “open-
Fig. 7. Cornpotison of tongue-bodymotion of themodel (smooth, glottis” to “closed-glottis.” Fig. 8 shows measured data from
lighter traces) with X-ray data onhuman articulation (data taken
from Houde [ 81). realspeech for this effect. In word-and strewinitial stops,
the glottis reaches its widest opening at the time of release of
the stop, and then takes 60 to 70 ms to close back to the point
Houde’s data. There is a general tendency, during vocal-tract where significant vocal-cordvibrations occur. But in the word-
closure of /g/, for the constriction to move from the back of final stop, the glottis never opens. Word-medial stops exhibit
the palate toward the front-or at least, to never move toward various intermediate degrees of glottal opening and delay of
the back. closure, depending on phonological conditions.
After closure, the buildup of pressure behind the constriction The initial-final contrasts of voiceless stops are obvious and
would tend to move the tongue mass in the proper direction. well known. Less widelyrecognized and more subtle but
But the actual effect shows up in initial movements from the similar contrasts occur in all consonants. The initial-final con-
vowel to /g/, well before closure occurs, and in the departure trasts of fricatives involvemainly intensity. Our data show
from /g/ well after release. -The initial movement of / g / is wider glottal opening for initial and stressed fricativesand, for
horizontal; the final movement of /gi/ is vertical. The initial higher stress levels, heightened subglottal pressure. The effect
movement of lug/ is vertical, the final movement of / g u / is of subglottal pressure apparently reaches to adjacent vowels,
horizontal. where the portions nearest the stressed consonant can appear
Presumably this implies an active adjustment in sequencing, to be louder by two to four dB than the remainder of the
to prevent devoicing of the consonant. When the vocal tract vowel. Presumably, fricative intensity is also affected by the
is closed off to produce /g/,the enclosed cavity is smaller, and shape and degreeof constriction.
more hard-walled than with other consonants. In order to In voiced consonants, initial allophones have subjectively
prevent a pressure buildup that would quickly cause /g/ to lower intensity during the actual consonant, followed by a
devoice, the cavitygraduallyenlarges,making room for air rapid increase,going intothe vowel.As with other conso-
flowing through the glottis. Staggered timing of horizontal nants, a portion of the vowel adjacent to the stressed conso-
458 PROCEEDINGS OF THE IEEE, APRIL 1976
York: Harper & Row, 1968. Syst. Tech. J., vol. 54, pp. 485-506, Mar. 1975.
[ l 2 ] G . Fant, “Distinctive features and phonetic dimensions,” in [I71 L. R. Rabiner, L.B. Jackson, R. W. Schafer, and C. H. Coker,
Applications of LinguLprics (selected Papers of the 2nd Int.Congr. “A hardware realization of a digital formant synthesizer,” IEEE
Appl. Linguistics, Cambridge, England, 1969). Cambridge, Th~ns.Cornrnun. Technol., vol. COM-19, pp. 1016-1020, Nov.
England: Cambridge Univ. Press, 1971. 1971.
[13] N. Umeda, “Linguistic rules fortext-toapeech synthesis,” this [ l a ] M. M. Sondhi, “Model for wave propagation in a lossy vocal
issue, pp. 443-451. tract,”J.Acawt. SOC.Amer.,vol. 55, pp. 1070-1075,1974.
[ 141 N. Umeda and C. H. Coker, “Allophonic variation in American [ 191 H. K. Dunn, “Methods of measuring vowel formant bandwidtha,”
English,” J. Phonetics, vol. 2-5, pp. 1 4 , 1 9 7 4 . J. Acoust. SOC.Arne?., vol. 33, pp. 1737-1746,1961.
[ 151 C. H. Coker and N. Umeda, The importance of spectral detail in [20] C. H. Coker, Speech synthesis with a parametric articulatory
initial-final contrasts of voiced stop,’’ J. Phonetics, vol. 3.1, model,” presented at a Talk at Kyoto Speech Symposium, 1968.
pp. 63-68,1975. [ 21 ] C. H. Coker, N. Umeda, and C. P. Browman, “Automatic synthesis
[ 161 J. L. Flanagan, K. Ishizaka, and K. Shipley, “Synthesis of speech from ordinary English text,” L%EE Tmnr.Audio ElectrooLouat.,
from a dynamic model of the vocal cords and vocal t r a c t s , ” Bcfl vol. AU-21, pp. 293-298, Feb. 1973.
Abrtract-This paper presents a survey of automatic speaker recogpi- vices. Although vocoder-generated synthetic speech was quite
tion techniques The paper indudes a discussion of the speaker- intelligible, it was often deficient with respect to the recog-
dependent properties of the speech signal, methods for selecting an nizability of the speakers. This particular problem led to a
efficient set of speech measurements, results of experimental studies
inustnting the performance of vlrious methods of sperlrer recognition, search forthe factors whichconvey thespeakerdependent
and a comprriaon of the performmce of automatic methods with that information in speech. The principal aim of these studies was
of human listeners. Both textdependent as well a8 text-independent not just to detennine the accuracy with which listeners could
speaker-recognition techniques are diacwsed. identify speakers but to answer the fundamental question:
how do listeners differentiate among speakers? Unfortunately,
I. INTRODUCTION it is not easy to find a satisfactory answer to this question.