Auditory-Perceptual Interpretation of The Vowel: Central Institute For The Deaf St. Louis, Missouri 63110

Auditory-perceptual
James D. Miller
interpretation
of the vowel
CentralInstitute for theDeaf St. Louis,Missouri63110
(Received8 October1987;accepted for publication19January1989)
The major issues in relatingacoustic waveforms of spoken vowelsto perceived vowelcategories are presented and discussed in termsof the author'sauditory-perceptual theoryof phonetic recognition. A brief historicalreviewof formant-ratiotheoryis presented, aswell asan analysis of frequency scales that havebeenproposed for description of the vowel.It is illustratedthat the monophthongal vowelsounds of AmericanEnglishcanbe represented as clustered in perceptual targetzones within a three-dimensional auditory-perceptual space (APS), and it is shownthat preliminaryversions of thesetargetzonessegregate a corpus of vowelsof American Englishwith 93% accuracy.Furthermore,it is shownthat the nonretroflex vowelsof AmericanEnglishfall within a narrowslabwithin the APS, with spread
vowels near the front of this slab and rounded vowels near the back. Retroflex vowels fall in a
distinctregionbehindthe vowel slab.Descriptions of the vowelswithin the APS are shownto be correlatedwith their descriptions in termsof dimensions of articulationand timbre. Additionally, issues relatedto talker normalization,coarticulationeffects, segmentation, pitch, transposition, and diphthongization are discussed.
PACS numbers:43.71.An, 43.71.Es, 43.70.Fq
INTRODUCTION
suredcorrelationbetweenspectralpatternsand perceived
vowels.
The purpose of thispaper is to describe a quantitative, All of the factorsmentioned aboveapplyto the special theoretical approach to vowelperception. Thisapproach is case where the speaker and listener arebothnativespeakers based ontheauditory-perceptual theory ofphonetic recogniof the same dialect of their language, wherethe speech is tion (Miller, 1984a,b, 1987a,b), and it is designed to have accurately produced, where the semantics and syntax of the the potential to provide a comprehensive andcoherent accountof the factsrelatingacoustic waveforms to perceived utterance are well known to the listener, and where the lisnot needto process extraneous noise or speech. In vowelcategories. It is hoped that thistheoretical approach tenerdoes this special case, the issues of top-down cognitive and linto vowelperception will be usefulnot only as a guideto processing are largelyremoved from consideration. solutions of theunresolved problems in thisareaofinvestiga- guistic Even so, unresolved problems related to talkers, coarticulation,but mayalsobeuseful in organizing andsummarizing tion, rate and stress, voice pitch, and segmentation remain, existing knowledge aswell' asintegrating that knowledge and each of these issues is addressed by the auditory-percepinto a general explanation of phonetic recognition. unresolved problems can Presently, it is well established that thelocations of the tual theory.It isclaimedthat these either be resolved by the theory or that the theory provides first threeprominences of the short-term spectrum of the framework thatwill leadto theirresolution. vowel waveformare highly correlatedwith the perceived aninvestigative
Since the auditory-perceptual theory has descended from the formant-ratio theory of vowel quality,and since it attenuated by a varietyof factors evenwhenconsideration is relies heavily on the logarithmic frequency scale, this paper constrained to the carefully produced speech of a single diawith sections relatedto these issues. In Sec. I, a brief lect. Includedamongtheseattenuating factorsare: differ- begins ences between talkers' vocal characteristics associated with historyof the formant-ratio theoryis given,alongwith a of its advantages andshortcomings. Next, in Sec. size, age,andgender; differences between talkers in articula- summary II, frequency scales that have been suggested for thedescriptory styleor habit;anddifferences in theacoustic expression tion of vowels are compared and discussed. Included in the of a vowel associated with large coarticulation effectsindiscussion are mel, Bark, Koenig, log frequency, and freduced by surrounding segments, with effects of speaking rate scales. Section III presents a briefdescription of the andlinguistic stress, andwithchanges in, or modulations of, quency auditory-perceptual theory of phoneticrecognition and voice pitch.Additionally, when a vowel isuttered aspartofa demonstrates how it can be applied to a sustained vowel and syllable or word, the sequence of short-term spectra often a consonant-vowel syllable. Concepts to be described inexhibitsmore than one segment that is nearlysteadystate, clude the sensory reference, sensory formants, the sensoryandsometimes it exhibits a nearly continuous change. There perceptual transformation, perceptual formants, the audiis no precise information as to whichof these spectral pattory-perceptual space, perceptual target zones, and ternsor which combinationof thesespectralpatternscomsegmentation maneuvers. These concepts are illustrated by prises theacoustic correlates of theperceived vowel. Thisis data gleaned from the literature and by measures made in the problemof segmentation, and it alsodisturbs the meacolor of the vowel. However, this correlation is known to be 2114 J. Acoust. Soc. Am.85 (5),May1989 0001-4966/89/052114-21 $00.80 1989Acoustical Society ofAmerica 2114
Downloaded 11 Sep 2012 to 146.164.3.22. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
our laboratories at Central Institute for the Deaf (CID).
Why do some differently perceived vowels havesimilarformant ratios? And, how is it that different instances of the
Also, included in Sec.III is a description of preliminaryestimatesof the perceptual targetzones for the simple(that is, monophthongal)vowelsof American English.In Sec. IV, characterizations of the vowelsin the auditory-perceptual spaceare shown to be related to descriptions in terms of articulationand timbre. SectionV dealswith four separate issues in vowelperception: (a) voicepitchasa determiner of vowelcategory,(b) transposition of vowelspectra, (c) vowelswith changing formants,and (d) lability of vowelboundaries.SectionVI containsconcluding comments.
I. FORMANT-RATIO THEORY
same vowel can have very different formant ratios?The
strengths andweaknesses of formant-ratio theory asapplied to naturalspeech arebeautifully illustrated in thequantitativework of PotterandSteinberg (1950), Peterson andBarney (1952), and Peterson ( 1961).
Peterson and Barney(1952) plottedF2 values against F 1 values, as shown in Fig. 1 (seeRef.1). On thisfigure, frequency is plottedin accordance with the Koenigscale (Koenig, 1949). The linescorresponding to equalF2/F 1 ratioshavebeenadded to clarifythispresentation. It canbe seenthat, while the data for a givenvoweltend to cluster
arounda linerepresenting a givenF2/F 1ratio,thereiscon-
In the late 1800s,Richard John Lloyd ( 1890a,b;1891; 1892) formulated the formant-ratio theory of the vowel. Lloyd'sdictum wasthat like articulations produce like perceptions of vowel qualities,and that like articulationsproduce like ratios of the formants.He called his theory "the relativeresonance theory" and statedthat vowelquality dependson the intervalsbetween the resonances, not their absolutevalues.Lloyd basedhis theory on the central role of the frequencyratio in music,on acousticdifferences in the vowelsof men, women,and children,on experiments with syntheticvowels producedby exciting hand-blownglass tubes,and on studies of transposition donethroughsynthesis or by speeding and slowingthe playbackof Ediphone recordings. His critics [for example,Pipping (1893) ] were quickto respond by pointingout that certaindistinctvowels have common ratios of secondand first resonantfrequencies,that is,F 2/F 1ratios,that transposition worksonly over certainlimited ranges, that somevowelsseem to haveonly a singleformant rather than two, and that his formulasfor calculatingthe resonances of his tubeswere in error. In reply, Lloyd (1894) asserted that almostall vowels werecomposedof at leasttwo resonances and that errorsin calculating resonanceswere irrelevant to his thesis that like articulations producelike formant ratiosand like vowelpercepts. He defended hisexperiments in transposition asshowing that vowel perception was much more sensitiveto changes in formant ratiosthan to changes in absolute values that maintainappropriate ratioswhenboth kindsof change
are measured in semitones.But he did admit that, while the
siderable scatter. Also, within eachof the vowelgroups [/AA/ and /AO/], [/UH/, /UW/, and /AE/], and [/UW/,/ER/, and/EH/], theF2/F 1ratios areverysimilar. Earlier,PotterandSteinberg (1950) madesimilarobservations onthesimple vowels of American English asspoken by tenmen,tenwomen, andfivechildren. As shown in Fig. 2, theynotedstrong correlations between thepitchandformantvalues within eachvowelcategory. They alsonoted,as shown in Fig. 3, that characterizing the vowels by ratiosof themelequivalents off 1,F2, andF 3,thatis,byM 3/M 2and
M 2/M 1, almost eliminated talker differences within each
vowelcategory.Also, they statedthat ratiosof formant fre-
quencies behaved exactlythe sameasmel ratiosby way of eliminating talkerdifferences. ThusPotterandSteinberg in 1950 once again offeredstrong evidencein supportof Lloyd'sviewthat like formantratios resultin like perceived
formant ratio was the dominantfactor in vowel perception, absolute frequencies mustalsoplay a role. Statements of the formant-ratiotheory appearin the literatureeveryfew yearssinceLloyd'swork (as examples, seeChibaand Kajiyama, 1941;Potterand Steinberg, 1950; Iri, 1959; Peterson, 1961; Okamura, 1966; Minifie, 1973; Broad, 1976; and Kent, 1979), and, interestingly,the au-
thorsusually seem to beunaware of priordescriptions of the notion.Amongthese papers, Peterson's ( 1961) discussion is particularlyusefuland penetrating. The major strengthof formant-ratiotheoryis its ability to eliminateor dramatically reducedifferences in the acoustic description of vowels related to talker, age, and gender.The .major stumbling blocksto the acceptance of formant-ratiotheoryseemto be relatedto failuresto provideclearanswers to the following questions. Why doesthe transposition of vowelshold only over certain restrictedrangeson the log frequencyscale?
2115 J. Acoust.Soc. Am.,Vol. 85, No. 5, May 1989
FREQUENCY OF F2 IN CYCLES PER SECOND
FIG. 1. Scatterof vowels in F2 by F 1 plot whereboth variables are on the Koenig scale.Lines representing typical valuesof F2/F 1 for the vowels
/IY/,/EH/,/AE/, and/AA/are shown. Note that within each of the
vowel groups [/AA/ and /AO/], [/UW/, /UH/, and /AE/], and [/UW/,/ER/, and/EH/], theF2/F 1ratioscanbeverysimilar(after Fig. 8 from Petersonand Barney, 1952).
James D. Miller:Vowel perception
2115
4 500
4000
'
3500
30.00 -' :';' .'., ' '2,, ' ....
2000
1500
2soo - -, .:: ;. .." ,',.' .-..," ,/.:' L'-' : ,

,.t..
''
" ', :-'] ' '- ' -': i;'-" " ,. ' ,:. '
'
'
' I
*
'
%'
-'"
' I
'
'
F 3
iooo j
that melratiosandfrequency ratiosarenearlyidentical over the range usuallyoccupied by the first three formantsof vowels andthat the datafit eitherformulationnearlyequally well, asshownin Fig. 4. The early work of Lloyd, and the subsequent works
cited above, all indicate that use of the ratios of the center
3500
z 3 ooo
o
w 2500
2000
I1 15oo
>-
1000
700
1200 I 10001
800 i
600
frequencies of the firstthreeformants canreduce andnearly eliminatetalker differences. It is largelyfor this reasonthat suchratiosplay a prominentrole in the auditory-perceptual interpretationof the vowel. However, additionalconcepts are introducedto deal with the shortcomings of formantratio theory mentionedabove.Beforepresenting the auditory-perceptual theory, a more detaileddiscussion of frequencyscalesfollows in the next section.This sectionis meantto rationalizethe useof the log frequency scale in the auditory-perceptual theoryandto clarifythe relations of the log frequency scaleto its prominentalternatives.
400
200
II. FREQUENCY SCALESIN VOWELANALYSIS

F I
'I 0.4 :.-t_..t .:_.J _.,t .4

2OO
o'-t
i
r'-h'
c
Various scalesof frequency have been suggested for vowel analysis.Most of thesesuggestions are basedon the notion that auditory scales,such as the mel scale (Fant,
1973), the Bark scale (Zwicker and Terhardt, 1980), the
FIG. 2. The pitch and formant valuesof the simplevowelsof American English. On eachpanelfrom left to rightarethedatafor tenmen,tenwomen,andfivechildren. Within eachgroup of speakers, thedataarearranged fromlefttorightin order ofincreasing pitch. Thisisareprint ofFig. 11from Potter and Steinberg(1950).
vowels.However, Potter and Steinberg werediscouraged by the fact that neither [/AA/and/AO/] nor [/UH/and /UW/] couldbe distinguished by formant-ratiotheory. In thesecases, someother factor, perhaps akin to the absolute locationsof the formantsalong the frequencyscale,must play a role.
Later, Peterson ( 1961 ) returned to this issue.He con-
Koenig scale (Koenig, 1949), or cochlearpositionscales (Greenwood, 1961), reflect auditory analysismore accurately than others.Thesescales, whetherbasedon frequency-position mapsof the cochlea,critical-bandmeasures, or pitch scaling experiments, all tend to be linearfunctions of frequency in hertz in the low-frequency region,transitional in the mid-frequency region,and logarithmicin the highfrequency region.Consider Fig. 5, wherenormalized values of the mel, Bark, and Koenig scales are plottedagainstfrequency(of)in hertz,andboth thex andy scales arelogarithmic. The mel values were calculated from Fant's (1973)
equationfor technicalmels (TM),
TM:
( 1000/log 2)log(f/1000 q- 1).

'
( 1)
firmed the observations of Potter and Steinberg by showing that ratios of the center frequencies of the first and second formantsand of the secondand third formtntsservedto eliminate talker differences whether thesefrequencies are
The Bark values (B) were calculatedby the equationfrom

Zwicker and Terhardt (1980),
B = 13arctan(0.76f/1000) q- 3.5arctan(f/7500) 2.
(2)
The Koenigvalues(K) werecalculated from the equations
expressed in hertzor in mels.Peterson thendemonstrated (Koenig, 1949)
M3/M2
I'
L I o 3 u u, A 3
FIG. 3. The ratiosof the mel equivalents

ofF3andF2 (M3/M2) and ofF 2 and F 1
(M 2/M 1) for the data shownin Fig. 2.

Note that talker differencesare removed,
M2/MI
3
but that the pairs [/AA/and/AO/] and [/UH/ and /UW/] cannot be distinguished. Tn_ .isis a reprint of Fig. 13 from Potter and Steinberg (1950).
2116
J. Acoust.Soc. Am., Vol. 85, No. 5, May 1989
2116
4000
)-
2000
n-
,-,
1000
200
100
oo 4oo 6oo 00o ,00o 4ooo 600o
FREQUENCY (Hz)
]G. 5. The me] (closedcircles), Bark (open circles), oeni (open squares), and frequency (solidline) scales are plottedaainst frequency. TheBarkvalues aremultiplied by 117.5015 andtheKoenig values aremultipliedby 500sothatall scales havea valueof 1000for thefrequency of 1000 Hz. Note the similarityof all four scales.
-/
thesedeviations from linearity,Fig. 5 supports the ideathat, overmostof the ranges of valuesof the centerfrequencies of F 1, F2, and F3, the Koenig scale,the mel scale,the Bark scale,and frequency in hertz are nearly equivalent. A corollary of this observationis that formant ratios, whetherin mels,Barks,Koenigs,or hertz, shouldbe nearly equivalentover the rangesassociated with the vowels of American English. To test this hypothesis, we calculated these ratios for the mean data of Peterson and Barney
(1952) for nonretroflex vowels. Since there are nine vowels
and three talker groups, therewere 27 F 2/F 1 ratiosand 27

F 3/F2 ratios. The results of these calculations are shown in
40
600
000
0O4)O
FREQUENCY OF FIRST
FORMANT
FIG. 4. The frequency of the firstandsecond formants for a setof matched
vowels by men,women, andchildren speakers. Points for a givenvowel clustercloseto constant frequency ratio (dashed)or constant mel ratio (solid)lines. (Thelarge rectangle onthefigure isnotrelevant tothepresent paper. ) Thisisa reprintof Fig.2 fromPeterson ( 1961 ). (Figurereprinted
with the permission of ASHA.)
K= 0.002f for 0<1000,
(3)
K= (4.5 logf) -- 11.5 for 1000<10000.
(4)
For purposes of graphic comparison,the calculated Bark valueswere multiplied by 117.5015and the Koenig values weremultipliedby 500 sothat all scales wouldhavea valueof 1000for the frequency of 1000Hz. As canbe seen in Fig. 5, all four scales have similar slopes up to about 1000 Hz, and from 1000-4000 Hz the Koenig, mel, and Bark scalesmake the transition to a more nearly logarithmic slope.Interestingly,the changein slopefrom the range of 100-1000Hz to the rangeof 10(X)4000Hz isgreatest for the Koenig scale,intermediate for the Bark scale,and leastfor the mel scale.Otherwisesaid,the mel scaleis most nearly a linearfunctionof frequency, whilethe Bark scale is intermediate, and the Koenig scaleleastin this regard.In spiteof
2117 J. Acoust. Soc. Am., Vol. 85, No. 5, May 1989
Fig. 6. In panel (a), the logsof the ratiosof the mel valuesof the formantsare plotted againstthe logs of the frequency ratios.An extremely highcorrelation isevident,andthe conclusion of Potter and Steinberg (1950) and of Peterson ( 1961) that mel ratiosare equivalent to frequency ratiosin describing vowels is confirmed. Similar,but less precise, correlationsare notedfor the Bark transformation in panel (b) and the Koenig transformationin panel (c). Theseresults reflectthe deviations from linearity notedin the discussion of Fig. 5; that is, the greaterthe deviationfrom linearity notedon Fig. 5, the poorerthe correlation notedin Fig. 6. In contrastwith the useof ratiosor the logsof ratios,it has sometimes been suggested that differences in the variables describing F 1,F 2, andF 3 mightbe an effective metric for vowel quality. For example,Syrdaland Gopal (1986) haveshownthat certainfeaturedescriptions of vowels canbe predicted by particularvalues of differences in the Bark valuesof the centerfrequencies of the formants. Figure7 exhibits the correlations betweensuchdifferences in mels [panel (a) ], in Barks [panel (b)], and in Koenigs [panel (c)]. Here, it canbe seen that the correlations with the logsof the frequency ratiosare bestfor differences in Barksand Koenigsand a bit poorerfor differences in mels.The correlations betweenlogs of frequencyratios with Bark differences or with Koenigdifferences wouldbe nearlyperfectexceptfor a clusterof pointsassociated with theF 1 andF2 values of the vowels/UW/and/UH/. Figure 8 is includedfor completeJames D. Miller: Vowel perception 2117
1.0
0.6
2000
1750
i 500
(H3.L2)
0 = (H2.L i)
i
o
...i.. o_
0.6
0.4 -
1250
IT 1000
750
I
oo oo o
o
.i
o o
0.2
o.o
500
oo .... ....
250
o
2000 ' I ' I ' I

1750
(H3,L2).
0 , (H2.LI)'
'
]250
=1ooo
.. e ....
o
0 "!" ' o
soo
250
so po .... ....
..; ....... : ....
'.
2000 ,
.
' I ' I
: :
!
O. o
o o"i'-
1750
(H3'L2): 0 (H2,LI)
]50o
2so 'i"' 8:'"
o
-
= ]000
o...:;.. .
0.0 0.2 0.4 0.6 0.8 1.0 250 .. ..... ..... ..... ...
o
o.o o. 0.4 o. o.s .o
6. Scattc plots are show. for the rciatio.s bctwcc logarithms of foratiosexpressed i. hertz [x axesof pa.c]s (a), (b), a.d (c) ] a.d
logarithms of formant ratios expressed in mels[y axisof panel(a) ], Barks [y axisof panel(b) ], andKoenigs [y axisof panel(c) ]. The dataarethe mean values off 1,F 2,andF 3given byPeterson andBarney (1952)fornine nonretroflex vowels of AmericanEnglishfor threetalker groups. Thus thereare27 values oftheratioF2/F 1 (opencircles) andthereare27 values of theratioF3/F2 (closed circles) oneach plot.TheF2/F1 ratios for variousscales are indicatedas (H 2/L 1), while the F 3/F2 ratiosare indicated
as (H3/L 2).
LOG
FIG.7.Scatter plots areshown fortherelation between logarithms offormant ratios expressed in hertz[x axes of panels (a), (b), and(c)] and differences in formant values expressed in mels [panel(a)], in Barks X 117.5015 [panel (b)], and inKoenigs X 500 [panel (c)]. Open cir-
cles arefor F2 -- F 1 differences andclosed circles areforF3 -- F2 differences. Dataarethesame asfor Fig. 6.
ness, and it demonstrates that differences in formant values
Examination of Figs.5-8 suggests that useof differ-'

ences, ratios,or log ratiosof any of the suggested scales mightbeaboutequally effective in clustering thevowels independent of talkergroup. The degree of clustering for the
James D.Miller: Vowel perception 2118
in hertzdonotcorrelate aswellwith thelogs of thefrequency ratiosas do the differences in mels,Barks,or Koenigs shownin the previous figure.
2118 J.Acoust. Soc. Am., Vol. 85,No.5, May1989
3200 - ,
,
-
,
I
(H3,L2)
I
1444)
2800 ......... i.................................................................................... . O (H2.LI)

0.8
o IOtO
2400
...............................................................................................................
o
0.4
o
-
44
720
2000 ................................................................... ..................................... 3.._

i
7 8
555666
1600 ................................... "F'"'I ............... .......... ........ ....................

1200 ................. ......... e..Oi ...................................................................
,.o o
LOGio {M3/M2)
.o
i
,o ,o',o ,Lo ,;4

3-M2 ISOO
T '
(c)
1280
.oo 4.
r.
o.o o
1
1
22
2
NO
3
778 8
0.0 0.2 0.4 0.6 0.8 1.0
840
LOGo (FH/FL )
0.2
. 0:4
1.0
320 640 980

83-82
,
FIG. 8. A scatter plot is shownfor the relationbetween logarithms of formant ratiosanddifferences in formant values. Both scales are in hertz. Symbolsand data are the sameas in Fig. 7.
o.s
L. OGio (B/B2)
i !
'
(e)
',
(f)
1800
1200
log ratios and differences of(F 1,F 2) and (F 2,F 3) isshown

for the four scales in Fig. 9. On eachpanel,the vowelis indicated by an Arabic number ascoded in thelegend. By
0.4
%97 88
666
4OO
visual inspection, one can see that allofthe ratio descriptions

cause the like vowelsto clusterwell, with fair to goodseparation of unlike vowels.Differences also serveto clusterthe
o18 '
LOGio (K/2)
1.0
,oo 0oo ,2'oo

K3-K2
r' ' '1 I I ' i r
3200
vowels. In thecases ofthemel,Koenig, andfrequency scales,
,
1
(h)
2660
thelogratio approach seems todoa better jobofclustering

thevowels thandoes thedifference approach. In thecase of the Bark transformation, the two approaches appear to be
nearly equivalent.
21
2
3
4 4
192o
!280
?
A mathematical analysis confirms the visualimpres-
840
sions ofFig.9.Following thework ofMilleretal. (1983),we

calculated the distances between everypair of pointsin a
5 b8 6
, 0 I. I , 0. I , 0 I6 , l.O O' . 640 I ,, 1260 I 1920 I 25603000 0'%:0 2 , 0.4

LOGio (F 3/F2) F3-F2
plane. Next,wecalculated themean of thesquares of all

distances between vowel categories (ca) andthemean of
1--/IY/; 2--/IH/; 3--/EH/; 4--/AE/; 5--/AA/; 6---/AO/; the squares ofalldistances within vowel categories (owca).numbers: 8--/UW/; and9/AH/. Onevalue foreach vowel isshown for Thesquare root oftheratio ofthese mean squares, (a/ 7--/UH/; meandataof men,women,andchildren(Peterson andBarney,1952). The
FIG. 9. The clustering of vowels is shown for various expressions of the relations among F 1, F2, andF3. The vowels are indicated by Arabic
owa) 1/2, is the normalized root-mean-square distance reader should notethespacing within vowel categories in relation to that between vowel categories andis symbolized by d ms' The betweenvowel categories. larger this number, the better theoverall clustering byvowel category. Unfortunately, d ;ms can bemisleading in thefolproduced bythetwostatistics, a clear Ordering lowing way.Several vowelcategories couldoverlap com- rankings
pletely, butd s could yetbelarge if one ormore categories emerges. Whenthevowels arespecified bythelogs oftheratios of were greatly separated fromtheothers. Therefore, a second
statistic wascalculated. The mean position of each vowelin
estdistance between means (d,,) wasfound. Then,thesta-
formantcenter frequencies measured in hertzor mels,the

servedfor differences of the centerfrequencies of the for-
is observed. Clearly,inferiorclustering isothe plane under consideration was calculated, and the small- bestclustering
tistic d ,, - d,,/tra was calculated representing theminimum separation between vowelcategories expressed in
standard deviation units.
mants in Koenigs, mels, or hert.z. Intermediate c!usterieg is observed when logsof Koenigratios,Bark differences, er logsof Bark ratiosare usedas descriptors of the relt'mas
The author's preference for the log frequency scaleis
based on the literature and the results described above, on
Using thestatistics justdescribed, weranked thevowel amongthe formants.

metrics discussed above with regard to theireffectiveness in
clustering thevowel datashown in Fig. 9. The results are

shown in Table I. While thereare slightdifferences in the
the results of similar,more extensive studies(Miller et al.,
2119
J.Acoust. Soc. Am., Vol. 85,No. 5,May 1989
TABLE I. Comparison ofspectral metrics ofvowels based onclustering by

vowel category. Average
Metric
log(F3/F2),log(F2/F1 ) log(M3/M2),log(M2/M1) log(K3/K2),log(K2/K 1)
(B3--B2),(B2--B1)
d ,,
11.08 9.86 8.08
8.58
Rank d ;ms Rank

1 2 4
3
rank
1.5 1.5 3.5
4.0
2.38 2.50 2.14

1.73
2 1 3
5
log(B 3lB 2),log(B 2/B 1)

(K3--K2),(K2--K1) (M3 -- M2),(M2 -- M 1) (F3 -- F2),(F2 -- F1)
6.83
6.67 5.59 2.87
5
6 7 8
1.96
1.28 0.92 0.25
4
6 7 8
4.5
6.0 7.0 8.0
acoustic waveformto auditory-sensory dimensions. Stage2 is the transformation of the sensory data to perceptual values. In stage3, the perceptualvariablesare convertedto phonetic-linguistic categories. In stage1, it is assumed that short-termspectral analysesare performedon the incomingspeech waveform.Each spectrumcan be classified as a glottal-source spectrum,a burst-frictionspectrum,or a combinationof the two. At eachmoment,the spectral envelope patternsof the glottalsourceand burst-frictionsounds are represented as sensory responses or sensory pointers in a phonetically relevant audi-
tory-perceptual space. 2 In stage 2, these sensory responses,

or sensory pointers,are converted into a unitaryperceptual response, or perceptual pointer,that is alsolocatedin the auditory-perceptual space. The perceptual response is a hypothetical construct or intervening variablethat is based on the general notionthat speech inputsareintegrated to form a unitaryperceptual stream. In practice, thishypothetical perceptual response is calculatedby mathematicallydefined sensory-perceptual transformations that rely on the histories, trajectories, and dynamics of the sensoryresponses (Miller et al., 1988). Finally, in stage3, segmentation and categorization mechanisms that dependon the dynamics of the perceptualpointer in relationto perceptual target zones within the auditory-perceptual spaceresult in a string of category codes that correspond to the allophones of the language. All of theseconcepts are elaborated asneeded in the exposition that follows.
A. Auditory-perceptual theory applied to a sustained
vowel
1980, 1983), whichindicatedthat logsof frequency ratiosof the formantsgroupthe vowelsas well or better than other metrics,and on the generalimportance of the log frequency scalein auditory science. With regardto the generalprominence of frequency ratios,andthusthe logfrequency scale in auditory science, the following points are noted. Weber's law, which approximatelydescribes frequency discrimination overa widerangeof frequencies, impliesa log frequency
scale. In the caseof music, it is true that all musical scalesof
all cultures arebased onthe octave andthusonlogfrequency scales.Chord structures are definedin terms of frequency ratios,and thusare mosttractablein termsof log frequency. Additionally, transposition of melodyis effective on the log frequencyscale. Furthermore, Harris (1960) has shown that humanlisteners scalepitch in accordance with log frequencyaslongasthe intervalsto be scaled are small.A mel scaleis only obtainedwhen listeners are askedto scalelarge intervals.In the samearticle, Harris pointsout that the use of mel-like scaleswould result in very different musical forms basedon relatively larger intervalsthan thosecommonly used,and, furthermore,suchmusicwould haveto be monophonic, asmel-like scales will not allow harmony asit is presently conceived. From an anatomical point of view,it is notedthat frequency-position mapsof mostmammalsappearto belogarithmicovermostof their range(Greenwood, 1961). Finally, the measurement of spectra in termsof log frequencyprovidesthe opportunity to compare distances andtransition velocities in termsthat canbeeasilycompared
to those of other sounds. All of the material cited above leads
The analysisof a sustained vowel in terms of the auditory-perceptual theory will now bedescribed. The vowelwas
usto conclude that, eventhoughcompeting scales may perform quitewell in many circumstances as shownin the preceding material,the logarithmic frequency scale is quitefundamentalto hearing,and that it is, therefore,the metric of choiceuntil decisive resultsproveotherwise.
III. THE AUDITORY-PERCEPTUAL TO VOWELS THEORY APPLIED
The auditory-perceptual theoryof phonetic recognition (Miller, 1984b, 1987a, b) will now be briefly described. It is comprised of a three-stage process (Miller, 1984a), where the acoustic waveformof speech is converted by the human listenerto a stringof category codes that correspond to the allophones of the language. The theory describes the "bottom-up" aspects of phoneticperception, and it is conceptual in nature. Stage1 of the theory is the transformation of the
2120 J. Acoust.Soc. Am., Vol. 85, No. 5, May 1989
intoned by a speaker, recorded, and digitallysampled at 20 kHz. It wasthenfilteredwith a high-pass filter setto 50 Hz to remove low-frequencynoise. After high-frequency preemphasis, a short-term spectral analysis wasperformed. In our work, we routinelyusea linear predictionanalysis with 24 poles.A Hammingwindowof 24 ms wasusedand moved in 1-mssteps, andthusa spectral envelope wascalculatedfor eachmillisecond of speech. In Fig. 10,an example of a spectral envelope from the vowel/IY/is shown.In the auditory-perceptual theory,it is assumed that the listener's auditorysystem derivesa sensory-spectral envelope that is nearly equivalentto the one shown.It is then assumed that this spectrumcan be concisely represented as a sensory pointer in a phonetically relevant auditory-perceptual space (APS). The vowelspectrum is, of course, a glottal-source spectrum andwill be represented asa glottal-source sensory pointer(GSSP) in the auditory-perceptual space. To locate the spectrum in the auditory-perceptual space, the center frequencies of the first three significant prominences are identifiedas sensory formant 1, $F 1; sensory formant 2, $F2; and sensory formant 3, $F3. Furthermore,a sensory reference $R is located.As described in AppendixA, the sensory reference serves a varietyof importantfunctions in the auditory-perceptual theory,including talkernormalization and the disambiguation of certainvowels. The sensory reference is believed to depend on the talker'saverage vocal
James D. Miller:Vowel perception 2120
FRONT
dB
SIDE
9O
90
8O
80
7O
70
z
6O
!
5O
z
Y
4O
.... I
SR
.......
SF1
I
Frequency kHz
1.0
t I I I '' ,I
SF2
SF3 10.0
0.1 (REF)
FIG. 11. Sensory(top) and perceptual (bottom) pathsin the auditoryperceptualspaceare shownfor the vowel /IY/ as intoned by a male speaker(JH). These are shownin front and sideviews.The envelope of the perceptual target zone for/IY/ isshownasis the locationof the glottal source neutral point (GSNP). A pyramidis plottedon the path every 80 ms. The reducedfigure (middle) showsthe perceptualpath with the
locations of the relevant items in the
FIG. 10.A spectral envelope forthevowel/IY/as intoned by a male speaker (JH).Thesensory reference ($R) and thefirst three sensory formants ($F 1,$F2, $F3) arelocated bythevertical lines. Thevalues ofthe dimensions--xyz---of theauditory-perceptual space areindicated asthe
distances between thevertical lines. Since GMFO = 120Hz, $R = 150Hz
APS. Here, x, y, and z axesare in 0.1 logunitsandthepointof originis (0,

0,0).
byEq.(5).The values ofthe formants are $F 1= 310 Hz,$F2 = 2450 Hz,

andSF3 = 2810 Hz. Thecorresponding values ofxyzare0.060, 0.315, and
0.898, respectively.
characteristics and on appropriately filtered pitch modulations. Formost ofthepurposes ofthepresent paper, asimplified expression can beused. Thisisgiven byEq. (5)' SR = 168(GMF0/168) /3, (5)
GSNP
pyramid isshown every 80ms. Thewire-mesh volume indiin Fig. 11is theperceptual target zone for thevowel where GMFOisthegeometric mean ofthecurrent speaker's cated
voice pitch. Formale speakers, a typical value of GMFO is
/IY/.
In our earliestwork (Miller, 1984b), perceptual tar-
were represented asspheres centered onpoints in 132Hz andthe corresponding $R is 155Hz. For female getzones toestimates ofprototypical formant pospeakers, a typical value ofGMFO is223Hz and thecorre- APScorresponding sponding $R is 185Hz. Forchildren 7-10years of age, a sitionsfor each vowel. It was soondiscoveredthat the data thetarget zones tobelarge andirregularly shaped typical value ofGMFO is263 Hzand the corresponding $Ris required
195 Hz.
volumes. Theparticular zone shown for/IY/in Fig. 11is

ond-iteration zones are referred to as the I2 zones and are
Returning toFig.10,wenote thatthisspeaker's GMFO at thetimeof thespectrum was120Hz, and,therefore, the sensory reference was 150 Hz, asindicated inthefigure. The
formants $F 1,$F 2, and$F 3 have center frequencies of 310,
theresultof oursecond attemptor second iteration in defin-
ingtheperceptual target zones forsimple vowels. (Such secdetailedin Sec.III C.) It is dear that this examplefalls inside of the I2 zone for the vowel/IY/.
2450, and2810Hz, respectively. Thisspectrum islocated in

theAPS asfollows. The variable y isdefined asthelogarithmic distance from $R to $F 1:
In the uditory-perceptual theory, however, it isnot the

sensory path oritscharacteristics thatdirectly determine the perceived vowel. Rather, it istheperceptual path thatishypothesized tobecritical. A perceptual path always begins at the glottal-source neutral point,whichrepresents an abstract home position fortheperceptual response thatissimply related to thelocus of thespectrum associated witha
y = i log SF1) -- (log SR ) = log (SF1/SR ).

SF 1 to SF2:
(6)
The variable z is defined as the logarithmic distance from
z= (logSF2) -- (logSF1) ----log(SF2/SF1). (7) uniformvocaltract. In the auditory-perceptual theory,evFinally, thevariable x isdefined asthelogarithmic distance ery perceptual pathbegins andends at the glottal-source from SF2 to SF 3: neutral point(GSNP). The perceptual pathis calculated x= (logSF3) -- (logSF2) = log(SF3/SF2). (8) fromthe sensory input.The glottal-source sensory pointer defined by the sensory formants and sensory reference drives The pointassociated withthisspectrum hascoordinates response orperceptual pointer (PP) through x -- 0.060, y -- 0.315,andz = 0.898. These coordinates de- theperceptual resonantsystem.It finetheposition of theGSSP forthisspectrum. Of course, the APS in the manner of a second-order sensory pointer andtheperceptual during a sustained vowel, many such spectra areproduced, isasif theglottal-source were joined bya spring andtheAPSoffered viscous andthese arerepresented in theAPSatthetopofFig.11in pointer
frontandsideviews. Here,thesensory pathof the GSSPis resistance. Whenthe sensory pointerdisappears, a similar, but much weaker, spring is activated that attracts theperplotted foreach millisecond olintoned vowel, and adirected
2121 J. Acoust. Soc. Am., Vol. 85, No. 5, May 1989
JamesD. Miller:Vowelperception
2121
ceptualpointerbackto the glottal-source neutralpoint.The modelassumes that sensory pointersand the glottal-source neutralpointhaveverylargemasses relativeto the perc.eptual pointer,whichis arbitrarilyassigned a mass of 1.0unit. In this way, the perceptual response cannotact backon sensoryinputsor the neutralpoint.The mathematical relations between the sensory pointer (sensory response) andthe perceptualpointer (perceptualresponse) constitute an important part of the stageII or sensory-perceptual transformation of the auditory-perceptual theory (Miller et al., 1988).
Furthermore, the mathematics allows one to calculate val-
FRONT
s IDE
..............
.................
..................
.....
FIG. 12.Sensory (top) andperceptual (bottom) pathsin the auditory-perceptual space for thesyllable/LAA/produced by a female
speaker (LT). These are shownin front and
. GSNP
uesof the perceptual reference (PR) and the perceptual formantsPF 1,PF 2, andPF 3. The perceptual pathsogenerated
in the APS is shown in front and side views at the bottom of
sideviews. Thereis a breakin each pathat each millisecond. A pyramid isplotted onthe perceptual pathevery 80ms.Preliminary estimates of thetarget zones for the/L/(light 1) andthevowel/AA/are shown. Theperceptual pathcanbe interpreted asentering and
activating thesezones. The reduced(middle) figureshows the perceptual pathwith the locations of the relevant items in the APS.
Fig. 11. It is alsoshownin reducedsizein the centerof the figureto provideorientationwith the axesof APS. It canbe seenthat the perceptualpointer leavesthe glottal-source neutralpoint,entersthe I2 zonefor the vowel/IY/where it
remains for over 400 ms, and then returns to the GSNP.
In the auditory-perceptual theory, a perceptualtarget zone is activatedto issuea neural symbolor categorycode corresponding to its associated allophone only whenthe perceptual pointer performsa segmentation maneuverwithin the zone.While the exactnatureof the segmentation maneuver has yet to be formulated,it has been noted previously (Miller, 1984b; 1987a,b) that suchmaneuversare associated with decelerations andrelativelowsin velocityof theperceptual pointerandwith highcurvaturepointsalongitspath. In our samplecase,the perceptualpointer decelerates within the target zone and exhibitsa long period of low velocity. Also, the path shows highcurvatureasit goes to andreturns from its average locationinsideof the perceptual targetzone for/IY/. In terms of the auditory-perceptual theory, this stimuluswould inducethe perception of/IY/because the perceptualpointer entered the perceptualtarget zone of /IY/and executed a segmentation maneuver that caused the target zone to be activated.Also, it is assumed that the activation of the target zoneis maintainedduring the sustained low-velocity part of the path. It is alsonecessary to modelthe loudness of the perceptual response, aswell asits spectralcharacteristics. For this purpose, it is assumed that the loudness of the perceptual .pointerrapidlygrowsoversome10-70 ms at the beginning of a utterance anddecays moreslowly, perhaps overa period of 100-250ms,afterthe cessation of the sensory input (Miller et al., 1988). The valuesfor the growth and decayare consistent with knowledge of loudness integration(Shaft, 1978) and forward masking (Zwislocki, 1978). In the simple caseof the sustained vowel illustratedin Fig. 11, the implicationis that the loudness of the perceptual pointer decays to zero as the perceptual pointerreturnsto its home position,the glottal-source neutralpoint. More generally, it
described. In thiscase, a female speaker uttered thesyllable /LAA/. Thesensory path isshown inAPSatthetopofFig.
12 in front and sideviews.The I2 zonesfor the consonant
/L/and thevowel/AA/areshown. Theresultant perceptualpath isshown atthebottom ofthesame figure. Ascan be seen, theperceptual pointer follows a pathfromtheGSNP
to the/L/-target zone,to the/AA/-target zone,and, then, back to the GSNP. Segmentation maneuvers, suchas decelerations, periodsof low velocity,and sharpturns,are assumedto activate first the/L/-target zone and then the /AA/-target zone.
C. Vowels in the APS
Sincethe patternsof spectralchangeassociated with vowelsare generallyrather slow,it is oftenthe casethat the perceptual and sensory pathswill be identical,or nearlyso, in thosesegments of the pathsthat triggerthe perception of the vowel.Therefore,it is reasonable to estimate the positions of the perceptualpointer associated with the vowels directlyfrom formant valuesreportedin the literature,in which casethe judgmentsof the authorsas to whetherthe reportedformant valuesare representative of a vowel are accepted. Where the pitch is not given,valuesof 133Hz for
adult males, 225 Hz for adult females, and 263 Hz for chil-
dren 7-10 years of age are assumed.In our own work, we
listento windowed segments of a path,examine patterns of velocity andcurvature in the APS, andfinallyaverage points overa segmented regionof the perceptual path. In thisway, 406 data pointswere collected for the simple,nonretroflex vowels of American English:/IY/m24,/IH/--88,/EH/m
89, /AE/23, /AA/--24, /UH/--24, and/UW/22. /A0/25, /AH/87, These include data for men,
is hopedthat the growthanddecayof loudness mayhelpto explainwhy speech canbe "heard" duringbrief silences.
B. Auditory-perceptual theory applied to a consonantvowel syllable
A slightlymorecomplicated and interesting example of the application of the auditory-perceptual theorywill nowbe
2122 J. Acoust. Soc.Am.,Vol.85, No.5, May1989
women,and childrenin a variety of phoneticcontexts and speakingrates and from a variety of laboratories. All are from singlesyllableutterances and the sources of thesedata are described in detail in AppendixB. It wasimmediatelynoticedthat the nonretroflex vowels fall in a narrow slabin the APS. This is illustratedby showJamesD. Miller: Vowelperception 2122
rotatesthe axes.Space SLAB isoftenconvenient for examining vowels,while spaceAPS is often more convenientfor examiningother allophones. Many timesthe xyz axesare shown,but the displayis rotatedsothat the vowelslabis in
the vertical.
After collectingthe data points shownin Fig. 13 and described in Appendix B, we developed target zones,the I2 zones,to represent the clusteringof the vowelsinto distinct
regio, nsoftheAPS. Since most ofthevariability ofthepoints

was along the height (y' axis) and width (x' axis) of the vowel slab and not in the depth (z' axis), the target zones were constructedwith variable outlinesin the height and width dimensions, but with uniform, that is, rectangular, outlinesin the depthdimension. The author useda computer-aidedmethod with a resolutionof 0.01 log units to draw thesezones by hand.He alsoimposed additionalconstraints on the regularityofx'y' borders,sincethe densityof the data pointswas not sufficient to constrainthe outlines.Also, he ignoredcertain pointsbecause they seemed to be unlikely andwerethoughtto beerrors.Ifi spiteof the factthat certain pointswere excludedin creatingthe zones,all data were retainedfor purposes of displayand evaluation.By this process,a set of nonoverlappingperceptual target zones was createdthat classified the data pointswith goodaccuracy. The outlinesof hand-drawntarget zonesare shownin Fig. 14. Here, the slabcoordinates y' and x' [ seeEqs. (10) and (9) ] are shown,and the triangular outline is the intersectionof the slabcoordinates with the originalxyz coordinatesat the middle of the vowel slab.The maplike character of vowel regionsis apparent. The hand-drawn zoneswere givendepthlimitsin z' that wereconsistent with the dataand then were subjected to a softwarecontouringprocess that
created wire-mesh enclosures that fit the hand-drawn con-
Z-AXIS/
/
/
'-
/, -
..:..
++
'-...&
xgY AXES
FIG. 13.The406examples ofthenonretrofiex vowels used todevelop theI2 target zones areshown in front(top) andside (bottom) views. Theregions
associated with vowels areindicated by thelabels in the front view.The side viewdramatically illustrates the existence of the vowelslab.For the side view,the axesof the APS havebeenrotatedto bring the vowelslabto the vertical. In thisparticularorientation, thex andy axes appear to superpose.
tours with an error of 0.01 log units. These wire-mesh
For bothpanels, x, y, andz axes havetic marks every 0.1 logunits andthe
pointof originis (0, 0, 0).
ingall 406points in front (toppanel)andside(bottompanel) views in Fig. 13.The middleof the vowelslabis defined by the sumplanewherex + y + z = 1.220 0.135. This means thattheslabhasa thickness 3of about 0.156logunits
or about one-half octave. The existence of the vowel slab is
almost certainly related to vocaltractacoustics asexpressed in relations between vocaltract size,fundamental frequency,
and the value of the third formant. A related, but distinct,
geometric approach to vowelformantfrequencies can be

found in Broad and Wakita (1977). Sometimes it is useful to rotate the axes so that the vowel
slabis brought to the vertical.Thiscanbeaccomplished by the equations givenbelow:

x' -- O. 7071 (y -- x),
(9)
-0.2
-0.4
-0.6 -0.4 -0.2 0 0.2 0.4 0.6
y' -- 0.8162z -- 0.4081(x + y),

z' = 0.5772(x + y + z).
(10)
(11)
Generally, wespeak of space APSashaving coordinates xyz and space SLAB as havingcoordinates x'y'z'. Noticethat this transformation doesnot translatethe origin but only
2123 J. Acoust. Soc.Am.,Vol.85, No.5, May 1989
FIG. 14. The second iteration (I2) target zonesfor the nine nonretroflex vowels of AmericanEnglishareshownin SLAB coordinates [ seeEqs. (9)( 11) ]. In thisorientation, thez' axisis perpendicular to thex'y' plane.Note the maplikecharacter of these zones with their largeareas, irregularshapes, and abuttingboundaries.
JamesD. Miller: Vowelperception
2123
1111 ill
IY
ii i [ i .! i i iTi ! Ii
[11
Ii111111111111]
il
ltl
i i 111J
,,H+H-H+t-I-H-H-H H-t
AA
qliill!llllll'illr
qlilll!!lllll!llr
"lllill!111111111r
111111dllllllll'
qli!lilliklllll]l'
111111II]rl.I 1111 I!1111IIJIIIP III!l
1111111qlq IllrllU *PI I I 1 I I J*lidl/I
I I I
11111111
Ii iii
1111111111Ir!!lr
lklll
I.!
I-J
I 1 I
I i 1 I
d I ! [ I ! 1Ll.d I I TJ ! 1I 1 ! I I I ! I !!*lrl 1 I I ! ! I i I I I*i-1 I I Ig II11111111111111
EH
AE
FIG. 15. Data pointsand targetzonesfor the vowels/IY/,/IH/,/EH/, and/AE/are shownin front (left) and side(fight) views. The numbers and percentages of pointsenclosed in eachwire-mesh targetzonecan be visualized andcompared with thenumbers presented in TableII. See Fig. 14 for the location of the individual targetzones in APS.
FIG. 16.Data pointsandtargetzones for the vowels/AA/,/UW/,/AO/, and/ER/are shownin front (left) and side(fight) views.The numbers and percentages of pointsenclosed in eachwire-mesh targetzonecan be visualized and compared with the numbers presented in Table II. SeeFig. 14 for the locationof the individualtargetzones in APS.
enclosures are shown in front view (x'y') and side view (y'z') in Figs. 15-17. Note the irregularborders in the front view and the rectangular bordersin the sideview. A sense of theaccuracy with whichthese preliminary 12 target zonesclassifythe vowel data can be obtainedfrom Table II and from Figs. 15-17. Here, all of the 406 data pointsfor the oral, nonretrofiex vowelsare shown,aswell as an additional 29 samples of thevowel/ER/for a totalof 435 points. Overall,thepresent targetzones segregated thevowelsof AmericanEnglish with 93% accuracy. Correctclassificationrangesfrom 100% for/IY/to 82% for/UW/. TableII demonstrates thata highproportion of thepoints falls in the appropriate targetzoneor in adjacent unclaimed regions that couldbeeasily incorporated intothetargetzone. Thesedataare in accord with a varietyof studies showing that humanlisteners identifyEnglish vowels in CVC context
withapproximately 95% accuracy (see, forexample, PetersonandBarney,1952; GottfriedandStrange, 1980; Macchi, 1980). Inspection of Figs. 15-17, additionally, shows that
mostof the errorsor misses are "nearmisses," whichcanbe judgedby the fact that the wire-meshenclosures illustrated
in thefigures consist of squares that are0.01logunitson a

side,or approximately 1/30 of an octave. About one-halfof
theerrors fallbetween 0 and0.03logunits, thatis,fallwithin 1/10 of an octave of a targetzoneboundary. Thesezones are characterized asbeingverylargewith irregularand abuttingboundaries. The boundaryregions
seem to beextremelysmallin relationto the sizeof the zones.
It is the hypothesis of the auditory-perceptual theorythat targetzones with the qualities just described will accurately describe the simplevowels of eachdialectof a spoken language.However, in order to accuratelydefinethesezones,
2124
J.Acoust. Soc. Am., Vol. 85,No. 5,May 1989
AH
vowels. We are currently conductingexperimentsusing vowelssynthesized from xyz valuesfrom APS to verify the perceptual significance of the present irregularlyshaped target zones.It is clearthat eachvowelis represented by a very large classof spectralpatterns. The auditory-perceptual equivalent of Lloyd'sclaimthat like formantratiosproduce like vowel percepts is that like valuesof xyz producelike vowelpercepts. This,of course, implies that like values of the ratios (SF 1/SR), (SF2/SF 1), and (SF3/SF2) produce
likev6wel percepts. Butonemust notmake thelogical error

of inferringthat like percepts havelike valuesof xyz, since the largesizesof the perceptual target zonesdisprove such an inference evenas an approximation. It is a hypothesis of the auditory-perceptual theory that the large and irregular shapes of the perceptual targetzones will account for a substantialproportionof differences in vowelsdue to coarticulation effects as well as a substantial proportionof the variability dueto differences between talkersin articulatorystyle and habit. It alsoappears that the sensory reference concept helpsto eliminatedifferences amongtalkersdueto size,age, and gender,as well as servingto disambiguate the vowels
with similar F2/F 1 and F 3/F2 ratios noted in Sec. I. There-
fore, the auditory-perceptual approachmay lead to a relatively simpleaccount of talker differences and coarticulation effects in vowelperception.
IV. OTHER RELATED DESCRIPTIONS TO THEIR LOCI OF THE VOWELS IN THE AUDITORYAS
PERCEPTUAL
SPACE
The vowels have often been described in terms of dimen-
sions of articulation.For example, chartsin ThePrinciples of

UH
the International Phonetics4ssociation(International Phonetics Association, 1949) show the loci of vowels in relation
to a high-low dimension anda front-back dimension, which represent the positionof the "tonguehump." In an alternate presentation, the front-back dimension is plottedagainstan open-close dimension that represents thejaw openingasso-
ciated withthevowel. These charts arebased in a large part

on the x-ray photographs of Jones(Jones,1914, 1919). Another description of the vowelsin termsof variablesrelated to acoustic timbre has beensuggested by Fant (1973). In this scheme, vowelsare classified along a grave-acutedimension and a compact-diffuse dimension. Graveness is relatedto the average position of F 1 andF2 prime on a mel scale, whilecompactness isrelated to separation between F1 and F2 prime similarlyexpressed in mels (Fant, 1973,p. 188). When the oral, nonretroflex vowels of AmericanEnglish are viewedin the vowel slab,relationships similar to thoseseen in the planes just described are observed. In fact, we havenoticed that verysimplerotations of the vowelslab
FIG. 17.Datapoints andtarget zones forthevowels/AH/and/UH/are

shown in front (left) andside(right) views. The numbers andpercentages
ofpoints enclosed ineach wire-mesh target zone can bevisualized and compared with the numbers presented inTable II. See Fig.!4 forthe location of
the individualtargetzones in APS.
we will haveto make improvements in our procedures and database. Theseinclude: ( 1) better procedures for the seg-
mentation ofperceptual paths, (2) better procedures forlistener verification of pointsin unbiased listeningtasks,(3) procedures for contouring setsof like-identified pointsthat are moreprecise and objective than present techniques, and
(4) much more data. Since the I2 zones were created, we
produce relationshipssimilar to those of either thearticulatory or timbre dimensions. Thesesimilaritiesare illustrated
havebeenmakingprogress towardtheserequiredimprovements (Miller, 1987c).
The largesizes of the targetzones, alongwith their irregular boundaries, may proveuseful in accounting for muchof the measured variabilityin acoustic spectraassociated with
2125 J. Acoust. Soc.Am.,Vol.85, No.5, May 1989
in the four panels of Fig. 18. For thesefigures, the average valuesof xyz were calculatedfor eachvowel from the mean data for men,women,and childrenof Peterson and Barney (1952). The severalorganizations of the nonretroflexvowels of AmericanEnglishshownin Fig. 18 are simplerotationsof the originalxyz dimensions of APS. To aid in illustratingthispoint,the intersections of the coordinate planes
JamesD. Miller: Vowelperception 2125
TABLE II. Numberof tokens of each vowel enclosed byeach targetzone.Here,L isthetargetzonefor light-1, AUC istheadjacent unclaimed region, NAUC is the nonadjacent unclaimed region,and the numbers in parentheses includepoints in AUCs ascorrect.
Target zones
Vowels
IY IH
IY
24 "'
IH
EH
AE
AA
AO
AH
UH
UW
ER
AUC
NAUC
Total
24
Proportion correctly enclosed

1.00 0.99
.................................... 87 ............ 1 ..................
88
EH
AE
......
......
87
2
.........
21
1
...........................
............
1
2
...
"'.' 1
89
23
0.98
0.91
(0.99)
(1.00) (0.91) (1.00)
AA
AO
............
............
22
!
..................
24 .....................
24
25
0.92
0.96
AH UH
UW
...... ...............
.....................
1 1
73 2
2 20
1
......... ............
18 -" 3
6
......
87 24
22
0.84 0.83
0.82
ER Total
...... 24
87
1 91
.................. 22 26 26
77
23
18
27 27
". 3
1 10
." 1
29 435 Total Correct
0.93 92.6 403
(0.97) (94.9) (413 )
of APS,thatis,thexy,xz, andyz planes, withtheillustrated B = -- 0.3536(y -- x) -- 0.7068(z) + 0.3534(x + y), plane areshown by thetriangle plotted oneach panel. The (12) upperleft panelshows the vowelplanein slabcoordinates. H1 =0.4081(z) --0.2041(x +y) --0.6123(y--x), (13) In thiscase, x'y'z'areasdefined in Eqs.(9), (10), and ( 11), where z' is fixed at 0.704. whereB is backandH 1 is highin thisrotation. The dimensions shown in panel(c) aregiven by Thefront-back andhigh-low dimensions of panel(b)
can be calculated as
O = 0.3536(y - x) - 0.7068(z) + 0.3534(x + y),
(14)
-;-
iHIGH
1.0
0.6
","; .... '..... 'r .......
'
u..... ,,. .... 'v..... ,....;,;..
(a)
FR0! \
.
0.6
UW
UH
AH
-'"" ,o. TM X EH
'
ACK
AO '
..... '..... "" /0.6 o.8 !.o
FIG. 18. The locationsof the oral,

nonretroflex vowels of American
English areshown in fourcoordinate systems that are simplyrotations of the auditory-perceptual space.In
panel (a), the SLAB coordinates definedby Eqs. (9)-( 11) are used.In panel (b), the articulatory dimensions of Eqs.( ! 2) and (13) areused. In panel(c), thearticulatory dimensions of Eqs.(14) and( 15) areused. Finally, in panel (d), a rotation [ Eqs. (16) and (17) ] is usedthat providesdimensions similar to the grave-acuteand compact-diffuse dimensions of Fant (1973). The symbols are locatedat the mean positions of the combineddata for men,
women, and children of Peterson
-0.6
iLOW
(c)
IH EH AE
and Barney (1952).
":'11; .............. uw u.
-0.6
............... ,,.;..
ieAcK
2126
J. Acoust.$oc. Am., Vol. 85, No. 5, May 1989
James D. Miller:Vowelperception
2126
TABLEIII. Transformations ofxyz toother dimensions fordescription of

the vowels.
flex vowels,while the average for/ER/was

tokens of the vowels /UW/ and /AO/
1.02.Only a few
have values of
Variable a
x' y'
B H 1 O H 2 ,4 D
x
-- 0.7071 -- 0.4081 0.7070 0.4082 -- 0.0002 -- 0.8164
-- 0.8065 --0.1273
y
0.7071 -- 0.4081
-- 0.0002 -- 0.8164 0.7070 0.4082 0.5139 --0.6344
z
0.0000 0.8162
-- 0.7068 0.4081 -- 0.7068 0.4081 0.2925 0.7620
(x + y + z) assmallasthe largest values for tokensof/ER/. However, in regionsof the vowel slab closeto the/ER/ region,all oral vowelshave valuesof (x + y + z) that are largerthan that of any/ER/token. Thus,it is clearthat the
retroflex vowel/ER/falls behind the nonretroflex vowels of
American English. Furthermore,it is possible to calculate retroflexion asa

function of the variables$R, $F2, and $F 3. If one definesK as constant with a value of about 7.8, retroflexion should
Seetext andEqs.(9)-(17) for definitions.
H2 = 0.4081(z) -- 0.2041(x q- y) q- 0.6123(y -- x),
(15)
where O is openand H 2 is high in this context. Finally, in the lowerright panel[ panel(d) ], horizontal
and vertical axes are
peakasthegeometric meanofSF 2 and$F 3 approaches KSR and as$F2 and $F 3 approachequality.Otherwisesaid, as $F2 and $F3 approach1510, 1793, and 1983 Hz for men, women,and children,respectively, retroflexion is indicated. The measureof nasalizationutilized in the auditoryperceptual theory is based on the observations of Fant (1970), Fujimura (1962), and Stevens et al. (1987). They showthat a complexinteractionof the nasalpoleand nasal
zero with the first resonance of the oral tract results in two
A -- 0.6602(y -- x) q- 0.2925(z) -- 0.1463(x q-y),

D -- 0.7620(z) -- 0.3810(x q- y) -- 0.2534(y -- x),
where A is acute and D is diffuse.
(16)
(17)
Thesesametransformations are givenin anotherform in Table III. Each variableis calculatedby multiplying the valuesofxyz by the indicatedcoefficients and summingthe resultingproducts. Other articulatorydescriptions of vowelscanbe related
to the dimensions of the APS. Consider the retroflexion of
peaks in thelowerpart of thespectrum. While, for oral vowels,thesetwo peaksare merged,during nasalization we define the lower of thesetwo peaks as the low first sensory formant (SF 1L) and the higher of thesetwo peaksas the
highfirstsensory formant (SF 1H), independent of their

bases in vocaltract acoustics. Suchinterpretationis simple in the caseof nasalization of a vowel with a high F 1. Here, the nasal pole corresponds to $F 1L and the nasalizedF 1
corresponds to$F 1H,while thenasal zero isassociated with

the valley betweenthe two and shiftsthe nasalizedF 1 up from itsoralposition. The case of thenasalization of a vowel with a low F 1 is different.Here, the nasalpole and oral F 1 caninteract to produce a peakcorresponding to $F 1L. The nasalzero interacts with the falling spectrum to produce,
vowels. Asiswellknown, $F3 drops close to$F2 'or the

/ER/vowel ofAmerican English, andboth approach a value of about 1510 Hz for males, 1793 Hz for females,and 1983
Hz for children (Petersonand Barney, 1952). In terms of the dimensionsof APS, this meansthat the values of x and the valueof the sum (x q- y q- z) will decrease during retroflexionof a vowel.This further impliesthat the/ER/vowels
will fall behind the vowel slab defined for the nonretroflex
withincreasing frequency a valley andthena local maximum in the spectrum. This local maximum is taken as $F 1H, eventhoughit mayhaveno corresponding resonance
in the vocal tract. Otherwise said, $F 1L and $F 1H are de-
vowels. This is the case. We have used 29 values of/ER/in
creatingthe I2 zone for/ER/(see bottom of Fig. 16). As illustratedin Fig. 19,thesepointsfall behindthe vowelslab. The average valueof (x q- y q- z) was 1.22for the nonretro-
finedasthe two low-frequency peaksthat appearin the spectral envelopes of nasalized sonorants. The degree ofnasalization thenisdescribed by theequation
N = log(SF 1H/SF 1L),
(18)
where SF 1L is the low first formant and SF 1H is the high

first formant. In oral vowels,N has an effectivevalue of zero.
FIG. 19. Side view of the vowel slab
When nasalization is present, the dimensions of the APS are given by x = Iog(SF3/SF2), y = log(SF1L/SR), and z
showing thatmost of the29 datapoints fortheretroflex vowel/ER/(triangles)

fall behindthe data pointsfor the oral,
nonretroflex vowels (crosses). In this
= log(SF2/$F 1H). Bythisformulation, (x + y + z) llecomes smallduringnasalization because, in essence, the value of N is subtractedfrom the sum. That is, in the oral case,
orientation, thex andy axes appear to

superpose. Here, x, y, andz axeshavetic
(x + y + z) = log(SF3/SR ), wile, inthe case ofnasalization,
marksin 0.1 logunitsand the pointof

origin is (0, 0, 0).
x q-y q-z = log---- -- log

SR
SF 3
SF 1H
SF 1L
= log
SF 3
SR
-- N. (19)
Thus nasalizedvowelswill fall well behind the vowel slab whenN is large.
. Finally, consider lip rounding. Fant (1970) hasnoted that lip rounding tends to lowerthe formantpattern. We
haveobserved in comparing/UW/with/IY/in American Englishthat the rounded/UW/tends to betowardthe back portionof the vowelslab,whilethe spread vowel/IY/tends to be toward the front portion of the vowel slab.Jongman and Fourakis (1987) havenoteda similartrend in comparingrounded andunrounded vowels in German.Thusrounding is indicatedby a movementwithin the vowelslabfrom largervaluesof (x 4-y 4- z) to smallervalues of this variable. A pure measure of roundingin termsof the auditoryperceptual variables that is uncontaminated by other variables,suchasnasalizationand retroflexion,has not yet been derived,but it seems likely that sucha measure canbe found.
A second mechanism proposed in theoryis to havethe $R influenced by pitchmodulations. Thishypothesis is motivatedby the well-known relationships of pitchmodulation to stress anda desire, then,to incorporate suchmodulations in the sensory and perceptual pathsof an utterance. In this versionof the theory,
$R = 168(GMFO/168) 1/34- FIL(mod F0),
(20)
where thetermFIL (modF 0) ismeant to represent themodulations in F0 afterbandpass filtering. By thisapproach, the $R wouldnotbe influenced by the veryslowchanges in F0 associated with pitch declination(Cooper and Sorensen, 1981 ), nor wouldit be influenced by the rapidchanges that In this section, we have indicated how the variables of occur at voice onset and offset. However, appropriate pitch the auditory-perceptual theorycanbe relatedto other,percouldshiftthe locationof a spectrum in APS, haps morefamiliar,descriptors of thevowels. Consideration modulations and if such a shift crossed a vowel boundary, then the identiof theserelations suggests that fivevariables arenecessary to ty of the perceived vowel would be changed. Quantitative disambiguate the vowels. Theseare$R, $F 1L, $F 1H, $F 2, evaluation of thishypothesized relationis a matterfor future and$F 3. With these descriptors of the acoustic spectrum, it attention. appears that onecancharacterize vowels with regardto retroflexion,nasalization, and rounding,aswell aswith regard to tonguepositionand jaw opening.Furthermore,the five B.Transposition ofvowel Spectra
variables described above can be combined into the three
dimensions of the APS to providea concise description of vowelspectra that canbe easilyvisualized.
V. EXTENSIONS OF THE AUDITORY-PERCEPTUAL APPROACH TO OTHER ISSUES IN VOWEL PERCEPTION
Studies of the effects of transposition of vowelspectra

beganwith mechanicallysynthesized vowelsin the 1800s,
In thissection, extensions of the auditory-perceptual approachto otherissues in vowelperception arebrieflyindicated. Specifically, issues of the relationshipof voicepitch to perceived vowel category,of the transposition experiment, of diphthongs and diphthongization of vowels,and of the nature of the boundariesand vowel target zonesare discussed.
andtheir numberwasgreatlyincreased asEdison's phonograph became available in laboratories. Amongmanypublications on, or relatedto, thistopicarethose of Bell (1879), Lloyd ( 1890a,b, 1891,1892,1894),Pipping( 1893 ), Fletcher (1929), ChibaandKajiyama ( 1941), Ochaietal. ( 1955), Kurtzrock (1956), Tiffany and Bennett(1961), Klumpp
and Webster (1961), Slawson (1968), Daniloff et al. (1968), Fujisaki and Kawashima (1968), and Sekimoto
(1982). Now, if the formant-ratio theoryof vowelperception were strictlytrue, vowelidentitywouldbe maintained
A. Voice pitch and perceived vowels

It is well known that, under most conditions, the identi-
ty of a perceivedvowel depends stronglyon the formant valuesof the spectrumand is independent of voicepitch. However, under certain circumstances the voicepitch can influence a vowel's identity, even when the formants are fixed (Fant et al., 1974; Miller, 1953; Fujisaki and Kawashima, 1968;and Scott, 1976). Within the auditory-perceptual theory,two hypothetical mechanisms are includedthat would allow voice pitch, under certain circumstances, to control the perceptionof vowel category.One mechanism followsfrom the fact that a change in the geometric meanof the fundamental(GMF O) results in a change in the sensory reference(SR)--Eq. (5)--and thus the value of y, since y -- log($F 1/SR ). Therefore, a change in GMF 0 cancause a spectral representation to shiftits locationin the APS. Since
asthevowel spectrum istransposed upor down thelogarithmic frequencyaxis. The experimental resultscited above confirm thisprediction within limits,andthe remaining issues relateto the rangeof transposition andthe causes of its limits.Early apparatus, suchasthe Ediphone used by Bell (1879), suffered serious distortion astherange of transpositionwasincreased. Laterwork is less subject to interpretation in termsof distortions produced by the experimental procedures. One reasonable hypothesis is that transposition islimitedby reduced resolving powerof theauditory system in thelowandhighranges, asisexemplified in theincrease in the Weber fractionfor frequency discrimination below 500 Hz and above2000 Hz, as measured by Wier et al. (1977). Such reductions in resolving powermaycause certain spectral peaksto fuseand either shift the identification or cause
the vowelto become indistinct. Thisviewis partiallysup-
ported by thefinding that women's speech cantolerate more downwardtransposition than men's speech, while men's speech can toleratemoreupwardtransposition than womthetargetzones arelargeandchanges in F0 arecompressed en's (Klump and Webster,1961). Anotherpossible limitaby Eq. (5), mostchanges in GMFO will not change vowel tion on the transposition experiment followsfrom the audiidentity, andthisisconcordant witheveryday experience. If, tory-perceptual theory.In order to maintainvowelidentity however, thespectrum liesneara boundary of a vowel target in theauditory-perceptual theory,theratio off 1to $R must zone,thenan appropriate shiftin GMFO cancause the locus be maintainedsuchthat a vowel boundaryis not crossed. of the spectrum in APS to shiftfromonetargetzoneto an- Since most transpositionexperiments have not made other,thuschanging the identityof the perceived vowel. changes in F0 relativeto F 1 so asto maintainappropriate
J. Acoust. Soc.Am.,Vol.85, No. 5, May 1989 JamesD. Miller:Vowelperception 2128
2128
F 1/SR ratios, this condition has not been met. Therefore,
someof the observed failuresof transposition may relate to changes in the F 1/SR ratio. Perhapsfuture researchwill clarify this possibility. Eventhoughall of the conditions for successful transpositionare not yet knownand furtherexperimental work can beof interest, the results of the citedexperiments in andafter the 1950sall demonstrate that transposition holdsreasonably well over the rangesnormally encountered in the vowel
sounds of children, women, and men. Such results are con-
the apparentboundaries of vowelcategories can be shifted by a variety of experimental factorsis conclusively demonstrated.Therefore,avenues for the explanation of changes in boundaries must be provided. One suchavenue is through"top-down"processing. As suggested elsewhere(Miller, 1987b), top-downprocessing
sistentwith the auditory-perceptual interpretationof the

vowel.
C. Vowels with changing formants
Naturally produced vowels often include formant movementsas discussed by Nearey and Assmann (1986), Nearey (1989), and Strange(1989). Of course,it hasbeen
well knownthat suchmovements play a role in the perceptionof diphthongs andin theperception of obviously diphthongizedsimplevowels.In the recentwork of Nearey and Strange, however,an emphasis hasbeenplacedon the possible importanceof suchmovements in the perception of the simplevowelsevenwhenthey are perceived assuch.In the auditory-perceptual theory,diphthongs or diphthonglike sequences areto be treatedeitherasphonepairsor phone-plusglide sequences, as discussed elsewhere(Miller 1987a,b; Chang, 1987b), and in thesecases it is clear that formant movements play an importantrolein controllingwhichphoneticelements are perceived. But, it is alsotrue in the caseof simplevowelsthat the auditory-perceptual theory specifies that thepath of the perceptual pointerand the dynamics of that pointerin relationto the perceptual targetzones control the perceived categoryof the vowel.While all the implications of the theory in relation to thesefactorshave not been explored,suchwork is under way in our laboratory.
D. Boundaries of vowel target zones
canbe easily integrated into the auditory-perceptual theory in a varietyof ways.One suchmechanism provides for topdownprocessing drivenby situational contexts, suchastask demands and expectations, to influence the path of the perceptualpointer.By this mechanism, the boundaries of the perceptual targetzones remainfixed,andit isthe path of the perceptual pointer that changes. In this way, it is only the boundary"ascalculated"that changes, while the "true" inherent boundaryremainsfixed. In another explanationof observed boundaryshifts,it is proposed that temporarytarget zonescan be developed basedon the stimuli beingpresentedduring an experimental session (Miller, 1987b). Of course,whethersuchexplanations proveusefulis a matter
for future research. What is of current interest is that the
auditory-perceptual theorypositsthat boundaries of target zones for the vowelsof an adult listener's nativelanguage are
fixed, and that these are the boundaries that are relevant to
the bottom-upprocessing of clearspeech in everyday listening situations. On the other hand, under conditions other than thosejust described,the apparent boundariesof the perceptual targetzones canbeshifted by a varietyof factors, as conclusively demonstrated in the cited literature.
Vl. CONCLUDING COMMENTS
The boundaries of the perceptualtarget zonesfor the vowels,asillustratedin the figures of this paper,appearto be fixed and rigid. In the theory of bottom-up processing of speech providedby the auditory-perceptual theory,this is a correctinterpretation. Indeed,the boundaries of the perceptual target zonesof adultsare assumed to be stable.We are awarethat this view of the boundaries of vowelsappearsto contradictmany published observations, for it is clear that estimatesof vowel boundariesdetermined with synthetic
stimuli are not fixed but can be shifted in location under the
The auditory-perceptual interpretation of vowelperception hasbeendescribed. Within this framework,the sensory and perceptual pathsprovidea detaileddescription of the changesin spectralpatterns of the formantsduring the courseof an utterance.It is proposed that a segmentation mechanism based on the dynamics of these spectral patterns causes perceptualtarget zonesto issueneural symbols or categorycodesthat correspond to the vowel sounds. Furthermore, it is demonstratedthat the preliminary target zonesdescribed in this papercan classifythe presentdatabaseof American Englishvowelswith 93% accuracy. The auditory-perceptual theory is a descendent of formant-ratio theory and attempts to deal with the latter's shortcomings. A sensoryreferenceconceptis introduced that not only servesto improve talker normalization,but alsomakes theinterpretation oftheF 2/F 1andF 3/F 2 ratios
contingent onthe F 1/SR ratio. Since $Risaielatively weak

function of the averageF0 (pitch), this conceptbrings a factor akin to the absolutefrequencylocationof the formantsinto the theoryasa determiner of vowelquality,and
influence of a variety of variables. Theseinclude:( 1) stimulusrange(Ades, 1977;Goldberg,1986), (2) adaptation and anchor effects(Morse et al., 1976; Sawuschand Nusbaum, 1979;Repp et al., 1979;Sawusch et al., 1980;Fox, 1985), (3) ordereffects(Fry et al., 1962;Repp et al., 1979;Cowan
and Morse, 1986), (4) contextual effects (Millar and Ains-
thisappears to allowthedisambiguation ofvowels thathave

similarF2/F 1 and F 3/F2 ratios.The auditory-perceptual
worth, 1972;Repp et al., 1979;Macchi, 1980;Gottfried et al., 1985), and (5) experimental designand methodsemployed (Macmillan et al., 1977, 1987;Ades, 1977;Pisoni, 1973). Even thougha numberof thesestudies provideconflictingresults on similarissues, the overallimplicationthat
theory alsooffers at least a partialsolution to thecoarticulationproblem bypositing verylarge target zones withirregular borders.In this way, very large differences can exist among vowelspectra that are all associated with the same vowel phoneme. It isalso proposed thatcareful study of the dynamics of spectral change will revealthe hypothesized segmentation maneuvers thatactivate theperceptual target
zones.Furthermore,a mechanism wherevoicepitch canal-
TABLE AI. Normalization of average ratiosof F 1 to F0. a

Men Women Children
ter the perceived vowelcategory is described. Finally,the

relations between articulatorydescriptions of the vowel,in-
cludingtonguebodylocations, retroflexion, nasalization, andlip rounding withthelociof thevowels in theauditoryperceptual space, arediscussed andshown tobeamenable to quantitative description in theAPS. Eventhoughfurtherevaluation and refinement of the concepts introduced in thispaper areneeded before themerit of the auditory-perceptual approach canbe decisively assessed, it doesseemto providea usefulorganizing framework for further study of the acoustic and auditory
correlates of vowel sounds.
GMFO GMF 1
132
479
223 545
264 637
log[(GMF1)/(GMFO) /3]
y -- log(GMF 1/SR)
1.97
0.49
1.95
0.47
2.00
0.52
aDataare from Peterson and Barney (1952).
ACKNOWLEDGMENTS
This work was supported by NIH ProgramProject

Grant NS 03856, NIH Grant NS 21994, and AFOSR Grant 86-0335to CentralInstitutefor the Deaf. Many people have workedwith me at onetime or anotheroverthe pastseveral
years onthetopic of vowel perception andanalysis. Included in thisgroup are:A. MaynardEngebretson, N. RaoVemula, DeborahG. Servi,Nancy Dunlop, Robert H. Gilkey, and
Arnold H. Heidbreder.More recently,Marios S. Fourakis, Allard Jongman, Steven J. Sadoff, Frank E. Kramer,Joan A. Sereno, andLynneW. Shields havemade contributions to this work. With regardto this application of the auditory-
In this way, the absolute locationof $F 1 becomes a significantfactorin the description of the spectrum asa locus in the auditory-perceptual space. Since, asa rule of thumb,the author used125Hz asan average pitch of the adult malevoice and 225 Hz asan average pitch of the adult femalevoice,he selected their geometric mean,168Hz, asthe reference frequency. Of course, betterestimates of the meanpitchof all humanspeakers might leadto a betterreference frequency, but its value is surelyboundedbetween 150 and 200 Hz. The second goalwasto eliminate or reduce theeffects of talkerageandsexon the variableywithoutlosing theadvantageof a reference point.The solution wasto allowthereferencefrequency of 168 Hz to be shifted,depending on the talker'svocalcharacteristics. As an initial stepin this direction, the talker'smeanpitchwaschosen asa measure of his
or her vocal characteristics. This variable was selected since
perceptual theory to thevowels, HisaoM. Chang andJohn
W. Hawksdeserve special mention, astogether we collected data in our laboratory,organizeddata from the literature, and developed the I2 targetzones as well as a varietyof graphics routines for the appropriate visualization of the
it brought pitchin asa defining characteristic of thespectral patternandgavepitch a role in defining a paththroughthe auditory-perceptual space. The next stepwasto definethe sensory reference ($R) to be a function of the geometric meanof the talker'sfundadata. The detailed and constructive criticisms of the two remental frequency,such that the averageratio (SF 1/SR) viewers, Bjorn Lindblomand an anonymous reviewer, and across all glottal-source spectra wouldbeindependent of the of Editor JoanneL. Miller, are greatlyappreciated. talker'sage and sex. To estimatethe form of the desired function, the geometric meanof the firstformants(GMF 1) APPENDIX A: THE SENSORY REFERENCE andfundamental frequencies (GMFO) reported by Peterson The conceptof the sensory reference was developed and Barney (1952) were separately calculated for children, withfourgoals in mind.( 1) Onegoalwasto correct themost women,and men. It wasthen found that the equation
glaring errorof the formant-ratio theory of vowelquality,

that is,thefailureof thistheoryto distinguish certainvowels,
such as/AA/and/AO/or/UH/and/UW/, as illustrated
k ---log[(GMF1)/(GMFO) 1/3 ]
(A2)
results in a valueofk that isnearlyconstant across talkers,as
by the work of Potterand Steinberg (1950). (2) Another shown in Table AI. It can be seen that k has a value of about 1.97,which is nearly independent of talker sexor age.It is goal was tomaintain theeffective elimination oftheeffects of not claimed that the proposed function is provento be optitalkerageandsexachieved by formant-ratio theory.(3) A mal; rather, it is claimed to be a simple, working approximathird goal was to introduce a mechanism whereby voice tion to a desired function that should eliminate all differpitchcould,undercertainspecial circumstances, influence
vowel identification, as indicated in the literature (Fant et
ences in k between talkers.
al., 1974;Fujisaki and Kawashima, 1968;Miller, 1953;and Scott,1976). (4) The fourth goalwasto introduce a mechanism whereby pitch modulationsof certain frequencies couldinfluence sensory andperceptual paths,andthusinfluencethe segmentation process aswell asvowelidentity. The approach to the first goalwasto establish an absolute reference point againstwhich the positions of the first threeformants are to bejudged.This wasaccomplished [in Eq. (6) ] by definingthe variable y in the auditory-perceptual spaceby y ---log(SF1/SR).
2130 J. Acoust. Soc.Am.,Vol.85, No.5, May1989
Now, an equationwasneededthat would meetthe requirements that the sensory reference always bedefined and that it quicklyconverges to a shiftedvalueas information about the current talker's voice pitch becomes available. This is done [in Eq. (5) ] by letting
$R = 168( GMFO/168 )1 /3.
(A3)
(A1)
By this formulation,the valueof $R is shiftedfrom the absolute valueof 168 to a valueappropriate to the current talker,such thataverage value y in Eq. (A 1) isnearly identical across talker groups,as shownin the last row of Table AI. Since eachtalker'saverage pitchis fairly stable, thevowJamesD. Miller: Vowelperception 2130
els/AA/and/AO/or/UH/and/UW/can
be distin-
guished alongthey dimension evenwhentheyhavesimilar values of x and z. Furthermore, in special caseswhere changes iny canshifta vowelacross theboundary of a vowel targetzone,the identification of the vowelcanbeinfluenced by voicepitch. Thus the sensory reference conceptas presented seems to achieve the firstthreegoals mentioned in the introductionto this appendix. Our approach to thefourthgoalisindicated by Eq. (20) givenin the text; that is,
common to thegeneral environment in which judgments are being made; andshort-term factors maybeeffects of precedingstimuli or sequences of stimuli. In thecase of thesensory reference of the auditory-perceptual theory,the parameter of about168Hz isthought to bebased on long-term factors; the shiftproduced by the currenttalker'saverage pitchis conceived of asanintermediate-term factor; andthepossible effectof pitchmodulation is thoughtof asa short-term factor. Viewedin thismanner, thesensory reference of theauditory-perceptual theoryisa special case of theadaptation level concept.
APPENDIX B: VOWEL DATABASE
SR = 168(GMFO/168)/3 + FIL(mod F0),
(A4)
where FIL(mod F0) indicates modulations of F0 filtered
to eliminaterapid changes at pitch onsetand offset,as well as the very slow changes or pitch declinations that do not seemto carry informationrelevantto the allophonicstructure of an utterance. As stated in the text, this last term,
FIL(modF0), has yet to be evaluatedin our research. Nonetheless, even though it seems likely that appropriate pitchmodulations canserve to significantly alter sensory or perceptual paths,whetherpitch modulations (and which ones) can influencethe allophonicsegments of speech are
matters for future research.
The sources for the data usedin the development of the I2 target zonesare summarizedin Table BI. The entry for eachsource indicates how manydatapointswereconverted into x, y, and z coordinatesin the APS and whether the speakers were male (M), female (F), or children (C). The numbers beforeeachsource correspond to the numberof the
relevant comments that follow.
( 1) These arethemean formant values forthreegroups of speakers (men,women, andchildren) fromTableII (p.
183) in Peterson and Barney (1952).
Finally, it is notedthat the sensory reference concept is similarto the more generaladaptation levelconceptof Helson (1964). Helson'sconceptis that stimuli are judged in relationto an adaptationlevel or reference. The adaptation levelis generally thoughtto be controlled by long-termfactors, intermediate-term factors, and short-term factors.
(2) These arevalues takenfromTables I-III (pp. 1719) in Peterson ( 1961).
(3) The values are from Table IV (p. 52) in Holbrook

and Fairbanks (1962).
(4) Thesevalues are for onemale speaker from Table

VIII, list 1 (p. 21 ) in Lehiste(1962). The additionalvalues
Long-term factorsmayinclude biological structure andglobal pastexperience; intermediate termfactors maybefactors
for/ER/are theaveraged formantfrequencies listedin Ta-
TABLE BI. Sources for the435 datapoints used in thecreation of thetenperceptual targetzones.
Sourc
(1) Peterson and
IY
1M
IH
1M
EH
1M
AE
1M
AA
1M
AO
1M
AH
1M
UH
1M
UW
1M
ER
1M
Total
Barney, 1952
1F
1C
1F
1C
5M 2F
1F
1C
5M 3F
1F
1C
4M 2F
1F
1C
4M 3F
1F
1C
5M 3F
1F
1C
5M 3F
1F
1C
5M 2F
1F
1C
3M 2F
1F
1C
4M 2F
30
(2) Peterson, 1961
4M 3F
99
3C
3C
3C
3C
3C
3C
3C
3C
3C
3C
(3) Holbrook and Fairbanks, 1962
1M
1M
1M
1M
1M
1M
1M
1M
1M
1M
10
(4) Lehiste, 1962

(5) Klatt, 1980
1M
1
"
1M
1
16M
1M
1
16M
1M
1
.........
1M
1
1M
1
1M
1
16M
1M
1
.........
1M
1
8M
2
17
11
(6) CID corpus I
"
16F
16M
16F
16M
.........
.........
16F
16M
.........
.........
96
(7) CID corpusII
16F
16F
.........
14F
.........
94
(8) CID corpusIII
3M 3F
3M 3F
3M 3F
3M 3F
3M 3F
3M 3F
3M 3F
3M 3F
3M 3F
3F 3F
60
(9) CID corpus IV

Totals
1M
1F
1M
1F
1M
1F
1M
1F
1M
1F
1M
1F
1M
1F
1M
1F
1M
1F
'"
'" 18
24
88
89
23
24
25
87
24
22
29
435
2131
J. Aoust. So. Am., Vol. 85, No. 5, May 1989
2131
bleV (p. 65) for seven malespeakers. For all tokens, F 0 was
set to 133 Hz.
TheARPABET notation is used in thispaper(Lea, 1980).Correspondences with IPA of symbols usedin thispaperare:/IY/--/i/;/IH/i/x/; /EH/i/e/; /AE/--/,e/; /AA//o/; /AO//a/; /AH//,/; /UH//u/;/UW/1/u/;/ER///;/L/--/1/;/D//d/;/B//b/;
and/R//r/.
(5) Theseproposed formant valuesfor synthesis are fromTableII (p. 986) in Klatt (1980), withF0set equal to 133Hz. All values arefor nondiphthongized vowels. In addition, a second/ER/token (Table III, p. 987). is basedon the valuesfor/R/
2The concept oftheauditory-perceptual space isthatspectral shapes can be

characterized by a few, perhaps three, dimensions. Similarspaces have beensuggested previously by authorssuchas Peterson(1952), Shepard (1972), and Pols (1977). There exist numerouspossibilities for deriving these dimensions. For purposes of initiatingexploration of thisconcept, in
our work we have based our dimensions on the locations of the center fre-
(6) The CID recordingswere made in an anechoic chamber using a high-quality microphone (B&K 4179/2660). All subjectswere speakersof Midwestern AmericanEnglish.For this corpus(CID corpusI), two maleandtwo female subjects wererecorded reading listsof
CVC words where C was either/D/or/B/. The lists were
readat botha normaland a fastrateof speech. The recordingswere digitizedat 20 kHz and spectral measurements weremadeat approximately the middleof the vocalicportion of the utterance (38%-50% from vowelonset).SpecBladon (1982), McCandless (1974), Miller (1984a), and others. tral analysis wasperformed using thespeech microscope de- 3The z' axis ofslab coordinates isperpendicular tothevowel slab. Thus, by
scribedin Vemula et al. (1979). (7) CID corpus II containsthe same stimuli as CID
quencies of formantsasdescribed in the text. Clearly, spectral shapes are highlycorrelated with the locations of formantfrequencies and,therefore, suchlocations offeran attractiveapproach to the specification of spectral shape. However,it would not be surprising if relatedmetrics based on the entirespectral shape will be required in the futureasit is well knownthat "formanttracking"in continuous speech is quitedifficult.For thepresent, we carryout the requiredformanttrackingusinga varietyof software aids andhandintervention. At the sametime, we are trying to develop formant trackingalgorithms that dealwith problems of weak,merged, andmissing formants aswell asa varietyof otherproblems mentioned in the work of
Eq. (8), the center of the vowel slab has a z' of 0.704 log units (0.5772X 1.22) and the thicknessof the vowel slab is 0.156 log units
(0.5772X0.27).
corpus I, usingfour newspeakers, two maleandtwo female. Spectral analysis wasdone using ILS. LPC analysis wasperformed using a 25.6-ms Hamming window moving in 3.2-ms steps, a high-frequency preemphasis factor of 98%, and 24
poles. Thefundamental frequency wasdetermined using the cepstrally based algorithm provided by ILS. Formantvalues werepickedautomatically usingthe procedure described in
Chang (1987a). (8) CID corpus III ispart of the datareported by Four-
Ades, A. E. (1977)."Theoretical notes: Vowels, consonants, speech, and

nonspeech," Psychoanal. Rev. 84, 524-530.
Bell,A. G. (1879). "Voweltheories," Am. J. Otol.1, 163-180.
Bladon, A. (1982). "Arguments against formants inthe auditory representation ofspeech," in The Representation ofSpeech inthe Peripheral Auditory System, edited by R. Carlson andB. Granstr6m (Elsevier Biomedi-
akisandMiller (1987). Two maleandtwo female speakers

were recordedproducingthe vowelsin isolationand in sonorant context. Four tokens (two male, two female) of each
cal,Amsterdam, TheNetherlands), pp.95-1, 02. Broad, D. J. (1976)."Toward defining acoustic phonetic equivalence for
vowels,"Phonetica33, 401424.
Broad, D. J.,and Wakita, H. (1977). "Piecewise-planar representation of

vowel formant frequencies," J. Acoust. Soc. Am. 62, 1467-1473.
vowelin isolationand two tokens(onemaleandonefemale)
fromsonorant context wereincluded in thiscorpus. Spectral analysis wasdoneusingILS. LPC analysis wasperformed usinga 24-msHammingwindowmoving in 1-mssteps, a high-frequency preemphasis factorof 98%, and 24 poles. Formantvalues of the vowelwereconverted into x, y, andz values by means of Eqs. (5)-(7). The values of x, y, andz were calculatedand plotted in APS for eachmillisecond of thewaveform, thusgenerating a sequence of points, or path, through thethree-dimensional space. A portion of thepath, selected by theexperimenter, wastakenascorresponding to the vowel.Factors influencing thisselection werevariedand included: ( 1) localextremes alongthe path, (2) regions of low velocity of movement along the pathcorresponding to "steady states," and (3) regions of highpathcurvature. Finally, the experimenter listened to the segmented portionto verifyhisselection. The experimenter averaged formant values overtheselected partof thevowel path,thus computing onepointper voweltokenfor eachspeaker. (9) For CID corpusIV, two speakers were recorded readingthe vowelsin /B/-vowel-/B/ context.Spectral analysis wasdoneusing ILS. LPC analysis wasperformed
Chang, H. M. (1987a). "SWIS: See what I say; A speaker-independent

word recognition system byphoneme-oriented mapping onaphonetically-encoded auditory-perceptual speech map," unpublished Ph.D. dissertation,Washington University, St.Louis,MO.
Chang, H. M. (1987b). "Automatic detection fordiphthongs," J.Acoust.

Soc.Am. Suppl.1 82, S37. Chiba,T. andKajiyama,M. (1941). The Vowel. Its NatureandStructure (Tokyo-Kaiseikan, Tokyo,Japan).
Cooper, W.E.,and Sorensen, J.M. (1981). FundamentalFrequency inSentence Production (Springer, New York).
Cowan, N., and Morse, P.A. (1986). "The use ofauditory and phonetic
memory in vowel discrimination," J. Acoust. Soc. Am. 79, 500-507.
Daniloff, R.G.,Shriner, T. H.,and Zemlin, W.R. (1968). "Intelligibility of

vowels altered in duration andfrequency," J.Acoust. Soc. Am.44,700707.
Fant,G. (1970).Acoustic Theory of Speech Production (Mouton, 's-Gravenhage, The Netherlands).
Fant, G. (1973). Speech Sounds and Features (MIT, Cambridge, MA). Fant, G.,Carlson, R.,and Granstr6m, B. (1974). "The[e]-[4] ambiguity,"Speech Communication Seminar, Vol.3,Speech Transmission Laboratory, pp. 117-121.
Fletcher, H. (1929).Speech andHearing (VanNostrand, NewYork).

Fourakis, M., and Miller, J. D. (1987). "Measurements of vowels in isola-
tion and in sonorant context," J.Acoust. Soc. Am.Suppl. 181,S17.

Fox, R. A. (1985). "Within- andbetween-series contrast in vowelidentifi-
cation: Full-vowel versus single-formant anchors," Percept. Psychophys.

3, 223-226.
using a 12.5-ms Hamming window moving in 2.5-ms steps. Formant values werefound asdescribed for corpus III, exceptthat the experimenter picked a single pointalong the pathto characterize thevowel anddidnotattempt to listen
to the segmented point.
2132 J.Acoust. Soc. Am., Vol. 85,No.5, May1989
Fry,D. B.,Abramson, A. S.,Eimas, P. D., andLiberman, A.M., (1962).
"Theidentification anddiscrimination of synthetic vowels," Lang.

Speech 5, 171-189.
Fujimura, O. (1962)."Analysis ofnasal consonants," J. Acoust. Soc. Am.

34, 1865-1875.
Fujisaki, H., andKawashima, T. (1968)."Theroles ofpitch and higher

formants in the perception of vowels," IEEE Trans.Audio Electroa-
James D.Miller: Vowel perception
2132
coust. AV-16, 73-77.
Miller, J.D.,Engebretson, A.M.,and Vemtla, N. R. (1983). "Observationson theacoustic descriptions of vowels asspoken by children, women, and men" (unpublished). Miller, J. D., Sadoff,S. J., and Veksler,M. R. (1988). "Sensory-perceptual transformations in speech analysis,"J. Acoust. Soc.Am. Suppl. 1 83,
S70.
Goldberg, R. F. (1986). "Perceptual anchors in vowel andconsonant continua," M.S.E.E. thesis,MIT, Cambridge,MA. Gottfried,T. L., and Strange, W. (1980). "Identification of coarticulated vowels,"J. Acoust. Soc.Am. 68, 1626-1635.
Gottfried,T. L., Jenkins, J. J., and Strange, W. (1985). "Categorical discrimination of vowelsproduced in syllable contextand in isolation," Bull. Psychon. Soc.23, 101-104. Greenwood, D. D. (1961). "Auditory masking and the criticalband," J.
Acoust. Soc. Am. 33, 484-501.
Miller, R. L. (1953). "Auditory testswith synthetic vowels,"J. Acoust.

Soc. Am. 18, 114-121.
Harris,J. D. (1960). "Scaling of pitchintervals," J. Acoust. Soc.Am. 32,

1575-1581.
Helson, H. (1964). ,daptation-leoel Theory (Harper and Row, New

York).
Minifie, F. D. (1973). "Speech acoustics," in Normal ,spects of Speech, Hearing,andLanguage, edited by F. D. Minifie,T. J. Hixon,andF. Williams (Prentice-Hall,Englewood Cliffs,NJ), pp. 235-284. Morse,P. A., Kass,J. E., andTurkienicz,R. (1976). "Selective adaptation of vowels,"Percept. Psychophys. 19{2},137-143.
International PhoneticAssociation (1949). ThePrinciples of the InternationalPhonetic .dssociation (InternationalPhoneticAssociation, UniversityCollege, London,England). Iri, M. (1959). "Mathematicalmethods in phonetics," Gengo Kenkyu (Language Research)35, 23-30. Jones, D. (1914). The Pronunciation of English(CambridgeU. P., Cambridge,England).
Nearey, T. M. (1989). "Static, dynamic, andrelational factors in vowel perception," J. Acoust. Soc. Am. 85, 2088-2113. Nearey, T. M., andAssmann, P. F. (1986). "Modeling theroleof inherent spectral change in vowel identification," J. Acoust. Soc. Am. 80, 1297308.
Ochai, Y., Saito, S., and Sakai, Y. (1955). "Articulation study of speech
qualities in rotational synchronous distortion," Mere.Fac.Eng.Nagoya

Univ. 7, 40-48.
Jones, D. (1919). "Experimental phonetics anditsutilityto thelinguist,"

Proc. R. Inst. G. B. 22, 8-21.
Jongman, A., andFourakis, M. (1987). "A cross-language study of vowel spaces andinterference," J. Acoust. Soc. Am. Suppl.1 81, S66.
Kent, R. D. (1979). "Isovowellinesfor the evaluationof vowel formant structure in speech disorders," J. Speech Hear. Disord.44, 513-521.
Okamura,M. (1966). "Acousticalstudies of Japanese vowelsin children. The formantconstruction andthe developmental process," Jpn.J. Otol. (Tokyo) 69, 1198-1214. Peterson, G. E. (1952). "The information bearing elements of speech," J.
Acoust. Soc. Am. 24, 629-637.
Peterson, G. E. (1961). "Parametersof vowel quality," J. Speech Hear.

Res. 4, 10-29.
Klumpp,R. B., and Webster, J. C. (1961). "Intelligibility of time-compressed speech," J. Acoust.Soc.Am. 33, 265-267. Koenig,W. (1949). "A newfrequency scale for acoustic measurements,"
Bell Labs Rec. 27, 299-301.
Kurtzrock,G. (1956). "The effects of time andfrequency distortion upon wordintelligibility," unpublished Ph.D. dissertation, University of Illinois, Urbana, IL.
Peterson, G. E., and Barney,H. L. (1952). "Control methods usedin a studyof the vowels," J. Acoust.Soc.Am. 24, 175-184. Pipping,H. (1893). "Review and criticismof Lloyd--1890a,b, 1891, 1892."Z. franzSsische Sprache und Literatur 15, 157-171. Pisoni,D. B. (1973). "Auditory and phonetic memorycodes in the discriminationof consonants and vowels,"Percep.Psychophys. 13, 253260.
Lea, W. A. (1980). Trends in Speech Recognition (Prentice-Hall, Englewood Cliffs, NJ), p. 127.
Lloyd, R. J. (1890a). SomeResearches into the Nature of Vowel-Sound (Turner and Dunnett, Liverpool,England). Lloyd,R. J. (1890b). "Speech sounds: Their natureand causation (I),"
Phonetische Studien 3, 251-278.
Pols,L. C. W. (1977). "Spectral analysis andidentification of Dutch vowelsin monosyllabic words,"Institutefor Perception TNO, Soesterberg,
The Netherlands.
Lloyd,R. J. (1891). "Speech sounds: Theirnature andcausation (II-IV),"

PhonetischeStudien 4, 37-67, 183-214, 275-306.
Lloyd, R. J. (1892). "Speech sounds: Their nature and causation(VVII)," Phonetische Studien5, 1-32, 129-141,263-271.
Potter,R. K., and Steinberg, J. C. (1950). "Towardthe specification of speech," J. Acoust.Soc.Am. 22, 807-820. Repp,B. H., Healy,A. F., andCrowder, R. G. (1979). "Categories and context in the perception of isolated steady-state vowels," J. Exp. Psychol.:Hum. Percept.Perform.5, 129-145.
Sawusch, J. R., and Nusbaum,H. C. (1979). "Contextualeffects in vowel
perception I: Anchor-induced effects," Percept. Psychophys. 4, 292-302. Lloyd,R. J. (1894). "Replyto criticisms by Dr. Pipping." Z. franzSsische Sawusch, J. R., Nusbaum,H. C., and Schwab,E. C. (1980). "Contextual Sprache und Literatur16, 201-206. effects in vowel perception II: Evidence fortwoprocessing mechanisms," Macchi, M. J. (1980). "Identification of vowelsspoken in isolationversus Percept.Psychophys. 5, 421-434. vowels spoken in consonantal context," J. Acoust. Soc.Am. 68, 16361642. Scott,B. L. (1976). "Temporalfactorsin vowelperception," J. Acoust. Soc. Am. 60, 1354-1365. Macmillan, N. A., Kaplan,H. L., andCreelman, C. D. (1977). "The psySekimoto, S. (1982). "Perceptual normalization offrequency scale," Annu. chophysics of categorical perception," Psychol. Rev.84, 452-471.
Macmillan, N. A., Braida, L. D., and Goldberg, R. F. (1987)."Central and

peripheral processes in theperception of speech andnonspeech sounds," in ThePsychophysics of Speech Perception, editedby M.E.H. Schouten
(Martinus Nijhoff, Dordrecht,The Netherlands),pp. 28-45. McCandless, S.S. (1974). "An algorithm forautomatic formant extraction using linear prediction spectra," IEEE Trans. Acoust. Speech Signal Process.22, 135-141.
Bull. RILP 16, 95-101.
Sharf,B. (1978). "Loudness," in Handbook of Perception: Hearing,edited byE. Carterette andM. Friedman (Academic, NewYork), pp. 187-242. Shepard, R. N. (1972). "Psychological representation of speech sounds," in Human Communication:,4 UnifiedVieto, editedby P. B. DenesandE.
E. David, Jr. (McGraw-Hill, New York).
Millar, J. B., and Ainsworth,W. A. (1972). "Identification of synthetic

isolated vowels and vowels in H-D context," Acustica 27, 278-282.
Slawson, A. W. (1968). "Vowelqualityandmusical timbreasfunctions of spectrum envelope andfundamental frequency," J. Acoust. Soc. Am. 43,
87-101.
Miller, J. D. (1984a). "Auditory processing of the acoustic patternsof speech," Arch. Otolaryngol. 110, 154-159. Miller, J. D. (1984b). "implications of the auditory-perceptual theoryof phonetic recognition bythehearing impaired," ASHA Rep.# 14,45-48. Miller, J. D. (1987a). "Auditory-perceptual processing of speech waveforms,"in ,uditoryProcessing of Complex Sounds, editedby W. A. Yost and C. S. Watson (Erlbaum, Hillsdale, NJ), pp. 257-266. Miller, J. D. (1987b). "The auditory-perceptual theoryof phonetic recognition:A synopsis" (unpublished). Miller, J. D. (1987c). "Classification of vowel productions by meansof perceptual targetzones: A response to Ladefoged and Studdert-Kennedy,"J. Acoust.Soc.Am. Suppl.1 82, S82. Miller, J. D., Engebretson, A.M., andVemula, N. R. (1980). "Vowelnormalization: Differences between vowels spoken by children,women,and men," J. Acoust.Soc.Am. Suppl. 1 68, S33.
Stevens, K. N., Fant, G., and Hawkins, S. (1987). "Someacoustical and
perceptual correlates of nasal vowels," in In Honorofllse Lehiste: Ilse

Lehiste Pihendusteos, editedby R. Channon andL. Shockey (Foris,The Netherlands), pp. 241-254.
Strange, W. (1989). "Evolving theories of vowelperception," J. Acoust.

Soc. Am. 85, 2081-2087.
Syrdal, A. K., andGopal,H. S. (1986). "A perceptual model of vowelrecognition based onauditory representation ofAmerican-English vowels,"
J. Acoust. Soc. Am. 79, 1086-1100.
Tiffany,W. R., andBennett, D. N. (1961). "Intelligibility of slow-played

speech," J. Speech Hear. Res.4, 248-258. Vemula,N. R., Engebretson, A.M., Monsen,R. B., and Lauter,J. L. (1979). "A speech microscope," in Speech Communication Papers, edited by J. J. Wolf and D. H. Klatt (Acoustical Society of America,New
York), pp. 71-74.
2133
J. Acoust. Soc.Am.,Vol.85, No.5, May1989
2133
Wier, C. C., Jesteadt, W., andGreen,D. M. (1977). "Frequency discriminationas a functionof frequency and sensation level,"J. Acoust.Soc.
Am. 61, 178-184.
Soc. Am. 68, 1523-1525.
Zwicker,E., andTerhardt,E. (1980). "Analyticalexpressions for critical bandrate andcriticalbandwidths asa functionof frequency," J. Acoust.
Zwislocki, J.J. (1978)."Masking: Experimental andtheoretical aspects of simultaneous, forward, backward, andcentral masking," inHandbook of Perception: Hearing, edited by E. Carterette andM. Friedman (Academic,New York), pp. 283-336.
2134
J. Acoust. Soc.Am.,Vol.85, No.5, May 1989
2134

Auditory-Perceptual Interpretation of The Vowel: Central Institute For The Deaf St. Louis, Missouri 63110

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Auditory-Perceptual Interpretation of The Vowel: Central Institute For The Deaf St. Louis, Missouri 63110

Transféré par

Droits d'auteur :

Formats disponibles

Auditory-perceptual

CentralInstitute for theDeaf St. Louis,Missouri63110

(Received8 October1987;accepted for publication19January1989)

our laboratories at Central Institute for the Deaf (CID).

same vowel can have very different formant ratios?The

vowelcategory.Also, they statedthat ratiosof formant fre-

FREQUENCY OF F2 IN CYCLES PER SECOND

James D. Miller:Vowel perception

30.00 -' :';' .'., ' '2,, ' ....

2soo - -, .:: ;. .." ,',.' .-..," ,/.:' L'-' : ,

II. FREQUENCY SCALESIN VOWELANALYSIS

'I 0.4 :.-t_..t .:_.J _.,t .4

equationfor technicalmels (TM),

( 1000/log 2)log(f/1000 q- 1).

The Bark values (B) were calculatedby the equationfrom

The Koenigvalues(K) werecalculated from the equations

expressed in hertzor in mels.Peterson thendemonstrated (Koenig, 1949)

FIG. 3. The ratiosof the mel equivalents

(M 2/M 1) for the data shownin Fig. 2.

J. Acoust.Soc. Am., Vol. 85, No. 5, May 1989

James D. Miller:Vowel perception

and three talker groups, therewere 27 F 2/F 1 ratiosand 27

FIG. 4. The frequency of the firstandsecond formants for a setof matched

K= 0.002f for 0<1000,

K= (4.5 logf) -- 11.5 for 1000<10000.

2000 ' I ' I ' I

ness, and it demonstrates that differences in formant values

Examination of Figs.5-8 suggests that useof differ-'

2800 ......... i.................................................................................... . O (H2.LI)

2000 ................................................................... ..................................... 3.._

1600 ................................... "F'"'I ............... .......... ........ ....................

,o ,o',o ,Lo ,;4

320 640 980

log ratios and differences of(F 1,F 2) and (F 2,F 3) isshown

visual inspection, one can see that allofthe ratio descriptions

,oo 0oo ,2'oo

vowels. In thecases ofthemel,Koenig, andfrequency scales,

thelogratio approach seems todoa better jobofclustering

A mathematical analysis confirms the visualimpres-

sions ofFig.9.Following thework ofMilleretal. (1983),we

, 0 I. I , 0. I , 0 I6 , l.O O' . 640 I ,, 1260 I 1920 I 25603000 0'%:0 2 , 0.4

plane. Next,wecalculated themean of thesquares of all

formantcenter frequencies measured in hertzor mels,the

Using thestatistics justdescribed, weranked thevowel amongthe formants.

clustering thevowel datashown in Fig. 9. The results are

the results of similar,more extensive studies(Miller et al.,

J.Acoust. Soc. Am., Vol. 85,No. 5,May 1989

James D.Miller: Vowel perception 2119

TABLE I. Comparison ofspectral metrics ofvowels based onclustering by

Rank d ;ms Rank

2.38 2.50 2.14

log(B 3lB 2),log(B 2/B 1)

tory-perceptual space. 2 In stage 2, these sensory responses,

APS. Here, x, y, and z axesare in 0.1 logunitsandthepointof originis (0,

byEq.(5).The values ofthe formants are $F 1= 310 Hz,$F2 = 2450 Hz,

voice pitch. Formale speakers, a typical value of GMFO is

In our earliestwork (Miller, 1984b), perceptual tar-

volumes. Theparticular zone shown for/IY/in Fig. 11is

theresultof oursecond attemptor second iteration in defin-

2450, and2810Hz, respectively. Thisspectrum islocated in

In the uditory-perceptual theory, however, it isnot the

y = i log SF1) -- (log SR ) = log (SF1/SR ).

1111111qlq IllrllU PI I I 1 I I Jlidl/I

d I ! [ I ! 1Ll.d I I TJ ! 1I 1 ! I I I ! I !!lrl 1 I I ! ! I i I I Ii-1 I I Ig II11111111111111