Dissimilarity and The Classification of Female Singing Voices - A Preliminary Study

Dissimilarity and the Classification of Female Singing
Voices: A Preliminary Study
Molly L. Erickson
Knoxville, Tennessee
Summary: Traditionally, timbre has been defined as that perceptual attribute

that differentiates two sounds when pitch and loudness are equal and thus is
a measure of dissimilarity. By such a definition, each voice possesses a set
of timbres, and the identity of any voice or voice category across different pitch–
loudness–vowel combinations must be due to an abstraction of the pattern of
timbre transformation. Using stimuli produced across the singing range by
singers from different voice categories, this study sought to examine how
timbre and pitch interact in the perception of dissimilarity. This study also
investigated whether listener experience affects the perception of timbre as
a function of pitch. The resulting multidimensional scaling (MDS) repre-
sentations showed that for all stimuli and listeners, dimension 1 correlated with
pitch, whereas dimension 2 correlated with spectral centroid and separated
vocal stimuli into the categories mezzo-soprano and soprano. Dimension 3
appeared highly idiosyncratic depending on the nature of the stimuli and on
the experience of the listener. Inexperienced listeners appeared to rely more
heavily on pitch in making dissimilarity judgments than did experienced
listeners. The resulting MDS representations of dissimilarity across pitch
provide a glimpse of the timbre transformation of voice categories across pitch.
Key Words: Voice classification—Perception—Timbre—Pitch—Listener
experience.
INTRODUCTION
Accepted for publication October 17, 2002.

The accepted definition of timbre is as follows:
This paper was presented at the 28th Annual Symposium: Two tones are of different timbre if they are judged
Care of the Professional Voice, June 1999, Philadelphia, Penn- to be dissimilar and yet have the same loudness
sylvania. and pitch.1 By this definition, when pitch and loud-
From the Department of Audiology and Speech Pathology,
ness are equal, timbre becomes the primary auditory
University of Tennessee, Knoxville, Tennessee.
Address correspondence and reprint requests to Molly Erick- cue used to identify differences and thus may be
son, Department of Audiology and Speech Pathology, 457 South used to compare sound-producing objects. Tradi-
Stadium Hall, University of Tennessee, Knoxville, TN 37996. tional timbre research has been guided by the notion
E-mail: merickso@utk.edu of timbre differences embodied in this technical
Journal of Voice, Vol. 17, No. 2, pp. 195–206
쑕 2003 The Voice Foundation
definition, and it has concentrated on the acoustic
0892-1997/2003 $30.00⫹0 factors that underlie the perception of dissimilarity
doi: 10.1016/S0892-1997(03)00022-5 when pitch and loudness are constant or nearly so.2–5
195
196 MOLLY L. ERICKSON
Many sound-producing objects are capable of abstraction of the pattern of timbre transforma-
large variations in pitch and loudness. In spite of this tion. Singers in the same category would be charac-
great variation, not only can listeners distinguish terized by similar transformations, whereas singers
between sound-producing objects, but listeners can in different categories would be differentiated by
also identify sounds of varying pitch and loudness as dissimilar timbre transformations, not necessarily
having been produced by the same sound-producing by dissimilarity at specific pitches. This conceptual-
object. Thus, it could be argued that some element ization allows us to bridge the technical definition
of timbre also serves the important function of main- of timbre as a difference between sources to the
taining the coherence of sound-producing objects conceptualization of the timbre transformation as
in an ever-changing auditory world. From this per- the “signature” of a source.
spective, every sound source can be thought to have This experiment represents a preliminary attempt
a timbre transformation, a set of timbres across dif- to test the above hypotheses. Previously, the percep-
ferent pitches, loudness levels, vowels, and so on. tion of voice classification has been examined using
It is the listener’s task to abstract the timbre transfor- forced-choice paradigms based on the traditional
mation, creating coherence of the source in the face voice classification system of bass, baritone, tenor,
of timbre variation across conditions. alto, mezzo-soprano, and soprano.7 Such perceptual
In the vocal pedagogy literature, the term timbre experiments provide information as to how listeners
is often used in a manner contrary to the technical place vocal stimuli when provided with arbitrary
definition. It is generally stated that along with pitch classification categories. They do not provide infor-
range and tessitura, timbre is a major factor in the mation concerning how timbre is transformed across
determination of a singer’s voice category.6 pitch within categories. The question that begs an-
swering is this, if provided no classification system
Cleveland7 states that an individual singer has a
a priori, how do listeners tend to group vocal stimuli
characteristic timbre that is a function of the laryn-
across pitch? Do they group vocal stimuli in a
geal source and vocal tract resonances. The pedagog-
manner that supports current classification systems
ical use of the term timbre implies that it is a property
based on the presumption that vocal timbre is an
of the singer, not of each note sung by the singer.
invariant perceptual cue across the singing range,
This interpretation of the term timbre as used by
or do they group them in a manner that suggests that
vocal pedagogues has historically been reflected
vocal timbre undergoes a systematic transformation
in vocal timbre research where timbre is assumed to across the singing range, with singers in the same
be constant throughout a singer’s range, or at least vocal category displaying similar timbre transforma-
throughout each of a singer’s registers.7,8 Given the tions? Additionally, this experiment tested whether
technical definition of timbre, it would be more accu- listeners with experience in the classification of sing-
rate to refer to a singer as having a set of timbres, or ers differ from untrained listeners in the way they
a timbre transformation, rather than simply as having organize vocal stimuli? Previous studies of timbre
a timbre. By this logic, singers with similar timbre at one pitch have not revealed differences between
transformations constitute members of the same trained and untrained listeners.2,9 The question re-
voice timbre type or voice category. Given the possi- mains, does experience affect the perception of
ble changes in timbre across the singing range, it is the timbre transformation across pitch?
possible that any two voices may be perceived To answer these questions, this study used a
as having similar timbre on one pitch–loudness– dissimilarity paradigm in which listeners judged
vowel combination and dissimilar timbre on another. the timbre similarity between singers at different
Thus, the categories must be elastic and may not be pitches. If there is a timbre transformation shared
reducible to one unchanging acoustic metric. by singers of the same category, then the dissimilar-
To summarize this argument, one could hypothe- ity judgments between different notes for the same
size that each voice possesses a set of timbres, and singer ought to follow a pattern that is similar to
the identity of any voice across different pitch– other singers in the same category, but different from
loudness–vowel combinations must be due to an singers in a different category. If listener experience
Journal of Voice, Vol. 17, No. 2, 2003

CLASSIFICATION OF FEMALE SINGING VOICES 197
affects the perception of timbre transformations, then allowed to vocalize freely and become comfort-
the dissimilarity judgments should differ between able with the recording environment.
experienced and inexperienced listeners. One-second digital samples were constructed for
each sung vowel. Each stimulus was low-pass fil-
tered at 20 kHz using a Tucker-Davis-Technologies
(TDT, Gainesville, FL) FT6 anti-aliasing filter, and
METHOD then digitized at 48 kHz using a Turtle Beach Pinna-
Stimuli cle (Yonkers, NY) sound card. The software program
Master’s level singers from the School of Music Cool Edit Pro (Syntrillium Software Corporation,
at the University of Tennessee, Knoxville, provided Phoenix, AZ) was used to extract one-second sam-
stimuli for the experiment. All singers were mem- ples from the midpoint of the sung /ɑ/. This was
bers of the University of Tennessee Opera Studio accomplished by measuring the exact duration of
program, a training program jointly operated by the the sung vowel, calculating its midpoint, and then
University of Tennessee and the Knoxville Opera extracting a one-second segment that originated one
Company. All subjects provided informed consent half-second before the midpoint and terminated
using a procedure that was previously approved one half-second after the midpoint. Spline curve
by the Institutional Review Board of the Univers- amplitude shaping functions were applied to each
ity of Tennessee, Knoxville. These subjects met sample to provide ramped onsets and offsets. Be-
the following criteria: (1) bilateral hearing within cause equal loudness contours in the frequency
normal limits as determined by a 20-dB hearing range of these stimuli are relatively flat, the overall
screening at 500 Hz, 1000 Hz, 2000 Hz, and 4000 amplitude of each stimulus was adjusted so that
Hz10; (2) voice study at the Master’s degree level all were of approximately equal amplitude (⫾3.5-
or higher; and (3) no voice problems at the time of dB RMS). The small variations in dB RMS were
taping as determined by a certified speech-language randomly distributed throughout the stimuli so that
pathologist. Additionally, all subjects must have any perceivable variations in loudness likely did
been consistently categorized as a soprano or mezzo- not systematically affect the outcomes.
soprano for more than five years. Such categoriza-
tion must also have been agreed on by 100% of Listeners
the voice professionals associated with the Knoxville All listeners provided informed consent using a
Opera Company and the University of Tennessee procedure previously approved by the Institutional
School of Music. Review Board of the University of Tennessee, Knox-
Two singers from each voice classification, ville. Two groups of listeners were recruited for
mezzo-soprano and soprano, were recorded singing the dissimilarity experiment: inexperienced listeners
the vowel /ɑ/ on six pitches, A3, C4, G4, B4, F5, and experienced listeners.
and A5, at a comfortable loudness level. Each singer Inexperienced listeners were recruited from stu-
produced a sustained /ɑ/ for approximately four dents enrolled in introductory psychology courses
seconds. Both sopranos were characterized as lyric at the University of Tennessee, Knoxville. Eighteen
sopranos, whereas one mezzo-soprano was char- inexperienced listeners were recruited that met
acterized as a lyric mezzo-soprano and the other as the following criteria: (1) bilateral hearing within
a dramatic mezzo-soprano. normal limits as determined by a 20-dB hearing
Recordings were made in a single-walled sound screening at 500 Hz, 1000 Hz, 2000 Hz, and 4000
booth (Acoustic Systems RE-144-S, Austin, TX). Hz10; (2) no history of choral singing or vocal train-
Subjects were recorded using a digital audio tape ing; and (3) no interest in classical vocal music
recorder (Sony PCMR500, Park Ridge, NJ) and a or opera.
Sennheiser MD 441-U (Old Lyme, CT) microphone. Experienced listeners were recruited from the
Subjects stood in the center of the booth. Lip to Knoxville Choral Society and the Knoxville Opera
microphone distance was 12 inches. A keyboard was Company. Twelve listeners were recruited that met
used to present pitches. Prior to taping, subjects were the following criteria: (1) bilateral hearing within

normal limits as determined by a 20-dB hearing experiment, all experienced listeners heard each of
screening at 500 Hz, 1000 Hz, 2000 Hz, and 4000 the 24 vocal stimuli in isolation and were asked to
Hz10; and (2) bachelor’s degree or higher in a vocal rate them on a visual analog scale (0 to 100) using
arts related discipline (e.g., pedagogy, performance, a scroll bar where the left anchor was mezzo-soprano
or choral conducting) or five years professional ex- (assigned a value of 0) and the right anchor was
perience in a vocal arts discipline. soprano (assigned a value of 100).
The listening experiment took place in a single-
Procedure walled sound booth (Acoustic Systems RE-144-S,
The 24 vocal stimuli (6 pitches × 2 classifica- Austin, TX). Stimuli were presented binaurally via
tions × 2 singers) were combined in all possible a Turtle Beach Montego II (Yonkers, NY) sound
pairs, for a total of 276 paired stimuli. Within the 276 card and Sennheiser HD 545 (Old Lyme, CT)
pairs, there were three types of timbre comparisons. headphones. The stimuli were presented at a com-
In the first, participants judged the similarity be- fortable listening level, approximately 65-dB SPL.
tween two different pitches produced by the same
voice. There were 15 comparisons for the 6 pitches
Acoustic Measures
produced by each singer with 60 judgments in total.
In the second, participants judged the similarity be- As discussed by Bloothooft and Plomp,2 the
tween stimuli of the same pitch for two different acoustic signal produced by the voice can be
singers. There were six pitches produced by all described using a production-oriented or perception-
four singers, resulting in a total of 36 judgments. In oriented point of view. From the production-oriented
the third, participants judged the similarity be- point of view, the vocal instrument is generally con-
tween stimuli of different pitches produced by differ- ceptualized using a source-filter model.11 Assuming
ent singers, for 180 judgments in total. On every such a model, vocal acoustics could ideally be sepa-
trial, each of the paired stimuli was presented once, rated into those characteristics that are produced by
separated by 500 ms of silence. For each pairing the source, the vibrating vocal folds, and those pro-
of stimuli, A and B, order of pairing was randomized duced by the filter, the vocal tract. Traditionally,
across listeners, resulting in nearly equal numbers singing voice authorities have postulated that timbre
of instances when stimulus A was followed by stim- properties that distinguish singers of different
ulus B and when stimulus B was followed by stimu- voice categories are related to properties of the filter,
lus A, thus eliminating ordering effects. particularly vocal tract length,7,12 whereas timbre
The 276 paired stimuli were presented to the two properties that distinguish vocal registers,13 brassy
groups of listeners, inexperienced and experienced. versus fluty quality,14 pressed versus flow phona-
Prior to the listening experiment, listeners were in- tion,15,16 or breathy versus clear tone17 could be
structed to rate the difference in quality of the two related to characteristics of the voice source. Thus,
sounds. Quality was defined as that part of a sound acoustic measures such as long-term average spectra
that distinguishes one person’s voice from another’s. (LTAS) or mean formant frequency are believed to
Listeners rated quality differences using a dissimilar- predict vocal categories, whereas measures such as
ity scale (1 to 10). They were instructed not to use spectral slope and harmonics-to-noise ratio are be-
pitch in their judgments. Each listener then com- lieved to be unrelated to voice category and may help
pleted a practice session consisting of ten pairs of distinguish voices within categories or to distinguish
stimuli that were not used in the experiment. The registers within a particular voice.
276 trials were presented in a different random order From the perception-oriented point of view, the
to each participant. Listeners were allowed to replay output vocal signal is treated as a whole, not the
each trial as many times as they needed. Listener sum of two distinct parts, source and filter. Such an
judgments were recorded via a computer interface approach makes some sense when considering high-
consisting of a monitor and a mouse. pitched voices where (1) the true resonance frequen-
To verify the perceived vocal categories of each cies of the vocal tract will not likely be reflected
of the singers, upon completing the dissimilarity in the vocal output signal due to the wide spacing

of harmonics and (2) there may be significant source- above 350 Hz.22 In such cases, formant frequen-
filter interaction.18 Using this approach, Bloothooft cies obtained through linear predictive coding (LPC)
and Plomp2 analyzed the vocal signal using a set of might have little relationship to the actual acoustic
bandpass filters. Another possible perception-ori- cues present in the signal. It might be possible to
ented acoustic measure might be spectral centroid, measure the frequency of upper peaks directly from
a measure that is frequently employed in studies of the output spectrum; unfortunately, at high pitches,
instrumental timbre, but is not often applied to stud- such peaks are often not present. What is needed
ies involving the vocal instrument. is a method for quantifying the shape of the spectrum
The current experiment employed a variety of in the regions of formants 3 and 4. One such method,
acoustic measures in an attempt to identify those the spectral centroid, has been used extensively by
spectral cues that lead to both categorization and instrumental timbre researchers. Thus, for high-
identification of specific voices. Additionally, be- pitched voices, spectral centroid in the frequency
cause little is known concerning the effects of tempo- regions of 2000 to 5000 Hz might provide a better
ral parameters on the perception of vocal categories measure of vocal tract acoustics in the regions of
or individual voices, two measures of vibrato were F3 and F4. Spectral centroid was calculated as fol-
calculated. In total, four acoustic measures were lows. The average spectrum was calculated from 0
computed: amplitude of the first harmonic minus Hz to 8000 Hz using a fast Fourier transform (FFT)
the amplitude of the second harmonic (H1-H2), with a bandwidth of 50 Hz. One hundred sixty-four
spectral centroid from 2 to 5 kHz, mean rate of data points were obtained representing the spectrum
frequency vibrato, and mean extent of frequency from 2000 to 5000 Hz. Using these data points,
vibrato. Each measure was calculated from the the spectral centroid from 2000 to 5000 Hz was
middle one second of the original digitized voice calculated after Sandell23 using the formula:
sample. All samples were digitized at 48 kHz unless 164
otherwise noted. 兺 ek fk
k⫽1
164
H1-H2
This measure is believed to provide a relative
兺 ek
k⫽1
indication of glottal adduction and, therefore, where e is the vector of 164 spectral amplitude data
source spectral slope.13,19,20 Source slope is believed points between 2000 and 5000 Hz and f is the vector
to be unrelated to voice category and may help of 164 spectral frequency data points between 2000
distinguish voices within categories. Source slope and 5000 Hz. The resulting number is essentially a
is often associated with terms such as “sharpness,”2 measure of central tendency and would be highly
“brassy” versus “fluty” quality,14 or “pressed” versus correlated with the average of F3 and F4 when strong
“flow” phonation.15 peaks are present in the output signal. When no peaks
are present, the measure is affected by spectral slope.
Spectral Centroid from 2 to 5 kHz
Singing voice researchers have suggested that Frequency Vibrato Rate and Extent
vocal tract length is one of the primary physiological Temporal characteristics of voice are typically
predictors of voice category.12,14 Although all for- overlooked in studies of vocal timbre. Yet, the fre-
mants are affected by vocal tract length, formants quency modulation underlying vocal vibrato likely
1 and 2 are also greatly affected by tongue and has perceptual consequences.2,24 Mean frequency
jaw position. Thus, it has been postulated that the vibrato rate was calculated after Prame.24 Each
location of the upper formants, particularly formants digitized stimulus was subjected to an autocorrela-
3 and 4, provide the perceptual cue to voice cate- tion analysis via the Corr routine of Soundswell
gory.12,21 However, in female voices at high pitches, (Hitech Development) using a fixed window of 10
wide spacing of harmonics makes it unlikely that ms. The resulting waveform representing frequency
these resonance peaks will be represented precisely as a function of time was segmented via cursors
in the acoustic output at fundamental frequencies to determine the period of each vibrato cycle. The

FIGURE 1. Experienced listeners’ ratings of voice category for each of the 24 vocal
stimuli.
inverse of each period was calculated and averaged from that obtained from the experienced listeners,
to produce the mean frequency vibrato rate. Fre- an INDSCAL analysis (SPSS 10.0, Chicago, IL)
quency vibrato extent was calculated by using cur- using an ordinal distance measurement and a Euclid-
sors to measure maximum and minimum frequencies ian metric was performed where each group was
at each vibrato half-cycle. Using these data points, treated as a separate subject. Group weights gener-
mean vibrato extent in cents was calculated after ated by INDSCAL were used to determine the
Prame.25 importance of each dimension to each group of
listeners. Group weights for each group are pre-
sented in Table 1. The first dimension was slightly
RESULTS more important to the inexperienced listeners than
to the experienced listeners, whereas the second di-
Analysis of the experienced listener judgments mension was equally important to both groups. The
regarding the voice category of each stimulus sug- third dimension was more important to experienced
gests that although listeners were heavily influenced listeners than to inexperienced listeners. Given these
by pitch in judging these categories, they also differences in the MDS solutions for the two groups,
used other perceptual cues to categorize each stimu- separate MDS ALSCAL analyses (SPSS 10.0) were
lus (Figure 1). Stimuli were perceived as increas- performed. These analyses used ordinal distance
ingly soprano-like with increasing pitch; however, measurements and Euclidian metrics. Throughout
at all pitches but A5, the stimuli produced by the the following results, all stress values reported were
two sopranos were heard as more soprano-like than calculated using Young’s S-stress formula 1. Also
those produced by the two mezzo-sopranos. These throughout the following results, correlation alpha
results support the notion that all of the singers in levels for each data set were adjusted using a Bonfer-
the study are in fact perceived to be members of roni correction (p ⬍ 0.005).
the vocal category assigned to them by the vocal The inexperienced listeners generated an optimal
pedagogues most familiar with their voices. MDS solution in three dimensions (S-stress ⫽
To test the assumption that the MDS solution 0.133, R2 ⫽ 0.91). The experienced listeners gener-
obtained from the inexperienced listeners differed ated an optimal MDS solution in three dimensions

(S-stress ⫽ 0.201, R2 ⫽ 0.83). The three-dimen- TABLE 1. Group Weights Generated by INDSCAL
sional solution for both groups is presented graphi- for Inexperienced and Experienced Listeners
cally in Figure 2. Inexperienced Experienced
Dimension listeners listeners
Inexperienced Listeners 1 0.7507 0.6260

Results of the INDSCAL analysis indicate that 2 0.5628 0.5438
dimensions 1 and 2 were of primary perceptual im- 3 0.1541 0.3273
portance to the inexperienced listeners (Figure 3A).
Dimension 1 appeared most closely associated with
pitch. This dimension was monotonically related
between dimension 1 and pitch throughout the entire
to pitch for all four voices, and pitch, as measured in
pitch range for lyric soprano 1. For the remaining
semitones, was highly correlated with dimension 1
three voices, a monotonic relationship between di-
(R ⫽ ⫺0.99, p ⬍ 0.001). Dimension 2 separated the
mension 1 and pitch was seen for the pitch range
two voice categories, soprano and mezzo-soprano,
A3 through F5. Pitch, as measured in semitones,
at all pitches but the highest pitch, A5. Throughout was highly correlated with dimension 1 (R ⫽
the entire pitch range, dimension 2 scale values were ⫺0.874, p ⬍ 0.001).
most highly correlated with the acoustic measure of Again, dimension 2 separated the two voice cate-
spectral centroid from 2 to 5 kHz (R ⫽ 0.660, gories, soprano and mezzo-soprano (Figure 4A), at
p ⬍ 0.001). Visual comparison of dimension 2 all pitches but the highest pitch, A5. Similar to the
values (Figure 3A) and spectral centroid values data for the inexperienced listeners, the acoustic
(Figure 3B) reveals relatively isomorphic curves in measure of spectral centroid from 2 to 5 kHz was most
the pitch range A3 through F5. In this pitch range, the
highly correlated with dimension 2 throughout the
correlation between dimension 2 and pitch was
pitch range (R ⫽ 0.687, p ⬍ 0.001) (Figure 4B).
quite high (R ⫽ 0.758, p ⬍ 0.001). Dimension 3
This correlation was even stronger when the pitch
(Figure 3C), although contributing to an improve-
range was restricted to A3 through F5 (R ⫽
ment in overall model fit, did so less for the inexperi-
0.858, p ⬍ 0.001).
enced listeners than for the experienced listeners.
Dimension 3 appeared to separate individual sing-
None of the acoustic measures calculated in this study
ers within voice category (Figure 4C). For these
correlated significantly with dimension 3. Examina-
data, the strongest acoustic correlate to dimension
tion of dimension 3 scale values as a function of
3 was vibrato rate (R = 0.651, p = 0.001). Visual
pitch shows that this dimension failed to discrimi-
comparison of dimension 3 (Figure 4C) with vibrato
nate categories of voices or individual voices,
rate (Figure 4D) reveals that within each voice cate-
throughout the pitch range. However, the dimension
gory, the singer with the higher vibrato rate was also
did separate sopranos from mezzo-sopranos at the
the singer with the more positive scale values along
lowest pitch, A3. The small variations in dB RMS
dimension 3. The small variations in dB RMS seen
seen throughout the stimuli did not correlate signifi- throughout the stimuli did not correlate significantly
cantly with any dimension. with any dimension.
Experienced Listeners
Results of the INDSCAL analysis indicated that di-
mensions 1, 2, and 3 were of perceptual importance DISCUSSION
to the experienced listeners. Plots of dimension 1 This paper tested the hypothesis that rather than
versus dimension 2 and dimension 1 versus dimen- having a characteristic “timbre,” a singer has a set
sion 3 are shown in Figure 4A and C, respectively. of timbres, or a timbre transformation. Therefore,
As in the MDS analysis for inexperienced listen- singers with similar timbre transformations would
ers, dimension 1 appeared most closely associated constitute members of the same voice timbre type
with pitch. Here, there was a monotonic relationship or voice category. Based on this hypothesis, three

FIGURE 2. MDS perceptual space for singers as perceived by inexperienced listeners (a) and experienced listeners (b).
predictions can be made. First, listeners will per- perceptual support for the existence of these theoreti-
ceive dissimilarity across pitch within the same cal categories when conceptualized as timbre trans-
singer. Second, any two voices, or more likely, formations, rather than as static timbre categories.
voice categories, may be perceived as having similar In this study, for both inexperienced and experi-
timbre at one pitch and dissimilar timbre at an- enced listeners, dimension 2 separated singers based
other. Third, singers of the same voice category on voice category and correlated moderately well
should have similar MDS representations. A second with spectral centroid from 2 to 5 kHz. Spectral
hypothesis was that listener experience would affect centroid values were higher for both sopranos than
the perception of dissimilarity across pitch. they were for both mezzo-sopranos at all pitches
For both experienced and inexperienced listeners, except A5, with the greatest difference between the
pitch correlated most highly with dimension 1, in two groups in the mid-pitch range (G4, B5). Al-
spite of instructions to disregard pitch. As there though spectral centroid is affected primarily by the
are necessarily timbre changes associated with pitch location of peaks in the output spectrum, it is also
secondarily affected by spectral slope. Examination
changes, it is impossible to determine to what extent
of the spectra for each of the stimuli revealed at
either of these perceptual cues affected listener judg-
least one visible peak in the upper frequencies for
ments. Regardless, it is important to note that these
all stimuli at pitches A3 through F5, but no visible
listeners perceived greater dissimilarity across pitch peaks in the upper frequencies for any stimuli at
range within singers than they did between singers, the pitch A5. It is therefore likely that the spectral
regardless of voice category. Yet, inspection of di- centroid measurements at pitches A3 through F5
mension 2 shows that listeners also perceived great were primarily affected by location of spectral peaks,
similarity within voice categories, although, as hy- whereas the spectral centroid measurements at the
pothesized, sopranos and mezzo-sopranos were pitch A5 were primarily affected by spectral slope.
heard to be similar to each other at some pitches As spectral centroid correlated well with dimension
and dissimilar at others. Thus, the MDS solution 2 in the pitch range A3 through F5, these results
provides a visual cue to the pattern of timbre trans- lend support to the notion proposed by Dmitriev and
formation that constitutes the “signature” of each of Kiselev12 and Sundberg21 that location of higher
these vocal categories. These findings provide strong spectral peaks provides a strong perceptual cue to

FIGURE 3. Two-dimensional representations of the MDS perceptual space for singers as perceived by inexperienced listeners.
Dimension 2 as a function of dimension 1 (A), dimension 2’s acoustic correlate, spectral centroid, as a function of pitch (B),
dimension 3 as a function of dimension 1 (C). There was no acoustic correlate to dimension 3.
vocal category. It is interesting that for the pitch judgments, suggesting that experienced listeners
A5, where no spectral peaks were noted, listeners have developed listening strategies that use pitch
heard little difference between voices, regardless as a cue to a lesser degree. Second, inexperienced
of category. listeners, although able to discriminate differences
This study found several differences between in- between voice categories based on spectral centroid,
experienced and experienced listeners in the percep- were not able to discriminate differences between
tion of timbre as a function of pitch. First, although voices within a voice category. Experienced lis-
both experienced and inexperienced listeners used teners, on the other hand, were able to discriminate
pitch as the primary cue in this dissimilarity task, between singers within voice categories seemingly
experienced listeners judgments were less highly using temporal cues, as evidenced by the correlation
correlated with pitch than inexperienced listener between extent of vibrato rate and dimension 3 for

FIGURE 4. Two-dimensional representations of the MDS perceptual space for singers as perceived by experienced listeners.
Dimension 2 as a function of dimension 1 (A), dimension 2’s acoustic correlate, spectral centroid, as a function of pitch (B),
dimension 3 as a function of dimension 1 (C), and dimension 3’s acoustic correlate, vibrato rate, as a function of pitch (D).
these listeners. Such results are not necessarily con- strongest such cue in the case of these four singers
trary to those obtained by Bloothooft and Plomp2 was vibrato rate.
who found no difference in performance between These results show that when presented with a
experienced and inexperienced listeners on dissimi- pair of stimuli, it is impossible for listeners to judge
larity tasks using vocal stimuli at one pitch with voice similarity independently of pitch differences.
temporal characteristics removed. For the experi- If that was possible, the multidimensional represen-
enced listeners, the lack of a strong correlation be- tation would consist of discrete clusters of stimuli,
tween dimension 3 and any one acoustic measure one for each voice or, more likely, one for each
suggests that these listeners probably use multiple category. Yet, category differences do emerge in the
idiosyncratic cues based on their own experiences spectral centroid dimension (dimension 2). In that
and the peculiarities of the data to make distinc- dimension, voices in the same category tend to group
tions between singers within a vocal category. The at each pitch and tend to share dimension 2

coordinates of the same polarity (e.g., positive values their knowledge of a singer’s timbre transformation
for the sopranos and negative values for the to discriminate between individual singers or catego-
mezzo-sopranos). Thus, although listeners could not ries of singers, regardless of pitch differences. If this
disregard pitch differences when comparing stimuli explanation is true, then these results suggest that
within a singer across pitch, they were nonetheless a task comparing two pitches, although providing
able to hear timbre differences between categories sufficient information to distinguish between catego-
across pitch. ries of voices, does not provide the necessary percep-
It is interesting to note that experienced listeners tual information to discern timbre transformations
differed from inexperienced listeners in the manner across a wide range of pitches that can be used to
in which they weighted pitch in dimension 1. Stimuli identify a voice across pitch. Moreover, it helps
of the same pitch tended to be scaled similarly by to explain why previous work using voices at one
these listeners across all voices, as indicated by the pitch did not find consistent categories.2
nearly vertical arrangement along this dimension Studies such as this are limited due to the combi-
(see Figure 3). On the other hand, for the experienced natorial nature of multidimensional scaling. It was
listeners in the vocal experiment, stimuli of the same not possible to include more than four singers or
pitch were not scaled similarly across all voices more than one vowel. For example, increasing the
(see Figure 4). In fact, the experienced listeners heard number of singers in each category from 2 to 3
greater dissimilarity across pitch range for both so- would have increased the number of paired stimuli
pranos than they did for both mezzo-sopranos, sug- from 276 to 630. Thus, although these results do
gesting that these listeners may have been less not disprove the hypotheses postulated here, they
influenced by pitch and harmonic spacing than were cannot necessarily be generalized to the larger popu-
the inexperienced listeners. lation of singers. Studies such as this one should be
The fact that listeners were able to make consis- replicated with other singers and singing categories.
tent judgments for each voice in the face of large If listeners use timbre transformations to help
pitch differences suggests that although the timbre of identify sound sources such as voices over varying
an individual singer is not invariant across pitch, pitch and loudness, then it is important to understand
the singer’s timbre transformation may provide per- how listeners construct such transformations. Using
ceptual cues to the identity (or at least, category) of a three-note oddball task where two stimuli were
the singer. At the present time, it is unknown whether produced by the same singer and a third stimulus
listeners can abstract such timbre transformations in was produced by a different singer, Erickson et al.27
an identification task. However, it could be specu- found that when the stimuli produced by the same
lated that through normal experience and/or explicit singer were separated by more than one octave, lis-
training, listeners come to build a model for how the teners were unable to construct a timbre transforma-
timbre of one singer changes across pitch. In other tion sufficient for identification of the “oddball”
words, listeners may build a unique timbre trans- stimulus. Future research should focus on the ability
formation for each singer or perhaps for each singing of listeners to build such transformations based on
category. This is analogous to the uncanny ability the number of same-singer stimuli and the pitch
of some individuals to connect baby pictures to the differences between same-singer stimuli.
correct adult pictures. In both cases, the perceptual
invariant must be linked to mechanical constraints.
In the former case, the mechanical constraints are
based on the physical properties of the laryngeal SUMMARY AND CONCLUSIONS
source and vocal tract.7 In the latter case, the me- Traditionally, timbre has been defined as that per-
chanical constraints can be defined in terms of ceptual attribute that differentiates two sounds when
growth patterns.26 Thus, listeners may judge two pitch and loudness are equal. By this definition, each
vocal stimuli as sounding very different and yet voice or category of voices possesses a set of tim-
maintain that they were produced by the same singer bres, and the identity of any source across different
or the same category of singer. Listeners could use pitch–loudness combinations, or in the case of singers,

across different pitch–loudness–vowel combina- 10. ASHA. Guidelines for screening for hearing impairments
tions, must be due to an abstraction of the pattern and middle ear disorders. ASHA. 1990;32:17–24.
11. Fant G. Acoustic Theory of Speech Production. The Hague:
of timbre transformation. Using stimuli produced Mouton; 1960.
across the singing range by singers from different 12. Dmitriev L, Kiselev A. Relationship between the formant
voice types, this study sought to examine how timbre structure of different types of singing voices and the dimen-
and pitch interact in the perception of dissimilarity. sions of the supraglottal cavities. Folia Phoniatr (Basel).
The resulting MDS representations showed that al- 1979;31:238–241.
13. Sundberg J, Högset C. Voice source differences between
though listeners cannot ignore pitch differences in
falsetto and modal registers in counter tenors, tenors and
the perception of dissimilarity, non–pitch-related baritones. Log Phon Vocol. 2001;26:26–36.
timbre differences can be perceived within the con- 14. Titze IR. Principles of Voice Production. Englewood Cliffs,
text of pitch. Such representations of dissimilarity NJ: Prentice Hall; 1994.
across pitch provide a glimpse of the timbre transfor- 15. Sundberg J, Gauffin J. Waveform and spectrum of the glottal
mations associated with voice categories. For these source. In: Lindblom B, Öhman S, eds. Frontiers of Speech
Communication Research, Festschrift for Gunnar Fant.
data, timbre transformations in dimensions 2 were London: Academic Press; 1979:301–320.
nearly identical for singers of the same voice cat- 16. Gauffin J, Sundberg J. Spectral correlates of glottal voice
egory, providing strong evidence for the existence source waveform characteristics. J Speech Hear Res. 1989;
of these theoretical categories. 32:556–565.
17. Mori K, Blaugrund SM, Yu JD. The turbulent noise ratio:
An estimation of noise power of the breathy voice using
PARCOR analysis. Laryngoscope. 1994;104:153–158.
Acknowledgements: The author would like to express
18. Hertegard S, Gauffin J. Voice source-vocal tract interaction
her sincere gratitude to Joshua Lacey for his assistance
during high-pitched female singing. In: Friberg A, Iwarsson
with the experimental procedures and to Stephen Handel J, Jansson E, Sundberg J, eds. Proceedings of the Stockholm
for his support, encouragement, and comments. Music Acoustic Conference (SMAC93), 79. Stockholm:
Royal Swedish Academy of Music; 1993:177–182.
19. Holmberg EB, Hillman RE, Perkell JS, Guiod PC, Goldman
REFERENCES SL. Comparison among aerodynamic, electroglottographic,
and acoustic spectral measures of female voice. J Speech
1. ANSI. Psychoacoustical Terminology. S3.20. New York: Hear Res. 1995;38:1212–1223.
American National Standards Institute; 1973. 20. Klatt DH, Klatt LC. Analysis, synthesis, and perception of
2. Bloothooft G, Plomp R. The timbre of sung vowels. J voice quality variations among male and female talkers.
Acoust Soc Am. 1988;84:847–860. J Acoust Soc Am. 1990;87:820–857.
3. Iverson P, Krumhansl CL. Isolating the dynamic attributes 21. Sundberg J. Perceptual aspects of singing. J Voice. 1994;
of musical timbre. J Acoust Soc Am. 1993;94:2595–2603. 8:106–122.
4. Plomp R. Timbre as a multidimensional attribute of com- 22. Monsen RB, Engebretson AM. The accuracy of formant
plex tones. In: Plomp R, Smoorenburg GF, eds. Frequency frequency measurements: A comparison of spectrographic
Analysis and Periodicity Detection in Hearing. Leiden, The analysis and linear prediction. J Speech Hear Res. 1983;
Netherlands: Sijthoff; 1970:397–414. 26:87–97.
5. Samson S, Zatorre RJ, Ramsay JO. Multidimensional scal- 23. Sandell GJ. Roles for spectral centroid and other factors in
ing of synthetic musical timbre: Perception of spectral and determining “blended” instrument pairings in orchestration.
temporal characteristics. Can J Exp Psych. 1997;51:307– Music Percept. 1995;13:209–246.
315. 24. Prame E. Measurements of the vibrato rate of ten singers.
6. Vennard W. Singing, the Mechanism and the Technique. J Acoust Soc Am. 1994;96:1979–1984.
New York: Fisher; 1967. 25. Prame E. Vibrato extent and intonation in professional
7. Cleveland TF. Acoustic properties of voice timbre types Western lyric singing. J Acoust Soc Am. 1997;102:616–621.
and their influence on voice classification. J Acoust Soc 26. Pittenger JB, Shaw RE, Mark LS. Perceptual information
Am. 1977;61:1622–1629. for the age level of faces as a higher-order invariant of
8. Keidar A, Hurtig R, Titze I. The perceptual nature of vocal growth. J Exp Psych Human Percept Perform. 1979;5:
register change. J Voice. 1987;1:223–233. 478–493.
9. Lakatos S. A common perceptual space for harmonic and 27. Erickson ML, Perry S, Handel S. Discrimination functions:
percussive timbres. Percept Psychphys. 2000;62:1426– Can they be used to classify singing voices? J Voice. 2001;
1439. 15:492–502.

Dissimilarity and The Classification of Female Singing Voices - A Preliminary Study

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Dissimilarity and The Classification of Female Singing Voices - A Preliminary Study

Transféré par

Droits d'auteur :

Formats disponibles

Dissimilarity and the Classification of Female Singing

Voices: A Preliminary Study

Summary: Traditionally, timbre has been defined as that perceptual attribute

Accepted for publication October 17, 2002.

Journal of Voice, Vol. 17, No. 2, 2003

Journal of Voice, Vol. 17, No. 2, 2003

Journal of Voice, Vol. 17, No. 2, 2003

Journal of Voice, Vol. 17, No. 2, 2003

Journal of Voice, Vol. 17, No. 2, 2003

Inexperienced Listeners 1 0.7507 0.6260

Journal of Voice, Vol. 17, No. 2, 2003

Journal of Voice, Vol. 17, No. 2, 2003

Journal of Voice, Vol. 17, No. 2, 2003

Journal of Voice, Vol. 17, No. 2, 2003

Journal of Voice, Vol. 17, No. 2, 2003

Journal of Voice, Vol. 17, No. 2, 2003

Vous aimerez peut-être aussi