Vous êtes sur la page 1sur 55

Journal o f Psycholinguistic Research, Vol. 25, No.

2, 1996

A Prosody Tutorial for Investigators of

Auditory Sentence Processing
Stefanie Shattuck-HufnageP ~ and Alice E. Turk 2
In this tutorial we present evidence that, because syntax does not f u l l y predict the way
that spoken utterances are organized, prosody is a significant issue f o r studies o f
auditory sentence processing. We describe the basic elements and principles o f current
prosodic theory, review the psycholinguistic evidence that supports an active role f o r
prosodic structure in sentence representation, and provide a road map o f references
that contain more complete arguments about prosodic structure and prominence.
Because current theories do not predict the precise prosodic shape that a particular
utterance will take, it is important to determine the prosodic choices that a speaker
has made f o r utterances that are used in an auditory sentence processing study. To
this end, we provide information about practical tools such as systems f o r signal
display and prosodic transcription, and several caveats which we have f o u n d useful to
keep in mind.

Spoken utterances have characteristics that their written counterparts do not,
and it is a reasonable assumption that these characteristics have consequences for perceptual processing. As interest in the auditory processing of
spoken sentences increases, it becomes important to tmderstand these characteristics. The past several decades have seen the emergence o f theories
We thank the following for discussions o f specific points or for their comments on
portions earlier drafts, which improved the paper substantially: A n n Bradlow, Ronnie
Cann, Miriam Eckert, Merrill Garrett, Caroline Heycock, Pat Keating, Sharon Manuel,
Janet Nicol, Lisa Selkirk, Mark Steedman, and two anonymous reviewers.
1 Speech Communication Group, Research Laboratory of Electronics, Massachusettes Institute of Technology, Cambridge, Massachusettes 02139.
2 Department of Linguistics, University of Edinburgh, Edinburgh, United Kingdom EH8
3 Address all correspondence to Stephanie Shattuck-Hufnagel, Speech Communications
Group, Research Laboratory o f Electronics, 36-511 MIT, 77 Massachusettes Avenue,
Cambridge, Massachusettes 02139, or stef@speech.mit.edu.

9 1996 Plenum Publishing Corporation


Shattuck-Hufnagel and T u r k

that attempt to account for patterns o f intonation, timing and even variations
in segmental implementation in spoken utterances, in terms of a hierarchy
of prosodic constituents and prominences. This proposed prosodic structure
is separate from the morphosyntactic structure of the utterance, although
influenced by it. Several years ago, we set out to educate ourselves on the
literature that forms the basis o f these claims: the linguistic theories that lay
out a hierarchy o f prosodic constituents and prominences, the phonological
arguments that support the theoretical claims, and the quantitative behavioral
evidence that tests their relevance for speech and sentence processing models.
Our purpose is to summarize what we have learned in these three areas, in a
way that will be useful to others investigating spoken language processing.
Among the major lessons we have absorbed are:
9 The morphosyntactic hierarchy influences the signal indirectly, via the constraints it imposes on the choices that the speaker makes among the prosodic
possibilities for a given utterance;
9 These prosodic choices are also influenced by many other factors;
9 For this reason, the prosody of a particular utterance of a sentence cannot
be predicted reliably from the text alone; thus, it is necessary to determine
the prosodic structure that the speaker actually used for each particular
spoken utterance;
9 Current prosodic theory and transcription systems provide the tools to specify this structure, at least in a preliminary way that has already led to many
useful insights.
The view that prosodic structure arises via the prosodic component o f
the grammar, is influenced by a number o f extrasyntactic factors, and (in
concert with the segmental specification of the words o f the utterance) directly determines the acoustic-phonetic realization o f an utterance, raises a
number o f critical issues. These include how to formally represent the prosodic aspects o f phonological structure, how these structures influence the
phonetic dimensions of F0, duration, amplitude and segmental quality, how
the various factors that influence the speaker's choice of prosodic structure
for a given utterance are weighted, and the mechanism by which they interact. In this tutorial we will not provide definitive answers to these questions, but will review information that is useful in thinking about them. We
begin in Section 2 with the claim that the organizational structure o f an
utterance is not identical to its syntactic structure. Section 3 is a summary
and comparison of several proposed prosodic hierarchies in the literature,
along with evidence for each prosodic element. Section 4 presents evidence
that bears specifically on the hierarchical arrangement o f the constituents,
including the results o f studies that directly compare the efficacy of syntactic
and prosodic structure in accounting for acoustic-phonetic measures of sp0-

A Prosody Tutorial


ken utterances. Section 5 addresses prosodic transcription, extra-syntactic

factors that influence prosodic choices, and the nature of the prosodic component of the grammar. The transcription system we describe, the recentlyproposed ToBI system, has proven useful in capturing the prosodic structures
behind spoken utterances in a number of recent studies. We conclude with
four Helpful Hints for taking prosody into account in auditory sentence
processing research.
A great many aspects of prosody are not dealt with in this review. For
example, we have almost nothing to say about rhythm, lexical tone, emotional and attitudinal components, or semantic and pragmatic effects. Instead, we limit ourselves to arguments that show that syntax does not fully
predict the way utterances are organized, to theories of the prosodic hierarchy that provide an alternative grammar of spoken utterance organization,
and to evidence for this proposed hierarchy, along with practical advice for
investigators interested in avoiding some of the potential pitfalls that lie in
wait for those who adventure in the prosodic jungle. Of necessity, this survey
is incomplete, since many publications of great interest cannot be cited because of space constraints, and others are appearing almost daily. But we
hope this tutorial will serve as a road map to guide the interested reader to
further sources of information.


Intuitively, spoken utterances appear to have structural constituents, i.e.,
to be divided into phrases. In this section we summarize the general correspondences between prosodic and syntactic structure revealed by early studies (2.1), examine the discrepancies that emerged from further analysis (2.2),
and touch on some of the proposals that have been made for mapping syntactic structure onto the prosodic patterns of constituency, prominence, intonation and timing observed in spoken utterances (2.3). First, however, we
address the problem of defining prosody.
Defining Prosody

A universally acceptable definition of prosody has been elusive. Some

definitions refer to the acoustic parameters whose variation is believed to
signal constituent boundaries and prominence: F0, duration, amplitude and
segment quality or reduction. However, this definition is problematical, since
these parameters also vary systematically from one segment to another in the


Shattuck-Hufnagel and Turk

same prosodic context. Others use the term to refer to the phonological organization of segments into higher-level constituents and to the pattern of
relative prominences within these constituents. For example, proposed prosodic constituent hierarchies include such elements as intonational phrases,
prosodic phrases, prosodic words, clitic groups, metrical feet, etc., and proposed hierarchies of relative prominence include such contrasts as nuclear vs.
pre-nuclear pitch accents at the phrasal level, and full-vowel syllables vs.
reduced syllables at the lexical level. A third class of definition merges the
phonetic and phonological aspects of prosody, including both the higher level
organization, with its constituent boundaries and prominences, and the phonetic reflexes of this organization in the pattern of FO, duration, amplitude
and segment quality/reduction within an utterance. The evidence laid out in
this paper and elsewhere has convinced us that an acceptable definition should
include the notion of prosodic structure as an abstract entity, associated with
a separate component of the grammar, and that this component must integrate
various types of information to determine the appropriate prosodic shape of
a spoken utterance. As a working definition, we specify prosody as both (l)
acoustic patterns of F 0, duration, amplitude, spectral tilt, and segmental reduction, and their articulatory correlates, that can be best accounted for by
reference to higher-level structures, and (2) the higher-level structures that
best account for these patterns. We subscribe to Beckman's (1996) observation that prosody is " a complex grammatical structure that must be parsed in
its own right"; it is "the organizational structure of speech."
2.1. General Correspondence Between Syntax and Prosody

A natural early hypothesis in generative grammar, and in the psycholinguistic studies that grew out of it in the 1960's and '70's, was that the
structural constituents of spoken utterances correspond to those predicted by
the syntax. Many studies showed that major acoustic phonetic phenomena,
such as intonational boundaries, preboundary lengthening and pausing, tend
to occur at major syntactic boundaries, and that some aspects of phrasal
prominence patterns can also be predicted from morphosyntactic properties
(Brown & Miron, 1971; Goldman-Eisler, 1972; Klatt, 1976; Cooper & Paccia-Cooper, 1980; Chomsky and Halle, 1968).
Further evidence of the syntax-prosody link is provided by the fact that
some syntactic ambiguities can be disambiguated by the placement of spoken boundaries (Lehiste, 1974; Lehiste et al., 1976; Price et al., 1991). This
is particularly true for bracketting ambiguities like:
(1) (old) (men and women) vs. (old men) (and women),
(2) (a + b) (c) vs. (a) + (b c)

A Prosody Tutorial


(for evidence see e.g., Streeter 1978), although less true for other types of
ambiguity involving grammatical relations or grammatical function, such as
the infamous:
(3) Visiting firemen can be a nuisance.
(4) The shooting of the hunters was terrible.
The ability of speakers to disambiguate some forms of syntactic ambiguity
by prosodic means shows that syntax imposes some constraints on prosodic
Another line of evidence comes from the fact that some syntactic constituents obligatorily require a particular intonational constituent, e.g., parentheticals, tags, and nonrestrictive relatives and other appositives are
produced with their own intonational phrase. Selkirk (1978, 1981:137) gives
examples such as
(5) In Pakistan, Tuesday, which is a weekday, is, Jane said, a holiday.
The first, third, and fifth word strings between punctuation marks in
this example are obligatory intonational phrases. Similarly, some prosodic
structures are ruled out for certain syntactic forms, as shown by the unacceptability of examples like (6c), which indicate that surface syntax imposes
constraints on the way an utterance is organized.
(6) (a)

George and Mary give blood. (one intonational phrase)

George and Mary, give blood. (two intonational phrases)
*George, and Mary give blood. (two intonational phrases)
George, and Mary, give blood. (three intonational phrases)

Finally, the form class of a word, traditionally regarded as a syntactic dimension, has been shown to affect its durational pattern (Sereno and Jongman 1995) as well as its prominence pattern (Kelly 1989).
Early studies of sentence processing, most carried out in English,
amassed a substantial body of evidence supporting the effects of surface
syntactic structure on various measures of perceptual processing; for a summary see Fodor, Bever and Garrett (1974). For example, Ladefoged and
Broadbent (1960) showed that listeners' misperceive the location of a click
presented simultaneously with speech, in a way that suggests resistance to
interruption of constituents. Garrett et al. (t966) extended this method to
show that major syntactic boundaries influence the perceived location of the
click even when prosodic cues are neutralized.
Although initial evidence seemed to support the claim that syntactic
structure has a direct effect on the phonological and acoustic-phonetic shape
of an utterance, as well as on other forms of language behavior, as investigators began to examine larger corpora of utterances actually produced by


Shattuck-Hufnagel and Turk

speakers, they found notable discrepancies with the results predicted by syntax. As the examples in the following section show, traditional morphosyntax is not isomorphic with the organizational structure of spoken utterances.
2.2. Discrepancies B e t w e e n Syntactic and P r o s o d i c Structure

We present two kinds of arguments that suggest discrepancies between

the syntactic parsing of an utterance and facts such as the location of perceptible or acoustically measurable boundaries. The first is that not all aspects of syntax, or even all important syntactic contrasts, are signalled in
the structure of a spoken utterance (Section 2.2.1.). The second is that many
aspects of spoken utterances cannot be predicted from the traditional morphosyntactic structure of their underlying sentences, suggesting that other
factors play a role (2.2.2.).

2.2.1. Some Important Aspects of Syntax Are Not Reflected in Prosody

As we have seen, prosodic structure can sometimes disambiguate a
syntactically-ambiguous word string. Thus, it might be a reasonable assumption that syntactic differences which distinguish one interpretation from
another are signalled by prosody. Lehiste (1973), Price et al. (1991) and
others have shown that sentences like those in (7), (8), and (9) below are
disambiguated for listeners by the speaker's choice of one prosodic structure
over another. For these pairs, the differences in syntactic structures are regularly reflected in prosodic differences of tune and phrasing. (Numbered
example utterances cited in this paper are available via anonymous ftp from
(7) Object noun phrase plus vocative vs. Complex noun phrase
(a) I'll take the eggs, Benedict.
(b) I'll take the Eggs Benedict.
(8) Left vs. right attachment
(a) When you learn gradually, you worry more.
(b) When you learn, gradually you worry more.
(9) Parenthetical main-main clause vs. nonparenthetical main-subordinate
(a) Mel knew, by the way, you were driving.
(b) Mel knew, by the way you were driving.
However, some critical syntactic distinctions cannot be systematically
resolved by differences in prosodic treatment. For sentences like those in
(10) there is no systematic difference in prosody, as measured by the ability
of listeners to determine the speaker's intended sentence. Table I, from Price

A Prosody Tutorial


Table I. Percent of responses appropriate to one of the two possible interpretations of

syntactically ambiguous sentences, in response to (a) a spoken rendition produced in a
context appropriate for that interpretation, and (b) a spoken rendition produced in a
context appropriate for the other interpretation. Chance performance = 50%; some
structures are more reliably disambiguated than others (see Price et aL, 1991 for
explanation of the ambiguity types and the list of all stimulus sentences used).
Judged (a) syntax in
response to (a) prosody
Main-main vs. main-subord clause
+ / - Tag Q
Far vs. near attachment
Left vs. right attachment
Particles vs. prepositions


Judged (a) syntax in

responseto (b) prosody

et al. (1991), shows the variation in disambiguatability across ambiguity

(10) Far vs. near attachment
Andrea m o v e d the bottle under the bridge.
(a) bottle was elsewhere, was moved to location under bridge
(b) bottle was under bridge, was moved
John ran away with the girl wearing a blue coat.
(a) John was wearing blue coat
(b) girl was wearing blue coat
There is currently an intense debate about which aspects o f syntax are
signalled in the organizational structure o f utterances, and which are not.
We will provide some signposts to this discussion in 2.3. below. First, however, we discuss further the lack o f isomorphism between syntactic and
prosodic organization.

2.2.2. Some Aspects of Spoken Utterance Structure Cannot Be Predicted

from the Morphosyntax
In the studies discussed above in 2.1, the nearly-unanimous syntactic
phrase structure judgments o f practicing linguists did not predict all the
prosodic behavior o f language users. For example, Brown and Miron (1971)
measured pause durations in a passage read by a professional talker, and
were able to predict only 64% o f the variance in pause duration from syntactic parsings of the passage. In fact, there are often a variety of prosodic
possibilities for the utterance of a sentence, and in some cases, these well-


Shattuck-Hufnagel and T u r k

formed prosodic structures appear to violate syntactic structure. We will

review the evidence first that speakers have choices, then that some of them
are not isomorphic with surface syntactic structure, and finally that other
factors play a role in determining the speaker's choice.
i. Speakers Have Multiple Prosodic Choices for a Given Syntax. A
given syntactic string can be parsed into spoken phrases in several ways. In
some cases, utterances with the same syntax are produced with different
spoken constituent boundaries (nb: we restrict ourselves here to examples
from English which can be conveyed relatively straightforwardly with traditional orthography):
Same Syntax, Different Phrasing.
(11) (a) Rebecca won their support. (one intonational phrase)
(b) Rebecca, won their support. (two intonational phrases)
(c) Rebecca won, their support (two intonational phrases)
The range of acceptable forms illustrates that syntax does not predict
which of the available choices a speaker will select. Jun (1993) describes
similar phenomena in Korean. At best, syntactic structure may define a set
of options and constraints; apparently other factors also play a role.
Other variations that point to extra-syntactic effects on the prosodic
shape of an utterance include the placement of phrase-level prominences
and selection of tune type. As the examples in (12), (13), (14), and (15)
show, the placement of prominences and the choice of intonational contours
in English is not predicted by the morphosyntactic structure of an utterance
Same Syntax, Different Tunes.
(12) (a)
(13) (a)

Are you coming to the PARty tonight? (rising intonation)

Are you coming to the PARty tonight. (falling intonation).
Is it ALways this way? (high F0 on accented syllable)
Is it ALways this way? (low F0 on accented syllable)

Same Syntax, Different Prominence Locations.

(14) (a)
(15) (a)

He gave the book to JOHN.

He gave the BOOK to John.
He GAVE the book to John.
He GAVE the book to JOHN.
It's alRIGHT. (one pitch accent, i.e., phrase-level F0-cued prominence)
(b) It's AL-RIGHT! (two pitch accents)
(c) IT'S AL-RIGHT! (three pitch accents)

These examples show that speakers have options for the prosodic treatment
of a given syntactic structure. This observation itself suggests that syntax

A Prosody Tutorial


does not entirely determine prosody. Even more compelling, however, is the
fact that some of these options divide an utterance into constituents which
appear to violate surface syntactic structure.
(ii) Some Well-Formed Prosodic Choices Appear To Violate Syntactic
Structure. One way to derive a set of prosodic options would be by stipulating a condition which parses a spoken string exhaustively into any of the
set of well-formed syntactic constituents. However, if a parsing constraint
exists, it must be weakened to allow for the possibility that at least some of
the prosodic constituents do not correspond to well-formed syntactic constituents. A notorious example is the sentence:
(16) Sesame Street is brought to you by the Children's Television Workshop.
In utterances of this sentence, the 'by' can be grouped either with its following noun phrase (forming two syntactically well-motivated phrases):
(16a) Sesame Street is brought to you, by the Children's Television Workshop,
or with the preceding phrase,
(16b) Sesame Street is brought to you by, the Children's Television Workshop
in which case the left constituent does not correspond to a well-motivated
constituent in traditional syntactic theories (see Section 2.3.).
Gee & Grosjean (1983) provide a further illustration of this point. They
measured pause duration in slowed speech, and interpreted the results in
terms of hierarchical performance structures. In these structures, function
words were often grouped with adjacent content words, even in cases where
there was a major syntactic constituent boundary between the function word
and the following content word (as in the case of a subject pronoun followed
by a verb). For example, speakers' pauses divided the clause:
(17a) He brought out the objections . . .
into the two constituents.
(17b) He brought out ] the objections . . .
which do not correspond to the two major surface syntactic phrases of subject NP and VP. Such examples suggest that the organizational boundaries
in an utterance do not delineate the syntactic constituents of the underlying
sentence in a rigid way, either by consistently marking the same types of
syntactic boundaries or by consistently parsing the word string into wellformed syntactic constituents of various types.
Syntax also cannot provide an account of the effects of rate, length,
and symmetry on constituent boundary location, illustrated by the examples
in the next section.


Shattuck-Hufnagel and Turk

iii. Evidence that Extra-Syntactic Factors Influence Prosodic Structure.

A particularly striking example which illustrates the role o f non-syntactic
factors in determining prosodic constituents, comes from the work of Cheng
in Chinese (1970, 1973, cited in Shih, 1986). The sentence shown in Fig. 1
has the English translation:
(18) Old Li buys good wine.
This sentence was produced in slow speech as three constituents--corresponding to:
(18a) (Old Li) (buys) (good wine)
- - w h i c h echo the NP-V-NP structure of the sentence. But in more rapid
speech, this sentence was produced differently, divided into constituents corresponding to
(18b) (Old Li buys) (good wine)
which are less well motivated by the syntax.
Evidence for the different constituent structure of slowly vs. rapidly
spoken utterances o f this sentence is the pattern o f application o f the rule


/ 3

" buys

d wine





3 I 2
3 = low;

Rule: 3

-> 2 /

2 = rising

3 in same constituent.

Fig. 1. Comparison of tones produced in slow and fast renditions of this phrase in
Mandarin Chinese. The rule changing Tone 3 to Tone 2 when followed by a 3 in the
same constituent applied to Old but not to Li in the slow version, indicating a constituent
boundary after Li, but to both O/d and Li in fast speech (but not to buys), indicating a
constituent boundary after buys. (after Cheng, 1970, 1973).

A Prosody Tutorial


that turns an underlying low tone into a rising tone when followed by a low
tone in the same constituent. This rule operates on old in slow speech but
on both old and Li in more rapid speech, showing that buys must be in the
same constituent as Li in the rapid rendition. Jun (1993) reports parallel
phenomena in Korean.
Another factor that seems to play a role is symmetry, or a balance
between the length of subconstituents of an utterance. For example, Gee and
Grosjean (1983) and Grosjean et al. (1979) report that talkers placed a performance constituent boundary in the middle of a syntactic constituent when
this resulted in a more equal partition of the elements of the utterance. That
is, the length of the prosodic subparts of a spoken utterance was determined
in part not by the syntactic structure, but by a tendency to divide the spoken
utterance into equal parts.
Another demonstration that syntax can't predict all aspects of prosody
comes from non-syntactically-structured word lists. Talkers can 'prosodify'
such lists; that is, they naturally group words into panse-delimited constituents (Suci, 1967), and some factor(s) other than syntax must be determining
these prosodic patterns. However, Suci's further finding that intertalker variability is substantially less for syntactically organized utterances than for
word lists suggests that, even though syntactic structure does not predict all
aspects of prosody, it nonetheless provides a powerful constraint on the
range of ways in which a sentence can be prosodically treated by a speaker.
Some of the discrepancies with syntactic structure that arise in prosody
also displayed themselves in studies of other forms of language behavior.
For example, Martin (1970) found that subjects asked to parse visually presented sentences into 'natural' subgroups produced hierarchical structures,
but did not always parse the sentences following traditional syntactic lines,
showing "differences between prescriptive and subjective sentence organization" (p. 159). In particular, while sentences are thought to follow a S
(VO) syntactic structure in English, subjects often gave (SV) O parsings.
This result is reminiscent of the performance structures obtained by Gee and
Grosjean (1983) from the pause duration patterns observed in slowed speech.
In 2.1. and 2.2. we have seen that, despite a general correspondence
between the surface syntactic structure of a sentence and the prosody of an
utterance of that sentence, there are many discrepancies between the two
sets of structures. It appears that extrasyntactic factors as well as syntactic
factors influence the speaker's choice of prosodic shape for an utterance.
What, then, is the best way to characterize the relationship between syntax
and prosody? This question is currently the topic of lively debate and extensive research; we address it briefly in the final portion of this section.


Shattuck-Hufnagel and T u r k

2.3. The Prosody/Syntax Mapping

If the organizational boundaries in an utterance are not isomorphic with
the syntactic constituents of the underlying sentence in a rigid way, either
by consistently marking the same types of syntactic boundaries or by consistently parsing the word string into well-formed syntactic constituents of
various types, what is the relationship between the syntactic and prosodic
structures of an utterance? One alternative is embodied in Selkirk's theory
of edge alignment (Selkirk, 1986, 1993; Hale & Selkirk, 1987). In current
versions of her theory, certain prosodic boundaries are constrained to align
with certain syntactic boundaries. This type of constraint (see Section 3 for
examples) has been expressed in an optimality theoretic framework (Prince
& Smolensky, 1993; McCarthy & Prince, 1995). In Optimality Theory, phonological tendencies are expressed as a set of ranked violable constraints,
whose application to the set of all possible forms yields a single optimal
output which violates the fewest and least important (lowest ranked) constraints (see Section 3 for further discussion). Optimal outputs will vary
between languages depending on the language specific ranking of the (universal) set of constraints. We do not know whether optimization plays a
similar role in the production processing mechanism for specific utterances,
i.e., whether the different factors that influence the prosodic structure o f an
utterance are integrated by a similar mechanism of violability rankings. But
if this is the case, then different ways of prosodifying a given syntactic string
might be expected under different conditions. For example, in some cases
the non-syntactic factors that influence the prosodic structure of a given
utterance may be differently ranked?
Steedman (1991; see also Prevost and Steedman, 1994) has taken a
different tack. He proposes a syntax based on Categorial Grammar that defines a freer notion of surface syntactic structure, while still delivering compositional semantic interpretations for the sentence and each non-standard
constituent. This theory was originally proposed as a syntactic account of
coordinate structure and relativization, but Steedman argues that the range
of well-formed prosodic alternatives for a given syntactic structure simply
reflects the constituent structure that the Categorial Grammar makes available in order to account for such constructions, which must be captured
under any theory of grammar. For a critique of this approach, see Beckman
Jun (1993: chapter 1) discusses three possibilities for the relation between syntax and prosodic constituent structure: direct reference to syntax,
1This view raises the question of whether all the factors that appear to influence prosodic
choices, including speaking rate and constituent length, function in the grammar in the
same way (see Section 5.3. for a brief discussion).

A Prosody Tutorial


indirect reference to syntax, and syntax-independent definition of prosodic

constituents, opting for the latter in the determination of Accent Phrases and
Intonation Phrases in Korean. Readers are referred to volumes such as Selkirk (1984), Ewen and Anderson (1987), Inkelas and Zec (1990) and Kaisse
(1985) for discussions of this complex issue.

In sum, although syntax imposes certain constraints on prosody, other
factors must be invoked to account for which aspects of syntax can and
cannot be violated, which aspects can and cannot be signalled, and which
aspects the speaker chooses to signal or not to signal in a particular utterance, as well as to account for elements of spoken utterances for which
syntax does not provide a prediction. The postulated prosodic component of
the grammar provides a candidate for the organizational structure underlying
spoken utterances that should help provide answers to these questions, and
a locus for the integration of factors from various components that have an
effect on the shape of this structure for any given utterance. In Section 3,
we describe the theory of the prosodic hierarchy.

3. T H E O R I E S OF T H E P R O S O D I C H I E R A R C H Y OF


We begin this section with a general discussion of the prosodic hierarchy (3.1.), and then turn to proposals for specific prosodic constituents
(3.2.) and proposals for the hierarchy of prosodic prominences (3.3).
3.1. The Hierarchical Organization of Prosodic Constituents

With the advent of modern prosodic theories, beginning with Liberman's (1975) and Liberman and Prince's (1977) seminal work on metrical
theory, and with the integration of metrical theory and intonational theory
in a single component of the grammar, it became possible to consider the
role of specific proposals for prosodic structure. We will review four wellknown prosodic hierarchies proposed by Nespor and Vogel (1986), Hayes
(1989), Beckman and Pierrehumbert (1986) and Pierrehumbert and Beckman
(1988), and Selkirk (1978; 1986 and to appear). These theories (Fig. 2) share
the view that prosodic structure consists of a hierarchy of labelled constituents, and they give an idea of the range of possibilities in the literature.
Additional important views, such as those of Halle and Vergnaud (1987),
Gussenhoven (1984), Jun (1993), Hayes (1995), and others will be referred


Shattuek-Hufnagel and T u r k

to in the discussion.Note that we present Selkirk'scurrentview, whichhas

evolved somewhatfromthe widely-knownversionin her book, Phonology
and Syntax (1984) and more closely resembles her earlier formulationin
Selkirk (1978, 1981).
In general,there is good agreementover the constituentsnear the top
of the hierarchy, i.e., at the level of some type of intonationallydefined
Nespor and Vogel,







Full Inton. Phrase

PhJn. Phrase-~-----Malor Phrase ----Intermed.lat.Phrase





Fig. 2. Prosodic constituent hierarchies from the literature; additional important theories,
such as those of Halle and Vergnaud (1987), Liberman (1975), Liberman and Prince
(1977), Gussenhoven (1988) and others are discussed in the text.

~utl) Intonational Phrase



F[I ~


//~ F





Massachusetts Supreme
Fig. 3. Prosodic constituent boundaries for the utterance illustrated acoustically in Fig.
4a; digitized examples of critical utterances from the paper are available via anonymous
ftp from lexic.mit.edu.

A Prosody Tutorial


constituent like the (Full) Intonational Phrase. 2 Definitional agreement is also

good at the lowest levels o f the hierarchy, i.e., for the Syllable and the Mora,
which are defined within a lexical word. However, correspondences are less
clear at the midlevels of the hierarchy, where intonational constituents meet
sublexical constituents. Confusion can arise when the same terms are used
by different theorists to refer to slightly different constituents, different terms
are used to refer to constituents at the same level in the hierarchy, and
constituents present in one proposed hierarchy are missing from others.
Some of these terminological problems are illustrated in Fig. 2. A prosodic
tree illustrating the prosodic constituents for a particular utterance from the
Boston University FM Radio News Corpus (Ostendorfet al., 1995) is shown
in Fig. 3. Before turning to a discussion of these issues, we will present one
of the fundamental differences between prosodic and traditional syntactic
hierarchies: i.e., the relationship between constituents at adjacent levels.

Formalization of Hierarchical Structure as the Strict Layering


The view that prosodic structure is hierarchical has been formalized in

linguistics in terms o f the Strict Layering Hypothesis. In general, a constituent at one level in the hierarchy is thought to be composed exclusively of
one or more constituents from the next level down in the hierarchy, e.g.,
Full Intonational Phrases are composed o f Intermediate Intonational Phrases
(Beckman & Pierrehumbert, 1986), Feet are composed o f Syllables, and so
forth. The Strict Layering Hypothesis (Selkirk, 1984; Nespor & Vogel, 1986;
Selkirk, to appear; Hayes, 1984) states that:
The categories of the Prosodic Hierarchy may be ranked in a sequence C1,
C 2 , . . . Cn such that
a) all segmental material is directly dominated by the category Cn, and
b) for all categories Ci, i = / n , Ci directly dominates all and only constituents
of the category Ci + 1 (Hayes, 1984: 204)
In other words, a prosodic category of one level is exhaustively parsed
into constituents o f the next-lower level, and those next-lower-level constituents are all of the same type. This contrasts with syntactic structure, where
a constituent may be parsed into constituents o f many different types at the
next level down, including constituents o f its own type.
2 Gussenhoven (1992) argues that intonationally defined constituents should not be part
of the prosodic hierarchy per se, proposing that they are defined in separate terms. In
his view, intonationally defined constituents called tone words are mapped onto the
prosodic hierarchy, where they cannot consistently be mapped onto any particular prosodic constituent (1992:89).


Shattuck-Hufnagel and Turk

While the Strict Layering Hypothesis appears to hold in the great majority o f cases, it is thought that some prosodic constituents m a y occasionally
directly dominate constituents two or even three levels down in the hierarchy, e.g., Prosodic Words m a y directly dominate Feet, 1 level down, along
with (optionally) Syllables, 2 levels down, (Inkelas, 1989; Selkirk, to appear
and references). Selkirk (to appear) presents a case for the Minor Phrase
directly dominating a Prosodic Word as well as a Syllable (see Section
3.2.3.). Likewise, in some views, constituents are thought to directly dominate other constituents o f the same type: e.g., Ladd (t986) argues that sentences which contain two or more Intonational Phrases have their own
intonation contour which is more than just the sum o f the two Intonational
Phrase contours. Ladd proposes that an Intonational Phrase can directly dominate another Intonational Phrase, so that Intonational Phrases can be nested
recursively. Also, a Prosodic Word can dominate another Prosodic Word in
Selkirk (to appear) and in Inkelas (1989).
The exceptions to Strict Layering cited above suggest that this principle
is not an exceptionless rule, but instead, m a y be at most a strong tendency.
In order to allow for the types o f cases where Strict Layering does not appear
to apply, the Strict Layering Hypothesis has been broken down into a set o f
four violable constraints within an optimality theory framework (Ito and
Mester, 1992); see 2.3. The four constraints are listed below as they appear
in Selkirk (to appear: 3).
(a) Headedness: A constituent of level Cj in the Prosodic Hierarchy must
dominate a constituent of level Cj-1 (i.e., of the next level down).
(b) Layeredness: A constituent of level Cj in the Prosodic Hierarchy may not
dominate a constituent of level Cj + n (i.e. of a higher level).
(c) Exhaustivity: A constituent of level Cj in the Prosodic Hierarchy may not
dominate a constituent of level Cj - (1 + n) (i.e., of more than one level
down). This constraint would be violated in the case that a Prosodic Word
dominates a Syllable, for example.
(d) Nonrecursivity: A constituent Of level Cj in the Prosodic Hierarchy may
not dominate a constituent of the same level Cj.
While Headedness and Layeredness are not known to be violated, Exhaustivity and Nonrecursivity do appear to be violable. (19) shows a violation o f Exhaustivity proposed in Selkirk (to appear); (20) shows a violation
o f Nonrecursivity proposed in Ladd (1986).

A Prosody Tutorial



MiP = Minor Phrase; see 3.2.1.

Syl Syl

/ \


MP = Major Phrase, analogous to Beckman and Pierrehumbert's (1986) Full

Intonational Phrase

3.2. The Hierarchy of Prosodic Constituents

Prosodic constituents have been defined in several different frameworks: in terms of the domains of phonological rules (e.g., The Utterance
is the domain of application of/r/epenthesis in British English, Nespor and
Vogel, 1986), in terms of intonation (e.g., an Intermediate Intonational
Phrase in English is the span of a coherent intonation contour that includes
a certain pattern of phonological tonal elements, Beckman and Pierrehumbert, 1986), and in terms of rhythmic prominence (e.g., a Foot in English
consists of a full-vowel syllable followed by the unstressed syllables in the
same word). Some theorists view prosodic constituents as members of a
single hierarchy, and for presentation purposes we will adopt this view (see
Gussenhoven, 1992 for an alternative position.) Recently, studies have appeared which explicitly test the notion that e.g., intonationally-defined constituents are the domain of application of phonological rules; see Jun (1993).
In this section we discuss each proposed constituent in the prosodic
hierarchy, as summarized in Fig. 2. For each constituent, Utterance, Intonational Phrase, Phonological Phrase, Clitic Group, Foot, Syllable, and
Mora, we will present the definitions that have been proposed and some of
the evidence that has been invoked. Evidence that supports the generally
hierarchical structure of spoken utterances, without relying on the detailed
assumptions of a particular theoretical proposal, is reserved for Section 4.1.
3.2.1. The Utterance

The Utterance has been proposed as the largest unit in the prosodic
hierarchy: It is the largest span of application of phonological rules (Selkirk,
1978, 1980; Nespor & Vogel, 1986; Hayes, 1989) and its boundaries are


Shattuck-Hufnagel and T u r k

sometimes said to be the location of non-hesitation pauses (Hayes, 1989).

This unit often corresponds to a single syntactic sentence, but can include
two or more sentences joined into a single higher-level sentence (Selkirk,
1978, 1981). The Utterance is absent from some theories of the prosodic
hierarchy, where the largest constituent is the Full Intonational Phrase. It
may be that the Utterance and the Full Intonational Phrase are in fact different names for the same unit; unfommately, crucial information about the
intonational contours o f Utterance-sized domains is lacking, and cannot be
inferred from text alone.
This problem is illustrated by a consideration of Nespor and Vogel's
(1986) proposal o f the Utterance as the domain o f flapping in American
English. They cite the following example o f two sentences which form a
single Utterance in English (p. 238).
(21) Turn up the heat. I'm freezing.
They argue that the f i n a l / t / i n 'heat' can be flapped in (21), but not in (22),
where the two sentences form separate Utterances.
(22) Turn up the heat. I ' m Frances.
It is clear that there are two sentences in (21), but it is possible that
flapping occurs when this word string is produced as a single Full IP. Indeed,
Selkirk (1978, 1981) proposed that the Intonational Phrase is the domain o f
flapping in American English. This is an example o f the difficulties that
arise from arguing about prosodic structures based on written text: We can't
be sure whether the pairs of sentences in (21) or in (22) are produced as a
single Intonational Phrase when t h e / t / i s flapped, since we can't determine
the intonational prosody from the text alone.
3.2.2. The Intonational Phrase

It is generally agreed that the Intonational Phrase is a prosodic constituent which is intonationally defined, and is the domain o f a perceptually
coherent intonational contour, or tune. According to Pierrehumbert (1980),
the Intonational Phrase contains a specified sequence o f phonological elements: Nuclear Pitch Accent followed by a Phrase Accent and a Boundary
Tone (additional Prenuclear Pitch Accents are optional). Her theory differs
from previous formulations in several ways, notably in its sparse phonological specifications of the intonationally significant elements of the utterance,
and its restriction to tone level targets of High and Low, with neither Mid
tone targets, nor target movements such as Rise and Fall; apparent rises and
falls are captured by appropriate sequences o f High and Low elements (see
below and Section 3.3. for a more complete summary).

A Prosody Tutorial


Perceptually, the boundaries of an Intonational Phrase are quite clear,

even in cases where there is not a large final F0 excursion, presumably in
part because of preboundary lengthening on the final syllable (Lehiste et al.,
1976; Selkirk, 1984; Ladd & Campbell, 1991; Wightman et al., 1992; Berkovits, 1993, 1993a). The evidence is strong that Intonational Phrases are
not always isomorphic to syntactic phrases; as we saw in Section 2, the
same syntactic structure can be parsed into Intonational Phrases in several
different ways. However, there do appear to be some syntactic constraints
on Intonational Phrases. For example, parentheticals, unrestrictive relative
clauses, preposed adverbials, tag questions, expletives, vocatives, and certain
moved elements are thought to obligatorily form their own Intonational
Phrases (see Selkirk, 1978, 1981; Nespor & Vogel, 1986 and references; for
an example see (5) above). In addition to syntactic constraints, other factors
such as length of the utterance and speaking rate as well as semantic, pragmatic, and stylistic concerns may influence speakers' decisions about how
to divide an utterance into Intonational Phrases (Gee and Grosjean, 1983;
Selkirk, 1984; Nespor and Vogel, 1986; Jun, 1993).
Several authors have proposed that Intonational Phrases can be either
subdivided or combined into other intonational constituents. Ladd (1986)
raises the possibility of recursive Intonational Phrases. Beckman and Pierrehumbert (1986), on the other hand, in an expansion of Pierrehumbert's
(1980) theory, propose just two separate levels of Intonational Phrases: A
Full Intonational Phrase is parsed exhaustively into one or more Intermediate
Intonational Phrases. In their view, the Intermediate Intonational Phrase is
defined as the domain of a coherent intonational contour that includes at
least a Nuclear Pitch Accent (Prenuclear Accents are optional), and a Phrase
Tone which describes or controls the F0 contour from the Nuclear Pitch
Accent to the end of the phrase. Since a Full Intonational Phrase contains
one or more Intermediate Intonational Phrases, it therefore contains (like
Pierrehumbert's (1980) undifferentiated Intonational Phrase) at least a Nuclear Pitch Accent, a Phrase Accent (which does not itself convey prominence, but may intensify the salience of the Nuclear Pitch Accent which
precedes it), and an obligatory Boundary Tone associated with the right edge
of the phrase; a left-edge boundary tone is optional.
Unfortunately, for many of the examples of Intonational Phrases discussed in the literature it is impossible to know whether they are Full Intonational Phrases, or Intermediate Intonational Phrases: Cited examples are
usually printed and it is impossible to infer the prosody directly from the
text. However, a few studies have reported results for intonationally transcribed speech that captures this distinction. The Intermediate Intonational
Phrase corresponds to the domain of catathesis (also called downstep, i.e.,
F0 range lowering after an accented syllable) in Japanese and English (Beck-


Shattuck-Hufnagel and T u r k

man & Pierrehumbert, 1986), and its left edge has been described as the
locus for Early Accent Placement within the word in American English
(Home, 1990; Shattuck-Hufnagel, 1988, 1992a, 1995; Shattuck-Hufnagel et
al., 1994) as well as for glottalization of vowel-onset words (Pierrehumbert
& Talkin, 1992; Dilley et al., 1994; Dilley and Shattuck-Hufnagel, 1995).
The Full Intonational Phrase permits glottalization of vowel-onset words at
its left edge, while the Intermediate Intonational Phrase does not (Dilley et
al., under revision).
A number of other theories define the Intonational Phrase in other ways,
including the British approach (associated with O'Connor and Arnold and
with Halliday) which defines the phrase in terms of the structural units of
head, nucleus and tail, and the Dutch approach developed by 't Hart, Collier
and colleagues which specifies a small set of rises, falls and level tones for
each language, some of which are prominence-lending. For further discussion of these views, see O'Connor and Arnold (1961), Halliday (1967),
Cruttenden (1986) especially Chapter 3, and 't Hart and Collier (1975) summarized in 't Hart et al. (1990). Additional systems have been developed
by Pike (1945), Chafe (1980) and others.
3.2.3. The Phonological Phrase

Full Intonational Phrases can be parsed into one or more Phonological

Phrases which appear to be tightly constrained by the syntax. As a result,
they have often been defined in syntactic terms, although precise syntactic
definitions differ between theorists (cf. Bickmore, 1990; Kisseberth &
Abasheikh, 1974; Kaisse, 1985; Kaisse & Zwicky, 1987 and others in Phonology Yearbook 4; Hayes, 1989; Inkelas & Zec, 1990; Selkirk, 1978, 1986;
Nespor & Vogel, 1986). Phonological Phrases also appear to be constrained
by non-syntactic factors (Cheng, 1973; Shih, 1986 for Mandarin; Silva,
1989; and Jun, 1993, for Korean), although the influence of non-syntactic
factors is less well understood.
Selkirk (1986) and Selkirk & Tateishi (1991), following McCawley
(1968), proposed that there are two types of Phonological Phrase, the Major
Phrase, and the Minor Phrase. A sentence parsed into both Major and Minor
Phrases is shown below (based on an example in Selkirk, to appear:20).

A Prosody Tutorial





Maj Phr [
Min Phr [



[[Blue]AP [aphids]N]NP




/ \
Det NP
I \

[[completely]AP[covered]V[[the]Det[new]AP [buds]NlNP.




Selkirk's Major Phrase is constrained to align with either the left or

right edge (right edge in English) of a non-lexically governed syntactic maximal projection (Hale & Selkirk, 1987; Selkirk, to appear). In phrase strncrural syntactic theories, a maximal projection of a particular head (lexical
item) is the highest syntactic node of the same grammatical category as the
head which it dominates. (For example, in the phrase 'blue aphids', the head
noun 'aphids' is the Noun Phrase (NP) 'blue aphids'). Roughly, a sister to
an adjacent head is said to be governed by that head (see Haegeman, 1993:
125 for a more precise definition of 'government'). For example, in (23),
the adjective phrase 'blue' is lexically governed since the adjacent head noun
'aphids' is its sister (i.e., both are immediately dominated by the same syntactic node). The right edge of a Major Phrase thus does not align with the
right edge of 'blue.' The subject NP, on the other hand, is not lexically
governed, and thus the right edge of the Major Phrase does align with the
right edge of this NP.
Selkirk (to appear) proposes a phonological argument for this constituent, arguing that function words in English surface in their strong form
(i.e., are not reduced) when they are Major Phrase-final, e.g.:
(24) The brook you will look at I bubbles
(taken from Selkirk (to appear: 21), where


cannot be reduced.


Shattuck-Hufnagel and T u r k




.... l o o k







Whereas Selkirk's Major Phrases are proposed to align with the edges
of e n t i r e maximal projections which are not lexically governed, her Minor
Phrase is proposed to align with either the left or right edge of h e a d s of
maximal projections which are not lexically governed. The Minor Phrase
"groups together a phrasal head and adjacent modifiers and functional elements, as in Det Adj Noun sequences in Italian, English, French Modem
Greek (see Nespor & Vogel, 1986; Selkirk, 1986)" (Selkirk, to appear: 20).
As shown in the above example, the Minor Phrase in English aligns with
the right edge of heads of non-lexically governed maximal projections.
In other theories (Selkirk, 1978; Nespor & Vogel, 1986; and Hayes,
1989), only a single Phonological Phrase is proposed. In some cases, the
proposed Phonological Phrase corresponds most closely with Selkirk's Major Phrase, and in other cases, the Phonological Phrase corresponds more
closely to the Minor Phrase. It remains to be seen whether both a large and
a small Phonological Phrase are used in all languages.
A correspondence between the intonationally defined Intermediate Intonational Phrase (Beckman & Pierrehumbert, 1986) and a syntacticallyconstrained constituent such as the Major Phrase has been suggested for
Japanese (Selkirk and Tateishi, 1988, 1991), Bengali (Hayes and Lahiri,
1991) and Korean (Jun, 1993). Selkirk and Tateishi (1988, 1991) found that
this domain coincides with the left edge of a maximal projection (which
aligns with a Major Phrase boundary). Hayes and Lahiri (1991) and Jun
(1993) suggest a correspondence between the intonational phrase and the
domain of segmental phonological phenomena which occur in constituents
roughly the size of the syntactically constrained Major Phrase.
Selkirk and Tateishi (1988, 1991) and Selkirk (to appear) suggest that
the Minor Phrase corresponds to the Accentual Phrase in Japanese (a unit
equal in size to or larger than a prosodic word in which no more than one
pitch accent occurs, Beckman and Pierrehumbert, 1986). Jun (1993) presents
evidence consistent with this view for Korean. In English it is less likely

A Prosody Tutorial


that the Minor Phrase can be defined intonationally; it may be the case that
different types of constraints/definitions are appropriate for similar-sized
units in different languages.
3.2.4 Clitic Group
The Clitic Group contains at most one content word with (optionally)
adjacent monosyllabic function words (clitics). There is ample evidence that
monosyllabic function words (including prepositions, determiners, pronouns,
auxiliaries, modals, complementizers, and conjunctions) are realized quite
differently from content words (nouns, verbs, adjectives, some adverbs) in
continuous speech. In particular, while content words always bear some type
of lexical stress, function words often, but not always, surface in a reduced
form which is phonologically related to the full form: 'is' can surface as
'[0z]'; 'him' can surface as '[0m]'; 'have' can surface as 'Iv]', etc. (see
Selkirk, 1984; Inkelas and Zec, 1993). In addition, these function words
often appear to be closely grouped with an adjacent content word. For example, Grosjean et al. (1979) and Gee and Grosjean (1983) found that function words are often grouped together with an adjacent content word. That
is, talkers produced a much shorter pause between they and offered in They
offered.., than between John and asked in John asked . . . . The Clitic Group
is proposed to account for this close linking and potential for reduction.
The close grouping of function words with adjacent content words has
been described in several different ways. In Selkirk (to appear) and Inkelas
and Zec (1993), the Phonological Phrase (Minor Phrase in Selkirk's theory)
and Prosodic Word are used to group function words with either following
or preceding content words (see Section 3.2.3 and 3.2.5). In Hayes (1989)
and Nespor and Vogel (1986), on the other hand, a function word is always
grouped into a Clitic Group with an adjacent content word. Theories differ
as to the exact definition of the term 'clitic' and as to whether function
words group with following or preceding content words (see discussions in
Nespor and Vogel, 1986; Hayes, 1989; and Inkelas & Zec, 1993).
Selkirk's Minor Phrase provides some of the functions of the Clitic
Group proposed by Nespor and Vogel (1986) and Hayes (1989). As described in the following section, the Clitic Group contains at most one content word with (optionally) adjacent function words. According to Selkirk
(to appear: 19), the Minor Phrase can also serve to group function words
with (an) adjacent content word(s), as shown in (25):


Shattuck-Hufnagel and T u r k







/ \
Syl Syl

sto - ry

The function word's lack of Prosodic Word status accounts for the fact
that a function word in this position is reduced: It is not the head of a foot,
and thus cannot be stressed.
Hayes presents phonological arguments for this constituent, citing English rules of/v/-deletion and palatalization (from Selkirk, 1972) as rules
which operate within the Clitic Group domain: e.g.:
(26) Will you {save me} a seat?
(27) {Is Sheila} coming?
In Hayes' view, the/v/in {save me} is deleted and the/z/in {is Sheila}
can be palatized, since {save me} and {Is Sheila} are Clitic Groups. In the
following examples,/v/-deletion does not apply because {save} and {Morn}
are two separate Clitic Groups, and in {Carla's} {shower}, the /z/ is not
palatalized, since {Carla's} and {shower} are separate Clitic Groups.
(28) {Save} {Mom}
(29) {Carla's} {shower}
Intuitions may differ on the palatalizability of this/z/; again, arguments from
text are not compelling, if this phrase can be produced optionally as one or
two prosodic constituents, with the/z/palatalized only when it is produced
as one.
3.2.5. The Prosodic Word

While Hayes (1989), Selkirk (1978, to appear), Beckman and Pierrehumbert (1986), and Nespor and Vogel (1986) all include the Prosodic Word
in their respective hierarchies, definitions for this constituent differ in several
key respects. Namely, they differ in how function and content words are
parsed into Prosodic Words, and also in how different types of morphemes
are parsed into Prosodic Words.
Prosodic Word (a): Function vs. Content Word Distinction

In Hayes' (1989) and Nespor and Vogel's (1986) account, function

words and content words are both parsed into prosodic words. Function

A Prosody Tutorial


words are Prosodic Words which form Clitic Groups with adjacent content
words. In Selkirk's (to appear) and Inkelas and Zec's (1993) accounts, on
the other hand, content words, but not function words, are parsed into Prosodic Words at the output of the lexical component of the grammar. According to Selkirk (to appear), function words which procliticize are attached
to Minor Phrases during the postlexical component of the grammar (example
25); unstressed function words which encliticize at the postlexical level adjoin to a Super Prosodic Word which dominates the function word and the
Prosodic Word dominating the preceding content word (as in the phrase
need him, where the him is reduced:



... need

him ...

Note that when the him is not reduced, it has a different structure, like
the one in (31). When function words are Major-Phrase-final in English (and
are not what Inkelas and Zec (1993) would call 'clitics'), they form Prosodic
Words on their own and as a result surface in strong form (e.g., That's the
one to look for):



... look


f o r ...

P r o s o d i c W o r d (b): Sublexical Components

There is much disagreement as to when Prosodic Words can be smaller

than orthographic words, i.e., when they can correspond to stems or morphemes contained within an orthographic word. Nespor and Vogel (1986),
Hayes (1989), Selkirk (to appear), Inkelas (1989), and McCarthy (1993)
discuss this issue.
Some evidence from work based on the traditional morphosyntactic
lexical word can be reinterpreted as edge effect evidence for the Prosodic
Word. For example, Cooper (1991) showed that word-initial stops show


Shattuck-Hufnagel and Turk

longer acoustic closure duration and longer aspiration duration (VOT) than
word-medial stops in the same lexical stress environment. Similarly, Krakow
(1989) showed different patterns of labial and velic articulation for wordinitial vs. word-medial /m/; the difference in articulatory kinematics for
word-initial vs. word-medial nasals was quantitative, rather than qualitative,
when the consonants were both in the same position in their respective
syllables (i.e. syllable-initial).
Although a range of constituents in the hierarchy have been shown to
exhibit pre-boundary lengthening (Ladd and Campbell, 1991; Wightman et
al., 1992), it is still unclear whether a Prosodic Word boundary per se induces such lengthening.
3.2.6. The Foot

The term 'Foot' has been used in the literature to refer to two distinct
types of units. The Foot referred to most often by generative phonologists
is what we will call the Within-Word Foot, sometimes called the Rhythmic
Stress Foot. In English, this Foot contains at most one lexically-stressed
syllable, followed by zero or one (or in some formulations two) reduced
syllables, and may not extend beyond a lexical word boundary. In studies
of the role of the Foot in determining lexical stress in the word (Kiparsky,
1979; Hayes, 1981; Halle and Vergnaud, 1987), the 'word' referred to is the
lexical word, and so this constituent might be called a Within-Lexical-Word
The Within-Word-Foot differs from the Abercrombian, Cross-WordBoundary Foot (Abercrombie 1965, 1973), which can combine fragments
of adjacent lexical words and is thought to extend in most cases from one
pitch-accented syllable to just before the next one. For example, Abercrombie (1973: 11) gives:
(32) [ Know then thy- [ -self, pre- [ -sume not [ God to [ scan [^ I,
where the Foot represented by I -self, pre- I crosses at least a lexical word
boundary, if not other constituent boundaries as well.

3 Little w o r k h a s b e e n done to investigate the status o f a possible W i t h i n - P r o s o d i c - W o r d

Foot, w h i c h m i g h t incorporate portions o f several lexical w o r d s as in e.g.

/ \

W P W Ft


W P W Ft



Prosody Tutorial


A number of studies investigate the hypothesis that the Foot plays an

active role in the processing of spoken language. For example, Turk and
Sawusch (1995) report that a Pitch Accent lengthens the duration of the
Within-Word Foot in English, Scott (1982) reports that duration lengthening
of the foot signals a prosodic boundary within the Cross-Word-Boundary
Foot in English, and Fant et al. (1991) describe a relationship between
boundary-related lengthening and the mean duration of preceding CrossWord-Boundary Feet.
3.2. 7. The Syllable

The Syllable is the unit of which Within-Word Feet are composed.

Syllables consist of a vocalic nucleus (or syllabic sonorant) along with associated preceding (onset) and/or following (coda) consonants. Discussions
of syllabification principles can be found in Pulgram (1970), Kahn (1976),
Clements and Keyser (1983), and Kenstowicz (1994), among others. Although the Syllable has regained a place in phonology after a period of
neglect, it is interesting to consider the degree to which the phenomena that
are currently accounted for in terms of the Syllable might be as well or
better described in terms of other prosodic constituents like the Foot (see
e.g. Shattuck-Hufnagel, 1992).
3.2.8. The Mora

The mora is a unit smaller than the syllable whose existence is uncontroversial in languages such as Japanese, but somewhat more controversial
in other languages such as English. It is dominated by the syllable and
dominates a segment (either a vowel or consonant) or segments. A syllable
contains at least one mora and normally contains no more than two. For
example, in Japanese, a commonly cited moraic language, C0V and V syllables are considered monomoraic, whereas CoV N (N = nasal consonant)
and CoVQ (Q = the first member of a geminate consonant) syllables are
considered bimoraic (McCawley, 1968, Otake et al., 1993).




In many languages (e.g. Latin, English, Tfibatulabal, Ojibwa, and others

cited in Halle and Vergnaud, 1987), the location of stress placement is sensitive to the internal composition of syllables. In these languages, called


Shattuek-Hufnagel and Turk

quantity sensitive, syllables which contain long vowels are considered heavy
and tend to attract stress. 4 In some languages, syllables which contain VC
rimes are also considered heavy, whereas in other languages, syllables with
VC rimes are considered light. Hyman (1985, as cited in Kenstowicz, 1994)
argues that light syllables are monomoraic, while heavy syllables are bimoraic. The mono- vs. bi-moraic nature o f the syllable t h u s depends on
properties o f the syllable rime, and does not appear to be sensitive to properties o f the syllable onset. We refer the reader to McCawley (1968), Port
et al., (1987), Katada (1990) and Otake et al. (1993) for discussions o f and
evidence for this unit, and to Hayes (1989a) for a theory which implicates
the mora in explanations o f phonemic vowel length distinctions (quantity
sensitivity) in various languages not traditionally thought to make use o f
this constituent.
Pierrehumbert and Beckman (1988) report that certain aspects o f F0
slope in a Japanese Accentual Phrase are negatively correlated with the
number o f Morae the phrase contains.
This review o f the constituents proposed in several prosodic hierarchies
suggests that (a) there are substantial disagreements about the nature and
definition o f the constituents, particularly at levels between the syllable and
the intonational phrase, and (b) there is also a modicum o f common ground,
particularly at the intonational and sublexical levels. There may also be
differences across languages: It is unclear to what extent speakers o f different languages make use o f all the units in the hierarchy. Despite the lack o f
universal agreement on the nature o f the constituents in the hierarchy and
on whether all prosodic constituents belong in the same hierarchy, a number
o f investigators have begun to test the hypothesis that one or another o f the
proposed constituents plays a role in the representations that guide human
language behavior. We will summarize some of their findings in Section 4.
First, however, we turn to the second o f the two major aspects o f prosodic
structure: the pattern of relative prominences o f an utterance. In addition to
proposals for the hierarchy o f constituents, the prosodic literature contains
a number o f proposals for the hierarchy of prominences, which we review
in the following section.
3.3. The Hierarchical Organization of Prosodic Prominences

The phonology and phonetics o f prosodic prominence have received

disproportionately little attention in comparison to those of prosodic con4 Light syllables may be stressed or unstressed, in conformity with the stress pattern of
the language.

A Prosody Tutorial


stituents, perhaps because interest in syntactic effects on prosody has been

so strong and syntax lends itself more naturally to discussion of constituents
than to consideration of prominence patterns. We will discuss some of the
theoretical treatments of prominence that have appeared. Because much of
this work deals with intonationally-marked sentence-level prominences
known as Pitch Accents, we also briefly review in this section several theories of intonation.

The Hierarchy of Prominences

The development of theories of prosodic prominence over the past 20
years can be seen as the gradual emergence of Claims of greater and greater
specificity about the hierarchical nature of prominence, the factors that govern prominence assignment and the acoustic dimensions that convey it. The
generative approach to relative prominence developed in the 1960s regarded
stress as a unitary dimension, with greater levels of stress assigned to the
nuclear element of higher-level constituents, and corresponding reduction in
the stress specifications of lower-level constituents when they are combined
into larger structures. Critics pointed out that, on this view, the unbounded
nature of syntactic structures leads to an infinite number of putatively distinguishable degrees of stress, for which there is little perceptual evidence
(Lieberman, 1960).
Early generative views of prosodic prominence as stress were expressed
in tree representations, but in the 1970s and '80s, Liberman and Prince
(1977) and Prince (1983) explored a specific notational device for representing relative prominence, which they called the grid. The grid consists
of cells organized into rows and columns, with one column over each syllable and one row for each level of prominence represented in the utterance;
the number of marked cells in the column above each syllable expresses the
prominence of that syllable relative to other syllables in the grid. This approach was explored further in work by Hayes (1983) and by Selkirk (1984),
which did not incorporate information about the constituent structure of the
serially-ordered string but only about the relative prominence patterns of its
syllables. The grid provided a useful framework for thinking about rhythmic,
semantic, and other factors which can change the relative prominence structure of words and phrases when they occur in larger contexts as opposed to
citation forms. As in earlier approaches, relative prominence in the grid was
a unitary dimension: syllables associated with higher marked levels in their
grid columns have more prominence than syllables associated with lower
marked levels.
An alternative approach was suggested by Bolinger (1958, 1965, 1981),
who postulated different types or levels of prominence, rather than simply


Shattuck-Hufnagel and Turk

greater or lesser degrees of it. He suggested that there were only two kinds
of prominence contrasts: on the one hand, full vowels vs. reduced vowels,
and on the other, pitch-accented full vowels vs. non-pitch accented full vowels, (where a Pitch Accent is an intonationally-cued phrase-level prominence). In his view, only full vowels are candidates for the phrase-level
prominence conveyed by Pitch Accents, and the placement of Pitch Accents
on certain of the full-vowel syllables of an Intonational Phrase is governed
primarily by the speaker's tendency to place one accent as early as possible
in the phrase and another as late as possible. This two-accents-per-phrase
view is embodied most persuasively for American English by the typical
intonation contour of short simple declarative statements like The sky is blue.
This contour is also described in terms of the F0 'hat pattern' in the IPO
system developed for Dutch (see 't Hart et al., 1990) and extended to American English by Maeda (1974). Bolinger's view suggests that primary and
secondary word stress differ not in degree or type of articulatory or acoustic
prominence, but in the instructions they provide for the placement of pitch
accent. That is, in English, a phrase-final accent occurs on the main-stress
full vowel of its word, with phrase-initial accent possible on a secondary
stress syllable that precedes the main stress, in words like antique or Massachusetts.
Bolinger's view is compatible with the four-level prominence system
described by Vanderslice and Ladefoged (1972) and Ladefoged (1975). Ladefoged contrasted (a) Nuclear Pitch Accented syllables, (b) Accented
stressed syllables, (c) non-accented Stressed syllables and (d) Reduced syllables. The major difference is that Ladefoged's, 'Accented' syllables were
not defined intonationally and so his analysis did not provide a specific
account of Prenuclear Pitch Accents that occur before the intonationallydefined Nuclear Pitch Accent of an Intonational Phrase.
Evidence for Elements in the Prosodic Prominence Hierarchy
A critical prediction made by recent theories of the prosodic prominence hierarchy is that different types or levels of prominence are signalled
by different dominant acoustic cues. Beckman and Edwards (1990, 1994)
propose that Stressed syllables are distinguished from Reduced syllables by
quality, duration, and possibly amplitude, while Nuclear Pitch Accented syllables are distinguished from non-accented Full-Vowel syllables by an F0
marker. Stevens (1994) has reported that, for speakers of American English,
the glottal excitation waveform differs for the three categories of Nuclear
Accented, post-nuclear Full-Vowel (i.e., non-accented) and Reduced syllables. This result can be interpreted as support for the claim that these three
types of prominence correspond to different levels in the hierarchy. Sluijter

A Prosody Tutorial


(1995) presents evidence that speakers of Dutch and American English distinguish between Pitch-Accented (i.e. focussed) and non-accented FullVowel syllables via F0 levels. More surprisingly, from the point of view of
theories that predict only a 4-way prominence contrast, she reports that
speakers of both languages distinguish primary from secondary stressed
vowels even in non-Pitch-Accented contexts, via the relative level of energy
at high frequencies. In general, the results of quantitative studies support the
view that prosodic prominence is not a single parameter, but that there are
different types or levels of prosodic prominence, associated with a different
dominant acoustic cue or set of cues.
Another line of evidence distinguishing types or levels in the prosodic
prominence hierarchy is the finding of Shattuck-Hufnagel et al. (1994) that
Nuclear Pitch Accents almost invariably occur on the lexically-main-stressed
syllable of their word. In contrast (as predicted by several theories), Prenuclear Pitch Accents may occur on a Full-Vowel syllable earlier in the word,
as in such phrases as:
(34) the MASsachusetts legisLAtion
where upper case indicates a Pitch Accented syllable. Beckman and colleagues (1990) present similar results for phrases like
(35) CHinese anTIQUES.
Apparently, constraints on the within-word placement of Nuclear Pitch Accents differ from those on Prenuclear Accents, further supporting the theoretical distinction between these two types or levels of prominence.
3.4. Linking the Prominence and Constituent Hierarchies

In the spirit of Halle and Vergnaud (1987), Beckman and Edwards

(1990, 1994) suggest a connection between the type of prominence and the
type of prosodic constituent. They suggest that each type of prominence is
associated with a different constituent in the prosodic hierarchy, by virtue
of serving as its head, or most prominent element. For example, the head
of an Intermediate Intonational Phrase is the Nuclear Pitch Accented syllable, and the Head of the Within-Word (Rhythmic Stress) Foot is the fullvowel syllable (i.e. an unreduced syllable that bears some degree of lexical
stress). To illustrate: when the word discombobulation is uttered as an Intonational Phrase, the syllable -la- may carry the Nuclear Pitch Accent of
the Intonational Phrase, and serve as its head; all three stressed syllables
dis- and -bob- and -la- serve as the heads of their respective Within-Word
Feet, and the two stressed syllables dis- and -bob- may or may not carry
Prenuclear Pitch Accents.


Shattuck-Hufnagel and T u r k

This theory draws much of its power from its integration of the hierarchy of constituents with the hierarchy of relative prominences. In addition,
Beckman and Edwards provide evidence from articulatory studies to support
some of their claims (Edwards et al., 1991). However, the proposal leaves
several levels in the prosodic constituent hierarchy without well-defined
heads, and suggests no specific constituent for which Prenuclear Pitch Accents could serve as heads, at least in American English.
One final point about intonational prominence: the term 'stress' has
often been used to refer to prominence at any level in the hierarchy, e.g. to
both lexical stress and phrasal prominence. Since recent investigations suggest that the dominant acoustic cues to lexical and phrasal prominence are
different, it is useful to define the term 'stress' whenever it is used, and to
employ the term 'prominence' to refer to the generic quality shared by all
In Section 3 we have laid out several theoretical proposals for prosodic
constituent and prominence hierarchies, which provide an alternative description for the structural organization of utterances, beyond traditional
morphosyntactic structure. In addition, we have sampled the kinds of evidence that have been used to argue for each prosodic component. We turn
now to more general evidence for the relevance of prosodic structure in
speech production and perception.

4. F U R T H E R E V I D E N C E F O R P R O S O D I C S T R U C T U R E

We begin with a summary of some of the methods that have been used
to evaluate prosodic theories (4.1.), before turning to examples of evidence
for the generally hierarchical nature of the organization of spoken utterances
(4.2.), and of explicit empirical comparisons between syntactic and prosodic
accounts of utterance organization (4.3.).
4.1. Types of Evidence for Prosodic Elements

As we have seen, investigators have presented evidence for specific

prosodic constituents from phonological observations, and from acousticphonetic measurements. Phonological evidence is of various types; for example, Selkirk (1980), Nespor and Vogel (1986) and Hayes (1989)
distinguish phonological rules which span a particular prosodic domain (Domain Span rules) and rules which apply at prosodic domain edges (Domain
Juncture and Domain Limit rules); see also Jun (1993). Quantitative acoustic-phonetic aspects of spoken utterances can also be described in terms of
domain edges and domain spans. For example, Pierrehumbert and Talkin

A Prosody Tutorial


(1992) and Dilley et al. (1994) looked at vowel onset glottalization at the
left edge of intonational constituents, Ladd and Campbell (1991) looked at
preboundary lengthening at the right edge of intonational constituents, and
Turk and Sawusch (1995) examined the domain of lengthening associated
with pitch accents.
Evidence for the psychological reality of prosodic constituents also
comes from studies of perception, memory, and other aspects of language
behavior. Several laboratories have employed unit monitoring tasks to test
the hypothesis that the initial perceptual organization of spoken utterances
occurs in terms of prosodic constituents, and that these constituents may
differ across languages (Mehler et al., 1981; Cutler et al., 1986; and Otake
et al., 1993). Another type of evidence comes from listeners' preferences
for interruption points at constituent boundaries rather than within constituents. Several studies suggest that listeners find passages which have pauses
artificially inserted within constituents unnatural (Pilon (1981) and Wakefield et aL (1974)). Gerken et al. (1994) suggest that the units whose integrity infants prefer may be prosodic constituents, although a critical
comparison of infant preferences for uninterrupted syntactic vs. prosodic
constituents has not yet been reported. Finally, the method of unit extraction
has been employed, in which the infant is familiarized with a list of spoken
words and then presented with a story that either does or does not contain
those words. The degree to which the infant prefers to listen to the story
containing the familiar words is taken as a measure of the ability to extract
those words as units (Jusczyk & Aslin, 1995). Work is in progress to determine whether the infant is extracting morphosyntactic or prosodic units.
The evaluation of phonological, acoustic-phonetic and other behavioral
evidence that bears on the psychological reality of the constituents of the
prosodic hierarchy is still in its infancy. However, taken together, currentlyavailable evidence provides considerable support for the claim that speakers
make active use of prosodic elements in the production of spoken utterances,
and that systematic variations in the phonetic realization of phonemic segments and features depends at least in part on prosodic structure. What is
the evidence that these individual prosodic constituents are organized into a
hierarchical structure, and that this hierarchy provides a better account of
the observable facts of continuous speech than does syntax?
4.2. Evidence for the Generally Hierarchical Nature o f the
O r g a n i z a t i o n a l Structure of S p o k e n Utterances

A variety of evidence supports the hypothesis that the organizational

structure of spoken utterances is hierarchical. For example, several studies
of prosodically labelled speech databases show that the amount of pre-


Shattuck-Hufnagel and Turk

boundary lengthening increases for final boundaries at increasingly higherlevel prosodic constituents (Wightman et aI., 1992, Ladd and Campbell,
1991). Gussenhoven and Rietveld (1992) showed that listeners are sensitive
to this variation. Other studies, focusing on constituent-initial rather than
- final phenomena, show that articulation of a constituent-initial consonants
is in some sense stronger than in other positions, and that this marker increases for initial segments at increasingly higher-level prosodic constituents
(Cooper, 1994; Dilley et al., 1994 under revision; Krakow, 1989; Fougeron
and Keating, 1995 for English; Fougeron, 1996 for French). Still other studies provide evidence for the distinction between two adjacent levels in the
constituent hierarchy, such as Dilley et al. (1994 and under revision) for the
Full vs. Intermediate Intonational Phrase in English, and Jun (1993) for the
Accentual Phrase vs. Intonational Phrase in Korean. Taken together, these
results provide strong evidence for a hierarchical organizational structure in
spoken utterances, and provide suggestive evidence that this structure corresponds to that in the prosodic hierarchy. Evidence would be more persuasive, however, if it came from direct comparisons of the predictions made
by prosodic and syntactic structure. In the next section, we describe several
studies which provide just such comparisons.
4.3. P r o s o d i c vs. S y n t a c t i c S t r u c t u r e as the O r g a n i z a t i o n a l P r i n c i p l e
of Spoken Utterances

A number of studies reviewed above support the notion that the organizational structures underlying spoken utterances correspond to prosodic
rather than syntactic hierarchies. The most compelling evidence for this
claim, however, would be provided by studies that directly contrast the predictions of prosodic and syntactic structures for the same utterances. Only
a few investigations have taken this tack. Gee and Grosjean (1983) compared
their performance structures with syntactic hierarchies as predictors of pausing behavior in the same utterances. As noted above, several key differences
emerged between syntactically motivated predictions and the prosodic facts.
In the performance structures, function words (e.g. pronoun subjects) were
separated from content words by only the shortest pauses, although a syntactic major boundary divided them. Also, pauses tended to bisect sentences
into approximately equal halves: performance structures thus appeared to be
symmetrical and balanced, whereas balance is not a requirement for syntactic
structures. Gee and Grosjean proposed that these performance structures reflect prosodic, rather than syntactic, structure.
Using a different measure, Ferreira (1991) also directly compared the
predictions from prosodic and syntactic structures. Final or pre-boundary
lengthening has traditionally been thought to occur at the right edge of

A Prosody Tutorial


syntactic constituents (Klatt, 1976; Cooper and Paccia-Cooper, 1980). Ferreira presents evidence showing that a hierarchy of syntactic constituents
can't predict the amount of pre-boundary lengthening in a set of controlled
stimuli, and that Selkirk's version of the prosodic hierarchy appears to do a
better job. Ferreira analyzed sets of phrases like the last three in (36) (a)
and (b), which have different numbers of syntactic boundaries but the same
number of prosodic boundaries after the word cop. (See Section 3 for definitions of PWd (Prosodic Word), PPhr (Prosodic Phrase) and IntPhr (Intonational Phrase). When the target phrase is produced in longer sentence
contexts, these three target phrases show the same degree of pre-boundary
lengthening on 'cop' compared with the control phrase shown in the first
example of each set, as predicted by their identical prosodic status. In contrast, the substantial differences in number and type of syntactic boundaries
are not reflected in the measurements of preboundary lengthening. This observation supports the notion that aspects of an utterance that involve timing
are best accounted for in terms of the prosodic rather than the syntactic
(36) Prosodic vs. syntactic boundaries
(a) Syntactic boundaries
[The cop [who's [[ [a friend] NP] VP] S] S-overbar] NP]
[The [friendliest] AdjP cop] NP
[The friend [of [the cop] NP] PP] NP]
[The man [who's [[ [a cop] NP] VP] S] S-overbar] NP]
(b) Prosodic boundaries (according to Selkirk 1986 algorithm)
(((The cop)PWd (who's a friend)PWd)PPhr)IntPhr
(((The friendliest)PWd (cop)PWd)PPhr)IntPhr
(((The friend)PWd (of the cop)PWd)PPhr)IntPhr
(((The man)PWd (who's a cop)PWd)PPhr)IntPhr



A final set of studies directly comparing syntactic and prosodic structure as determinants of phonetic realization is described by Jun (1993) for
Korean. She found that the domains of several postlexical rules in Korean
provide support for two intonationally-defined constituents: the Intonational
Phrase and the Accentual Phrase. Jun argues that these intonational constituents cannot be syntactically defined, and thus that intonational constituents
(rather than other prosodic constituents more closely tied to the syntax) are
the appropriate description of domains of phonological rule application (see
Ewen and Anderson (1987) for further discussion).
In Sections 2, 3, and 4 we have sampled some of the evidence for the
claim that prosodic constituents and prominences provide the organizational
structure of spoken utterances. What are the implications of this pattern of
results for studies of auditory sentence processing? In Section 5 we discuss


Shattuck-Hufnagel and Turk

several of the practical implications, before concluding with four specific

Helpful Hints.

5. P R A C T I C A L M A T T E R S
The proposed hierarchies of prosodic prominences and constituents described in Section 3 and supported by a body of evidence provide candidate
descriptions for some of the aspects of spoken utterances that are not fully
determined by morphosyntacfic structure. In order to make use of these
proposals for studies of spoken sentence processing, it is necessary to understand what prosodic shape the speaker has given to the particular stimulus
utterances to be used in those studies. To this end, we will describe a proposed system for the transcription of prosody called ToBI, along with some
other potential methods for determining prosodic structure (5.1.), summarize
some of the extrasyntactic factors that appear to influence the speaker's
choice of prosodic structure for a particular utterance (5.2.), and touch briefly
on how the prosodic component might function in the grammar (5.3.).
5.1. Tools for Determining the Prosody of a Specific Utterance
Earlier assumptions that prosody could be predicted from the text of a
sentence alone are not supported by the evidence. Unfortunately, our understanding of the additional factors that determine the speaker's choice of
constituent boundary and prominence locations for a particular utterance is
far from complete. These two facts combined might seem to indicate a bleak
picture for investigators who want to study the effects of different prosodic
structures on sentence processing, or at least to control for these effects by
understanding the prosody of stimulus utterances. However, there are tools
available for transcribing the prosodic shape of an utterance post hoc, and
these tools make it possible to select from among candidate utterances the
ones which best fit the requirements of a given study, as well as to indicate
to a reader the prosodic shape of the utterances that were used. We will
describe one such tool, the ToBI transcription system, in some detail, and
survey a number of other proposed systems for describing the prosody of a
particular utterance.
Motivation for the Development of the ToBI Transcription
A pervasive and persistent problem in the study of prosody has been
the lack of an IPA-like system, widely accepted and practiced, for the tran-

A Prosody Tutorial


scription of the prosodic structure of spoken utterances. Individual theorists

and laboratories have often developed their own transcription systems, which
capture the aspects of spoken prosody that they feel are critical. But without
direct experience in the relevant laboratory, these systems are sometimes
difficult for others to understand and use. Moreover, written descriptions of
idiosyncratic transcription systems are not wholly satisfactory in their ability
to convey information to a reader, since they rely on the reader's ability to
imagine the prosodic shape from the written text which, as we have seen,
is not always possible. Including waveforms and F0 tracks is helpful in this
regard, but is not always convenient and in point of fact is not often done.
Such problems have created difficulties both for the mastery of published
transcription systems by new users, and for the comprehension of published
studies of prosody.
These problems had long prevented the development of large study
corpora of prosodically transcribed speech that can only be created by pooling data obtained from transcription efforts in many laboratories, using a
reliable and mutually intelligible transcription system. To address this issue,
a group of researchers representing more than 10 laboratories has been working toward the goal of a generally acceptable and reliably useable transcription convention since the spring of 1991.
The ToBI Transcription Convention

The proposed system, called the ToBI transcription convention (for

TOnes and Break Indices), incorporates the approach to intonation proposed
by Beckman and Pierrehumbert (1986), described above in Section 3.2.2.
and 3.3. ToBI also incorporates the suggestion of Price et al. (1991) that
boundaries between words can be marked on a scale that reflects their perceived depth or strength, i.e. that the boundary between each pair of adjacent
words can be assigned a Break Index. Price et al. originally suggested a
range from 0 to 6, which included groupings of Intonational Phrases; ToBI
uses a restricted 5-point scale, ranging from 0 (corresponding to a phonetically-adjusted inter-word boundary as in didju) through 4 (corresponding to
a Full Intonational Phrase botmdary). ToBI also restricts the set of Pitch
A c c e n t types, eliminating the bitonal [H* + L] label, and makes several
other changes from Beckman and Pierrehumbert's (1986) theory of intonation. Despite its limitations, it appears that ToBI provides a consistent system
of prosodic transcription that has already provided useful data from several
databases. The system is described in Silverman et al. (1992). A training
handbook with digitized labelled examples is available via anonymous ftp. 5
Examples of ToBI-transcribed utterances of
5From ldwi@unm.edu.


Shattuck-Hufnagel and Turk

(37) Wanted: Chief Justice of the Massachusetts Supreme Court

(38) Whenever a computer randomly calls them from jail
spoken by an FM radio announcer in the BU FM Radio News Corpus (Ostendorf et aL, 1995) are shown in Fig. 4a and 4b; the utterance file is available via ftp as noted above (in the Fig. 3 caption). These figures also show
the signal display windows (waveform at the top, F0 at the bottom) and
transcription tiers (in the middle window, tones at the top, with orthography,
break indices, and comments in successively lower tiers) provided by the
labelling tool called Transcriber, written for use with the waves+ signal
analysis software (developed by David Talkin and available through Entropic Research Laboratory, Inc., Washington, D.C.). Transcriber is a menudriven transcription tool, which reduces typing errors, and also includes a
Checker to flag ungrammatical transcriptions.
Among the advantages of the ToBI system are its on-line training and
documentation materials, and the fact that it is in active use by several
laboratories, providing both mutually comprehensible transcriptions and the
opportunity to evaluate inter-transcriber reliability (Pitrelli et al., 1994). An
example of the usefulness of speech databases labelled with the same transcription system in different laboratories can be found in Dilley et al. (under
review), which compares the glottalization of vowel-initial syllables at Intonational Phrase boundaries in read radio news speech from the BU FM
Radio News Corpus (Ostendorf et al., 1995), and in spontaneous non-professional speech from the Trains Corpus (Heeman and Allen, 1995). The
two sets of utterances were ToBI-transcribed at different sites by different
transcribers, and produced similar results.
F u r t h e r D e v e l o p m e n t s and Alternatives

The ToBI system, with versions under development for German, Japanese, and Korean, is just one of a number of systems that are being developed and tested in different laboratories for various languages around the
world. Several transcription systems based on intonational grammars described earlier are also in active use. For example, 't Hart et al. (1990)
review the system for specifying intonational contours developed for Dutch
at the IPO at Eindhoven in the 1970s. This system is based on a small
number of elements, i.e. rises, falls, and plateaus, that combine to specify
the well-formed intonational contours of the language, and it invokes three
target F0 levels. The IPO-system components have been determined for a
number of languages, including British English, Russian and others, and
were quantified for simple declarative sentences in American English by




[ ......

~ ....

I ....


9 (








I . . . . ~ , ,I ....

t ....


Shattuck-Hufnagel and T u r k

~11 a8






.4 "~



j .~i


. . . .

. . . .

. . . . .

Prosody Tutorial


Maeda (1974). Another system in active use for transcribing English intonation is the British approach (O'Connor and Arnold, 1961; Halliday, 1967)
which specifies a complex Nuclear Accent element for each phrase, flanked
by a preceding Head and a following Tail. Notable alternative approaches
to intonation in American English include Pike (1945), and Chafe (1980 and
its references) which treats additional components of spontaneous spoken
prosody as well as intonation.
The discussion above emphasizes the perceptual transcription of utterance prosody, but it is also possible to make use of the growing body of
phonological and acoustic phonetic diagnostics for prosodic constituent
boundaries of various types. For example, if the Intermediate Intonational
Phrase is the domain of Early Accent Placement within late-main-stress
words, as proposed by Beckman and Edwards (1994) and observed by Shattuck-Hufnagel et al. (1994), then the occurrence of main-stress accent vs.
early accent in the word Massachusetts might help to disambiguate an utterance of In Massachusetts hospitals don't thrive as:
(39) In MassaCHUsetts, ] HOSpitals don't THRIVE

(40) In MASSachusetts HOSpitals, I don't THRIVE.

As the prosodic constraints on the application of phonological rules and on
phonetic realization become better understood, these phenomena will provide clearer information about the prosodic structure of continuous speech.
5.2. Other Factors that Influence Prosodic Structure

We treated syntax as a factor at some length in Section 2 because there

is general agreement that syntactic structure constrains prosody, although as
we have seen the exact nature of this relationship is still under discussion.
Additional factors have been suggested as influences on the speaker's decision about what prosody to employ for a given utterance of a sentence
include factors such as focus, new vs. given information, beliefs about the
assumptions shared by two conversing speakers, and quantitative factors,
such as rhythm, number of elements and speaking rate. Perhaps because
attention has been focused so intensively on the relationship between syntax
and prosody, our understanding of precisely what these other factors are,
their relative importance, and how they interact with morphosyntactic structure to determine the speaker's choice of prosody for a given utterance, is
far from complete.
Factors Influencing Placement of Constituent Boundaries
A number of quantitative factors apparently influence the speaker's
choice of prosodic boundary location. For example,


Shattuck-Hufnagel and T u r k

(i) Length and balance.

Several investigators have suggested that speakers have a tendency to
avoid very short prosodic constituents in order to balance, roughly, the
length o f the prosodic constituents in an utterance. Klatt (1976) proposes
that syntactic constraints on prosodic phrasing may be overruled in cases
where the alignment o f a prosodic boundary with a syntactic boundary
would create a constituent which is very short. He cites results o f durational
measurements by Goldhor (1976) which show that single noun subject NP's
are shorter than the same nouns occurring phrase-finally in a multiple word
subject NP. A noun which is the only word in a phrase should actually be
longer than the same noun occurring finally within a longer phrase, since
talkers tend to increase their rate o f speech when phrases contain a larger
number o f words (Lehiste, 1974). Klatt argues that a single noun NP may
get incorporated into the following phonological phrase, in which case it
would be produced with a shorter duration.
Gee and Grosjean (1983) also propose that balancing the size of constituents plays a role in determining the placement o f the boundaries, citing
data from the performance structures revealed by a number o f parsing-related
tasks. This factor may provide an account of an observation that has long
vexed syntacticians: when speakers who are naive with respect to explicit
discussion o f syntactic structure are asked to determine the major syntactic
boundary in sentences like He wanted a green coat, they are more likely to
place the boundary after wanted than after He. This may reflect the intuition
that this sentence could be uttered as two intonational phrases, He wanted
and a green coat, more closely approximating an equal balance in the number o f words and syllables in the two constituents.
(ii) Speaking rate.
As we have seen above, in examples like Old Li buys good wine in
Chinese, speaking rate is one of the factors that influences the speaker's
choice o f prosodic boundary locations. Similarly, speakers asked to produce
utterances at a slow rate in Gee and Grosjean's (1983) experiment showed
pause duration evidence for boundaries o f lower-level constituents which
did not show up at faster rates. Such results illustrate again the claim that
although syntactic structure may provide constraints on the set of wellformed prosodic structures, it does not fully specify the selection among
these forms for a particular utterance.
Jun (1993) cites evidence from Korean showing that both speaking rate
and length (which she expresses as "phonological weight" in syllables)
influence the placement of Accentual Phrase and Intonational Phrase bound-

A Prosody Tutorial


aries. She suggests that focus and semantic weight also influence these decisions.

Factors Influencing Placement of Phrase-Level Prominence

We have argued elsewhere (Shattuck-Hufnagel, 1992a; Shattuck-Hufnagel et al., 1994) that the pattern of Pitch Accent placement on individual
syllables within a word is influenced by at least three kinds of information:
semantic and pragmatic factors (e.g. focus, given/new status), structural factors (e.g. speakers prefer to place the initial accent of a new Intonational
Phrase on the first pitch-accentable syllable, even if this is not the mainstress syllable of the accented word, and the final accent o f the IP on the
main-stress syllable; if a word contains all the accents of an IP, it tends to
be double accented, ie. to carry both the phase-initial and phrase-final accents) and rhythmic factors (e.g. speakers prefer not to place pitch accents
on two adjacent syllables.) It is a reasonable but untested hypothesis that
the same kinds of factors govern the selection o f which words within the
phrase will be Pitch Accented.

Factors Influencing Choice of Tune

The intonation contours o f utterances are often highly-familiar constructs; for example, a final fall in F0 is associated with simple declarative
utterances in English, and a final rise with yes/no questions. Pierrehumbert
and Hirshberg (1990) have argued that a phrase's tune, defined as its particular sequence o f tonal elements, is associated with a particular meaning
when used in particular discourse contexts. They propose that:
"A speaker chooses a particular tune to convey a particular relationship between an utterance, currently perceived beliefs of a hearer.., and anticipated
contributions of subsequent utterances." (p. 271)
and that:
"speakers use tune to specify a particular relationship between the 'propositional content' realized in the intonational phrase over which the tune is employed and the mutual beliefs of participants in the current discourse." (p.
They also suggest that tune meaning is compositional, noting that tunes
that share certain tonal features seem intuitively to share some aspects of
meaning, and cite as an example various tunes ending in a movement from
low to high F0 signifying that the current utterance will be completed by a
subsequent utterance (sometimes called a 'continuation rise').


Shattuck-Hufnagel and Turk

Conspicuously absent from this preliminary list of factors influencing

prosodic decisions is the emotional state of the speaker. While an understanding of this factor is clearly important, we adopt the view that linguistic
and paralinguistic influences can be addressed separately, and choose here
to focus on linguistic influences. See Ladd (to appear) for a discussion of
this distinction. Anecdotally, it seems clear that emotional considerations
can affect the prosodic structure of an utterance, as when the single lexical
item Wowee! is produced as two separate Intonational Phrases, to express
an especially intense feeling.
Although this sketchy treatment does not do justice to the considerable
literature on factors that affect prosodic patterns, the fact of the matter is
that we are not sure what all of these factors are, we don't know much about
their relative importance and we are not sure how they interact. We are far
from understanding, for example, which of two competing factors that indicate different prosodic solutions will prevail in a given context. Leaving
these important issues to future research, we will nontheless venture to speculate about how the various factors that affect a prosodic decision might
5.3. How the Factors Determining Prosody Might Interact: The
Prosodic Component in the Grammar

If syntax is just one of several disparate factors that influence prosodic

decisions, then the language processor must provide a locus for the integration of these different factors. One such locus could be a separate prosodic
component in the phonology. We illustrate this possibility in Fig. 5, which
presents one possible view of the relationship between the prosodic component and other components of the grammar.
This picture is motivated primarily by two considerations: the need to
integrate several factors that influence phonetic facts such as duration, F0,
etc., and the hypothesis that these separate factors cannot be integrated directly in phonetic processing, but require a separate abstract phonological
An alternative view is that the phonological component is divided into
at least two subcomponents, lexical phonology and postlexical phonology.
In some proposals, discussed in Inkelas and Zec (1990), postlexical phonology corresponds to phrasal phonology, concerned with phrase-level prosody and phonological adjustments in response to combining adjacent words,
but in others, phrase-level and lexical-level phenomena may occur in either
lexical or post-lexical phonology. The important questions of where prosody
resides in the grammar of any particular language, how phonological processes divide themselves between lexicaMevel and phrase-level components,

A Prosody Tutorial











Fig. 5. One view of the role of the prosodic component of the grammar.

and how these distinctions in the grammar relate to the syntactic component,
are not yet resolved. However, it is clear that a number o f factors that
contribute to determining the prosodic structure o f an utterance must be
combined at some point. We believe that resolution o f this question will
hinge, to some extent, on detailed phonological and acoustic-phonetic analyses o f the prosodic and segmental realization o f specific utterances.

Although prosodic inquiry based on current theory is still in its infancy,
we believe that it already has important implications for the design o f auditory sentence processing experiments. We summarize these implications
as four questions that an investigator might ask while constructing studies
o f auditory sentence processing using spoken utterances.
(1) W h a t is the prosody of the stimulus utterances?

We have seen that prosodic structure cannot be predicted from text

alone: for example, main lexical stress does not always predict the location
o f pitch accent within a word, and traditional syntactic boundaries do not
always predict intonational phrase boundaries. As a result, there are many


Shattuck-Hufnagel and Turk

ways that a speaker can 'prosodify' a given sentence. It is possible that

when we understand more completely the mapping between prosodic structure and other factors, what looks like a range of options will turn out to be
fully determined by these factors and the morphosyntax together, eliminating
the role of speaker choice. But at the moment, written text does not provide
an adequate idea of the prosody that will be produced, or even of the prosody
that is most likely to be produced.
Because a number of factors appear to influence the placement of both
prosodic prominence and prosodic boundaries, as well as the choice of tune,
wise investigators will not assume that they (or the readers of their papers)
can determine the prosody that was produced by the speakers from the
printed text alone. For stimulus creation, this means that pilot experiments
must be carried out to develop a context that produces the desired prosody,
or the prosody must be discovered by transcription. From painful experience
we can report that, if a particular prosodic pattern is desired, it sometimes
requires a substantial amount of pilot elicitation work to determine what
context and instructions will produce the desired prosodic result, and constant vigilance during recording sessions to ensure that the speaker does not
wander off the prosodic track. For publication of results, it means either
providing the prosodic transcriptions, and/or (perhaps best of all) making
the spoken versions available on-line via ftp.
(2) What is the phonology of the acoustic-phonetic signal being

Prosodic structure cannot be read directly off the signal. For example,
if we assume that the highest F0 peak in a word or phrase occurs on the
most prominent syllable, we will be wrong in many cases: English has a
Low pitch accent, which means that phrase-level prominences will not always correspond to F0 peaks.
Similarly, F0 may continue to rise after a High pitch accented syllable,
so that the actual peak occurs on a following reduced syllable; an example
of this phenomenon is shown in Fig. 6 (utterance available via ftp). Another
example: long duration is not always a cue to prominence; the final syllable
of a phrase is lengthened, even if it is a reduced syllable with the lowest
possible degree of prosodic prominence. Apparently, listeners are able to
distinguish prominence-related from boundary-related duration lengthening,
perhaps because the two factors affect different parts of the syllable (Edwards et al., 1991). As these examples show, it would not be wise to assume
that the highest F0 peaks and/or the longest duration syllables are necessarily
associated with perceptual prominence, since there is a many-to-one and
one-to-many mapping between prosodic structure and these acoustic dimen-

A Prosody Tutorial





,..,., .,' '"





m_. 20c

i ,"""




m~80! '



. , _



,. , . - "

"... ........ .

gr m





k._,.,j,,~ . ~

TIME (ms)










Fig. 6. F0 track and spectrogram of the first part of an utterance o f I n P a n a m a m 2 p l a n e

w a s l a t e , produced as two separate Full Intonational Phrases (digitized utterance available
via anonymous ftp from lexic.mit.edu). Note the delayed F0 peak on the reduced syllable
-ha-, following the prominent syllable Pa-.

sions. Data about the acoustic signal is more interpretable once the locations
of prominences and boundaries are known for the particular utterance at
hand; prosodic transcription is a useful tool for determining these facts. For
example, in studies of the acoustic correlates of Nuclear Pitch Accent, it is
important to determine that the speaker actually placed the Nuclear Pitch
Accent on the syllable to be measured.
This problem can arise even for the elicitation of simple utterances for
the study of acoustic phonetic patterns. An example is the ubiquitous frame
sentence for simple target word or syllable elicitation in English, " S a y targetword again." In our experience, speakers can produce even this simple
sentence with a variety of prosodic patterns, i.e. with different tunes, constituent boundary placements and prominence placements. Some examples
that we have encountered are shown in 41), where upper case indicates a
pitch-accented word and a comma indicates an intonational phrase boundary:
(41) (a) say TARGETWORD again
(c) SAY targetword AGAIN


Shattuck-Hufnagel and Turk


Since a growing body of evidence suggests prosodic constituent boundaries and accents systematically affect the production of phonemic targets,
it is important to ensure that critical comparison targets are produced with
comparable prosodic patterns, or at least to know what prosodic patterns the
speaker used.
3. Is there a prosodic as well as a m o r p h o s y n t a c t i c interpretation o f
the results?

Many experimental results can be interpreted in terms of several different types of constituents, e.g. morphosyntactic or prosodic. Most studies,
however, examine only one of these possibilities. Ideally, a given study will
compare the efficacy of both morphosyntactic and prosodic structures as
determinants of observed results. Such comparisons will provide evidence
to test the hypothesis that prosody is a separate component of the grammar.
4. W h a t is m e a n t by various t e r m s that refer to prosodic elements?

Many terms used in prosody have several possible interpretations. In

particular, the words 'stress', 'foot', 'metrical', 'word', and 'phonological
phrase' have been used to mean many different things. For example, the
term 'Foot' has been used to refer to both Within-Word (Rhythmic-Stress)
Feet and Cross-Word-Boundary-Feet. Terms traditionally used to refer to
types of Cross-Word-Boundary Feet in poetry, such as 'iamb', 'trochee',
and 'anapest', have been adopted to refer to both Within-Word and CrossWord-Boundary Feet in running speech. The term 'stress' has been used to
refer to both phrase-level and lexical-level prominence; this can cause a
problem if the user of the term has a particular type of prominence in mind,
and the reader has another. Even the seemingly innocent term 'word' is now
somewhat problematical, since it may refer to a lexical or a prosodic word
and these two meanings do not delineate the same classes of items. The
term 'Intonational Phrase' now is best used to refer to the set made up of
both Intermediate and Full Intonational Phrases; if Full Intonational Phrases
with Boundary Tones are intended, it is useful to say so explicitly. The habit
of defining terms like stress, foot, phrase, and word in each presentation and
publication is particularly important, because it means that the reader can
begin reasoning from the same point.
In sum, four helpful hints will help in the use of spoken utterances for
auditory sentence processing studies:

A Prosody Tutorial


1) Since prosody c a n ' t be predicted from text, specify the prosody o f

stimulus utterances as it was actually produced.
2) Since prosody c a n ' t be read o f f the signal alone, inform acoustic
m e a s u r e m e n t s b y perceptual transcription o f the prosodic structure o f
target utterances.
3) Consider interpretations o f results in terms o f prosodic as well as
morphosyntactic structure.
4) Define those terms!


Abercrombie, D. (1965). Syllable quantity and enclitics in English. In Studies in Phonetics and Linguistics, Oxford University Press.
Abercrombie, D. (1973). A phonetician's view of verse structure. In W.E. Jones and J.
Laver (eds), Phonetics in Linguistics: a book of readings, London: Longman.
Beckman, M. E. (1996). The parsing of prosody. Language and Cognitive Processes (to
Beckman, M. E., & Edwards, J. (1990). Lengthenings and shortenings and the nature of
prosodic constituency. In J. Kingston and M.E. Beckman (eds), Papers in Laboratory Phonology L" Between the Grammar and the Physics of Speech, Cambridge:
Cambridge University Press.
Beckman, M. E., & Edwards, J. (1994). Articulatory evidence for differentiating stress
categories. In P. Keating (ed), Phonological Structure and Phonetic Form: Papers
in Laboratory Phonology III, Cambridge: Cambridge University Press.
Beckman, M. E., & Pierrehurnbert, J. (1986). Intonational structure in Japanese and
English. Phonology Yearbook 3, 255-309.
Berkovits, R. (1993). Utterance-final lengthening and the duration of final stop closures.
J. Phonetics 21, 479-489.
Berkovits, R. (1993a). Progressive utterance-final lengthening in syllables with final fricatives. Language and Speech 36, 89-98.
Bickmore, L. (1990). Branching nodes and prosodic categories: Evidence from Kinyambo. In S. Inkelas and D. Zec (eds), The Phonology-Syntax Connection, U. Chicago Press, 1-18.
Bolinger, D. (1958). A theory of pitch accents in English. Word 14, 109-149.
Bolinger, D. (1965). Pitch accent and sentence rhythm. In Bolinger, D.L., Forms of
English: Accent, Morpheme, Order, Cambridge, Mass: Harvard University Press,
163 ff.
Bolinger, D. (1981). Two la'nds of vowels, two kinds of rhythm. Manuscript distributed
by the Indiana University Linguistics Club, Bloomington, Indiana.
Brown, E., & Miron, M.S. (1971). Lexical and syntactic predictors of the distribution of
pause time in reading. J. Verbal Learning and Verbal Behavior 10, 658-667.
Chafe, W. (1980). ed., The Pear Stories: Cognitive, Cultural and Linguistic Aspects of
Narrative Production. Norwood, N.J.: Ablex.
Cheng, C.-C. (1970). Domain of phonological rule application. In J.M. Sadock and A.L.
Vanek (eds), Studies Presented to Robert B. Lee by his Students, Edmonton: Linguistic Research 39-60.


Shattuck-Hufnagel and T u r k

Cheng, C.-C. (1973). A Synchronic Phonology of Mandarin Chinese. The Hague: Mouton.
Chomsky, N., & Halle, M. (1968). The Sound Pattern of English. New York: Harper
and Row.
Clements, G. N. (1978). Tone and syntax in Ewe. In D.J. Napoli (ed), Elements of Tone,
Stress and Intonation, Washington, D.C.: Georgetown University Press, 21-99.
Cooper, A. M. (1991). Laryngeal and oral gestures in English/ptr/. Proceedings of the
XIIth International Congress of Phonetic Sciences, Aix-en-Provenee, Vol. 2, 50-53.
Cooper, W. E., & Paccia-Cooper, J. (1980). Syntax and Speech. Cambridge, Mass: Harvard University Press.
Cruttenden, A. (1986). Intonation. Cambridge: Cambridge University Press.
Cutler, A., Mehler, J., Norris, D. G., & Segui, J. (1986). The syllables differing role in
the segmentation of French and English. Journal of Memory & Language 25, 385400.
Dilley, L., & Shattuck-Hufnagel, S. (1995). Variability in glottalization of word-onset
vowels in American English. Proc XIIIth International Congress of Phonetic Sciences, Stockholm, Vol. 4, pp. 586-589.
Dilley, L., Shattuck-Hufnagel, S., & Ostendorf, M. (1994). Prosodic constraints in glottalization of vowel-initial syllables in American English. JASA 95 (5-pt. 2) 29782979.
Dilley, L., Shattuck-Hufnagel, S., & Ostendorf, M. (under revision), Glottalization of
vowel-initial words as a function of prosodic structure.
Edwards, J., Beckman, M. E., & Fletcher, J. (1991). The articulatory kinematics of final
lengthening. JASA 89 (1), 369-382.
Ewen, C., & Anderson, J. (eds.) (1987). Phonology Yearbook 4: Syntactic Conditions on
Phonological Rules. Cambridge: Cambridge University Press.
Fant, J., Kntkenberg, A., & Nord, L. (1991). Stress patterns and rhythm in the reading
of prose and poetry with analogies to music performance. Presented at the Music,
Language, Speech, and Brain International Wenner Gren Symposium, Stockholm.
Ferreira, F. (1991). Creation of prosody during sentence production. Psychological Review 100 (2), 233 253.
Fodor, J. A., Bever, T. G., & Garrett, M. F. (1974). The Psychology of Language: An
Introduction to Psyeholinguistics and Generative Grammar. New York: McGrawHill.
Fougeron, C. (1996). Articulation of French nasal segments depending on their prosodic
position. Presented at the January meeting of the Linguistic Society of America,
San Diego.
Fougeron, C., & Keating, P. (1995). Demarcating prosodic groups with articulation. JASA
97 (5-pt 2), 3384 and U C L A ms.
Garrett, M. F., Bever, T. G., & Fodor, J. A. (1966). The active use of grammar in speech
perception. Perception and Psyehophysies 1, 30-32.
Gerken, L. A., Jusczyk, P. W., & Mandel, D. R. (1994). When prosody fails to cue
syntactic structure: Nine-month-olds' sensitivity to phonological vs. syntactic
phrases. Cognition 51,237-265.
Gee, J. P., & Grosjean, F. (1983). Performance structures: A psycholinguistic and linguistic appraisal. Cognitive Psychology 15, 411-458.
Goldhor, R. S. (1976). Sentential determinants of duration in speech. MIT ms.
Goldman-Eisler, F. (1972). Pauses, clauses, sentences. Language and Speech 15, 103113.

A Prosody Tutorial


Grosjean, F., Grosjean, L., & Lane, H. (1979). The patterns of silence: Performance
structures in sentence production. Cognitive Psychology 11, 58-81.
Gussenhoven, C. (1984). On the Grammar and Semantics of Sentence Accents. Dordrecht: Foris.
Gussenhoven, C. (1992). Intonational phrasing and the prosodic hierarchy. Phonologica
1988, 89-99.
Gussenhoven, C., & Reitveld, A. C. M. (1992). Intonation contours, prosodic smacture
and preboundary lengthening. J. Phonetics 20, 283-303.
Haegeman, L. (1993). Introduction to Government and Binding Theory. Oxford: Blackwell.
Hale, K., & Selkirk, E. O. (1987). Government and tonal phrasing in Papago. Phonology
Yearbook 4: 151-183.
Halle, M., & Vergnand, J.-R. (1987). An Essay on Stress. Cambridge, Mass: MIT Press.
Halliday, M. A. K. (1967). Intonation and Grammar in British English. The Hague:
't Hart, J., Collier, R., & Cohen, A. (1990). A Perceptual Study of Intonation. Cambridge:
Cambridge University Press.
Hayes, B. (1981). A metrical theory of stress rules. MIT PhD thesis, revised version
distributed by the Indiana University Linguistics Club, Bloomington, Indiana. Published by Garland Press, NY, 1985.
Hayes, B. (1983). A grid-based theory of English meter. Linguistic Inquiry 14, 357-393.
Hayes, B. (1984). The phonology of rhythm in English. Linguistic Inquiry 15, 33-74.
Hayes, B. (1989). The prosodic hierarchy in meter. In P. Kiparsky and G. Youmans
(eds.), Phonetics and Phonology, Vol 1: Rhythm and Meter. San Diego: Academic
Press. pp. 201-260.
Hayes, B. (1989a). Compensatory lengthening in moraic phonology. Linguistic Inquiry
20, 253-306.
Hayes, B. (1995). Metrical Stress Theory: Principles and Case Studies. Chicago: University of Chicago Free Press.
Hayes, B., & Lahiri, A. (1991). Bengali intonational phonology. Natural Language and
Linguistic Theory 9, 47-96.
Heeman, P., & Allen, J. (1995). The TRAINS 93 Dialogues. TRAINS Technical Note
94-2, University of Rochester.
Home, M. (1990). Empirical evidence for a deletion formulation of the rhythm rule in
English. Linguistics 28, 959-981.
Hyman, L. (1985). A theory of phonological weight. Dordrecht: Foris.
Inkelas, S. (1989). Prosodicy constituency in the lexicon. U. Mass Amherst PhD thesis.
Inkelas, S., & Zec, D. (1990). (eds), The Phonology-Syntax Connection. Chicago: University of Chicago Press.
Inkelas, S., & Zec, D. (1993). Auxiliary reduction without empty categories: A prosodic
account. Working Papers of the Cornell Phonetics Laboratory 8, 205-253.
Ito, J., & Mester, R.-A. (1992). Weak layering and word binarity. University of Santa
Cruz ms.
Jun, S.-A. (1993). The phonetics and phonology of Korean prosody. Ohio State University Phi:) thesis.
Jusczyk, P. W., & Aslin, R. N. (1995). Infants' detection of the sound patterns of words
in fluent speech. Cognitive Psychology 28, 1-23.
Kahn, D. (1976). Syllable-based generalizations in English phonology. Ms. distributed
by the Indiana University Linguistics Club, Bloomington, Indiana.


Shattuck-Hufnagel and T u r k

Kaisse, E. (1985). Connected Speech: The Interaction of Syntax and Phonology. Orlando:
Academic Press.
Kaisse, E. M., & Zwicky, A. M. (1987). Introduction: Syntactic influences on phonological rules. In C. Ewen and J. Anderson (eds.), Phonology Yearbook 4, Cambridge
Univesity Press.
Katada, F. (1990). On the representation o f moras: Evidence from a language game.
Linguistic Inquiry 21, 641-646.
Kelly, M. (1989). Rhythm and language change in English. J. Memory and Language
28, 690-710.
Kenstowicz, M. (1994). Phonology in Generative Grammar. Cambridge, Mass: Blackwell.
Kiparsky, P. (1979). Metrical structure assignment is cyclic. Linguistic Inquiry 10, 4 2 1 442.
Kisseberth, C., & Abasheikh, M. K. (1974). Vowel length in C h i m w i : n i - - a case study
o f the role of grammar in phonology. In M.M.L. Galy, R.A. Fox, and A. Bruck
(eds), Papers from the Parasession on Natural Phonology, Chicago: Chicago Linguistics Society.
Klatt, D. H. (1976). Linguistics u s e s of segmental duration in English: Acoustic and
perceptual evidence. JASA 59, 1208-1220.
Kxakow, R. (1989). The artieulatory organization of syllables: A kinematic analysis of
labial and relic gestures. Yale University PhD thesis.
Ladd, R. (1986). Intonational phrasing: The case for recursive prosodic structure. Phonology Yearbook 3, 311-340.
Ladd, R. (to appear 1996). Intonational Phonology. Cambridge: Cambridge University
Ladd, R., & Campbell, N. (1991). Theories of prosodic structure: Evidence from syllable
duration. Proceedings of the XIIth International Congress of Phonetic Sciences, Aixen-Provence, II, 290-293.
Ladefoged, P. (1975). A Course in Phonetics. New York: Harcourt, Brace, Jovanovich.
Ladefoged, P., & Broadbert, D. E. (1960). Perception of sequence in auditory events.
Quarterly J. of Experimental Psychology 13, 162-170.
Lehiste, I. (1973). Phonetic disambiguation o f syntactic ambiguity, Glossa F, 107-121.
Lehiste, I. (1974). Interaction between test word duration and the length of utterance.
Ohio State University Working Papers in Linguistics 17, 160-169.
Lehiste, I., Olive, J. P., & Streeter, L. A. (1976). The role of duration in disambiguating
syntactically ambiguous sentences. JASA 60, 1199-1202.
Liberman, M. Y. (1975). The intonational system of English. MIT Linguistics PhD thesis.
Liberman, M. Y., & Prince, A. (1977). On stress and linguistic rhythm. Linguistic Inquiry
8, 249-336.
Lieberman, P. (1960). Some acoustic correlates of word stress in American English. JASA
32, 451-454.
Maeda, S. (1974). A characterization o f fundamental frequency contours o f speech. Quarterly Progress Report, MIT Research Laboratory of Electronics 114, 1 9 3 ~ 1 1 .
Martin, E. (1970). Toward an analysis o f subjective phrase structure. Psychological Bulletin 74, 153-166.
McCarthy, J. J. (1993). A case o f surface constraint violation. Canadian Journal of
Linguistics 38 (2), 169-195.
McCarthy, J. J., & Prince, A. (1995). Prosodic morphology. In J. Goldsmith (ed), A
Handbook of Phonological Theory. Oxford: Basil Blackwell.

A Prosody Tutorial


McCawley, J. D. (1968). The Phonological Component of a Grammar of Japanese. The

Hague: Mouton.
Mehler, J., Dommergues, J. Y., & Frauenfelder, U. (1981). The syllable's role in speech
segmentation. Journal of Verbal Learning and Verbal Behavior 20, 298-305.
Nespor, M., & Vogel, I. (1986). Prosodic Phonology. Dordrecht: Foris Publications.
O'Connor, J. D., & Arnold, G. F. (1961), Intonation of Colloquial English. London:
Ostendorf, M., Price, P., & Shattuck-Hufnagel, S. (1995). The Boston University radio
news corpus. Boston University ECS Technical Report ECS-95-001.
Otake, T., Hatano, G., Cutler, A., & Mehler, J. (1993). Mora or syllable? Speech segmentation in Japanese. J. Memory and Language 32, 258-278.
Pierrehumbert, J. (1980). The phonology and phonetics of English intonation. MIT Linguistics PhD thesis. Distributed by the Indiana University Linguistics Club, Bloomington, Indiana.
Pierrehumbert, J., & Beckman, M. B. (1988). Japanese Tone Structure. Cambridge,
Mass.: MIT Press.
Pierrehumbert, J., & Hirschberg, J. (1990). The meaning of intonational contours in the
interpretation of discourse. In P. Cohen, J. Morgan and M. Pollack (eds), Intentions
in Communication, Cambridge, Mass.: MIT Press.
Pierrehumbert, J., & Talkin, D. (1992). Lenition o f / h / a n d glottal stop. In G.J. Docherty
and D.R. Ladd (eds), Papers in Laboratory Phonology II: Gesture, Segment, Prosody, Cambridge: Cambridge University Press.
Pike, K. (1945). The Intonation of American English. Ann Arbor: University of Michigan
Pilon, R. (1981). Segmentation of speech in a foreign language. J. Psyeholinguistic Research, 10, 113-121.
Pitrelli, J., Beckman, M. E., & Hirschberg, J. (1994). Evaluation of prosodic transcription
labelling reliability in the ToBI framework. In Proceedings of the International
Conference on Spoken Language Processing (ICSLP), Yokohama, Japan, VI, 123126.
Port, R. R., Dalby, J., & O'Dell, M. (1986). Evidence for mora timing in Japanese. JASA
81 (5), 1574~1585.
Prevost, S., & Steedman M. (1994). Specifying intonation from context for speech synthesis. Speech Communication 15, 139-153.
Price, P. J., Ostendorf, M., Shattuck-Hufnagel, S., & Fong, C. (1991). The use of prosody
in syntactic disambiguation. JASA 90, 2956-2970.
Prince, A. (1983). Relating to the grid. Linguistic Inquiry 14, 19-100.
Prince, A., & Smolensky, P. (1993). Optimality theory: constraint interaction in generative grammar. Rutgers University and University of Colorado ms.
Pulgram, E. (1970). Syllable, Word, Nexus, Cursus. The Hague: Mouton.
Scott, D. R. (1982). Duration as a cue to the perception of a phrase boundary. J. Acoustical Society of America 71, 996-1007.
Selkirk, E. O. (1972). The phrase phonology of English and French. MIT Linguistics
PhD thesis, distributed by the Indiana University Linguistics Club, Bloomington
Indiana, 1981.
Selkirk, E. O. (1978). On prosodic structure and its relation to syntactic structure. In T.
Fretheim (ed), Nordic Prosody II, Trondheim: TAPIR.
Selkirk, E. O. (1980). Prosodic domains in phonology: Sanskrit revisited. In M. Aronoff
and M.-L. Kean (eds), Juncture, Anna Libri, PO Box 876, Saratoga, Calif. 107129.


Shattuck-Hufnagel and T u r k

Selkirk, E. O. (1984). Phonology and Syntax: The Relation Between Sound and Structure.
Cambridge, Mass.: MIT Press.
Selkirk, E. O. (1986). On derived domains in sentence phonology. Phonology Yearbook
3, 3 7 1 4 0 5 .
Selkirk, E. O. (1993). Modularity in constraints on prosodic structure. Ms., presented at
the ESCA Workshop on Prosody, Lurid.
Selkirk, E. O. (1993a). Accent focus and given~new: The role for focus projection. U
Mass. Amherst ms.
Selkirk, E. O. (1994).
Selkirk, E. O. (to appear). The prosodic structure of function words. In J. Martin and K.
Demuth (eds), International Conference on Bootstrapping from Speech to Grammar
in Early Acquisition, Brown University, Providence RI, Hillsdale N.J.: Lawrence
Selkirk, E. O., & Shen, X. (1990). Prosodic domains in Shanghai Chinese. In S. Inkelas
and D. Zec (eds), The Phonology-Syntax Connection. Chicago: The University of
Chicago Press.
Selkirk, E. O., & Tateishi, K. (1988). Constraints on minor phrase formation in Japanese.
Papers from the Twenty-fourth Regional Meeting of the Chicago Linguistic Society,
Chicago: Chicago Linguistics Society.
Selkirk, E. O., & Tateishi, K. (1991). Syntax and downstep in Japanese. In C. Georgopoulos, and R. Ishihara (eds.), Interdisciplinary Approaches to Language. Dordrecht:
Kluwer Academic Publishing.
Sereno, J. A., & Jongman, A. (1995). Acoustic correlates of grammatical class. Language
and Speech 38, 57-76.
Shattuck-Hufnagel, S. (1988). Acoustic phonetic correlates of stress shift. JASA 84, S1,
Shattuck-Hufnagel, S. (1992). The role of word structure in segmental serial ordering.
Cognition 42, 213-259.
Shattuck-Hufnagel, S. (1992a). Stress shift as pitch accent placement: Within-word early
accent placement in American English. In Proceedings of the International Conference on Spoken Language Processing, Banff, v. 1 pp. 747-750.
Shattuck-Hufnagel, S. (1995). The importance of phonological transcription in empirical
approaches to 'stress shift' vs. 'early accent.' In B. Cormell and A. Arvaniti (eds),
Phonology and Phonetic Evidence: Papers in Laboratory Phonology IV, Cambridge:
Cambridge University Press.
Shatmek-Hufnagel, S., Ostendorf, M., & Ross, K. (1994). Stress shift and early pitch
accent placement in lexical items in American English. J. Phonetics 22, 357-388.
Shih, C.-L. (1986). The prosodie domain of tone sandhi in Chinese. UCSD PhD thesis.
Silva, D. J. (1989). Determining the domain for intervocalic stop voicing in Korean. In
S. Kuno et aL (eds.) Harvard Studies in Korean Linguistics III. Harvard University,
Cambridge, MA.
Silverman, K. (1987). The structure and processing of fundamental frequency contours.
Cambridge University PhD thesis.
Silverman, K., Beckman, M. B., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P.,
Pierrehm~abert, J., & Hirschberg, J. (1992). ToBI: A standard for labeling English
prosody. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Banff, II, 867-870.
Sluijter, A. M. C. (1995). Phonetic Correlates of Stress and Accent. Holland Institute of
Generative Linguistics, Den Haag: CIP-Gegevens Koninklijke Bibliotbeek, University of Leyden PhD thesis.

A Prosody Tutorial


Sluijter, A. M. C., Shattuck-Hufnagel, S., Stevens, K. N., & van Heuven, V. (1995).
Supralaryngeal resonance and glottal pulse shape as correlates of stress and accent
in English. In Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, II, 630-633.
Sluijter, A. M. C., & van Heuven, V. J. (to appear). Effects o f focus distribution, pitch
accent and lexical stress on the temporal organization of syllables in Dutch. Phonetica.
Steedman, M. (1991). Structure and intonation. Language 68, 260-296.
Stevens, K. N. (1994). Prosodic influences on glottal waveform: Preliminary data. In
Proceedings of the International Symposium on Prosody, Yokohama, 53-64.
Streeter, L. (1978). Acoustic determinants of phrase boundary perception. JASA 64,
Suci, G. (1967). The validity of pause as an index o f units in language. J. Verbal Learning and Verbal Behavior 6, 26-32.
Turk, A. E., & Sawusch, J. R. (1995). The domain o f the durational effects of accent.
Speech Group Working Papers, Research Laboratory of Electronics, Massachusetts
Institute of Technology, Cambridge, Mass, Vol X, 42-71.
Vanderslice, R., & Ladefoged, P. (1972). Binary suprasegmental features and transformational word-accentuation rules. Language 48, 819-836.
Wakefield, J. R., Doughtie, E. B., & Yom, L. (1974). Identification o f smacmral components of an unknown language. J. Psycholinguistic Research 3, 262-269.
Wightman, C. W., Shattuck-Hufnagel, S., Ostendorf, M., & Price, P. J. (1992). Segmental
durations in the vicinity o f prosodic phrase boundaries. JASA 91 (3), 1707-1717.