Vous êtes sur la page 1sur 18

Rhythm Metrics of Spoken Korean*

1)
Tae-Yeoub Jang
(Hankuk University of Foreign Studies)
Tae-Yeoub Jang. 2009. Rhythm metrics of spoken Korean.
Language and Linguistics. 46, 169-186. In this paper, the rhythm
characteristics of spoken Korean are investigated. Instead of
conventional methods in which rhythm events are examined by
researchers' impressionistic judgments and interpretation,
calculation of numerical measures are performed so that a large
amount of data tokens can be processed. Besides, the technique of
automatic speech recognition (ASR) is adopted for the purpose of
future application of measured values into practical systems.
Results show that some metrics are useful for characterizing the
rhythm structure of Korean utterances as compared to that of other
languages, especially those traditionally classified as stress-based
language. It is also discussed whether the conventional dichotomy of
rhythm classification is valid and plausible.
Keywords: Korean rhythm, rhythm metrics, syllable-based rhythm, stress-based
rhythm
1. Introduction
Traditional binary distinction of speech rhythm, i.e.,
syllable-timed or stress-timed (Pike 1945, Abercrombie 1967), has
been considerably weakened on the ground of phonetic studies
* This work was supported by Hankuk University of Foreign Studies Research Fund
of 2008.
170 46
providing experimental evidence against inter-stress or
inter-syllable isochrony (Pointon 1980, Dauer 1983, Roach 1995,
among others). The terminology has also become more generalized
to syllable-based and stress-based rhythm (Laver 1994: 528),
which will be adopted in the current study.
Recent studies on speech rhythm tend to be focused on
discovering acoustic characteristics of rhythm that will distinguish
one language from another regardless of whether it is classified as
either stress-based or syllable-based. Such metrics known as %V,
V, C (Ramus et al. 1999), VarcoC, VarcoV (Dellwo 2006, 2008),
nPVI-V, rPVI-C (Grabe and Low 2002) have been found to be
useful for characterizing the rhythmic structure of various spoken
languages.
A number of studies have been investigated through one or more
metrics for various languages, other than English, including
Chinese (Lin and Wang 2007), Japanese (Murty et al. 2007),
Thai-Polish (Grabe and Low 2002), Estonian (Grabe and Low 2002,
Asu and Nolan 2005), and Dutch-French-Spanish (White and
Mattys 2007). These studies attempt to show that cross linguistic
comparisons of rhythm structures between various languages are
convenient and valid when variability metrics are employed.
Speech rhythm of Korean has most frequently been categorized
into the class of language with the syllable-based rhythm. However,
it has often been classified into stress-based (Lee 1982) or even
mora-based (Cho 2004). However, it has not yet been
systematically investigated in terms of various rhythm metrics
which were discovered relatively recently. Consequently, in the
current study, eight metrics (%V, V, C, VarcoC, VarcoV, nPVI-V,
rPVI-C, and Speech rate) are utilized in an attempt to characterize
the rhythm structure of spoken Korean. As it will be hard to
Rhythm Metrics of Spoken Korean 171
interpret the results alone which are in the form of numerical
values, they will be compared with corresponding values extracted
from previous studies on rhythm metrics of other languages such as
English, Dutch, Spanish, and French. This will lead to further
discussions and tentative determination of which type of rhythm
better describes the spoken Korean.
The conventional way of rhythm investigation mainly depends
upon the impressionistic judgment and interpretation of speech
data. Even most of the studies trying to characterize speech rhythm
using fairly recent rhythm metrics more often than not adopt the
method of calculating measures in a habitual way of manual data
treatment. A shortcoming of this method is that only a relatively
small amount of data can be examined. On the contrary, in the
current study, a large number of speech tokens will be analyzed in fully
automatical ways so that more reliable and robust results can be derived.
These results can be directly applicable to practical systems such as speech
recognition/synthesis machines and pronunciation education tools.
2. Rhythm measures
The basic idea underlying various rhythm metrics is that
stress-based languages have more variable temporal characteristics
among their vocalic and/or consonantal intervals of spoken
utterances than syllable-based languages. The phonological contrast
triggered by stressing or de-stressing of syllables will bring about
such differences in acoustic quality, especially temporal
characteristics, accompanying elongation of stressed syllables or
shortening of unstressed or reduced syllables. Therefore, duration of
syllables will become more variable in stress-based languages than
172 46
in syllable-based languages. The syllable structure of stress-based
languages is also expected to be more complex as assigning stress is
usually related to the weight of the syllables, for instance, by
allowing consonant clusters in the onset and/or coda position of a
syllable. This will generate greater variability among consonantal
intervals.
Eight rhythm metrics that are most frequently investigated in
many languages have been employed in the current study. The
notion of each metric can be summarized as in (1).

(1) a. %V: proportion of vowel intervals (Ramus et al. 1999)
b. V: raw variation of vowel intervals (Ramus et al. 1999)
c. C: raw variation of consonantal intervals (Ramus et al.
1999)
d. VarcoC: rate-normalized variability of consonantal intervals
(Dellwo 2006)
e. VarcoV: rate-normalized variability of vocalic intervals
(Dellwo 2006, White and Mattys 2007)
f. nPVI-V: normalised Pairwise Variability Index (Grabe and
Low 2002)
g. rPVI-C: raw Pair-wise Variability Index (Grabe and Low
2002)
h. Speech rate: number of syllables per second
The metric %V is calculated by dividing the total duration of
vowel intervals by the total duration of the utterance. Utterance
internal pauses of non-speech parts, which are frequently found in
non-native speech, are to be excluded in the current study. The
measures V and C are obtained by calculating standard
deviations of vowel and consonant intervals, respectively. The VarcoV and
VarcoC are rate-normalised versions of V and C. They are derived by
dividing raw values by the mean duration of the utterance. The PVI values
represent variations of adjacent intervals of vowels (nPVI-V) and consonants
Rhythm Metrics of Spoken Korean 173
(rPVI-C). Using these pair-wise indices are expected to capture the
complexities of neighboring syllables. The last metric, speech rate, has been
included in order to check whether the other metrics play a sustained role as
rhythm type discriminator regardless of articulation rate of utterances as well
as to examine the characteristics of normal speech rate of Korean in
comparison to other languages.
The basic expectation is that values of variability metrics (b to g,
in (1) above) of Korean, if categorized as a syllable-based language,
will be less than those of English or other stress-based languages,
presumably due to lack of duration-sensitive prosodic events such
as stress and pitch accents. On the other hand, %V is expected to
be greater in Korean than that of stress-based languages
considering that relatively complicated syllable-internal consonant
clusters (e.g., three consecutive consonants at the onset position or
four at the coda position in English syllables) are not phonetically
permissible in Korean. Finally, the speech rate, measured by the
number of syllables per unit time, of Korean utterances are
expected to be closer to that of other syllable-based languages such
as Spanish or French than stress-based languages, assuming that
syllables of syllable-based languages in general comprise a smaller
number of segments than those of stress-based languages. These hypotheses
will be tested in the experiment.
3. Data and methods
Speech data tokens are extracted from a large database named A
read speech corpus of Seoul Korean (2003) developed and released
by The National Institute of the Korean Language.
1)
After
1) The corpus is publically available at http://www.korean.go.kr.
174 46
sentence-unit tokenization and other necessary preprocessing,
phonetic annotation is performed to generate detailed phone-level
boundary information. Given the size of data, automatic
segmentation and labeling techniques are employed to alleviate an
excessively large amount of time and efforts for manual annotation.
3.1. Data
Assuming no rhythm variation by speakers' age, I pick 40
speakers whose age varies from 20's to 40's: 10 males and 10
females in their 20's, 10 males in their 30's and 10 females in their
40's. The recording prompt of the corpus, according to the corpus
documentation, is composed of passages from 19 Korean fairy tales
and short novels. The number of sentences for each speaker to read
is 930 and there is a total of 36,410 speech tokens used in the
current analysis.
2)

Speech data files are preserved in the form of digital waveform files with
the 16-bit quantization and 16 KHz sampling rate.
3.2. Automatic segmentation
As the rhythm metrics calculation is based on temporal
information of each phone in the speech tokens, phone-level
segmentation and annotation is the most important procedure prior
to metrics calculation. As is mostly the case, this is achieved by
constructing a phone-level automatic speech recognizer, whose
2) The total number should be 37,200 (930 sentences x 40 speakers), but 790
tokens are excluded: erroneous files caused by defective recording, files with too
short sentences (e.g., a-ni-o) from which the extraction of utterance-level
rhythm structure is considered to be inappropriate.
Rhythm Metrics of Spoken Korean 175
implementation is summarised in Table 1.
Training data
KAIST data (Park et al. 1995) with 10863
read speech tokens by 89 Korean speakers
Units 34 phoneme-like units including silence
Modeling
3-state continuous left-to-right Hidden
Markov Models
Features
39 dimensional feature vectors:
12 MFCC + 1 energy + 13 deltas + 13
delta-deltas
System enhancement
unterance internal pause/silence modeling
dictionary expansion through pronunciation
variation modelling
<Table 1> Automatic phone recognizer specification
Automatic segmentation is conducted by the tool named Hidden
Markov Model Tool Kit (HTK) version 3.2 (Young et al. 2003).
Figure 1 is an example of autolabels compared with the
corresponding labels produced by hand.
<Figure 1> Comparison between autolable (upper tier) and handlabel (lower
tier): labels of a single utterance "tuk wi-lo ol-la seoss-ta". This autolable file
has been randomly selected from the whole set of autolabels and the
corresponding handlabling is performed by myself without referring to the
ready-performed autolabels.
As illustrated, the two labeling methods are quite correlated. It
has been verified that over 94% of autolabels are less-than-10-msec apart
176 46
from their corresponding handlabels. It does not necessarily mean that
autolabeler is responsible for the 6%-mismatch. In fact, errors caused by
hand tend to be more frequent and unpredictable that it is premature to
determine which method is more reliable, apart from the time required. As a
consequence, it is assumed, in the current analysis, that autolabeling does not
significantly undermine the quality of measurement.
3.3. Calculation of metrics
Once phonetic labels are provided, all the rhythm metrics are
calculated, again automatically, to generate an overall mean value
for each metric by averaging a set of values obtained from each
utterance token. After metrics are calculated for each token, an
overall value for each of eight metrics is generated by averaging the
corresponding metric values.
A program is created to enable this task to be performed fully
automatically, as shown in Figure 2.

<Figure 2> Metrics calculation procedure
3.4. Cross-linguistic comparison
Interpretation of metrics is conducted by comparing their values with
Rhythm Metrics of Spoken Korean 177
corresponding metric values of English, Dutch, Spanish and French that are
reported in White and Mattys (2007). Although there are other studies
measuring rhythm metrics of various languages, it is not appropriate to
employ their results for comparison since the methods used are not uniform
as they occasionally modify the formula of rhythm metrics for the purpose of
language specific adaptation.
However, validity of the present comparison is still restricted
mainly because the data used in the current experiment and the
data of the other languages in White and Mattys (2007) are not
homogeneous. No more than 30 tokens are used for each of four
languages in their experimentation while a much greater number of
data tokens are currently being analyzed for Korean. Inevitably, the
standard deviation of each metric value for Korean will be a lot
greater than that for other languages. Certainly, it would have been
a more parallel comparison if the metric values for other languages
are calculated from a larger database, but this difference does not
make it inappropriate to compare those results as the algorithm for
each rhythm metric has been consistent in both studies.
Nevertheless, due to the difference in data size, the cross-linguistic
comparison is to be performed without statistical significance tests.
4. Experimental results
Table 2 is the metric values of Korean and four other languages.
Each value is the mean (and standard deviation) of values averaged
over all tokens.

178 46
Syllable-based ? Stress-based
Spanish French Korean English Dutch
%V 48 (0.8) 45 (0.5) 54 (6.9) 38 (0.5) 41 (1.2)
V 32 (1.9) 44 (2.2) 64 (22.5) 49 (2.2) 49 (2.6)
C 40 (2.3) 51 (3.6) 49 (18.3) 59 (2.4) 49 (4.1)
VarcoV 41 (2.0) 50 (0.9) 64 (14.0) 64 (1.7) 65 (1.5)
VarcoC 46 (2.0) 44 (0.8) 59 (14.0) 47 (1.0) 44 (1.8)
nPVI-V 36 (1.6) 50 (1.8) 61 (13.2) 73 (1.2) 82 (2.4)
rPVI-C 43 (2.1) 56 (4.3) 55 (18.6) 70 (2.8) 52 (4.2)
Speech Rate
(syl/sec)
8.0 (0.3) 5.6 (0.3) 6.4 (0.9) 5.2 (0.2) 6.0 (0.3)
<Table 2> Mean value (and their standard deviation) of rhythm metrics of Korean as
compared to Spanish, French and English. Values of languages other than Korean
are based on White and Mattys (2007).
Results show that the metrics %V and nPVI-V seem to verify the
hypothesis that the Korean speech rhythm is syllable-based.
Especially, the %V value of Korean is greater than that of all the
other languages considered. It indicates that Korean utterances are
composed of relatively a larger portion of vocalic intervals than
English utterances, conforming to the notion that a language closer
to the ideal syllable-based language enforces its syllables to contain
relatively less complex consonantal clusters than languages with the
ideal stress-based rhythm.
Unexpectedly, VarcoV of Korean is closer to stress-based
languages and V, C and rPVI-C do not seem to give useful
information on classification, either. On the other hand, the speech
rate of Korean seems, at a glance, to be meaningful as it is greater
than English and smaller than Spanish. However, further
investigations are necessary to confirm its crucial role in
Rhythm Metrics of Spoken Korean 179
characterizing speech rhythm, as other various factors including
speaker style, text type, recording environment are believed to
affect the speech rate more or less. Thus, it is premature to regard
speech rate as critical cue to distinction between syllable-based and
stress-based rhythm.
All in all, on the assumption that Korean has more a
syllable-based rhythm than stress-based rhythm, the two metrics
%V and nPVI-V are helpful for describing its rhythm structure.
Consequently, the rhythm structures of Korean and the other four
languages are represented in Figure 3 in terms of those two rhythm
measures.
<Figure 3> Rhythm structure of five languages (Korean, Spanish, French, Dutch and
English) based on rhythm metrics with respect to vocalic intervals: each of the
coordinate values represents (nPVI-V: %V), respectively.
At a glance, based on the imaginary separation line, Korean
appears to be classified as a syllable-based language together with
180 46
Spanish and French. This observation, however, can be somewhat
illusory when we consider each of the two metrics independent of
the other. While %V locate Korean uppermost resulting in farther
distance from English and Dutch, the other metric nPVI-V put
Korean somewhere in the middle on a continuum between two
extreme types of rhythm. This implies that those two measures are
not sufficient to define the rhythm class of Korean, although it is
more likely that spoken Korean bears a fairly different rhythm
structure from conventionally classified syllable-base languages.
Before concluding, on the basis of these two metrics, that the
rhythm dichotomy is misleading or that characterization of rhythm
varies depending on individual rhythm metrics, it is necessary to
check if those metrics are seriously affected by another factor. The
most suspicious factor is speech rate as it has been argued by other
studies that such metrics as V and C are susceptible to the rate
of speech (Barry et al. 2003, Dellwo and Wagner 2003). If values
%V and nPVI-V are also found to change significantly in accordance
with variation of speech rate, they should not be considered to play
an important role of characterizing rhythm as they might be
vulnerable to other factors.
In order to perform this verification, 100 slow speech tokens (4-5
syl/sec) and 100 fast (6-7 syl/sec) speech tokens are randomly
picked and the metrics for each token are calculated. Then, the
significance of difference between metrics from the data with two different
speech rates is assessed through the two-tailed t-tests. The result is shown
in Table 3.
Rhythm Metrics of Spoken Korean 181
Slow
(4-5 syl/sec)
Fast
(6-7 syl/sec)
p value
No. of Tokens 100 100
%V 55.1 54.7 p <.73
V 87.5 64.1 p <.001
C 65.5 47.8 p <.001
VarcoV 57.3 66.0 p <.57
VarcoC 62.0 59.9 p <.37
nPVI-V 65.4 63.0 p <.29
rPVI-C 72.3 52.9 p <.001
Speech Rate
(syllables/sec)
4.69 6.44
<Table 3> Metrics and speech rate
Notice that values of the two metrics in question, %V and
nPVI-V, do not vary significantly on speech rate changes, while
other metrics such as V, C and rPVI-C do. This implies that %V
and nPVI-V are at least insusceptible to speech rate which does
affect other rhythm metrics considerably. Therefore, the two metrics
are safely assumed to be useful measures for rhythm
characterization though further elaborate verifications will be
necessary for more decisive conclusion.
5. Conclusion and future research
It is shown that spoken Korean can be characterized by some, but
not all, rhythm measures calculated by automatic methods. Such
metrics as %V, nPVI-V and speech rate appear to be especially
useful. As the whole procedure of metrics extraction is automatized,
the methods and results described in this paper are expected to be
especially useful for practical applications such as implementation
182 46
of prosody processing systems and pronunciation education tools.
Based on the metrics utilized in the current experiment, Korean
seems to be categorized as a language with the syllable-based
rhythm structure, but not as distinct from stress-based language as
Spanish or French. More comprehensive comparisons with metrics
extracted from other languages are expected to give further clues to
discover whether it makes better sense to describe speech rhythm in relative
terms on a continuum between the two stereotypes of speech rhythm, i.e.,
extremely syllable-timed and extremely stress-timed, instead of defining on
the traditional stronger version of binary classification. Still another
possibility, not yet obvious, is that even weak version of dichotomy is not
appropriate as individual metrics can categorize each language in quite
different manners. To come closer to the clarification, more languages need to
be tested in terms of other rhythm measures that have been employed in
previous research, or that are still to be discovered in the future.
Rhythm Metrics of Spoken Korean 183
References
Abercrombie, D. (1967) Elements of General Phonetics. Edinburgh:
Edinburgh University Press.
Asu, E. L. and F. Nolan (2005) Estonian rhythm and the Pairwise
Variability Index. In Proceedings, Fonetik 2005 29-32.
Gteborg University, Sweden.
Barry, W. J., B. Andreeva, M. Russo, S. Dimitrova and T.
Kostadinova (2003) Do rhythm measures tell us anything
about language type? In Proceedings of the 15th
International Congress of Phonetics Sciences 2693-2696.
Barcelona.
Cho, Moon-Hwan. (2004) Rhythm typology of Korean speech.
Cognitive Process 5: 249-253..
Dauer, R. M. (1983) Stress-timing and syllable-timing reanalyzed.
Journal of Phonetics 11: 51-62.
Dellwo, V. (2006) Rhythm and speech rate: A variation coefficient
for delta C. In P. Karnowski and I. Szigeti (eds.), Language
and Language Processing: Proceedings of the 38th
Linguistic Colloquium 231-241, Piliscsaba 2003. Frankfurt:
Peter Lang.
Dellwo, V. (2008) The role of speech rate in perceiving speech
rhythm. In Speech Prosody 2008, Proceedings of Speech
Prosody 2008 series. Campinas, 375-378.
Dellwo, V. and Wagner, P. (2003). Relations between language
rhythm and speech rate. In Proceedings of the 15th
International Congress of Phonetics Sciences 471-474.
Barcelona.
Grabe, E. and E. L. Low (2002) Durational variability in speech
and the rhythm class hypothesis. Papers in Laboratory
184 46
Phonology 7: 515-546. Berlin: Mouton.
Laver, J. (1994) Principles of Phonetics. Cambridge: Cambridge
University Press.
Lee, Hyun-Bok. (1982) A phonetic study on Korean rhythm.
Malsori [Speech Sounds] 4: 31-48. The Korean Society of
Phonetic Sciences and Speech Technology (in Korean).
Lin, Hua and Qian Wang (2007) Mandarin rhythm: An acoustic
study. Journal of Chinese Linguistics and Computing
17(3): 127-140.
Murty, L., T. Otake and Anne Cutler (2007) Perceptual tests of
rhythmic similarity: I. mora rhythm. Language and Speech
50(1): 77-99.
Nazzi, T., J. Bertoncini and J. Mehler (1998) Language
discrimination by newborns: toward an understanding of the
role of rhythm. Journal of Experimental Psychology 24(3):
756-766.
Park, J., O. Kwon, D. Kim, I. Choi, H. Jeong and C. Un (1995)
Speech data collection for Korean speech recognition. The
Journal of the Acoustic Society of Korea 14(4): 7481. (in
Korean).
Pike, K. (1945) The Intonation of American English. Ann Arbor:
University of Michigan Press.
G. Pointon. (1980) Is Spanish really syllable-timed? Journal of
Phonetics 8: 293-305.
Ramus, F., M. Nespor and J. Mehler (1999) Correlates of linguistic
rhythm in the speech signal. Cognition 73: 265-292.
Roach, P. (1982) On the distinction between stress-timed and
syllable-timed languages. In D. Crystal (ed.), Linguistic
Controversies 73-79. London: Edward Arnold.
Young, S., G. Evermann, D. Kershaw, G. Moore, J. Odell, D.
Rhythm Metrics of Spoken Korean 185
Ollason, D. Povey, V. Valtchev and P. Woodland (2003)
HTK Book (for version 3.2). Cambridge University Press.
White, L. and S. L. Mattys (2007) Calibrating rhythm: first
language and second language studies. Journal of Phonetics
35: 501-522.
Department of English Linguistics
Hankuk University of Foreign Studies
270 Imun-dong, Dongdaemun-gu, Seoul 130-791, Korea
e-mail: tae@hufs.ac.kr
Received: Aug. 31, 2009
Revised: Oct. 15, 2009
Accepted: Oct. 18, 2009

Vous aimerez peut-être aussi