10 1 1 28 9304

BOSTON UNIVERSITY COLLEGE OF ENGINEERING Thesis
PROSODY PREDICTION FOR SPEECH SYNTHESIS USING TRANSFORMATIONAL RULE-BASED LEARNING

by Cameron S. Fordyce B.A. Russian Middlebury College Middlebury, Vermont, 1985
Submitted in partial ful llment of the requirements for the degree of Master of Science 1998
Approved by First Reader Dr. Mari Ostendorf, Associate Professor, Department of Electrical and Computer Engineering, Boston University
Second Reader Dr. Lynette, Hirschman, Head, Intelligent Information Access Section, MITRE, Inc.
Third Reader Dr. Nanette Veilleux Assistant Professor, Department of Computer Science, MET, Boston University
Acknowledgments
Many people have provided advice, substantial support, or were great role models. Special thanks to Professor Mari Ostendorf for her insights and guidance. I know of no better example of a good researcher and scientist. I owe a special thanks to Dr. Lynette Hirschman for her many comments on earlier versions of this thesis. It was a course on computational linguistics taught by Lynette and John Burger, also of MITRE, that provided inspiration for using transformational rule-based learning. Thanks to go Nanette Veilleux for her comments and support on the thesis. Thanks especially for the technical and emotional support from SPI-lab people: Rukmini Iyer, Ashvin Kannan, David Palmer, Izhak Shaik, Randy Fish, Micheal Bacchiani, Rebecca Bates, Justin Hahn, Greg Grozdits, Loretta Hawkes, and Jay Hancock. Proofreading, practice talks and conversations that ranged from chocolate to decifering compiling errors were necessary and appreciated. Finally, none of this would have been possible without my wife, Cristiana Cucinotta Fordyce. She kept my spirits up in the dark moments. I would like to dedicate this thesis to my father-in-law, Concetto Cucinotta. This research was supported by the NSF under grant number IRI-9528990.
iii
Prosody Prediction for Speech Synthesis using Transformational Rule-based Learning Cameron S. Fordyce Boston University, College of Engineering, 1998 Major Professor: Mari Ostendorf, Associate Professor, Electrical and Computer Engineering
Abstract
Current speech synthesis systems produce intelligible output under conditions with low background noise and low cognitive load. However, the quality is far from natural and intelligibility degrades signi cantly under less than ideal conditions. One widely agreed upon area of improvement in speech synthesis output is prosody. Prosody includes the acoustic characteristics of speech that communicate important syntactic, semantic, and discourse information about the utterance. The acoustic correlate of prosody are the pauses, fundamental frequency contours, energy, and duration changes of utterances. Typically, prosody synthesis is a two step process, where symbolic prosodic labels such as phrase boundaries and relative emphasis are predicted from annotated text and then the acoustic correlates are predicted from these labels combined with phonetic information. The goal of this research is to improve the prediction of symbolic prosodic labels for text-to-speech systems, speci cally, location of phrase boundaries and phrase-level emphasis (i.e. pitch accents). To date, the most successful algorithms for predicting symbolic prosodic labels are based on either handwritten rules or statistical methods. This research will adopt and modify an alternative algorithm: transformational rule-based learning (TRBL), which has had success in many natural language processing tasks. This learning algorithm is automatically trainable like statistical methods, but is less sensitive to sparse training data conditions than these methods. A second contribution of the thesis is an analysis of the interaction of phrase and accent symbols in prediction. Previous approaches have predicted these prosodic events in a serial fashion, but the order is not agreed upon. In this study,
we compare serial prediction with the two possible orders, as well as explore joint prediction. To facilitate joint prediction, the TRBL learning algorithm is combined with a multi-level feature and label representation. Experimental studies were conducted on a prosodically labeled corpus of radio news speech, predicting presence vs. accent at the syllable level and three levels of phrase breaks (none, major, minor) at the word level. First, TRBL was compared to the most popular statistical method, decision trees, for the task of pitch accent location prediction with known phrase boundaries, where TRBL gave a small improvement in prediction accuracy over decision trees. Second, a distance-based metric for phrase prediction design was proposed and evaluated, arguing that this metric better describes the linguistic di erences between di erent levels of phrase breaks than does an exact match accuracy. The results showed that the new metric results in higher prediction rates for minor phrase boundaries. Next, experiments were conducted to assess the appropriate order of serial prediction, nding that phrase structure is used in accent prediction but not vice versa and therefore that phrase structure should be predicted rst. Experiments also showed that the performance loss associated with using predicted vs. actual spoken phrase boundaries in accent prediction can be almost entirely regained when using training data labeled with predicted boundaries. A nal experiment compares serial prediction with the joint prediction of pitch accents and phrase boundaries using TRBL, nding no advantage to joint prediction.
Contents
1 Introduction 2 Background
2.1 Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Linguistic Theory of Prosody . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Pitch Accents . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Phrase Boundaries . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Prosodic Labeling . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Handwritten Rules . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . .
1 6
6 12 12 14 15 16 16 17
3 Corpora
3.1 Boston University Radio News Corpus . . . . . . . . . . . . . . . . . 3.2 The Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
20 23
4 Prediction Algorithms
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
24
24 26
4.2.1 Decision Tree Design . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Using Decision Trees with a Markov Assumption . . . . . . . 4.3 Transformational Rule-Based Learning . . . . . . . . . . . . . . . . . 4.4 Joint Prediction of Labels at Two Time Scales . . . . . . . . . . . . .
26 29 31 36
5 Experiments
5.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Quantitative Evaluation Measures . . . . . . . . . . . . . . . . . . . . 5.2.1 Pitch Accent Evaluation . . . . . . . . . . . . . . . . . . . . . 5.2.2 Phrase Boundary Evaluation Measures . . . . . . . . . . . . . 5.3 Pitch Accent Prediction Experiments . . . . . . . . . . . . . . . . . . 5.4 Phrase Boundary Prediction Experiments . . . . . . . . . . . . . . . . 5.5 Serial Accent and Phrase Boundary Prediction . . . . . . . . . . . . . 5.6 Joint Phrase and Accent Prediction . . . . . . . . . . . . . . . . . . .
39
40 45 46 47 48 57 61 63
6 Discussion
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
65 65 67
A Penn Treebank Part-of-Speech Tags B Part-of-speech Classes
69 72
vii
List of Tables
3.1 Summary of pitch accent frequency in the Boston University Radio News Corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Summary of phrase boundary frequency in the Boston University Radio News Corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Summary of lexical features. These features are associated with syllables, and are used for pitch accent location prediction. . . . . . . . . . 5.2 Summary of syntactic feature categories. Part-of-speech features are used in both pitch accent and phrase boundary prediction experiments. These features occur at the word level. . . . . . . . . . . . . . . . . . 5.3 Summary of prosodic labels. Pitch accent labels align with syllables and phrase boundaries align with words. . . . . . . . . . . . . . . . . 5.4 Summary of pitch accent location prediction with phrase boundaries given for the Boston University Radio News Corpus. Accuracy is reported at the syllable level for an exact match with a target version. There are 3359 syllables total. . . . . . . . . . . . . . . . . . . . . . 5.5 Confusion table for best decision tree. . . . . . . . . . . . . . . . . . . 5.6 Accuracy with di erent initializations using the best case TRBL template set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 42
44 46
49 53 54
viii
5.7 Confusion table for best TRBL rules with no and template. Minimum gain set to 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Rules learned with the constrained TRBL algorithm. Errors is the sum of errors corrected minus the sum of errors introduced the training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Top 10 Rules learned with the best case TRBL algorithm. Where the feature entry contains two features, the and template was chosen and the rst value corresponds to the rst feature and the second value to the second feature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Syllable accuracy results for decreasing amounts of training data. . .
54
55
56 57
5.11 Summary of phrase boundary location with accuracy rates and average absolute distance given for the Boston University Radio News Corpus. 58 5.12 Rules for best case TRBL using the punctuation initialization rule. Distance score is the improvement in di erence between the old distance and the new distance for each rule. . . . . . . . . . . . . . . . . 5.13 Confusion table for best case TRBL rules for phrase prediction at the word level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.14 Pitch accent accuracy using predicted boundaries. . . . . . . . . . . . 5.15 Comparison of serial and joint prediction algorithms using TRBL. . .
59 60 61 64
ix
List of Figures
2.1 Main components of a text-to-speech synthesis system. . . . . . . . . 2.2 An example of a F0 contour and ToBI labels 10]. . . . . . . . . . . . 4.1 A sample decision tree with output probabilities. N and P refer to the label applied to data at each terminal node (the square boxes) and label of the majority of the data at internal nodes (ovals). . . . . . . 4.2 A simple diagram of the transformation rule-based learning process. . 5.1 The pruned tree using the basic decision tree algorithm for pitch accent location prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The pruned tree using the decision tree with a Markov assumption for pitch accent location prediction. . . . . . . . . . . . . . . . . . . . . . 5.4 A comparison of phrase boundaries predicted with TRBL and handlabeled boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 15
28 35 51 52
5.3 An comparison of predicted accents with TRBL vs. hand-labeled accents. 58 62
Chapter 1 Introduction
The development of speech synthesis systems has reached an acceptable level with regard to the intelligibility of the speech output as judged by most users in low noise environments. However, such acoustic environments are relatively rare. In many commonly found environments such as over telephone lines or in locations with much background noise, where a speech synthesis system might be utilized as part of a spoken dialog system, the ability to understand the output of speech synthesizers degrades very quickly 19, 53]. Even when current synthesizers produce intelligible output, the output compares poorly in \naturalness" with human speech, which directly impacts the acceptance of spoken dialog systems. Further, Boogaart and Silverman have shown that synthetic speech places more demands on the user than does human speech 14]. It follows that user acceptance of spoken dialog systems also decreases if the speech production component requires more concentration than natural speech communication between humans 53]. It has been proposed by many researchers that one important area in which the intelligibility and \naturalness" of synthesized speech can be improved is the prosody of the synthesized speech 28, 32, 46, 53]. Prosody is an important component of human speech that helps carry important semantic, syntactic, and discourse infor1
mation of the utterance. Prosody is found in the acoustic characteristics of human speech which include pauses, fundamental frequency (F0) contours, and energy of the sounds. One aspect of prosody, the variation of the fundamental frequency contour, is often referred to as \intonation." While these characteristics have continuously varying parameters, humans are able to perceive these characteristics as discrete events in natural speech. For example, listeners can reliably transcribe phrase boundaries and pitch accents (e.g. phrase-level emphasis labels) 40]. For that reason, many synthesis systems divide prosody prediction into two stages: rst predict symbolic events from annotated text, and then predict the acoustic correlates given these labels and phonetic pronunciations. Both steps are areas of on-going research. The focus of this work is on the rst step, improving the prediction of abstract prosodic labels from text. There are a large number of di erent prosodic characteristics found in human speech that can be predicted. Focusing on predicting discrete labels, the present work addresses prediction of the location of pitch accents and the location and size of phrase boundaries from linguistic analysis of the text input. Pitch accents mark emphasis placed by the speaker on a particular word in the utterance. Phrase boundaries mark the grouping of the words in the utterance. Since most theories of prosody have some concept of pitch accent and phrase boundaries, the work described here should be applicable to most theoretical frameworks. Pitch accents and phrase boundaries will be described in more detail in Chapter 2. The speci c linguistic features used in this work will be discussed in more detail in Chapters 2 and 5. Most current approaches to predicting abstract prosodic labels for speech synthesis rely on a set of rules which assign prosodic labels to the text input based on the linguistic analysis of the input. These rules embody a \model" of prosody. Methods for creating this model rely on either manual creation of rules or automatically trainable algorithms such as decision trees. The rst approach requires human experts, who rely on linguistic knowledge and data analysis and, as a result, is very costly 2
in development time (see Hirschberg for a discussion of human rule creation 24]). The huge expense makes adaptation of the rules to a new application or domain prohibitive. It is also possible that the resulting rules may not capture all the speci c prosodic patterns of a particular domain or style of speech. The second approach, automatic training lowers the cost of adapting models of prosody to new domains. Automatic training can also nd patterns in the inputs that human experts might miss. Automatic training has the disadvantage that it requires hand-labeled data for training which is expensive to create. However, when compared to the cost of hand-crafting linguistic rules, labeling a corpus is potentially cheaper, since it may be automated and may bene t other research. Of the automatic methods, classi cation and regression trees, or decision trees, are the most widely used. Decision trees have been successful in a number of tasks such as predicting phrase breaks 37, 61], prominences or pitch accents 24, 46] and segmental durations 44]. Decision trees automatically and recursively choose one question that minimizes classi cation error and partitions the data, as will be discussed further in chapter 4. The advantages of decision trees are that they choose the ordering of feature tests for prediction automatically, they permit categorical and numerical feature values, they create models that can provide linguistic intuitions about the data, and they can be used for prosody recognition 46, 59]. The primary disadvantage of decision trees lies in the training method. For each new node learned, the data available for learning a question is decreased. Thus, as later questions are asked there are fewer data points from the corpus with which to learn an appropriate question. In a related problem, at each new question the data is partitioned, causing it to be unavailable to other nodes. This may cause the algorithm to miss some of the dependencies of the data. This work will o er another automatic method of predicting abstract prosodic labels based on transformational rule-based learning (TRBL). TRBL is currently used in the natural language processing community in a number of applications, such 3
as: part-of-speech tagging 17], prepositional phrase attachment 16], word segmentation 38], discourse processing 48], spelling correction 30], and partial parsing or chunking 43]1 . There are four advantages to TRBL. First, it allows for automatic training. Second, the resulting model is easy to examine and can provide intuitions about the underlying interactions between di erent aspects of the data. Third, TRBL has achieved good results with less training data than other statistical methods in some tasks. In 17], rule-based part-of-speech tagging achieved the same accuracy as hidden Markov models with a substantially smaller training corpus, 64k vs. 1M words. This characteristic is important to prosody prediction because of the dearth of prosodically annotated corpora. Fourth, in contrast to decision trees, the data set is not fragmented during the training process, allowing TRBL to capture dependencies that might not be captured using decision trees. Another advantage of TRBL is that it allows us to assess the trade-o of joint vs. serial prediction of accents and phrase boundaries. There is some evidence that these two aspects of prosody are intra-dependent, as will be discussed in Chapter 2, and joint prediction may better capture these dependencies. Most earlier research has predicted prosodic labels in a serial fashion due to the limitations of the prediction algorithms 24, 46]. Using TRBL, this work will include an investigation of the joint prediction of pitch accents and phrase boundaries. Since TRBL can incorporate rules operating over long time scales without an exponentially growing cost (unlike ngrams models), it is straightforward to combine prediction of both accents and phrase boundaries, which e ectively operate at di erent time scales. In addition, TRBL will be less sensitive to data sparsity associated with predicting the product space of the labels. In summary, the main goals of this work are: 1. to demonstrate the usefulness of TRBL for predicting phrase and accent prosodic labels for speech synthesis, and 2. to use the TRBL algorithm to assess the trade-o s of joint vs. serial phrase and
1
\Chunking" is a term used by Steven Abney to denote the syntactic pre-parsing of sentences 3].
accent prediction. The next chapter, Chapter 2, will provide some background information to the problem of predicting prosody for speech synthesis systems. I will give an overview of speech synthesis systems indicating when prosody prediction ts in the system. I will then discuss some basics of linguistic theory of prosody, focusing on pitch accents and phrase boundaries. Finally, I will provide a brief overview of currently used prosody prediction algorithms for speech synthesis and motivate the development of a new approach. Chapter 3 will describe the corpus used in this research. The Boston University Radio News Corpus, a prosodically labeled corpus of radio news style speech, is used in the algorithm development. Chapter 4 will present the prediction algorithms used in this work: hand-written rules, decision trees and TRBL. Modi cations used to adapt TRBL to predicting two levels of prosodic labels will be presented. I will also discuss the advantages and disadvantages of decision trees and TRBL in more detail. Chapter 5 will rst present the linguistic features used in the prediction experiments. Next, results from experiments predicting pitch accent locations and phrase boundaries alone will be presented. Finally, Serial and joint prediction results will then be discussed. In Chapter 6, I will summarize the main results and discuss their signi cance to prosody prediction and speech synthesis research. In particular, I will present the implications that my results have for understanding the mechanisms of prosody. This will be followed by a discussion of future work.
Chapter 2 Background
In this chapter, I will provide the background necessary to place the present research in the broader context of speech synthesis and prosody research. In Section 2.1, I will give an overview of a speech synthesis system. Section 2.2 will present some background on linguistic theory concerning prosody, focusing on pitch accents and intonational phrases. The nal section of this chapter, Section 2.3 will present the most prevalent methods adopted in research and commercial speech synthesis systems to predict prosodic labels, speci cally handwritten rules and statistical methods.
2.1 Speech Synthesis

There are two approaches to the synthesis of speech: the storage of pre-recorded waveforms or \canned speech," and the synthesis of an acoustic waveform from unrestricted text. Both methods have a relatively long and varied history1 . The rst method is currently used in many telephony applications, where a caller hears prerecorded utterances such as
1
See 28] for a more complete account of the history of speech synthesis.
The number you requested, 555-1212, can be automatically dialed for an additional 35 cents.
in response to a request for information. Pre-recorded speech can be natural sounding and e ective in many applications. This method is also relatively simple to implement. The disadvantages of pre-recorded speech are several. First, pre-recorded speech may require upwards of four megabytes of storage space for a single minute of digitized speech. Second, the entire course of a conversation or \discourse space" must be known beforehand. Later changes to the collection of recorded speech are often di cult to make. Third, each application must be designed separately, because the recorded speech can rarely be reused in other applications. Finally, the prosody of the speech may not be natural where pre-recorded words must be spliced together. In the example above, the numbers are di erent for each request and must be added at the time of the response. Some systems attempted to address the limitations of this method by recording and storing smaller units of speech such as individual words or individual sounds 28]. Such an approach allows for lower storage costs and more portability but at the cost of speed (e.g due to the search through the inventory of recorded units for the appropriate sequence of units) and at the higher cost of \naturalness." The latter may be the most important drawback of a system that relies on pre-recorded speech. The speech resulting from the concatenation of words can sound \choppy" and unnatural. The concatenation of pre-recorded units of speech ignores very important boundary e ects of speech sounds on one another (e.g. co-articulation e ects) and the prosody of the utterance which acts at time scales greater than the individual units. The second approach to the synthesis of speech is to produce an acoustic waveform from unrestricted text input by representing and/or generating sub-word units of speech. This approach addresses many of the problems with pre-recorded speech. The acoustic output is created on-the- y and does not require storage. The set of possible outputs do not need to be known in advance since the synthesizer can produce 7
Text Text Processing
Text Annotations
Prosody & Pronunciation Prediction
Phone Labels
Prosodic Labels
Waveform Synthesis
Speech
Figure 2.1: Main components of a text-to-speech synthesis system. an in nite number of di erent acoustic waveforms. In this approach, the synthesizer converts input text to a waveform. There are three stages to this process: rst, a linguistic analysis of the input is performed; second, the resulting information is used in the conversion of the input to phonemes and prosodic annotation of the text and third, the phonemes and prosodic annotations are used to create an acoustic waveform. These steps are shown in Figure 2.1. In English, the input text to a TTS system can include words, Arabic and Roman numerals, abbreviations, and punctuation symbols. The rst step in any TTS system is to convert the input to a uniform representation. This corresponds to the rst box in Fig. 2.1. The initial treatment of the text is called \normalization." Text contains abbreviations, numerals, formating conventions for email, prose or poetry. Normalization of the text comes by converting the text into a sequence of words. This includes converting numeric symbols to their text representation, expanding abbreviations to their full spellings (e.g. Ave. becomes avenue), and interpreting punctuation and text formatting appropriately 45]. One example of the problems found as this step is the proper representation of number sequences. The same sequence of numbers may have two di erent text representations. 212: as part of a telephone number: two, one, two 8
212: as an amount: two hundred and twelve 212: as an address: two twelve Many current speech synthesis systems include modules speci cally designed to handle di erent styles of text which require di erent normalization such as mathematical and email text 57]. In order to accurately assign pronunciations and prosody at later stages of the synthesis process, a linguistic analysis of the text is performed as part of the text processing stage. The result of this analysis is provided with the text to the phoneme and prosodic prediction module. Most TTS systems perform a surface analysis of the text to determine such syntactic information as the part-of-speech of each word. Researchers have used this feature to predict pitch accents and phrase boundaries. An advantage to this feature, is that even relatively coarse classes (e.g. content vs. function word distinctions) can result in high accuracy in the pitch accent location task. There is reason to believe that other forms of syntactic information might also be useful in the prediction of prosody. For example, the identi cation of complex nominals can help determine the location of pitch accents 56, 46]. The identi cation of syntactic constituency may provide some information for the prediction of intonational phrase boundaries 21, 59]. An analysis of the discourse structure of the input text can be performed at this stage. This type of analysis provides information concerning the relationship of words or sections of the text to other more distant words or sections. Two important examples are \givenness" and contrast. \Givenness" describes if a word is perceived as a \given", or already mentioned, in the discourse or if it is to be perceived as new information. This characteristic has been shown to in uence pitch accent placement in some styles of speech 24]. \Contrast", the use of a word or phrase to contrast with an earlier phrase or word, can also in uence the prosody of speech 41]. These features are still not reliably derived from text and as a consequence, are rarely used 9
in speech synthesis systems. Linguistic analysis of text is still inexact and an area of much research. Therefore, this analysis usually does not provide all the information necessary to predict prosody accurately, and this will cause a cascade of errors in the annotations sent to the waveform generator 22]. This is especially true when the text input is allowed to come from any style or domain, i.e. completely unrestricted. An alternative to TTS is a concept-to-speech (CTS) system. In CTS, text normalization and linguistic analysis are unnecessary since the linguistic annotations (i.e. discourse, syntactic, semantic information) are provided by a language generator with the text. In g. 2.1, the language generator would replace the text analysis step as well as the text input. In such a system, the input is in the form of \concepts." The output of the generator is text and the linguistic representations used to create the text. Therefore, these annotations are more accurate than those coming from text analysis because they can include whatever high-level linguistic, semantic, and discourse information that was used by the generator to realize the text 24, 46]. Although there are few research systems that use language generators today, research e orts in this direction have been underway since the late seventies. Young and Fallside describe a CTS system as a part of a database telephone interface 67]. Other research systems that include a language generator are the SOCS system 65], the MTS 55], and the VieCtoS system 6]. After the linguistic analysis has been performed, pronunciation and prosodic labels are predicted. Pronunciations are predicted using a stored dictionary in conjunction with letter to sound rules. The dictionary used is similar to the one described in Section 3.2. Prosodic label prediction will be described in more detail in Section 2.3. With the prosodic labels and pronunciations derived, the next and last step is the production of the acoustic waveform. There are two predominant methods for producing the waveform from the phonemes and prosodic annotations: concatenation of sub-word units (e.g. diphones), and the creation of waveform from a parametric 10
representation 4]. Each method usually requires some signal processing smoothing techniques following the initial stage. Currently, the most widely used waveform generation method is concatenative synthesis. This method could be viewed as an extension of the \canned" speech method described above. There are important di erences. In current systems, parameters extracted from small units of speech that represent individual sounds are stored. In some systems, these are LPC parameters 57]. The important issues with this method are how to create the unit inventory (i.e. whether the unit size should be uniform or variable and whether to include multiple units of the same sound that re ect prosody) and how to search the unit inventory e ciently 26, 58]. This method of waveform production requires smoothing at the unit boundaries and the imposition of prosodic characteristics. Commercial systems that employ concatenative synthesis include Truetalk developed by Bell Labs and marketed by Entropic Research Laboratories, Laureate by British Telecom. Festival, created at the University of Edinburgh, is an example of a non-commercial system 11]. The second method of waveform synthesis models the production of speech as a combination of signal sources and lters. This model of speech production is a standard one in speech analysis. Speech sounds are generated by modeling the vocal tract as a lter and the vocal cords as a source. The lter and source are modi ed by changing certain parameters such vocal tract length, and vocal fold vibration frequencies. Commercial systems that employ parametric synthesis include the DEC's Dec-Talk (based on Klatt's MIT TALK, Eloquent Technology's Delta System, and Telia's Infovox. For a more complete discussion of speech synthesis acoustic waveform generation the reader is directed to Klatt's overview of speech synthesis research 28].
11
2.2 Linguistic Theory of Prosody

This section provides a brief discussion of linguistic theory as it applies to prosody. Prosody can be described as the acoustic characteristics of speech that carry the information that is not present in the words of the utterance. Emotion, discourse cues, and syntax are examples of the information that may be encoded in prosody. These characteristics can come into play at the phone, syllable, word, phrase, and higher levels. Perceptually, prosody can be perceived as varying durations of sounds, varying lengths of pauses, and tone and intensity changes.
2.2.1 Pitch Accents

Relative prominence is usually perceived when a word receives more emphasis on one syllable than the surrounding syllables. This emphasis may be perceived as having lower or higher pitch and relatively longer duration and higher energy than the surrounding syllables. In the following example, my is given the emphasis. The pitch accent is denoted by the \*" symbol.
Speaker One: Whose book is that? Speaker Two: That's MY book. *
In the \book" example, the emphasis is on my in response to speaker one's question about ownership. With a di erent question, the pitch would fall on a di erent word. In the example below, note that the question is about an object di erent and so the placement of the pitch accent moves to the object.
Speaker One: What is that? Speaker Two: That's my BOOK. *
12
It should be noted that in this work the term relative prominence and pitch accent are used interchangeably. This follows the convention introduced in 47], where Ross notes that in the Boston University Radio News Corpus, prominence and pitch accents almost always fall on the same syllable. Acoustically, prominence is cued by the energy, F0 contour, and duration of the vowel in the syllable. Prominence can be used by the speaker to \highlight" a particular word in the utterance. Later, in Section 2.2.3, see Fig. 2.2, for an example of a high F0 contour, where a prominence is on marianna. Pitch accent can also be cued by a low F0 contour. The placement of prominence on a word depends on many aspects of syntax, discourse and semantics, and these dependencies are still an active area of research. When prominence is placed on a word, its placement in the word is limited by lexical features of the syllable 39]. Each syllable in a multi-syllabic word can be categorized as having primary, secondary, or no stress. A vowel with primary lexical stress means that the syllable will usually be perceived as having the emphasis if a pitch accent is assigned to the word. The primary stressed syllable also receives the prominence if the word has the nuclear accent, i.e. the last accent in the phrase. Secondary lexical stress means that the syllable may have prominence assigned to it if there are reasons for shifting the prominence from the primary lexical stressed syllable and if the syllable is before the primary lexically stressed syllable. In Shattuck-Hufnagel et al. 50], it was noted that prominences on the earlier syllables if there is a prominence on the next word to avoid having too many prominences together, or if the word is early in a phrase to help mark phrase onset. Finally, a syllable is a multi-syllabic word that is not marked with lexical stress rarely receives a pitch accent. Monosyllabic words are not typically marked for stress in the dictionary and are candidates for prominence.
13
2.2.2 Phrase Boundaries

Another important type of prosodic event is a phrase boundary. Many researchers point out that the location and type of phrase boundaries (tones) can signi cantly change the meaning of a simple statement 61]. The following newspaper headline can provide either a humorous story or a serious piece of news.
(Police help dog)(bite victim.) (Police help)(dog bite victim.)
For a speech synthesis system, a string of phonemes can have two di erent meanings that can be disambiguated by phrase boundaries.
Would you like a soup, or salad? Would you like a super salad?
When spoken aloud, the grouping of the words helps the listener distinguish between the two sentences. Phrase boundaries are signaled by pauses, duration lengthening, and tonal changes. Prosodic phrase boundaries are not identical to syntactic boundaries, though they are related in that certain types of syntactic phrase boundaries are likely to coincide with a prosodic phrase boundary (see 21]). Most theories of prosody identify two distinct levels of phrase boundaries, an intonational phrase and an intermediate phrase 37]. The intermediate phrase is the smaller unit and has a phrase accent. The intonational phrase is made up of intermediate phrases and is marked by a tonal change at the boundary. In \super salad" example above, the intonational phrase includes the entire sentence, while in the \soup or salad" example there are two intermediate phrases. In the following sentence, the second intonational phrase contains two intermediate phrases.
((That year))((Thomas Maffy,))(( now president)
14
Figure 2.2: An example of a F0 contour and ToBI labels 10].

(of the Massachusetts Bar Association,))
((was
Hennessy's
law
clerk.))
In this work the intonational phrase boundary is refered to as a major boundary and the intermediate phrase boundary is refered to a minor boundary. What would happen if prosody was not present in speech? The resulting speech would generally be at and monotone. Studies have shown that far from this being only a question of pleasantness, the absence of prosody can lead to a decrease of intelligibility 53] as illustrated by the previous examples. It is clear that predicting prosody accurately will help listeners disambiguate the output of a speech synthesizer.
2.2.3 Prosodic Labeling

Many theories of intonation specify a nite set of possible prominence and boundary markers. In this work, we will use the Tone and Break Indices (ToBI) labeling system for American English 52] to specify the inventory of intonation markers. This 15
prosodic labeling systems consists of a tone tier, a break tier, a tier for dis uencies and other markings, and a tier for orthography. The tone tier aligns a series of tones that mark pitch accents, phrase accents, and phrase boundaries. The break tier shows the perceived phrase \break" or disjuncture between each word 59]. Figure 2.2 shows the four tiers in the following order from top to bottom: tone, orthography, break, and miscellaneous. In this work, the data has been labeled with labels in each of the three tiers, but only the tones have been used. In addition, the details of tone labels are not used. Only the location of accents and major and minor phrase boundaries are included. This information can from the tone tier exclusively. In several studies, a high reliability has been shown in labeling speech using this system of prosodic markers 40, 46].
2.3 Prediction Algorithms

This section begins with some terminology de nitions related to prediction. The following sections describes the two most currently used approaches for prosody prediction in speech synthesis systems, rule-based and decision trees, in order to provide background and motivation for the use of the proposed algorithm. The alternative explored here, TRBL, will be explained in detail in Chapter 4.
2.3.1 Handwritten Rules

Many commercial speech synthesis systems use handwritten rules to predict various elements of prosody. Handwritten rules attempt to predict the prosody of the input text with a sequence of if-then statements which test the existence of a particular feature of the input such as part-of-speech of a word. An example of such a rule might be:
if the word is a function word, do not accent this word.
16
Handwritten rules can provide accurate predictions of prosody because they are created by human experts who can rely on their experience which is not available to a computer 7]. However, the amount of time necessary for the creation of a rule set is enormous. Hirschberg reports a development time of two months for a set of rules to predict pitch accent 24]. Since prosody usually depends on the domain of the input, rewriting handwritten rules for each domain or application becomes prohibitive. Finally, human experts may miss systematic patterns in the data that a computer can learn from a corpus.
2.3.2 Statistical Methods

Stochastic methods for creating prosody prediction rules or a \model" of prosody typically follow these steps. First, training data is labeled, usually by hand, with labels that describe the prosodic characteristics present in the data. A model structure is assumed, and parameters are estimated to represent the probability of prosodic events occurring in the context of various linguistic features. This model can then be used to predict the most likely prosodic label sequence given future input to the system. Speci c examples of statistical models used in prosody label prediction are described below.
Decision Trees Classi cation and regression trees have been used in previous work
in prosodic prediction for speech synthesis (see 24], 46], and 59]). This approach has the advantage over handwritten rules in that the \model" is created automatically. One must only provide the algorithm with a list of potential questions or features to examine, and the system will automatically choose which features provide the highest predictive capability. Decision trees use a greedy search to decide which values of a feature provide the best distinguishing point for the data according to some pre-speci ed measures of goodness, and then divide the training data based on that feature value. This process continues until some stopping point is reached. 17
Decision tree training will be covered in more depth in Chapter 4. It is also possible to incorporate a Markov assumption in the decision tree model which allows the inclusion of previous classi cations in the prediction of prosodic events. This modi cation was implemented by Ostendorf and Veilleux for the task of phrase boundary prediction 37], and in Ross and Ostendorf for pitch accent prediction 46]. The addition of the Markov assumption is covered in more detail in Chapter 4. Ross found that a decision tree, which included a Markov assumption, predicted prominence location at the syllable-level with an 87.7% accuracy. These results and the following are reported here on the Boston University Radio News Corpus, which is described in Chapter 3. In comparison, a simple rule-based algorithm for placing prominences based on the content vs. function word distinction achieved an accuracy of 85.2% at the syllable level 46]. A second rule-based algorithm 35] intended to take into account rhythm and stress clash performed with 82.9% accuracy on the same task 46], demonstrating that handwritten rules do not t all task domains well. Ostendorf and Veilleux report results for the phrase boundary location task 70% using a metric of the number of correct break assignments over the total number of words, where there are three possible break values (no boundary, intermediate, and intonational phrase). While decision trees have their advantages, there are also several disadvantages to the decision tree method that have led me to investigate the e ectiveness of an alternative method, as described in Section 2.3 and in Chapter 4. Primarily, decision trees split data at each node learned, thus creating sparse data problems. In addition, data examined at each is not available to other nodes in the tree that are not child nodes.
Other Statistical Models The stochastic methods described above result in de18
terministic rules. This is also true of TRBL. However, others have employed Hidden
Markov Models in the prosody prediction task with some success. This approach differs from the other approaches mentioned in this work in that both the training and model are probabilistic. Black and Taylor employ Hmms in the prediction of phrase breaks 12]. They train the model using information about part-of-speech features of words and the previous predicted breaks. This approach is automatically trainable. The drawback to this approach is that it requires an enormous amount of training data. Because of the small size of the prosodically labeled corpus used in that research, it was necessary to train the models on a much larger corpus of words for the part-of-speech information. Since the model presented succeeded for prediction of a binary decision, presence vs. absence of a break, this approach faces data sparsity problems in the task of predicting a larger set of prosodic labels.
19
Chapter 3 Corpora
The present research uses The Boston University Radio News Corpus of recorded speech with annotations. This corpus consists of recorded speech, fundamental frequency data, phonetic labels, word transcriptions, part-of-speech tags for the words, and prosodic labels. The Boston University Radio News Corpus, re ects broadcast radio news style speech.
3.1 Boston University Radio News Corpus

The Boston University Radio News Corpus was created to further speech synthesis and prosody research 36]. News stories broadcast by seven announcers of a local radio station were recorded. Four of the broadcast stories were re-read under controlled conditions in the laboratory by six of the announcers. All the speech was transcribed. Forced phonetic alignments were produced by the Boston University speech recognizer. All words in the transcriptions were tagged with part-of-speech labels. A subset of the data, including speech from one speaker, F2b, and the laboratory stories were prosodically labeled. The present research uses a subset of the corpus. 34 stories (49 minutes) from 20
one female announcer, F2b, were used for the training data.1 This portion is referred to as \radio news" since it came from the broadcast portion of the corpus. The four stories from the controlled recordings will be used as the test data. This subset of the data will be referred to as \lab news." A version of the \lab news" that was recorded by the same speaker as the training data will be referred to as the \target" version. Most evaluation schemes evaluate prediction of prosody based on exact matches with predicted labels. The target version will be used for exact match evaluation. In a study of prosodic variability on the lab news portion of the corpus, Ross and Ostendorf report that over the same text readers showed some variability in the locations of pitch accent locations, and phrase boundaries 46]. This suggests that a multi-version best match comparison may be more meaningful than the exact match evaluation for evaluating prosodic label prediction. However, evaluation using the other versions is beyond the scope of this thesis. The entire corpus was labeled with part-of-speech tags using a part-of-speech tagger developed by BBN 33]. Part-of-speech labeling errors were not corrected for the training data. Ross notes that the accuracy rate was very high (approximately 97%) 46]. The part-of-speech labels for the test data were hand-corrected. The data from the speaker used in this research, F2b, was hand-labeled with ToBI labels. The prosodic annotations include pitch accent tones, breaks, and phrase boundary tones. Ross and Ostendorf conducted a consistency study of human labelers and found that there was 91% agreement among labelers for presence vs. absence of pitch accents 46]. There was an inter-annotator agreement of 93% for the location of phrase boundaries and 91% agreement for the location of phrase accents. This agrees with results reported in Pitrelli et al. 40]. This level of agreement demonstrates that this is a task that can be performed reliably. Pitch accent frequency is shown in table 3.1. Although pitch accent tones (i.e.
It should be noted that this research does not include the noisy les from F2b in contrast to Ross 46].
1
21
Pitch Accented Total Radio News (Training) Words Syllables Lab News (Test) Words Syllables 4679 (51%) 4741 (31.5%) 1008 (47.7%) 1025 (30.5%) 9179 15074 2112 3359
Table 3.1: Summary of pitch accent frequency in the Boston University Radio News Corpus. Major (Breaks 4-6) Minor (3) Radio News (Training) Words 1721 (18.7%) Lab News (Test) Words 416(19.7%) None (0-2) Total 9179
666 (7.2%) 6792 (74%)
178 (7.1%) 1518 (71.8%) 2112
Table 3.2: Summary of phrase boundary frequency in the Boston University Radio News Corpus. L+H*) were labeled in the corpus, agreement among annotators was somewhat lower (60%) than for location of the accents. Additionally, a variability study conducted by Ross and Ostendorf 46] on the lab news data shows that there is a good deal of variability for the same text by di erent readers. Phrase boundary frequency is shown in table 3.2. Phrase boundary tones were marked in the corpus using ToBI labels. Here, these labels were collapsed to three categories: major, minor, and no boundary.
22
3.2 The Lexicon

The lexicon associated with the corpus is based on the MOBY lexicon 62]. For testing purposes, there are no out of vocabulary words, i.e. no unknown pronunciations. This lexicon provided pronunciations for all the words in both corpora. If words in the lexicon have multiple pronunciations, the pronunciations are listed in an alphabetical order. Primary and secondary lexical stress is denoted in the pronunciations as well as syllable divisions. The recognized pronunciations in the Boston University Radio News corpus are based on forced alignment with a dictionary that allowed for phonological variation, so they do not necessarily correspond to what might be predicted from text input. Therefore, these pronunciations, were replaced by the pronunciations from the MOBY lexicon. This is signi cant in that the recognized phones may be di erent than the phones for that word in the lexicon. This was done to more closely mirror the conditions in a speech synthesis system. I should note, that, since there are sometimes multiple pronunciations for an individual word in the lexicon, the rst pronunciation with the same number of syllables as the original recognized pronunciation and, if present, the same part-of-speech tag was arbitrarily chosen as the pronunciation for each word. This led to some mismatches between the prosodic labeling and the pronunciations but these were not corrected. For example, if a pronunciation for the was recognized as dh iy+1, but the rst pronunciation of the several pronunciations in the lexicon was dh ax, this pronunciation was assigned. This will a ect prosody prediction because the assigned pronunciation contains a reduced vowel, ax, and a reduced vowel does not receive pitch accents. This type of mismatch between the prosodic labeling and the lexicon occurs infrequently.
23
Chapter 4 Prediction Algorithms

This chapter presents the two algorithms used here to predict abstract prosodic labels for speech synthesis: decision trees and TRBL. I will rst provide an overview of the two algorithms and their relative strengths and weaknesses in Section 4.1. Then in Sections 4.2 and 4.3, I will describe the two prediction algorithms, decision trees with a Markov assumption and transformational rule-based learning. In the nal section, Section 4.4, I will describe modi cation of the TRBL algorithm to enable the joint prediction of prosodic labels.
4.1 Overview
Both decision trees and TRBL methods share a number of advantages, as mentioned in Chapter 1. To reiterate here, these methods are automatically trainable. This characteristic allows new models to be created for di erent domains and applications. This makes particular sense for the prediction of prosody since di erent speech styles often have di erent patterns of prosody. Therefore, one model for the prediction of prosody may not provide the most natural output when developed on one type of speech and used on a di erent type of speech. Each method is also useful in 24
identifying the important characteristics of the text that may be useful in predicting prosody. Each method allows the inclusion of categorical and numerical features to describe the data. Categorical features are features that allow only set operations. An example of a categorical feature would be part-of-speech tags. There are 37 xed values for this category in the Penn Treebank 31], which do not allowing mathematical operations such as nding the mean or average of the values. Other examples of categorical features include: identi cation of complex nominals, vowel quality, identication of clause structure, and lexical stress information. Numerical features can be real or integer valued. These features allow mathematical operations such as greater than, less than. Examples of numerical features used in this research are integer valued variables such as location of a syllable in a word, number of syllables in a word, boundary strength, and distances in syllables or words from other prosodic labels. Chapter 5 discusses the features used in this research in more detail. The resulting models from each algorithm are easily readable. Decision trees may be read as a hierarchy of questions and also translated into a nite state network. The learned rule sequence resulting from the training of the TRBL algorithm can be read in much the same way. Each model may provide intuitions about the linguistic structure that in uences the choice of prosody. This is contrasted with the models created using HMMs which can be di cult to interpret. One disadvantage that these methods share is the need for an annotated corpus. Corpus annotation requires human experts to listen to recorded speech data and label signi cant events in the speech. In our case, this would be words, and prosodic events as described in Chapter 2. This is becoming less of a problem for speech science in general as ever larger speech corpora come on-line but remains one in the area of prosody research. However, automatic methods could be used to automatically label a corpus of speech as a pre-processing or \boot-strapping" measure and human annotators can then correct any mistakes of the automatic algorithm makes as has 25
been done in a number of other types of language corpora. There are a number of characteristics that set decision trees apart from TRBL as a prediction algorithm. The primary di erence concerns the method by which each algorithm learns new questions. These are mentioned in more detail in the following sections.
4.2 Decision Trees

Decision trees are a non-parametric approach to classi cation and regression problems. Data is classi ed according to a series of hierarchical questions, usually binary, asked at each branch in the tree about features that describe the data point. As with most statistical methods, there is a training phase and a test phase. The mechanism for training for a standard decision tree is reviewed brie y next, then I describe an extension developed by Ross and Ostendorf to handle label sequences 46].
4.2.1 Decision Tree Design

During the training phase, the algorithm learns a series of questions that partition the data in order to decrease the relative impurity of data at that node. The impurity of a node is roughly related to the degree of classi cation accuracy. At the root node, the entire data set is searched and is partitioned according to the question that best divides the data. Questions are recursively found to partition the data until some stopping criterion is met. The end result is a hierarchy of questions beginning at the root node. An inherent problem to this algorithm is that the amount of data available at nodes lower in the tree is diminished at each new split. A second problem is that data at one node in the tree is not available to other nodes in the tree. During the testing or classi cation phase, each data point is given to the tree and based on the questions in the tree, it traverses the tree until it is classi ed at 26
a terminal node with the label that that node carries. It is also possible to assign probabilities to the data point at the terminal node for each of the possible labels where the probabilities from these nodes are for each of the possible classes. That is, in the case of pitch accent location prediction, a data point would receive P ( ) where 2 fpitch present; pitch absentg. An example is seen in Fig. 4.1. This approach was successfully adopted in 46], when combined with a Markov assumption. This approach also allows decision trees to be adopted in the recognition of prosody for speech understanding applications 46, 59]. As will be explained later in this Chapter, the probabilities of labels at the terminal nodes will be used in this work. Decision trees require three elements. First, there must be a set of allowable questions to ask at each node. Second, a measure for evaluating the question must exist. Finally, the training algorithm must decide when to stop dividing the data and creating new nodes. Each of these is explained in more detail below. The set of allowable questions may be categorical such as: Is x a function word? Or they may be numerical, such as: Is x 5?
It is important that this set be of reasonable size, since question choice can quickly become computationally intractable. The questions here were chosen based on current and previous research in the prediction of pitch accents and phrase boundaries. For more details, see Chapter 5. The best question at each node is chosen based on a greedy search. This method of searching is locally optimal but not globally. To illustrate this point, it is possible that a question is chosen higher in the tree which will not allow other splits that would allow greater overall decreases in impurity at lower nodes to be chosen. Essentially, the greatest decreases in impurity at earlier nodes to do not guarantee the optimal choices for the children nodes. 27
root node
branch
leaf node
0.780984
0.219016
Figure 4.1: A sample decision tree with output probabilities. N and P refer to the label applied to data at each terminal node (the square boxes) and label of the majority of the data at internal nodes (ovals). The objective impurity function is the criteria that is used to decide whether or not to divide the data at a node. It is related to but not the same as minimizing misclassi cation error. As the tree is being grown during the training phase, the goal is to minimize the overall impurity of the tree. At each node, the training algorithm chooses the question that minimizes the impurity of the data. The impurity function, i(t), will be minimized when the node contains only one class. The basic equation for measuring the impurity of a question at a node is as follows:
i(t; s) = i(t) ? PRi(tR ) ? PLi(tL );
(4.1)
where the question s of node t sends a proportion PR of the data in t to the right child tR and PL to the left child tL 15]. There are four types of impurity functions which are commonly used in classi cation trees 15]: Minimum error
i(t) = 1 ? max p( jti)

28
(4.2)
Entropy Gini criterion Twoing criterion
i(t) = ?
p( jti) log p( jti)

X
(4.3) (4.4)
#2
i(t) = 1 ? i(t) = PL4PR

" X
p2 ( jti)
jp( jtL) ? p( jtR )j
(4.5)
where p( jti) is the probability of class at node ti. Misclassi cation error, equation 4.2, has been shown to provide lower overall classi cation accuracy rates. This work used minimum entropy as the impurity function; other research has shown that there is little di erence in which function is chosen 47]. The stopping criterion provides the training algorithm with a constraint on the size of the tree. Various measures include limiting the number of leaf nodes, making a node a terminal node when the incremental decrease in impurity falls below a certain value, or stopping node splitting when the number of data points available to a node drops below a certain threshold. Model order as represented in tree size is another important issue. This is related to the stopping criterion and to post-training pruning of the trained tree. Pruning is necessary when the trained tree too closely models the training data making generalization to new data (i.e. the test data) di cult. However, in this research, experiments showed that this did not appear to be much of an issue. Pilot experiments with pruned trees were examined by growing trees with 90% of the data and validating a development subset of 10% of the training data showed negligible di erences in classi cation accuracy.
4.2.2 Using Decision Trees with a Markov Assumption

Decision trees lack an important capability: they can not allow questions about previous labels because these are not available during the testing phase. Research has 29
shown that previous prosodic labels are an important feature in subsequent label prediction. Ross and Ostendorf introduced a modi cation of the traditional decision tree algorithm that allows a Markov assumption to be used 46]. A tree is trained allowing questions about previous labels. Then during the test phase, data is introduced to the tree with the addition of all possible previous labels. The probabilities for each of the possible classi cations of the data point at the terminal node are recorded. Thus, each data point will have as many possible output probabilities as there are possible previous label sequences. A Viterbi search through these possible states is then performed. In this approach, prediction of prosodic labels becomes a problem of nding n the sequence of 1 = f 1; : : : ; ng that is conditionally dependent on feature vectors, i i yn = fy1; : : : ; yng using the distribution P ( njyn). For pitch accents, the time index, 1 i, is over syllables, so i is the label on the i-th syllable, and yi is a vector of features that are computed from the whole text sequence and are relevant to the accent on the i-th syllable. Using the chain rule:
P ( 1 jy1
n
n) =
p( 1jy1
n n ) Y p( i=2
i?1 ; : : : ;
n ; y1 ): i
(4.6) is conditionally (4.7)
In the bigram model, it is assumed that given i?1, yi, and yi?1, independent of earlier j and other yj which gives:
n n P ( 1 jy1 ) = p( 1jy1 ) n Y i=2
p( ij
i?1 ; i ; i?1 ):
y y
To include the decision tree, the labels are conditioned on a set of equivalence classes which come from the decision tree: T ( i?1; yi; yi?1). This gives:
n n P ( 1 jy1 ) = p( 1jT (;; y1)) n Y i=2
p( ijT (
i?1 ; i ; i?1 )):
y y
(4.8)
As mentioned above, the terminal node t is associated with a discrete distribution which represents the conditional probabilities for each of the labels given that the input features mapped to node t = T ( i?1; yi; yi?1). 30
Using the Markov assumption on the label sequence (i.e. including the previous label in the set of questions), the problem of nding the best sequence of prosodic labels is equivalent to nding the best hidden Markov model state sequence. In other words, the prediction problem can be solved e ciently using a Viterbi search algorithm. De ne A to be equal to the possible labels, e.g. accented and unaccented for pitch accent location prediction. The prediction algorithm is then: 1. Initially, nd for all j 2 A
1
(j ) = log p( 1 = j jT (;; y1; ;))
(4.9)
2. For each time i = 2; : : : ; n and all j 2 A compute:

i (j )
= max log p( i = j jT (k; yi; yi?1)) + i?1 (k)]: k2A

i
(4.10)
Save a pointer ptr(i; j ) to the best previous label k when 3. Find the best value for ^n ^n = max n(k); k2A
= j for all j 2 A. (4.11) (4.12)
and traceback for each time i = n; : : : ; 2 to get the predicted sequence of labels ^i?1 = ptr(i; ^i):
4.3 Transformational Rule-Based Learning

Transformational rule-based learning is a machine learning algorithm. The algorithm learns an ordered sequence of rules that iteratively minimize the overall classi cation error. The resulting rule sequence, or model, can then be applied to labeling other data. Each question is chosen by a greedy search over the entire corpus. This is in contrast to the decision tree, where only the data at each node is available for question selection. The search for new rules stops when the decrease in misclassi cation accuracy reaches a minimum threshold. 31
Brill rst introduced this method to the NLP community 17] for the part-ofspeech tagging task. In that task, TRBL performed on par with the contemporary stochastic methods of part-of-speech classi cation, often with less data (64k words vs. 1M). TRBL has also been shown to have equivalent or greater success than other statistical or rule-based algorithms in a number of diverse NLP tasks such as prepositional phrase attachment 16], partial parsing or \chunking" 43], dialogue act identi cation 48], word segmentation 38], and spell-checking 30]. TRBL has three basic requirements to de ne the system 17]: an initial-state annotator, a set of allowable questions and transformations, and a function for ranking potential questions. The initial state of the data is an important point of research. The importance of the initial state lies in the requirement of the algorithm to measure error to learn new rules. By starting with the least possible number of errors (e.g. using the most accurate algorithm for labeling the data) TRBL may not have information with which to learn new rules as this would deprive the learning algorithm of examples from which to learn. This is analogous to the problem of decision trees with greedy search. An early split in the decision tree which maximizes the decrease in impurity may not allow the optimal sequence of questions to be asked in later child nodes. On the other hand, an initial state where the error is maximal, i.e. all labels are wrong, may not allow the learning algorithm to overcome the errors. An additional problem is the possibility of interaction between initial state and template design. It is possible the initial state might be such that no template included in the possible template set would be able to correct the errors introduced by the initial state. Ramshaw and Marcus discuss this in more detail in 42]. The prototypes or \templates" provided to the learning algorithm are analogous to the set of allowable questions in decision tree training. While, in theory, any TRBL template can be framed as a decision tree question, most commercial or freely available decision tree software do not provide a mechanism for designing questions that 32
combine features. Furthermore, decision trees that use questions about neighboring labels require a Markov assumption that increases the cost of prediction. However, templates must be created by hand. Template creation in this research is guided by linguistic theory as described in Chapter 2. Actual templates used in experiments are described in more detail in Chapter 5. Ramshaw and Marcus have shown that irrelevant or \bad" templates can signi cantly decrease accuracy of labeling and increase the possibility of the algorithm to over-train 42]. There is also a similar concern for the set allowable templates to be of reasonable size and complexity to limit computational costs but also broad enough to capture important dependencies in the data. For example, if a single template combined questions about N features and each of these features had M values, the number of possible rules created from this template would be N M . An approach to improving computational e ciency is to index the locations of the feature values and contexts so that any search for a template looks at only those locations in the training data that match the context being scored. Ramshaw and Marcus implement this approach in an application of TRBL to partial parsing 43]. There is an enormous increase in speed of the search process due to the indexing. Unfortunately, indexing is not as e cient with dynamic features (i.e. questions about other labels and features dependent on other labels), and so it is not used here. In order to constrain the search space, another approach was chosen. The only feature values for the templates evaluated are chosen from the feature values that actually occur in the error space. This requires a simple comparison of the true labels and the predicted labels and a record of where the labels di er. The last item needed to de ne the training algorithm is similar to the impurity function in decision trees. A score is computed for each possible instantiation of a rule by calculating the decrease in the misclassi cation error:
Nscore = Ncorrect ? Nerror ;

33
(4.13)
where Nerror is the number of errors introduced by the candidate rule and Ncorrect is the number of correct changes from old labels to new labels. This metric was successfully used in 17]. However, Ramshaw and Marcus have examined other scoring metrics such as, Nscore = Ncorrect ? Nerrors; (4.14) where 1 and serves to weight the errors. They have shown that results may be a ected by the choice of metric but that the di erences were not statistically signi cant 42]. Equation 4.13 was chosen as the scoring metric for the pitch accent prediction experiments described in Chapter 5. The boundary prediction problem has two possible scoring metrics. Misclassication error, eq. 4.13, is one option, but it may be preferable to have a score that weights the error of predicting no boundary where there is a major boundary more than predicting a minor where there is a major boundary for example. The motivation for this is that there may be a closer relationship between the two types of boundaries, major and minor, than to the non-boundary. For this metric, an absolute distance measure would be applied:
Nscore = i j
where boundaries are
i
i truth
i predicted
j;
(4.15)
= f0 = none; 1 = minor; 2 = majorg.
Given the three components described above - the initial state annotator, a set of templates, and a scoring function - the learning process proceeds as follows. 1. Label the data according to some base rules to create an initial state. This is necessary because the algorithm is error-driven requiring a comparison between a candidate labeling of the training data and the true labels for that data. 2. For all possible rule prototypes and all possible features: (a) Instantiate a rule from a prototype by taking values of features of the data. 34
Unlabelled Data
Initial State Labeller
Initialization rules
Labeled Data Rule Templates "Truth" Labeled Data
Error Driven Learner (by comparison with truth)
Sequential Rule Set
Figure 4.2: A simple diagram of the transformation rule-based learning process. (b) Compare the predicted values of applying the rule to the data in its current state with the \truth" and measure the gain in prediction accuracy (i.e. score). (c) Record the score of the rule. 3. Test if we have reached a threshold of minimal gain with the new rule. If yes, exit loop. 4. Add the rule that maximizes the gain in accuracy (i.e has the highest score as de ned in eq. 4.13) or reduction in distance and change the candidate labels of the data according to this rule. Go to step (2); The rule sets are similar to handwritten rules. The major di erence is that while the prototypes are designed by hand, as are hand-created rules, the values for the features examined by a rule are automatically learned. This allows the algorithm to capture relationships that might be missed by a human expert and to automatically learn relative importance of rules which may be dependent on speaking style. The search space for TRBL can grow to computationally intractable sizes very quickly with the number of values per feature examined and the complexity of each 35
template or rule prototype. There have been several attempts to address this problem. Samuel in the discourse tagging application proposes Monte-Carlo sampling of features and templates 48]. Ramshaw and Marcus propose a di erent approach in the task of chunking. They make the search faster through an indexing of the locations of features and contexts 43]. Neither approach was adopted here. There are important di erences between decision trees and TRBL. TRBL requires a comparison of labels during the training phase. The training phase requires a labeling of the data which allows this algorithm to access information about the whole sequence of labels. This is a more exible approach than decision trees with a modi ed Viterbi search as described in Section 4.2, because it does not require a Markov assumption. There is some evidence that the TRBL algorithm provides resistance to overtraining, i.e. tting the training data too closely 42]. TRBL allows for more data to be examined at each iteration since the entire space of data sharing feature values is examined for a question. Contrast this approach to decision trees where data divided at parent nodes.
4.4 Joint Prediction of Labels at Two Time Scales

In this work, there are labels that exist at two time scales. In essence, the problem is one of predicting both the sequence of syllable-level prominence labels, 1; : : : ; n, and the word-level phrase break labels, 1 ; : : : ; n, where n 6= m. For a decision tree, there are two options to joint prediction. First, one could de ne a product space, A B , ( i 2 A; j 2 B ) of classes for label prediction and assign the label at the syllable-level. In this case, there would be some meaningless or redundant breaks labels as breaks are aligned on word boundaries and not wordinternal syllables. One of the major drawbacks of decision trees is that the splitting of the data at each new node decreases the number of data points available for learning 36
the question for the next node. As noted earlier, this creates data sparsity. This means that the smaller the number of labels being predicted, the better. Alternatively, separate trees could be learned for the two types of labels, in which case the trees would have some sort of serial dependencies (see 46, 59]). In contrast, TRBL allows for the joint prediction of labels without creating a product space and avoiding some of the problems in designing serial label prediction with decision trees. Joint prediction introduces new issues for both template design and scoring. As the goal is to predict both the sequence of syllable level prominence labels, and the word-level phrase break labels, can templates change only one label at a time, or two? When changing two labels, can the template be written so that one need not evaluate every syllable of a word for the possible label change combinations in rule design? In either case, the issue of the scoring metric plays a role. An advantage of joint prediction is that it allows for an error measure that incorporates both types of labels: Ntotal score = (Npitch score) + (Nphrase score); (4.16) that can take into account the relative importance of the di erent labels. Unfortunately, there have not yet been any perceptual studies to establish a basis for choosing and . A joint metric is critical for use of templates that change two labels at a time or even for assessing the trade-o s of templates that change accents vs. breaks. One could arbitrarily choose equal weights or weight the accent score by the inverse of the average number of syllables per word to balance the contribution at the word level. Instead, we chose to allow changes only to one type of label at a time, perform an updating of features for the other label that are label dependent (e.g. distances from pitch accents) and to enforce a particular ordering of labels, i.e. alternating as long as both produce viable rules. This is clearly a sub-optimal choice for joint prediction. Another issue related to having labels and features at two time scales is related to e ciency of implementation. For example, one would like to allow questions about 37
features at both time scales for changing accent labels. Two instantiated rules might be: Rule A: If the word is a content word and the syllable has primary lexical stress, change the label from accent to no accent. Rule B: If the word is a content word, change the label of the syllable with primary lexical stress from accent to no accent. Automatic template design is more costly for Rule A than for Rule B, since all syllables must be checked to design the rule. In the second case, there needs to be some standard relationships between the levels built into the system. In either case, the size of the feature representation is reduced by having one part-of-speech feature per word, rather than duplicating the information for each syllable in the word. In this thesis, we chose to associate all word-level features such as part-of-speech and distance from a sentence boundary (as marked by punctuation) with the word. Then each word contains pointers to the syllables that it consists of. Each syllable has associated with it syllable-level information and a pointer to the word information in which it is located. A nal issue involves convergence. Many of the features for pitch accents and phrase boundaries are based on the other label. These features are dynamic such as distance to a previous pitch or the existence a boundary label on a word. So, any change to these labels involves updating the values during the learning iteration.
38
Chapter 5 Experiments
This chapter presents the experimental paradigm and results for experiments in predicting pitch accent and phrase boundary locations. Section 5.1 begins the chapter with a discussion of features and values used in the prediction experiments. The metrics used to quantitatively evaluate the prediction algorithms are presented in Section 5.2. Next, in Section 5.3, I will present the results of experiments for predicting pitch accent location with four algorithms: decision trees, decision trees with a Markov assumption, a constrained version of TRBL and an unconstrained version. In these experiments, known phrase boundary locations are used. Experiments to test the e cacy of the algorithms with di erent initialization rules and di erent amounts of training data are presented. For the best performing TRBL algorithm, results of experiments with di erent parameters are also presented. Section 5.4 presents results of phrase boundary prediction experiments using simple decision trees and TRBL. In these experiments, known pitch accent locations are assumed. Section 5.5 will present results in the serial prediction of pitch accents and phrase boundaries using only TRBL, building on the results from the known label cases. In Section 5.6, results using TRBL for the joint prediction of these labels will be presented.
39
5.1 Features
In order to predict the location of pitch accents and the location of phrase boundaries, each word and syllable is described by a vector of features. The decision tree or TRBL prediction algorithms then automatically learn which features and which values for those features are signi cant in predicting when a syllable receives a pitch accent and where a phrase boundary is located. The features used here are chosen with two motivations: rst, the features must be reliably derived from the text input using currently available analysis techniques, and second, there must be some linguistic justi cation for the features. Previous prosody prediction research provided an indication of the most useful features and values for those features (see 12, 24, 37, 46, 61]). Features that were determined to be useful in pitch accent and phrase boundary prediction include lexical, syntactic, and discourse information. Previous research by Wang and Hirschberg 61], and Ross and Ostendorf 46] have shown that other prosodic labels are also good features. All except discourse features, which require sophisticated text processing, are used in the following experiments and are described further below.
Lexical information about the data is derived from a dictionary or lexicon. In the process of assigning a pronunciation to a word, a TTS system has access to the phones, lexical stress of each syllable and the syllabic division of the word. Lexical information for this research comes from a computer readable pronunciation dictionary 62]. Once the pronunciation is assigned to a word, the lexical stress of each syllable is identi ed. Lexical stress is one of the more important features for pitch accent location prediction. There are two types of lexical stress that may be marked in a multi-syllabic word: primary and secondary. Syllables in a multi-syllabic word that are not marked for 40
Lexical Information:
lexical stress typically never receive a pitch accent in normal speaking style. Primary lexical stress indicates that the syllable in the word is pronounced with more prominence than its neighbors if the word is pronounced in isolation. Every multi-syllabic word includes a primary lexically stressed syllable. Primary lexical stress is important for pitch accent prediction because a syllable with primary lexical stress is more likely to receive a pitch accent. In American English, some multi-syllabic words may have one or more syllables with a secondary lexical stress. Syllables with secondary lexical stress can also receive pitch accents, though much less often than syllables with primary lexical stress. As mentioned in Chapter 2, this occurs only when the syllable with secondary lexical stress occurs before the primary lexically stressed syllable, and is referred to as early accent placement or stress shifting (see 50]). In this research, secondary lexical stress has been divided into two values signifying location in the word with respect to the primary lexical stressed syllable. The dictionary entry for:
information, ih+2 n * f ax r * m ey+1 * sh ax n

shows primary (+1), secondary (+2), and unmarked syllables. Words with only one syllable are not typically marked with lexical stress in the dictionary so these cases were given a separate value for this feature to di erentiate them from unmarked syllables in multi-syllabic words. Two values for unmarked syllables are necessary because mono-syllabic words may receive pitch accents. Altogether, the values for the lexical stress feature are: primary lexical stress, secondary lexical stress (divided into two categories based on location with respect to the primary lexical stressed syllable), and two categories of syllables with no lexical stress marked for multi-syllabic vs. single-syllable words. Phonetic features can also be derived by looking at the dictionary pronunciation of a word. Each vowel has a phonetic characteristic termed as tense, lax, or reduced. Although this feature is not widely recognized as important for accent prediction, 41
Table 5.1: Summary of lexical features. These features are associated with syllables, and are used for pitch accent location prediction. Lexical Features Stress Values Static/Dynamic primary, Static secondary pre-primary, secondary post-primary, not marked (multi-syllabic), not marked (mono-syllabic) tense, Static lax, unaccentable integer integer integer Static Static Static
Vowel Quality Location from end of word in syllables Location from begin. of word in syllables #Syllables in Word
Ross found that syllables with a tense vowel are more likely to receive a pitch accent than syllables with lax vowels 46]. The values for phonetic quality include lax1, tense2 , and reduced3 . Finally, information about the location of the syllable in the word was found to be of predictive value in decision trees in Ross and Ostendorf 46]. Location from the beginning of the word and location from the end of the word are considered. Another feature, the number of syllables in the word is also considered. These features are integer valued. All of the lexical features summarized in Table 5.1 are static, meaning that the values do not change as accents or boundaries are assigned to words.
Syntactic Information:
1 2
There has been much study of the importance of syntactic information in the
Lax vowels are 'ih','eh','ah', and 'uh'. Tense vowels are 'iy','ey','ae','aa','ow','uw','ay','aw', and 'oy'. 3 Reduced vowels are 'ax','axr', and 'er'.
42
prediction of both pitch accents and the placement of phrase boundaries. Important features concerning syntactic constituency are not addressed as syntactic labeling is beyond the scope of this thesis (see 37] for a more detailed discussion). In the absence of more sophisticated syntactic information from a parser, most prosody prediction algorithms have used part-of-speech information 7, 12, 24, 37, 46, 61]. This research will rely on this information for both pitch accent and phrase boundary prediction. Part-of-speech information is derived from an automatic tagger or can be derived from a dictionary (though, a dictionary look-up is less accurate). The Boston University Radio News corpus is labeled with the complete set of 37 Penn Treebank part-of-speech labels. In this work, three di erent groupings of part-ofspeech labels were used. There are two primary motivations for considering di erent groupings of part-of-speech labels. First, there are the limitations of the learning algorithms. Both decision trees and TRBL perform searches and the greater the number of values for a feature, the more computationally infeasible using that feature becomes. The decision tree growing package used in earlier prediction work forced clustering of part-of-speech labels into 8 groups. While Splus allows up to 32 values per feature, CART allows only 8 values 15]. Second, there is evidence both in this research and previous work, that a greater number of values may increase the e ects of data sparsity. See 12] for a more exhaustive discussion of di erent clusterings of part-of-speech labels and the e ects on phrase boundary location prediction. The rst grouping of part-of-speech labels includes 33 of the 37 Penn Treebank labels. Labels for punctuation were not included, but punctuation was considered a separate feature, (See Appendix A for a full list of these labels). This group is used primarily in the best case experiments using TRBL. The second grouping consists of eight categories: cardinal numbers, adjectives, nouns, adverbs and particles, moreacceptable verbs, less-acceptable verbs, and two classes of function words 46] (see Appendix B). Finally, others have suggested that a content vs. function word distinction is helpful in predicting both phrase boundary locations and pitch accents 37, 61]. 43
Table 5.2: Summary of syntactic feature categories. Part-of-speech features are used in both pitch accent and phrase boundary prediction experiments. These features occur at the word level. Features Values Static/Dynamic POS Group I: Penn Treebank labels Static Current, Previous, Second Previous, Next POS Group II: cardinals, Static Current, Previous, adjectives, nouns, Second Previous, Next adverbs and particles, more-accentable verbs, less-acceptable verbs, 2 classes of function words POS Group III: proper nouns, Static Current, Previous, content words, Second Previous,Next more-accentable function words, less-accentable function words Punctuation sentence nal(. ? !) Static Group I comma(, ) colons(; :) none Punctuation ,;: . ? ! Static Group II none In this group, there are four values: proper noun, content word, more-accentable and less-accentable function words 46]. These groups were used in experiments comparing decision trees to TRBL. A summary of these values is shown in Table 5.2. The table lists these features as being static because they are properties of the input text and do not change as label assignment changes. The part-of-speech classes of the current, previous, second previous, and following words are all considered as features. The part-of-speech of the current word is known to be important for accent and phrase boundary prediction. The use of the part-ofspeech of the neighboring words provides an indirect means of representing syntactic structure. An illustrative example is found in phrase prediction experiments, where boundaries are more likely to be found where the current word is a content word and the following word is a function word. 44
Punctuation is used to assign phrase boundaries in many TTS systems. Such systems typically assign major phrase breaks where sentences end and minor breaks where commas are found. This method is simple to implement but can result in under assignment of boundaries. In this work, two classes of punctuation are considered in phrase boundary prediction. The rst is motivated by work in Black and Taylor and classi es punctuation into two values according to whether or not there is any punctuation 12]. The second group has more features: sentence nal punctuation, commas, colons, and none. This feature was treated as distinct from part-of-speech information. See Table 5.2 for a summary of punctuation classes.
The decision tree and TRBL algorithms were both allowed to ask about the prosodic labels of neighboring syllables and words. For pitch accent prediction, asking about neighboring accents allows the learning algorithms to capture the phenomena of \accent clash", the tendency of speakers to accent early syllables in a multi-syllabic word if the following word has an accent 50]. In addition, Ross and Ostendorf found that boundary location is useful for accent prediction 46], and Wang and Hirschberg found that accent location is useful for boundary prediction 61]. These prosodic features are shown in table 5.3.
Prosodic Labels:
5.2 Quantitative Evaluation Measures

Quantitative evaluation of prosodic prediction is still an area of debate. Accuracy based on a comparison between the predicted label and the true label is most often reported. Others have reported results in confusion tables. Still others have adopted pattern classi cation metrics such as correct detection vs. false detection 37]. This section will present the objective evaluation metrics used here.
45
Table 5.3: Summary of prosodic labels. Pitch accent labels align with syllables and phrase boundaries align with words. Feature Values Static/Dynamic Pitch Accent: pitch accent static & dynamic Current, Previous, no pitch accent Second Previous, Next Phrase Boundary: major, static & dynamic Current, Previous, minor, Second Previous, no boundary Next Distance to previous, integer static & dynamic next prosodic label in words Distance to previous integer static & dynamic prosodic label in content words
5.2.1 Pitch Accent Evaluation

In pitch accent location research, accuracy based on exact match is most often reported where accuracy is A, (5.1) A = Ncor ; N
tot
Ncor is the total number of correctly predicted labels, and Ntot is the total number of labels. In early work, prediction is at the word level, with the assumption being that accent always falls on the syllable with the primary lexical stress. In pitch accent prediction results reported reported here, results are reported for prediction of pitch accent locations at the syllable level. This is done to compare with earlier research in accent prediction on the radio news corpus 46]. In addition, as mentioned in Chapter 2, there is a linguistic motivation to reporting syllable level accuracy: pitch
46
accents are sometimes assigned to the secondary stressed syllable to avoid accent clash. Syllables without pitch accents are much more frequently found in most speech, simply assigning all syllables no accent gives an accuracy of 70.0% in the Boston University Radio News Corpus. An evaluation metric based on exact match also misses natural prosodic variation. Ross and Ostendorf found that ve readers of the same passages in the Boston University Radio News corpus showed signi cant variation in the placement of pitch accents 46]. To address the problem that there is often more than one acceptable prosodic labeling of a sentence, even in the context of a story, Ross and Ostendorf proposed di erent multiple version scoring functions. Results of a multiple version match are not presented here, since it is beyond the scope of this thesis. In this research, exact match accuracy will be reported for pitch accent prediction experiments.
5.2.2 Phrase Boundary Evaluation Measures

Evaluating phrase boundary prediction is somewhat more complicated than pitch accent location prediction. In this case, there are three labels to evaluate. Nonboundary labels are much more frequent than words with boundaries, so assigning all words with a no boundary label gives an exact match accuracy of 71.8% correct boundaries. As with pitch accent location prediction, there is natural variation between placement of phrase boundaries among di erent speakers, which can be addressed with a multiple version scoring criterion 37]. Again, we are not able to use this approach and so will look only at scoring the predicted boundaries using a single target version. Two measures will be presented here for phrase accent prediction evaluation. First, exact match accuracy will be presented. Equation 5.1 can be used for this task as well. Second, the absolute average distance, will also be presented. This is a new 47
metric that is motivated by corpus analyses in 63], which found progressively longer duration lengthening at higher level phrase breaks. In other words, there is a bigger di erence between a major break and no break than between a minor break and either of the other two cases. If the three boundary labels are assigned integer values where the major boundary receives the largest (2) and no boundary label receives zero, then a distance between the true labeling and the predicted labeling can be computed rather than a percentage of correct matches, e ectively weighing the di erent types of errors. If phrase boundaries are i 2 f0 = none; 1 = minor; 2 = majorg, and Ntot is the total number of boundaries, then
Nscore =
Ntot i
i truth Ntot
i pred
j:
(5.2)
This metric penalizes errors of assigning no boundary for major boundary more than errors for assigning no boundary for a minor boundary or for assigning a major boundary for a minor boundary. A lower score with this metric means that the average di erence between predicted labels and the true labels are closer. For example, simply labeling all locations with no boundary results in a distance score of 0.478. Any result lower than this denotes an improvement. Other integer values might be chosen to denote the boundaries to express a greater linguistic similarity between labels but experiments with other integer mappings are beyond the scope of this thesis. For comparison with other research and metrics, a confusion table will be presented to allow for identi cation of error patterns in boundary prediction and to make it possible to compute other error measures from these data.
5.3 Pitch Accent Prediction Experiments

The rst experiment is the prediction of pitch accent locations with phrase boundaries known. One goal of this experiment is con rm the necessity of phrase boundaries in pitch accent prediction. This will allow us to choose an order for predicting both 48
Table 5.4: Summary of pitch accent location prediction with phrase boundaries given for the Boston University Radio News Corpus. Accuracy is reported at the syllable level for an exact match with a target version. There are 3359 syllables total. Algorithm Syllable Accuracy Content/Function 83.8% Simple Decision Tree 85.6% Decision Tree 85.2% with Markov TRBL 86.0% Best case TRBL 86.8% labels serially. A second goal is to determine the e cacy of TRBL in comparison to already established prediction methods. Simple decision trees, decision trees with a Markov assumption, and TRBL are used to predict pitch accent locations. Results for each of the decision tree experiments and TRBL are presented in table 5.4. In order to provide a simple baseline, accuracy using a simple rule that assigns a pitch accent to the syllable with primary lexical stress in every content word is shown. In brief, we nd that TRBL gives the best performance, particularly when more complicated rule templates are allowed. Details of these experiments follow. Two experiments were run with the decision trees. Both decision tree experiments include the same set of features, except that the decision tree with the Markov assumption includes a question about the previous label, as described in chapter 4. The only feature not available to the decision trees was part-of-speech group I. The trees were grown using Splus, a commercial statistical analysis package 54]. The objective function used here was minimum entropy (see Chapter 4). Other impurity functions have been used in this task but the reported di erence in results was not signi cant 46]. Trees were grown with all of the training data. Di erent constraints on tree size were examined by cross-validating with 90% of the data and holding out 10% of the data. In the simple decision tree experiments, it was found that limiting 49
terminal nodes to 24 provided equivalent results to an unpruned tree so this value was chosen. The most important features chosen in the simple decision tree experiment include lexical stress, part-of-speech group II, and part-of-speech group III. Other features were vowel quality, location from the beginning of the word, part-of-speech of the previous words (groups two and three), number of syllables, and boundary information (boundary of current word and distance in words to previous boundary). The simple tree did not choose distance in words to the succeeding boundary, boundary on the next word, boundary on the previous word, and part-of-speech information of two words back. The simple decision tree is shown in Figure 5.1. A confusion table is presented in table 5.5. The decision tree with a Markov assumption was grown using the same parameters as in the simple decision tree. The primary di erence between the simple decision tree and the tree with the Markov assumption is that the question about the previous label is chosen quite high in the tree. The results di er from previous results reported in 46]. There, the Markov assumption outperforms the simple decision tree, predicting pitch accent location with 87.7% accuracy at the syllable level. Here, the simple decision outperformed my implementation of the Markov assumption. There are several possible reasons for this. First, only a rst order Markov assumption was implemented, and Ross found that the longer context (second order) is used when it is available. In addition, he used a simple parsing of noun phrases to include complex nominal information. Finally, his experiments were based on an older version of the corpus which included noisy portions of the data. The decision tree used with the Markov assumption is shown in Figure 5.2. Two di erent experiments were run with the TRBL algorithm. The experiments di er in the type of templates the TRBL algorithm is allowed to use and in the permitted features. In the rst experiment, the TRBL learning algorithm was deliberately constrained to be as similar to decision trees as possible. This meant that all features except part-of-speech group I 50
Simple Decision Tree

N 4787/15121 LEX:NOM,POSLX2 LEX:LX1,NOO,PRELX2
N 56/5547 LOCB<0.5 LOCB>0.5 POS3:LSF
N 4731/9574
POS3:CONT,MOF,PR
N 34/1154 VOW:LAX,RED
N 22/4393
N 316/3296 LEX:LX1
P 1863/6278
POS2:LS,MR,MS1,MS2 POS2:ADS,AJS,CDS,NNS
VOW:TNS
LEX:NOO,PRELX2
N 15/964
N 19/190
N 220/3140
P 60/156
P 916/3768 POS2:CDS,MS2 POS2:ADS,AJS,LS,MR,MS1,NNS
P 947/2510 LOCB<1.5 LOCB>1.5
BD:MAJ,MIN BD:NOB
P 30/63
N 187/3077
N 117/240
P 793/3528 DBPBC<1.5 DBPBC>1.5
P 886/2446 PPOS3:CONT,MOF,PR
N 3/64
POS2:MR,MS1 POS2:LS,MS2
PPOS3:LSF,none
N 74/406
N 113/2671 NPOS3:LSF
P 314/1900 BD:MAJ,MIN
P 479/1628 PPOS2:AJS,LS,MR,MS2 PPOS2:ADS,CDS,MS1,NNS
P 610/1340 NSYLS<3.5
P 276/1106 NPOS2:ADS,AJS,CDS,MR
NPOS3:CONT,MOF,PR
BD:NOB
NSYLS>3.5 NPOS2:LS,MS1,MS2,NNS,none
N 69/646
N 44/2025
P 70/752
P 244/1148
P 169/753
P 310/875
P 538/1242 BD:MAJ,MIN
N 26/98
P 83/195
P 193/911
PPOS2:ADS,AJS,CDS,LS,MR,NNSVOW:RED PPOS2:MS1,MS2,none VOW:LAX,TNS
LOCB<0.5
PPOS3:CONT,LSF,MOF PPOS3:PR
LOCB>0.5
BD:NOB
N 42/565
N 27/81
N 0/472
N 44/1553
P 121/752
P 123/396
P 196/463
P 114/412
P 176/520
N 360/722 POS2:AJS,LS
NPOS2:ADS,AJS,CDS,MS1 PPPOS3:CONT,LSF
DPBD<3.5
NPOS2:LS,MR,MS2,NNS PPPOS3:MOF,PR
DPBD>3.5 POS2:ADS,CDS,MR,MS1,MS2,NNS
P 35/96
P 86/656
P 142/381
N 28/82
P 80/326
P 96/194
N 84/227
P 219/495
PPOS2:ADS,AJS,CDS,LS PPOS2:MR,MS2,NNS
P 19/78
N 39/116
Figure 5.1: The pruned tree using the basic decision tree algorithm for pitch accent location prediction. 51
N 4787/15121 LEX:NOM,POSLX2 LEX:LX1,NOO,PRELX2 N 56/5547 PLAB:P PLAB:N,none N 2/2814 N 54/2733 N 316/3296 PLAB:P PLAB:N,none N 227/637 LOCB<0.5 LOCB>0.5 N 218/543 N 9/94 P 692/3414 LEX:LX1 LEX:NOO,PRELX2 P 761/2227 LOCB<1.5 LOCB>1.5 P 701/2164 N 3/63 P 1453/5641 POS3:LSF POS3:CONT,MOF,PR P 1863/6278 N 4731/9574
LOCB<1.5 POS2:LS,MR,MS1,MS2 LOCB>1.5 POS2:ADS,AJS,CDS,NNS N 48/1593 N 6/1140 N 220/3140 P 60/156
BD:MAJ,MIN BD:NOB P 30/63 N 187/3077
POS2:MR,MS1 BD:MAJ,MIN POS2:LS,MS2 BD:NOB N 74/406 N 113/2671 N 150/307 N 68/236
PPLAB:P PPLAB:N,none P 250/747 P 442/2667
NPOS3:LSF NPOS3:CONT,MOF,PR N 69/646 N 44/2025
POS2:MS2 POS2:ADS,AJS,CDS,LS,MR,MS1,NNS P 46/94 P 396/2573
PPOS3:CONT,MOF,PR PPOS3:LSF,none P 462/1114 P 239/1050
PPOS2:ADS,AJS,CDS,LS,MR,NNSVOW:RED PPOS2:MS1,MS2,none VOW:LAX,TNS N 42/565 N 27/81 N 0/472 N 44/1553
PPOS2:ADS,AJS,CDS,MR,NNS PPOS2:LS,MS1,MS2,none P P 229/1152 167/1421 BD:MAJ,MIN BD:NOB P 41/626 P 126/795
NSYLS<3.5 NPOS2:ADS,AJS,CDS,MR NSYLS>3.5 NPOS2:LS,MS1,MS2,NNS,none P 406/1033 BD:MAJ,MIN BD:NOB P 114/403 P 292/630 N 25/81 P 80/191 P 159/859
NPOS2:ADS,AJS,CDS,MS1 PPLAB:P POS2:AJS,LS,MS1,MS2 NPOS2:LS,MR,MS2,NNS PPLAB:N,none POS2:ADS,CDS,MR,NNS P 42/118 P 84/677 P 76/174 P 38/229 N 129/297 P 124/333
DPBD<3.5 NPOS2:ADS,AJS,CDS NPOS2:LS,MR,MS2 DPBD>3.5 NPOS2:LS,MR,MS1,MS2,NNS NPOS2:ADS,AJS,CDS,MS1,NNS P 30/105 N 23/69 N 7/45 N 122/252 P 19/99 P 105/234
PPOS3:CONT,MOF PPOS3:PR N 73/186 P 17/66
Figure 5.2: The pruned tree using the decision tree with a Markov assumption for pitch accent location prediction. 52
Table 5.5: Confusion table for best decision tree. Truth Predicted Accent No Accent Accent 867 308 No Accent 162 2024 Total 1029 2332 were examined. Only rule templates that mirrored possible questions in decision trees were allowed. For instance, only templates that allowed questions about one feature, allowed two values to be or`d from one feature, or greater than for numerical features. A template that examined previous labels was also included to compare to the decision tree with the Markov assumption. In the second experiment, the best performing templates were provided to the learning algorithm as well as the part-of-speech group I feature. The best performing template was the and template. This template was not allowed in the rst experiment, since the decision tree software available does not look at the values of two features at the same node. The and template allowed the learning algorithm to examine the values of two di erent features at the same time. The threshold for minimum gain, important for limiting over-training, was set to ve. Di erent values were tried from 2 to 7 but there were no statistically signi cant di erences in accuracy with the best case algorithm. The rule ranking metric for the pitch accent prediction task was simple exact match accuracy. That is, whichever instantiated rule most decreased the number of errors in the intermediate labels was chosen. Di erent initializations were tried in both experiments. Results for each of the three initializations for the best case experiment are presented in Table 5.6, together with the accuracy at initialization. Despite a wide range of starting points, convergence is to similar points in most cases. When the or template was excluded from the allowed templates, initialization rules 53
Table 5.6: Accuracy with di erent initializations using the best case TRBL template set. Accuracy all no accents all accents content/function word Initial 70.0% 30.0% 83.8% Final 86.8% 86.7% 86.8% Table 5.7: Confusion table for best TRBL rules with no and template. Minimum gain set to 5. Truth Accent No Accent Predicted Accent 896 338 No Accent 133 1994 Total 1029 2332 did not a ect performance for either of the two TRBL experiments. When the or template was included with an initialization of all syllables as accented, the learning algorithm reached a local maximum and resulted in an accuracy of 80.6%, signi cantly below the best result with the one feature template only. While accuracy did not vary much with the di erent initializations, the rules and the number of rules learned did di er signi cantly. The initialization rule that introduced the greatest number of errors, assigning all syllables with pitch accents, caused the algorithm to learn eight rules. Five rules were learned with the content/function word initialization rule. In the remaining TRBL experimental results reported here, the initialization rule labeled all syllables with accents. In the experiment constraining TRBL to decision tree capabilities, eight rules were learned. The rst rule corrected the initialization rule: label all syllables without lexical stress in multi-syllabic words with no pitch accent. Table 5.8 lists the eight rules learned. The best case algorithm for TRBL learned 25 rules when initialized 54
Table 5.8: Rules learned with the constrained TRBL algorithm. Errors is the sum of errors corrected minus the sum of errors introduced the training data. Feature Value Old Label New Label Errors Stress No stress, Accent no Accent 5133 multi-syllabic word POS III Less accentable accent no accent 2664 function words Stress Post-primary, accent no accent 302 secondary Stress Pre-primary, accent no accent 40 secondary Location from 5 no accent accent 17 end of word Stress No stress, accent no accent 9 multi-syllabic word POS II of Cardinal numbers accent no accent 9 next word POS II Misc. function, accent no accent 7 words with the minimum gain is set to ve. The best case version allows questions about part-of-speech group I and the and template. The rule set is very similar to the rules learned in the constrained case, but the and template predominates very early in the list. Experiments showed that the best algorithm learned rules instantiated from the one feature template and the and template. Table 5.9 lists the rules learned with the best case algorithm. With both the decision trees and the constrained TRBL algorithm, lexical stress is very important to pitch accent placement. Part-of-speech group III is also very important and appears with the current boundary very early in the rule set. Vowel quality is chosen here as an important indicator of pitch accent placement. While other research and the decision trees grown here show that syllables with a tense quality and in the pre-primary secondary lexically stressed syllable often receive a pitch accent, the best 55
case TRBL shows the opposite. However, when reading the rules in table 5.9, one must remember the rules here are learned after the initialization rule of all pitch accents. A more detailed analysis is needed to resolve this discrepancy. Table 5.9: Top 10 Rules learned with the best case TRBL algorithm. Where the feature entry contains two features, the and template was chosen and the rst value corresponds to the rst feature and the second value to the second feature. Feature(s) Value(s) Old Label New Label Errors Corrected LEX NOM P N 5133 POS3 BD LSF NOB P N 2692 LEX POSLX2 P N 302 LEX VOW PRELX2 TNS P N 62 POS2 VOW MS1 LAX P N 39 POS3 PPOS1 PR PP P N 28 BD NPOS2 NOB CDS P N 24 LEX NPOS2 PRELX2 CDS N P 15 POS1 IN P N 13 LEX NPOS2 PRELX2 MS2 P N 13 TRBL has shown advantages when used with lesser amounts of training data than stochastic training algorithms in other prediction tasks as discussed in Chapter 4. To test this claim for the pitch accent prediction task, all algorithms were provided with varying amounts of the training data. The training data was divided at paragraph divisions to give 100%, 66%, and 33% sets. Each algorithm was then trained with these amounts of data and the resulting models were measured against the full target version. Table 5.10 shows these results. Interestingly, all models maintained a relatively high accuracy score. In fact, there was no statistically signi cant degradation of accuracy scores. This may be due to the fact that the radio style of speaking is highly accented, and simple content/function word rules alone give relatively good performance. The following example, in Figure 5.3 , is from the labnews data and demonstrates 56
Table 5.10: Syllable accuracy results for decreasing amounts of training data. Amount of Training 100% 66% 33% Simple Decision Tree 85.6% 85.9% 85.9% Decision Tree with Markov 85.2% 84.9% 84.9% TRBL, constrained 86.0% 85.5% 85.5% Best case TRBL 86.7% 86.4% 86.7% the predicted and hand-labeled pitch accents using the best case TRBL algorithm. The third phrase beginning with \of the Massachusetts..." contains too many pitch accents with two accent predicted for \Massachusetts". While this labeling is plausible, the example highlights the possibility for TRBL rules to over-accent. The TRBL algorithm did not choose any rule templates that used neighboring accent labels, which would be needed to capture the phenomenon of accent clash that might correct this example. A possible reason is that the neighbor label rule was implemented in a left-to-right fashion, and it may have been more e ective operating, right-to-left. The over-accenting in the fth phrase (\the SJC's'...") is another example.
5.4 Phrase Boundary Prediction Experiments

Phrase boundary prediction experiments with known accent location were also carried out in order to determine the best order for the serial prediction experiments. Features involving pitch accents were derived from the hand-labeled pitch accents in the corpus. Results are presented for phrase boundary prediction using simple decision trees and TRBL. Each algorithm was given the same set of features. The features did not include part-of-speech group I or previous phrase boundary labels. Table 5.11 shows results based on the absolute distance criterion, compared with accuracy measured with percent exact boundaries correct. Note that improved performance corresponds to a lower average distance and higher exact match accuracy. 57
Figure 5.3: An comparison of predicted accents with TRBL vs. hand-labeled accents.
|| = major phrase boundary, | = minor phrase boundary * = pitch accent, - = no accent
Wanted: || Chief Justice | of the Massachusetts Supreme Court.|| true: * * * * * * * * * * *
pred.: *
In April, | the SJC's current leader || Edward Hennessy || reaches true: pred.: * * --*-* * * * * * * - - * * -
the mandatory retirement age of seventy, || and a successor | true: pred.: * - - * - - - * - - * - * - * - * * -
is expected to be named in March. || true: pred.: - - - * - * * *
Table 5.11: Summary of phrase boundary location with accuracy rates and average absolute distance given for the Boston University Radio News Corpus. Algorithm Distance Accuracy All no boundary 0.478 71.9% Punctuation . 0.266 76.9% Simple Decision Tree 0.253 84.1% Constrained TRBL 0.239 82.3% TRBL 0.235 82.6% 58
Table 5.12: Rules for best case TRBL using the punctuation initialization rule. Distance score is the improvement in di erence between the old distance and the new distance for each rule. Feature(s) Value(s) Old Label New Label Distance POS2 MS2 1 0 0.0068 POS2 AJS 1 2 0.0053 NPOS2 NNS 1 0 0.0047 POS2 LS 1 0 0.0039 POS2 ADS 1 0 0.0025 NPOS2 LS 1 2 0.0018 POS2 MS1 1 0 0.0011 PPPOS2 CDS 1 2 0.0008 POS2 CDS 1 0 0.0010 PPOS2 MS1 1 2 0.0005 Neither algorithm asked questions about pitch accents, so these features were not included in other experiments with the TRBL algorithm. This result also motivated the choice to predict phrase boundaries rst and then pitch accents in the experiments described in Section 5.5. The simple decision tree did provide higher exact match accuracy (84.1% vs. 82.6%) but did not predict any minor boundaries. Further, the distance measure shows that despite the higher exact match accuracy, the absolute distance between predicted and true labels is actually higher than with TRBL. It should be noted that since optimization of rule choice in TRBL uses distance, exact match accuracy need not be optimized as well. Similarly, the decision tree is not designed to minimize distance; minimum entropy is better suited to the exact match criterion. The constrained algorithm did not produce signi cantly worse results than the best case for the learner (82.3% vs. 82.6% exact match accuracy). The rules learned in the best case phrase prediction algorithm are shown in Table 5.12. 59
Table 5.13: Confusion table for best case TRBL rules for phrase prediction at the word level. Truth Major Minor None Predicted Major 274 43 37 Minor 50 55 65 None 92 80 1416 Total 416 178 1518 The phrase boundary labels for TRBL were initialized in several ways. As in pitch accent prediction, all boundary locations were assigned the most prevalent label, no boundary (71.9% accuracy). Then, all positions with sentence nal punctuation were assigned a major boundary. Finally, all boundary locations at a content word followed by a function word were assigned a minor boundary. This resulted in an accuracy score of 81.1%. Here, again, choice of initialization rules did not have a great impact on either overall accuracy or the distance measure, though it did impact the number of rules learned. Figure 5.4 presents a comparison of phrase boundaries predicted with the best case TRBL algorithm and the hand-labeled phrase boundaries. TRBL misses one important phrase boundary. There is a major phrase boundary following \Wanted" which is orthographically marked by a colon. This was most likely missed in the training because of the sparsity of examples in the training data that contain colons (only three instances). Minor phrase boundaries are erroneously inserted between \age" and \of", \expected" and \to", and \named" and \in". This is most likely due the initialization rules which inserts minor boundaries at each content/function word boundary. TRBL did not, in this case, learn any rules that corrected the errors in these locations. A solution to this may be increased syntactic information from a parser. The missing boundary between \leader" and \Edward" might be xed with features about names in complex nominals. The major vs. minor 60
Table 5.14: Pitch accent accuracy using predicted boundaries. Boundaries Accent Rules Accuracy Hand-labeled Original 86.8% Predicted Original 86.3% Predicted Retrained 86.7% prediction di erences are not real errors in the sense that the hand-labeled phrase boundaries of other versions of this data read by di erent speakers matched what TRBL predicted. \leader" and \Edward".
5.5 Serial Accent and Phrase Boundary Prediction

This section will describe the results of serial prediction of phrase boundaries and pitch accents. Results from the previous single label prediction experiments described above indicated that phrase boundaries should be predicted rst, since, no questions about pitch accents were chosen in the phrase boundary prediction experiment. Accuracy of boundary prediction is therefore not a ected, but pitch accent prediction will be based on less reliable features found in Section 5.3. The pitch accent step was evaluated two ways. First, the best case rules were used with the predicted boundaries, instead of the hand-labeled boundaries to predict the pitch accents. Second, the pitch accent rules were retrained with the predicted boundary information, by running the boundary prediction rules on the training corpus. Table 5.14 shows the accuracy for the pitch accent prediction. Using the predicted rather than known boundaries degrades performance, as one might expect, but the di erence is small. However, most of the performance loss is regained by retraining the rules using predicted boundaries.
61
Figure 5.4: A comparison of phrase boundaries predicted with TRBL and handlabeled boundaries.
|| = major boundary, | = minor boundary
Wanted: true: pred.: ||
Chief Justice | ||
of the Massachusetts Supreme Court. || ||
In April, the SJC's current leader true: pred.: | ||
Edward Hennessy || || ||
reaches
the mandatory retirement age of seventy, and a successor true: pred.: | | || | ||
is expected to be named in March. true: pred.: | | || ||
62
5.6 Joint Phrase and Accent Prediction

TRBL provides a framework that allows for joint prediction of pitch accents and phrase boundaries while avoiding problems of data sparsity that can occur with decision trees. Results presented in Section 5.3 suggest that there is much redundancy in features used for pitch accent prediction so this may not be as much of a concern, at least for the Radio News corpus. Nevertheless, it is still of interest to investigate joint prediction. There are several approaches to learning rules that allow two label sets. These approaches can be divided into two areas: template design and rule choice. Chapter 4 discusses these approaches in more detail. Rule choice metrics are another area complicated by labels occurring at two levels. Finally, the complexity of feature interdependence can cause problems of convergence. Many of the features at the syllable level are derived from the phrase boundaries and vice versa. The approach taken here was to force the alternation of rules that ask about each label. Thus, one learning iteration was devoted to choosing the best phrase boundary rule and the next iteration was devoted to choosing the best pitch accent rule. After each iteration, the relevant feature values are updated. This is clearly not an optimal choice, but more computationally feasible than the alternatives. Unfortunately, due to time constraints, the other alternatives were not tried in this research. The joint learning algorithm was initialized with all pitch accents for syllables and the punctuation rule for phrase boundaries. The thresholds for learning were chosen from the serial experiment thresholds (gain of 4 for syllables, and 0.0001 for absolute distance for phrases). Serial prediction of phrase boundaries and then pitch accent relabeling using rules learned with with predicted phrase boundaries (86.7% pitch accent accuracy) signi cantly outperformed joint prediction (80.2% pitch accent accuracy). The poor performance suggests that the strict alternation of rule choice leads to poor local optima, so initialization is an important issue. 63
Table 5.15: Comparison of serial and joint prediction algorithms using TRBL. Method Accent Accuracy Boundary Distance Best Serial Prediction 86.7% 0.235 Joint Prediction 80.2% 0.251
64
Chapter 6 Discussion
6.1 Summary
Prosody production in speech synthesis systems remains a di cult problem. One step towards improving the production of prosody in these systems is to improve the prediction of abstract prosodic labels such as pitch accent locations and phrase boundaries. It is desirable to have an algorithm that is automatically trainable and portable. This research presents a new algorithm based on TRBL, which is automatically trainable and o ers some advantages over the prevailing methods of prosody prediction such as hand-written rules and decision trees. In this chapter, I summarize the major results and contributions of this thesis. I will also present some future research directions.
6.2 Contributions
The rst major contribution is the introduction of a new automatically trainable prediction algorithm, transformational rule-based learning, to the very small group of prosody prediction algorithms. TRBL has been shown to provide comparable results 65
to the best prediction methods currently in use today, decision trees, for the tasks of pitch accent and phrase boundary prediction. This method also allows the joint prediction of two di erent types of prosodic labels. Although we did not nd an advantage to joint prediction a less constrained implementation might lead to improved performance. The prediction work carried out here con rms the importance of features derived from text analysis in prosodic label prediction. The features chosen by both the trees and the TRBL algorithm demonstrate that features such as detailed part-of-speech and distinctions in location for secondary lexical stress are important predictors of pitch accent location. The use of several groupings of part-of-speech values showed the limitations of each of the groupings. Rule sets reported in Chapter 5 demonstrated that the use of di ering sets of values helps prediction. In contrast to Hirschberg 24], we did not nd that accents were useful in phrase prediction using the TRBL algorithm. Further research is needed. A second contribution is a new metric for quantitative evaluation of phrase boundary prediction, absolute average distance, was introduced. This metric has linguistic motivations and describes improvements in multi-label phrase boundary prediction more meaningful than exact match accuracy or a confusion table. In our experiments, we found that many more minor phrase boundaries are predicted with this metric. No minor phrase boundaries are predicted using the decision tree with the exact match criterion. This absolute distance metric can also be modi ed to re ect changes in our understanding of di erent levels of phrase boundaries. For example, if a closer relationship between major and minor boundaries is posited, the integer values could be changed to re ect less distance between these two labels that from either to no boundary. Phrase boundary labels might then be: = fmajor = 5; minor = 4; no boundary = 0g.
66
6.3 Future Directions

This section discusses a few possible directions for future research that build on the research in this thesis.
There are several modi cations to the learning algorithm that were suggested by the experiments carried out in this research. First, potential rules that looked at previous labels were never chosen in the experiments. The template was implemented in a manner that looked in a right to left direction. Ordering in such templates has been shown to be signi cant in other research. The templates were designed to look only at the immediate left neighbor, and a larger context may be necessary for prosody prediction as suggested in Ross's work 47]. Brill uses templates that examine two or three labels on either side for part-of-speech tagging 17]. In the experiments on pitch accent prediction, the and template was chosen most often in the best case experiments. Adding more templates of this type such as a template that and's three features together might provide more accuracy. Third, templates that allow questions about features and labels should be tried. In contrast, the or template hurt performance in a number of experiments. This template might bene t from more a e cient implementation as in decision tree design. In decision trees, any size subsets may be included in the question. In the or template here, only two values were allowed. As more features and more templates are considered for the learning algorithm, a method for constraining the search must be found. Samuel makes a proposal of sampling potential rules randomly through a Monte-Carlo method 48]. Also, several schemes have been advanced to decrease the amount of time necessary to measure the score for each rule. Ramshaw and Marcus mention an indexing scheme that 67
Modi cations to TRBL:
dramatically decreases this phase of the learning iteration 42]. Indexing was not used because of the dynamic features, but it may be possible to use a partial indexing of the data for the static features.
Improved Input Features:
As mentioned above, the learning algorithm should bene t from more complex rule templates and additional features. In particular, the prediction of phrase boundaries should bene t from increased syntactic information. Automatic syntactic parsers are rapidly improving in both accuracy and depth of information and this information could be included. Partial parsing or \chunking", rst proposed by Steve Abney in 3], creates a sequence of bracketing that di ers from traditional syntactic bracketing strategies in that there is no nesting of brackets. It is likely that improved performance will be obtained from including this information. The identi cation of complex nominals was not explicitly undertaken here. Instead, template designs that allowed the examination of part-of-speech labels on the current, previous, and second previous word were used. Yet, explicit identi cation of complex nominals might be more useful than relying on part-of-speech information alone. Multiple levels of partof-speech labels were used in this research. More study on the impact of di erent sets of part-of-speech labels as was done in 12], and when these labels used separately and together might would be useful. Finally, text generation may also provide more syntactic as well as discourse information. Discourse information has been shown to improve performance, though with small gains, in some prosodic prediction tasks such as pitch accent prediction 24].
68
Appendix A Penn Treebank Part-of-Speech Tags

1. CC Coordinating conjunction 2. CD Cardinal Number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Comparative adjective 9. JJS Superlative adjective 10. LS List item marker 11. MD Modal 69
12. NN Singular noun 13. NNS Plural noun 14. NP Singular proper noun 15. NPS Plural proper noun 16. PDT Predeterminer 17. POS Possessive ending 18. PP Personal pronoun 19. PP$ Possessive pronoun 20. RB Adverb 21. RBR Comparative adverb 22. RBS Superlative adverb 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb (base form) 28. VBD Past tense verb 29. VBN Past participle verb 30. VBG Gerund or present participle verb 31. VBN Past participle verb 70
32. VBP Present tense verb, other than third person singular 33. VBZ Present tense verb, third person singular 34. WDT Wh-determiner 35. WP Wh-pronoun 36. WP$ Possessive wh-pronoun 37. WRB Wh-adverb
71
Appendix B Part-of-speech Classes

1. Cardinal numbers CD 2. Adjectives JJ, JJR, JJS 3. Nouns NN, NNS, NP, NPS 4. Adverbs RB, RBR, RBS, RP 5. More-accentable verbs VBG, VBN 6. Less-accentable verbs VB, VBD, VBP, VBZ 7. Miscellaneous Group One PDT, PP, WP, WRB 8. Miscellaneous Group Two CC, DT, EX, FW, IN, LS, MD, PP$, SYM, TO, UH, WDT, WP$
72
Bibliography
1] Abe, M. (1997). \Speaking Styles: Statistical Analysis and Synthesis by a Text-to-Speech System." In Progress in Speech Synthesis. Jan P.H. van Santen, Richard W. Sproat, Joseph P. Olive, and Julia Hirshberg, eds. New York: Springer-Verlag. 495-510. 2] Aberdeen, J. (1996). \Brill Part-of-Speech Rule Format." Mitre Corporation. Part of course materials for CS545 taught by Profs. Hirschman and Burger, Boston. 3] Abney, S. P. (1991). \Parsing by Chunks" in Principle-based Parsing: Computation and Psycholinguistics, ed. by R. C. Berwick, S. P. Abney, and C. Tenny. Kluwer Academic Publishers, Norwell, MA. 257-279. 4] Allen, J., Hunnicutt, M. S., Klatt, D., et al. (1987). From Text to Speech: The MITalk System. New York: Cambridge UP. 5] Allen, J. F., Schubert, G., Ferguson, G., et. al. (1995). \The Trains Project: A Case Study in Building a Conversational Planning Agent." Journal of Experimental and Theoretical AI, 7:7-48. 6] Alter, K., Buchberger, E., et al. (1996). \VieCtoS: The Vienna Concept to Speech System." In Natural Language and Speech Technology. D. Gibbon, ed. Berlin: Mouton de Gruyter. 73
7] Arn eld, S. (1996). \Word Class Driven Synthesis of Prosodic Annotations." Proceedings of the International Conference on Spoken Language Processing. 8] Bachenko, J., and Fitzpatrick, E. (1990). \A Computational Grammar of Discourse-neutral Prosodic Phrasing in English." Computational Linguistics16(3). 155-170. 9] Barbosa, P., and Bailly, G. (1997). \Generation of Pauses Within the z-score Model." In Progress in Speech Synthesis. Jan P.H. van Santen, Richard W. Sproat, Joseph P. Olive, and Julia Hirshberg, eds. New York: Springer-Verlag. 365-377. 10] Beckman, M., and Ayers Elam, G. (1997). \Guidelines for ToBI Labelling." Version 3. Ohio State University. Available at http: //julius.ling.ohiostate.edu/Phonetics/E ToBI/etobi homepage.html. 11] Black, A. and Taylor, P. (1997). \Festival Speech Synthesis System: system documentation (1.1.1)." Human Communication Research Centre Technical Report HCRC/TR-83. Available at http://www.cstr.ed.ac.uk/projects/festival/papers.html. 12] Black, A., and Taylor, P. (1998). \Assigning Phrase Breaks from Part-of-Speech Sequences." Computer Speech and Language 12. 99-117. 13] Bolinger, D. (1986) Intonation and Its Parts. Stanford, CA: Stanford University Press. 14] Boogaart, T., and Silverman, K. (1992). \Evaluating the Overall Comprehensibility of Speech Synthesizers." Proceedings of the International Conference on Spoken Language Processing. 15] Breiman, L. (1984). Classi cation and Regression Trees. Belmont, CA: Wadsworth. 74
16] Brill, E. and Resnick, P.(1994). \A Rule-Based Approach to Prepositional Phrase Attachment Disambiguation." Proceedings of the Sixteenth International Conference on Computational Linguistics. 17] Brill, E. (1995). \Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging." Computational Linguistics, 21(4). 543-565. 18] Campbell, N., and Black, A. (1997) \Prosody and the Selection of Source Units for Concatenative Synthesis." In Progress in Speech Synthesis. Jan P.H. van Santen, Richard W. Sproat, Joseph P. Olive, and Julia Hirshberg, eds. New York: Springer-Verlag. 279-290. 19] Delogu, C., Paoloni, A. et al. (1996). \Spectral Analysis of Synthetic Speech and Natural Speech with Noise over the Telephone Line." Proceedings of the International Conference on Spoken Language Processing. 20] Eskenazi, M. (1993). \Trends in Speaking Styles Research." Proceedings of the 3rd European Conference on Speech Communication and Technology. 501-509. 21] Gee, J. and Grosjean, F. (1983). \Performance Structures: A Psycholinguistic and Linguistic Appraisal." Cognitive Psychology, 15. 22] Grote, B., Hagen, E., and Teich, E. (1996). \Matchmaking: dialogue modeling and speech generation meet."Proceedings of the 8th International Workshop of Natural Language Generation. 23] Hirai, T., Iwahashi, N., Higuchi, N., and Sagisaka, Y. (1996). \Automatic Extraction of F0 Control Rules Using Statistical Analysis." In Progress in Speech Synthesis. Jan P.H. van Santen, Richard W. Sproat, Joseph P. Olive, and Julia Hirshberg, eds. New York: Springer-Verlag. 333-346. 75
24] Hirschberg, J. (1993). \Pitch Accent in Context: Predicting Prominence from text." Arti cial Intelligence, 63. 305-340. 25] Horne, M., and Filipsson, M. (1996). \Implementation and Evaluation of a Model for Synthesis of Swedish Intonation." Proceedings of the International Conference on Spoken Language Processing. 26] Hunt, A., and Black, A. (1996). \Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database." Proceedings, International Conference on Acoustics, Speech and Signal Processing. 27] Jongenburger, W., and van Bezooijen, R. (1992). \Text-to-speech Conversion for Dutch: Comprehensibility and Acceptability." Proceedings of the International Conference on Spoken Language Processing. 1135-1138. 28] Klatt, D. (1987). \Review of Text-to-Speech conversion for English." Journal of the Acoustic Society of America, 82(3). 29] MacNeil, R. Wordstruck: A Memoir. Read by the Author. Directed by Arthur Luce Klein. Audio Cassette. 8 cassettes. Used by arrangement with Viking Penguin, New York. 30] Mangu, L., and Brill, E. (1997). \Automatic Rule Acquisition for Spelling Correction." International Conference of Machine Learning. 31] Marcus, M.P., and Santorini, B. (1993). \Building a Very Large Natural Language Corpora: The Penn Treebank." Computational Linguistics, 19(2). 32] McKeown, K. R., and Moore, J.D. (1995). \Spoken Language Generation" in The State of the Art in Human Language Technology, Ch. 5.4. 33] Meteer, M., Schwartz R., and Weischedel, R. (1991). \POST: Using probabilities in language processing." Proceedings of the International Joint Conference on Arti cial Intelligence. 76
34] Mixdor , H. (1998). Intonation Patterns of German - Model-based Quantitative Analysis and Synthesis of F0 Contours." PhD. Thesis, Dresden Technical University. 35] Monaghan, A. I. C. (1990). \ Rhythm and Stress-shift in Speech Synthesis." Computer Speech and Language 4. 71-78. 36] Ostendorf, M., Price, P., and Shattuck-Hufnagel, S. (1994). \The Boston University Radio News Corpus." Boston University Technical Report ECE-95-001. 37] Ostendorf, M., and Veilleux, N. (1994). \A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location." Computational Linguistics, 20(1). 27-52. 38] Palmer, D. (1997). \A Trainable Rule-Based Algorithm for Word Segmentation." Proceedings of the Annual Meeting of the Association for Computational Linguistics(ACL97). 321-328. 39] Pierrehumbert, J. (1980). \The Phonology and Phonetics of English Intonation." PhD thesis, Massachusetts Institute of Technology. 40] Pitrelli, J., Beckman, M., and Hirschberg, J. (1994). \Evaluation of Prosodic Transcription Labelling Reliability in the TOBI Framework." Proceedings of the International Conference on Spoken Language Processing. 123-126. 41] Prevost, S. (1996). \Modeling Contrast in the Generation and Synthesis of Spoken Language." Proceedings of the International Conference on Spoken Language Processing. 42] Ramshaw, L. A. and Marcus, M. P. (1996). \Exploring the Nature of Transformational-Based Learning." In The Balancing Act: Combining Symbolic and Statistical Approaches to Language. J.L. Klavans and P. Resnik, eds. Cambridge, MA: MIT Press. 135-156. 77
43] Ramshaw, L. A. and Marcus, M. P. (1995). \Text Chunking using Transformation-Based Learning." Third ACL Workshop on Very Large Corpora. 44] Riley, M. (1992). \Tree-based Modelling of Segmental Durations." In Talking Machines: Theories, Models, and Designs. G. Bailly, C. Benoit, and T.R. Sawallis, eds. New York: Elsevier Science Publishers. 265-274. 45] Riley, M. and Sproat, R. (1994). \Text Analysis Tools in Spoken Language Processing", A tutorial presented at 32nd Annual Meeting of the Association for Computational Linguistics. 46] Ross, K. and Ostendorf, M. (1996). \Prediction of Abstract Prosodic Labels for Speech Synthesis." Computer Speech and Language, 10. 155-185. 47] Ross, K. (1995). \Modeling of Intonation for Speech Synthesis." PhD thesis, Boston University. 48] Samuel, K., Carberry, S., and Vijay-Shanker, K. (1998). \Computing Dialogue Acts from Features with Transformation-Based Learning." Presented at AAAI 1998 Spring Symposium on Applying Machine Learning to Discourse Processing, March 24. 49] Satta, G. and Brill, E. (1994). \E cient Transformation-Based Parsing." Association for Computational Linguistics. 50] Shattuck-Hufnagel, S., Ostendorf, M., and Ross, K. (1994). \Stress shift and early pitch accent placement in lexical items in American English." Journal of Phonetics, 22. 357-388. 51] Shih, C., and Ao, B. (1997). \Duration Study for the Bell Laboratories Mandarin Text-to-Speech System." In Progress in Speech Synthesis. Jan P.H. van Santen, Richard W. Sproat, Joseph P. Olive, and Julia Hirshberg, eds. New York: Springer-Verlag. 383-397. 78
52] Silverman, K. et al. (1992). \TOBI: A standard for labeling prosody,"Proceedings of the International Conference on Spoken Language Processing. 867-870. 53] Silverman, K., Kalyanswamy, A., Silverman, J., Basson, S., and Yashchin, D. (1993). \Synthesizer intelligibility in the context of a name-and-address information service."Proceedings of the 3rd European Conference on Speech Communication and Technology. 2169-2172. 54] S-Plus: Reference Manual Vols. 1 and 2. (1993). Seattle, WA: Mathsoft Inc. 55] Spyns, P., Deprez, F., Van Tichelen, L., and Van Coile, B. (1997). \Message-toSpeech: High Quality Speech Generation for Messaging and Dialogue Systems." In Proceedings of the Concept to Speech Generation Systems Workshop, ACL '97. 11-16. 56] Sproat, R. (1990). \Stress Assignment in Complex Nominals for English Text-tospeech." In Proceedings of European Speech Communication Association Workshop on Speech Synthesis, Autrans, France. 57] \TrueTalk Reference Manual: Customizing text-to-speech behavior." Entropics Research Laboratory, Inc. Washington, D.C. 1995. 58] Takeda, K., Abe, K., and Sagisaka, Y. (1992). \On the Basic Scheme and Algorithms in Non-uniform Unit Speech Synthesis." In Talking Machines: Theories, Models, and Designs. G. Bailly, C. Benoit, and T.R. Sawallis, eds. New York: Elsevier Science Publishers. 93-106. 59] Veilleux, N. M. (1994) \Computational Models of the Prosody/Syntax Mapping for Spoken Language Systems." PhD thesis, Boston University. 60] Vilain, M., Burger, J. and Hirschman, L. (1996). Personal correspondence. 61] Wang, M. Q. and Hirschberg, J. (1992). \Automatic Classi cation of Intonational Phrase Boundaries." Computer Speech and Language, 6. 79
62] Ward, G. (1991). The Moby Pronunciator. The Institute for Language Speech and Hearing, The University of She eld, UK. A version is available at http://www.dcs.shef.ac.uk/research/ilash/Moby/. 63] Wightman, C., Shattuck-Hufnagel, S., Ostendorf, M., and Price, P. (1992) \Segmental Durations in the Vicinity of Prosodic Phrase Boundaries." Journal of the Acoustic Society of America91(3). 1707-1717. 64] Wightman, C., and Talkin, D. (1997) \The Aligner: Text-to-Speech Alignment Using Markov Models." In Progress in Speech Synthesis. Jan P.H. van Santen, Richard W. Sproat, Joseph P. Olive, and Julia Hirshberg, eds. New York: Springer-Verlag. 313-320. 65] Yamashita, Y., and Mizoguchi, R. (1992). \SOCS: A Speech Output System from Concept Representation." Proceedings, International Conference on Acoustics, Speech and Signal Processing. 66] van Santen, J. P. H. (1993)."Perceptual Experiments for Diagnostic Testing of Text-to-Speech Systems." Computer Speech and Language 7. 49-100. 67] Young, S.J., and Fallside, F. (1979). \Speech Synthesis from Concept: A method for speech output from information systems." Journal of the Acoustic Society of America, 66(3). 68] Zue, V., Glass, J., et al. (1997). \Jupiter: The Displayless Weather Domain." Summary of Research, Laboratory for Computer Science, Spoken Language Systems, Massachusetts Institute of Technology. 20-21.
80

10 1 1 28 9304

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

10 1 1 28 9304

Transféré par

Droits d'auteur :

Formats disponibles

BOSTON UNIVERSITY COLLEGE OF ENGINEERING Thesis

PROSODY PREDICTION FOR SPEECH SYNTHESIS USING TRANSFORMATIONAL RULE-BASED LEARNING

A Penn Treebank Part-of-Speech Tags B Part-of-speech Classes

5.3 An comparison of predicted accents with TRBL vs. hand-labeled accents. 58 62

2.1 Speech Synthesis

Text Text Processing

Prosody & Pronunciation Prediction

2.2 Linguistic Theory of Prosody

2.2.1 Pitch Accents

2.2.2 Phrase Boundaries

Figure 2.2: An example of a F0 contour and ToBI labels 10].

2.2.3 Prosodic Labeling

2.3 Prediction Algorithms

2.3.1 Handwritten Rules

2.3.2 Statistical Methods

3.1 Boston University Radio News Corpus

666 (7.2%) 6792 (74%)

178 (7.1%) 1518 (71.8%) 2112

3.2 The Lexicon

Chapter 4 Prediction Algorithms

4.2 Decision Trees

4.2.1 Decision Tree Design

i(t; s) = i(t) ? PRi(tR ) ? PLi(tL );

i(t) = 1 ? max p( jti)

Entropy Gini criterion Twoing criterion

p( jti) log p( jti)

i(t) = 1 ? i(t) = PL4PR

jp( jtL) ? p( jtR )j

4.2.2 Using Decision Trees with a Markov Assumption

(4.6) is conditionally (4.7)

i?1 ; i ; i?1 )):

(j ) = log p( 1 = j jT (;; y1; ;))

2. For each time i = 2; : : : ; n and all j 2 A compute:

= max log p( i = j jT (k; yi; yi?1)) + i?1 (k)]: k2A

= j for all j 2 A. (4.11) (4.12)

4.3 Transformational Rule-Based Learning

Nscore = Ncorrect ? Nerror ;

= f0 = none; 1 = minor; 2 = majorg.

Initial State Labeller

Labeled Data Rule Templates "Truth" Labeled Data

Error Driven Learner (by comparison with truth)

Sequential Rule Set

4.4 Joint Prediction of Labels at Two Time Scales

information, ih+2 n * f ax r * m ey+1 * sh ax n

5.2 Quantitative Evaluation Measures

5.2.1 Pitch Accent Evaluation

5.2.2 Phrase Boundary Evaluation Measures

5.3 Pitch Accent Prediction Experiments

Simple Decision Tree

N 56/5547 LOCB<0.5 LOCB>0.5 POS3:LSF

P 916/3768 POS2:CDS,MS2 POS2:ADS,AJS,LS,MR,MS1,NNS

P 947/2510 LOCB<1.5 LOCB>1.5

P 793/3528 DBPBC<1.5 DBPBC>1.5

P 479/1628 PPOS2:AJS,LS,MR,MS2 PPOS2:ADS,CDS,MS1,NNS

PPOS2:ADS,AJS,CDS,LS,MR,NNSVOW:RED PPOS2:MS1,MS2,none VOW:LAX,TNS

LOCB<1.5 POS2:LS,MR,MS1,MS2 LOCB>1.5 POS2:ADS,AJS,CDS,NNS N 48/1593 N 6/1140 N 220/3140 P 60/156

BD:MAJ,MIN BD:NOB P 30/63 N 187/3077

POS2:MR,MS1 BD:MAJ,MIN POS2:LS,MS2 BD:NOB N 74/406 N 113/2671 N 150/307 N 68/236

PPLAB:P PPLAB:N,none P 250/747 P 442/2667

NPOS3:LSF NPOS3:CONT,MOF,PR N 69/646 N 44/2025

POS2:MS2 POS2:ADS,AJS,CDS,LS,MR,MS1,NNS P 46/94 P 396/2573

PPOS3:CONT,MOF,PR PPOS3:LSF,none P 462/1114 P 239/1050