Vous êtes sur la page 1sur 289

Industrial Parsing of

Software Manuals
Editors:
Richard F. E. Sutcli e
University of Limerick
Heinz-Detlev Koch
University of Heidelberg
Annette McElligott
University of Limerick
Dedicated to
Dr. A. Daly Briscoe
Contents
1. Industrial Parsing of Software Manuals: an Introduction 1
1.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 1
1.2 IPSM Test Corpus : : : : : : : : : : : : : : : : : : : : : : 4
1.2.1 Why Software Manuals? : : : : : : : : : : : : : : : 4
1.2.2 The 600 Utterance Corpus : : : : : : : : : : : : : : 5
1.2.3 The 60 Utterance Subset : : : : : : : : : : : : : : 5
1.3 Analysis of Parser Performance : : : : : : : : : : : : : : : 6
1.3.1 Three Phases of Analysis : : : : : : : : : : : : : : 6
1.3.2 Analysis of Particular Constructs : : : : : : : : : : 6
1.3.3 Coverage : : : : : : : : : : : : : : : : : : : : : : : 7
1.3.4 Eciency : : : : : : : : : : : : : : : : : : : : : : : 7
1.3.5 Accuracy of Analysis : : : : : : : : : : : : : : : : : 8
1.4 Structure of the Book : : : : : : : : : : : : : : : : : : : : 8
1.4.1 Introductory Chapters : : : : : : : : : : : : : : : : 8
1.4.2 Parsing Chapters : : : : : : : : : : : : : : : : : : : 9
1.4.3 Appendices : : : : : : : : : : : : : : : : : : : : : : 10
1.5 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : 10
1.6 Acknowledgements : : : : : : : : : : : : : : : : : : : : : : 11
1.7 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 11
2. Dependency-Based Parser Evaluation: a Study with a
Software Manual Corpus 13
2.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 13
2.2 Dependency-Based Evaluation : : : : : : : : : : : : : : : : 15
2.3 Manual Normalization of Parser Outputs : : : : : : : : : 17
2.4 Automated Transformation from Constituency to Depen-
dency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19
2.5 Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : 22
2.6 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 22
3. Comparative Evaluation of Grammatical Annotation Mod-
els 25
3.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 25
3.2 Diversity in Grammars : : : : : : : : : : : : : : : : : : : : 26
3.3 An Extreme Case: the `Perfect Parser' from Speech Recog-
nition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27
3.4 The Corpus as Empirical De nition of Parsing Scheme : : 28
3.5 Towards a MultiTreebank : : : : : : : : : : : : : : : : : : 29
3.6 Vertical Strip Grammar: a Standard Representation for
Parses : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 29
3.7 EAGLES: A Multi-Layer Standard for Syntactic Annotation 35
3.7.1 (a) Bracketing of Segments : : : : : : : : : : : : : 36
3.7.2 (b) Labelling of Segments : : : : : : : : : : : : : : 36
3.7.3 (c) Showing Dependency Relations : : : : : : : : : 36
3.7.4 (d) Indicating Functional Labels : : : : : : : : : : 37
3.7.5 (e) Marking Subclassi cation of Syntactic Segments 37
3.7.6 (f) Deep or `Logical' Information : : : : : : : : : : 38
3.7.7 (g) Information about the Rank of a Syntactic Unit 38
3.7.8 (h) Special Syntactic Characteristics of Spoken Lan-
guage : : : : : : : : : : : : : : : : : : : : : : : : : 38
3.7.9 Summary: a Hierarchy of Importance : : : : : : : 39
3.8 Evaluating the IPSM Parsing Schemes against EAGLES : 39
3.9 Summary and Conclusions : : : : : : : : : : : : : : : : : : 41
3.10 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 43
4. Using ALICE to Analyse a Software Manual Corpus 47
4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 47
4.2 Description of Parsing System : : : : : : : : : : : : : : : : 47
4.2.1 Preprocessing : : : : : : : : : : : : : : : : : : : : : 48
4.2.2 Parsing : : : : : : : : : : : : : : : : : : : : : : : : 48
4.2.3 Postprocessing : : : : : : : : : : : : : : : : : : : : 49
4.3 Parser Evaluation Criteria : : : : : : : : : : : : : : : : : : 50
4.4 Analysis I: Original Grammar, Original Vocabulary : : : : 52
4.5 Analysis II: Original Grammar, Additional Vocabulary : : 53
4.6 Analysis III: Modi ed Grammar, Additional Vocabulary : 54
4.7 Converting Parse Tree to Dependency Notation : : : : : : 55
4.8 Summary of Findings : : : : : : : : : : : : : : : : : : : : 55
4.9 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 55
5. Using the English Constraint Grammar Parser to Anal-
yse a Software Manual Corpus 57
5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 57
5.2 Description of Parsing System : : : : : : : : : : : : : : : : 58
5.2.1 Sample Output : : : : : : : : : : : : : : : : : : : : 58
5.2.2 System Architecture : : : : : : : : : : : : : : : : : 59
5.2.3 Implementation : : : : : : : : : : : : : : : : : : : : 61
5.3 Parser Evaluation Criteria : : : : : : : : : : : : : : : : : : 62
5.3.1 Towards General Criteria : : : : : : : : : : : : : : 62
5.3.2 Remarks on the Present Evaluation : : : : : : : : 63
5.3.3 Current Evaluation Setting : : : : : : : : : : : : : 64
5.4 Analysis I: Original Grammar, Original Vocabulary : : : : 65
5.4.1 Observations about Morphological Analysis and
Disambiguation : : : : : : : : : : : : : : : : : : : : 65
5.4.2 Observations about Syntax : : : : : : : : : : : : : 70
5.5 Analysis II: Original Grammar, Additional Vocabulary : : 72
5.5.1 Observations about Morphological Disambiguation 72
5.6 Analysis III: Altered Grammar, Additional Vocabulary : : 77
5.6.1 Observations about Morphological Disambiguation 77
5.6.2 Observations about Syntax : : : : : : : : : : : : : 80
5.7 Converting Parse Tree to Dependency Notation : : : : : : 83
5.8 Summary of Findings : : : : : : : : : : : : : : : : : : : : 85
5.9 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 86
6. Using the Link Parser of Sleator and Temperly to Analyse
a Software Manual Corpus 89
6.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 89
6.2 Description of Parsing System : : : : : : : : : : : : : : : : 90
6.3 Parser Evaluation Criteria : : : : : : : : : : : : : : : : : : 92
6.4 Analysis I: Original Grammar, Original Vocabulary : : : : 94
6.4.1 Pre-Processing : : : : : : : : : : : : : : : : : : : : 94
6.4.2 Results : : : : : : : : : : : : : : : : : : : : : : : : 95
6.5 Analysis II: Original Grammar, Additional Vocabulary : : 97
6.6 Analysis III: Altered Grammar, Additional Vocabulary : : 99
6.7 Converting Parse Tree to Dependency Notation : : : : : : 99
6.8 Summary of Findings : : : : : : : : : : : : : : : : : : : : 101
6.9 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 102
7. Using PRINCIPAR to Analyse a Software Manual Cor-
pus 103
7.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 103
7.2 Description of Parsing System : : : : : : : : : : : : : : : : 104
7.2.1 Parsing by Message Passing : : : : : : : : : : : : : 104
7.2.2 Implementation : : : : : : : : : : : : : : : : : : : : 109
7.3 Parser Evaluation Criteria : : : : : : : : : : : : : : : : : : 110
7.4 Analysis I: Original Grammar, Original Vocabulary : : : : 112
7.4.1 Setting-Up the Experiment : : : : : : : : : : : : : 112
7.4.2 Results : : : : : : : : : : : : : : : : : : : : : : : : 112
7.4.3 Causes of Errors : : : : : : : : : : : : : : : : : : : 114
7.5 Analysis II: Original Grammar, Additional Vocabulary : : 115
7.6 Analysis III: Altered Grammar, Additional Vocabulary : : 116
7.7 Converting Parse Tree to Dependency Notation : : : : : : 116
7.8 Summary of Findings : : : : : : : : : : : : : : : : : : : : 116
7.9 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 117
8. Using the Robust Alvey Natural Language Toolkit to
Analyse a Software Manual Corpus 119
8.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 119
8.2 Description of Parsing System : : : : : : : : : : : : : : : : 121
8.2.1 The Basic ANLT : : : : : : : : : : : : : : : : : : : 121
8.2.2 The Robust ANLT : : : : : : : : : : : : : : : : : : 124
8.3 Parser Evaluation Criteria : : : : : : : : : : : : : : : : : : 127
8.4 Analysis I: Original Grammar, Original Vocabulary : : : : 128
8.4.1 Pre-Processing : : : : : : : : : : : : : : : : : : : : 128
8.4.2 Results : : : : : : : : : : : : : : : : : : : : : : : : 130
8.5 Analysis II: Original Grammar, Additional Vocabulary : : 130
8.6 Analysis III: Altered Grammar, Additional Vocabulary : : 130
8.7 Converting Parse Tree to Dependency Notation : : : : : : 131
8.8 Summary of Findings : : : : : : : : : : : : : : : : : : : : 132
8.9 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 135
9. Using the SEXTANT Low-Level Parser to Analyse a
Software Manual Corpus 139
9.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 139
9.2 Description of Parsing System : : : : : : : : : : : : : : : : 140
9.2.1 Preparsing Processing : : : : : : : : : : : : : : : : 141
9.2.2 Parsing : : : : : : : : : : : : : : : : : : : : : : : : 144
9.2.3 List Recognition : : : : : : : : : : : : : : : : : : : 146
9.3 Parser Evaluation Criteria : : : : : : : : : : : : : : : : : : 149
9.4 Analysis I: Original Grammar, Original Vocabulary : : : : 152
9.5 Analysis II: Original Grammar, Additional Vocabulary : : 153
9.6 Analysis III: Altered Grammar, Additional Vocabulary : : 154
9.7 Converting Parse Tree to Dependency Notation : : : : : : 156
9.8 Summary of Findings : : : : : : : : : : : : : : : : : : : : 156
9.9 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 157
10. Using a Dependency Structure Parser without any
Grammar Formalism to Analyse a Software Manual Cor-
pus 159
10.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 159
10.2 Description of Parsing System : : : : : : : : : : : : : : : : 160
10.3 Parser Evaluation Criteria : : : : : : : : : : : : : : : : : : 169
10.4 Analysis I: Original Grammar, Original Vocabulary : : : : 170
10.5 Analysis II: Original Grammar, Additional Vocabulary : : 174
10.6 Analysis III: Altered Grammar, Additional Vocabulary : : 175
10.7 Converting Parse Tree to Dependency Notation : : : : : : 175
10.8 Summary of Findings : : : : : : : : : : : : : : : : : : : : 175
10.9 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 176
11. Using the TOSCA Analysis System to Analyse a Soft-
ware Manual Corpus 179
11.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : 179
11.2 Description of Parsing System : : : : : : : : : : : : : : : : 181
11.2.1 The TOSCA Analysis Environment : : : : : : : : 181
11.2.2 The Tagger : : : : : : : : : : : : : : : : : : : : : : 183
11.2.3 The Parser : : : : : : : : : : : : : : : : : : : : : : 184
11.3 Parser Evaluation Criteria : : : : : : : : : : : : : : : : : : 189
11.4 Analysis I: Original Grammar, Original Vocabulary : : : : 190
11.4.1 Ecacy of the Parser : : : : : : : : : : : : : : : : 193
11.4.2 Eciency of the Parser : : : : : : : : : : : : : : : 199
11.4.3 Results : : : : : : : : : : : : : : : : : : : : : : : : 199
11.5 Analysis II: Original Grammar, Additional Vocabulary : : 201
11.6 Analysis III: Altered Grammar, Additional Vocabulary : : 201
11.7 Converting Parse Tree to Dependency Notation : : : : : : 202
11.8 Summary of Findings : : : : : : : : : : : : : : : : : : : : 205
11.9 References : : : : : : : : : : : : : : : : : : : : : : : : : : : 206
Appendix I. 60 IPSM Test Utterances 207
Appendix II. Sample Parser Outputs 213
Appendix III. Collated References 259
Index 269
1
Industrial Parsing of
Software Manuals:
an Introduction
Richard F. E. Sutcli e1
Annette McElligott
Heinz-Detlev Koch
University of Heidelberg
University of Limerick

1.1 Introduction
Parsing is the grammatical analysis of text. A parser is a computer
program which can carry out this analysis automatically on an input
provided in machine readable form. For example, given an utterance
such as
L22: Select the text you want to protect.

a parser might produce the following output:


[s,[vp,[v, select],
[np [det, the],
[n, text],
[rc, [pro, you],
[vp,[v,want],
[ic,[to, to],
[v, protect]]]]]]]

1 Addresses: Department of Computer Science and Information Systems, Uni-


versity of Limerick, Limerick, Ireland. Tel: +353 61 202706 (Sutcli e), +363
61 202724 (McElligott), Fax: +353 61 330876, Email: richard.sutcli e@ul.ie, an-
nette.mcelligott@ul.ie. Lehrstuhl fur Computerlinguistik Karlstra e 2, D-69117 Hei-
delberg, Deutschland. Tel: +49 6221 543511, Fax: +49 6221 543 242, Email:
2 Sutcli e, McElligott, Koch

This can be interpreted as saying that the input was a sentence com-
prising a single verb phrase, that the verb phrase consisted of the verb
`select' followed by a noun phrase, that the noun phrase comprised a
determiner `the', a noun `text' and a relative clause, and so on. To
understand the output it is necessary to know what each non terminal
(`s', `vp' etc.) means, and in precisely what kinds of structure each can
occur within the output. This in turn requires an understanding of the
linguistic formalism on which the parser is based.
The task of language engineering is to develop the technology for
building computer systems which can perform useful linguistic tasks such
as machine assisted translation, text retrieval, message classi cation and
document summarisation. Such systems often require the use of a parser
which can extract speci c types of grammatical data from pre-de ned
classes of input text.
There are many parsers already available for use in language engi-
neering systems. However, each uses a di erent linguistic formalism and
parsing algorithm, and as a result is likely to produce output which is dif-
ferent from that of other parsers. To make matters worse, each is likely
to have a di erent grammatical coverage and to have been evaluated
using di erent criteria on di erent test data. To appreciate the point,
study Appendix II where you will nd eight analyses of the utterance
L22. None of these bears any resemblance to the one shown above.
Suppose you wish to build a language engineering system which re-
quires a parser. You know what syntactic characteristics you want to
extract from an utterance but you are not interested in parsing per se.
Which parsing algorithm should you use? Is there an existing parser
which could be adapted to the task? How dicult will it be to convert
the output of a given parser to the form which you require? What kind
of coverage and accuracy can you expect? This book sets out to provide
some initial answers to these questions, taking as its starting point one
text domain, that of software instruction manuals.
The book is derived from a workshop Industrial Parsing of Soft-
ware Manuals (IPSM'95) which was held at the University of Limerick,
Ireland, in May 1995. Research teams around the world were invited
to participate by measuring the performance of their parsers on a set
of 600 test sentences provided by the organisers. The criteria to be
used for measuring performance were also speci ed in advance. Eight
groups from seven countries took up the challenge. At the workshop,
participants described their parsing systems, presented their results and
koch@novell1.gs.uni-heidelberg.de. We are indebted to the National Software Direc-
torate of Ireland under the project `Analysing Free Text with Link Grammars' and
to the European Union under the project `Selecting Information from Text (SIFT)'
(LRE-62030) for supporting this research. This work could not have been done with-
out the assistance of Denis Hickey, Tony Molloy and Redmond O'Brien.
Introduction 3

outlined the methods used to obtain them.


One nding of IPSM'95 was that the articles produced for the pro-
ceedings (Sutcli e, McElligott & Koch, 1995) were rather disparate,
making direct comparison between systems dicult. Each group had
conducted a slightly di erent form of analysis and the results were re-
ported using tables and gures in a variety of con gurations and formats.
To take the aims of the workshop further, and to make the information
available to a wider audience, each group was asked to carry out a more
tightly speci ed analysis by applying their parser to a subset of the orig-
inal IPSM corpus and presenting their ndings in a standard fashion.
The results of this second phase of work are contained in the present
volume.
Another issue which developed out of the workshop relates to stan-
dardisation of parse trees. Each parser used in IPSM'95 produces a
di erent type of output. This makes direct comparisons of performance
dicult. Moreover, it is an impediment to structured language engi-
neering, as we have already noted. In an ideal situation it would be
possible to link existing tools such as lexical analysers, part-of-speech
taggers, parsers, semantic case frame extractors and so on in various
ways to build di erent systems. This implies both that the tools can
be physically linked and that output data produced by each one can be
made into a suitable input for the next component in the chain. Physical
linkage is dicult in itself but has been addressed by such paradigms as
GATE (Cunningham, Wilks & Gaizauskas, 1996). What can be done
about the widely di ering outputs produced by parsers?
Dekang Lin has on a previous occasion suggested that any parse can
be converted at least partially into a dependency notation and that this
form could comprise a standard by which the output from di erent sys-
tems could be compared (Lin, 1995). The idea was discussed in detail
at the workshop and in consequence each group was requested to inves-
tigate the extent to which a dependency system could capture the data
produced by their parser.
In the remainder of this introduction we describe in more detail the
objectives and background of the IPSM project. In Section 1.2 we justify
the use of computer manual texts as the basis of the study, describe the
characteristics of the test data which was used, and explain exactly how
it was produced. Section 1.3 outlines the three phases of analysis which
were carried out on each parser, the kinds of information which were
determined for each phase, and the means by which this was presented
in tabular form. Section 1.4 describes the structure of the book and in
particular explains the set of standard sections which are used for all the
parsing chapters. Finally, Section 1.5 brie y discusses the ndings of the
project as a whole.
4 Sutcli e, McElligott, Koch

Type Dynix Lotus Trados


Count Selected Count Selected Count Selected
S 117 12 091 09 135 14
IMP 032 03 068 07 041 04
IVP 001 00 018 02 000 00
3PS 006 01 005 01 000 00
PVP 004 00 013 01 010 01
NP 040 04 005 00 012 01
QN 000 00 000 00 002 00
Total 200 20 200 20 200 20

Table 1.1: IPSM Corpus broken down by utterance type and


source document. Each column marked `Count' shows the number
of utterances of the given type which occurred in software manual
shown in the rst row. Each column marked `Selected' shows
the number of these which were used in the reduced set of 60
utterances. Examples of the various utterance types are shown in
Table 1.2.
Type Example
S Typically, there are multiple search menus on your system, each
of which is set up di erently.
IMP Move the mouse pointer until the I-beam is at the beginning of the
text you want to select.
IVP To move or copy text between documents
3PS Displays the records that have a speci c word or words in the
TITLE, CONTENTS, SUBJECT, or SERIES elds of the BIB
record, depending on which elds have been included in each index.
PVP Modifying the Appearance of Text
NP Automatic Substitution of Interchangeable Elements
QN What do we mean by this?

Table 1.2: Examples of the utterance types used in Table 1.1.

1.2 IPSM Test Corpus


1.2.1 Why Software Manuals?
Many studies on parsing in the past have been carried out using test
material which is of little practical interest. We wished to avoid this by
selecting a class of documents in which there is a demonstrated com-
mercial interest. Software instruction manuals are of crucial importance
to the computer industry generally and there are at least two good rea-
Introduction 5

sons for wishing to parse them automatically. The rst is in order to


translate them into di erent languages. Document translation is a major
part of the software localisation process, by which versions of a software
product are produced for di erent language markets. The second reason
is in order to create intelligent on-line help systems based on written
documentation. SIFT (Hyland, Koch, Sutcli e and Vossen, 1996) is just
one of many projects investigating techniques for building such systems
automatically.

1.2.2 The 600 Utterance Corpus


Having decided on software documentation, three instruction manuals
were chosen for use in IPSM. These were the Dynix Automated Library
Systems Searching Manual (Dynix, 1991), the Lotus Ami Pro for Win-
dows User's Guide Release Three (Lotus, 1992) and the Trados Trans-
lator's Workbench for Windows User's Guide (Trados, 1995). A study
had already been carried out on Chapter 5 of the Lotus manual, which
contained 206 utterances.2 For this reason it was decided to use 200
utterances from each of the three manuals, making a total of 600. This
corpus was then used for the initial analysis carried out by the eight
teams and reported on at the IPSM workshop.

1.2.3 The 60 Utterance Subset


Following the workshop, we wished to carry out a more detailed and
constrained study on the eight parsers in order to allow a more precise
comparison between them. Unfortunately it was not feasible for all the
teams to undertake a detailed study on the entire 600 utterance corpus.
For this reason a 60 utterance subset was created. The following method
was used to achieve this:
1. Each utterance in the original set of 600 was categorised by type
using the classes Sentence (S), Imperative (IMP), In nitive Verb
Phrase (IVP), Third Person Singular (3PS), Progressive Verb
Phrase (PVP), Noun Phrase (NP) and Question (QN). The anal-
ysis is shown in Table 1.1 with examples of each type shown in
Table 1.2.
2. A selection was made from each manual for each utterance type
such that the proportion of that type in the 60 utterance subset
was as close as possible to that in the original 600 utterance corpus.
2 We use the term utterance to mean a sequence of words separated from other
such sequences, which it is desired to analyse. Some such utterances are sentences.
Others (for example headings) may comprise a single verb phrase (e.g. `Proo ng a
Document'), a noun phrase (e.g. `Examples of Spell Check') or some other construct.
6 Sutcli e, McElligott, Koch

The 60 selected utterances can be seen in Appendix I.

1.3 Analysis of Parser Performance


1.3.1 Three Phases of Analysis
Each participant was asked to carry out a study to determine how well
their parser was able to extract certain categories of syntactic informa-
tion from the set of 60 test utterances. Three phases of analysis were
requested. In Analysis I, the parser had to be used with its original
grammar and lexicon. It was permissible however to alter the lexical
analysis component of the system. For Analysis II, the lexicon of the
parser could be augmented but no changes to the underlying grammar
were allowed. Finally, Analysis III allowed changes to both the lexicon
and the grammar.
The purpose of the three phases was to gain insight into how robust
the di erent systems were and to provide lower and upper bounds for
their performance in the task domain. Because of the diversity of parsing
methods being used, the criteria for each phase had to be interpreted
slightly di erently for each system. Such di erences are discussed in the
text where they arise.
Participants were requested to provide their results in the form of
a series of standard tables. The precise analysis which was carried out
together with the format of the tables used to present the data are de-
scribed in the following sections.

1.3.2 Analysis of Particular Constructs


The rst piece of information provided for each parser is a table showing
which forms of syntactic analysis it could in principle carry out. These
forms are explained below using the example utterance `If you press
BACKSPACE, Ami Pro deletes the selected text and one character to
the left of the selected text.'.
 A: Verbs recognised: e.g. recognition of `press' and `deletes'.
 B: Nouns recognised: e.g. recognition of `BACKSPACE', `text',
`character' and `left'.
 C: Compounds recognised: e.g. recognition of `Ami Pro'.
 D: Phrase boundaries recognised: e.g. recognition that `the
selected text and one character to the left of the selected text' is a
noun phrase.
Introduction 7

 E: Predicate-Argument relations recognised: e.g. recogni-


tion that the argument of `press' is `BACKSPACE'.
 F: Prepositional phrases attached: e.g. recognition that `to
the left of the selected text' attaches to `one character' and not to
`deletes'.
 G: Co-ordination/Gapping analysed: e.g. recognition that
the components of the noun phrase `the selected text and one char-
acter to the left of the selected text' are `the selected text' and `one
character to the left of the selected text', joined by the coordinator
`and'.
In each chapter, the above information is presented in Tables X.1
and X.2, where X is the chapter number.

1.3.3 Coverage
An indication of the coverage of the parser is given in Tables X.3.1, X.3.2
and X.3.3. Each is in the same format and shows for each of the three
sets of utterances (Dynix, Lotus and Trados) the number which could
be accepted. A parser is deemed to accept an utterance if it can produce
some analysis for it. Otherwise it is deemed to reject the utterance. The
three tables present this data for Phases I, II and III respectively.

1.3.4 Eciency
An indication of the eciency of the parser is given in Tables X.4.1,
X.4.2 and X.4.3. Each is in the same format and shows for each of the
three sets of utterances (Dynix, Lotus and Trados) the total time taken
to attempt an analysis of all utterances, together with the average time
taken to accept, or to reject, an utterance. Once again the three tables
correspond to Phases I, II and III.
The type of machine used for testing is also speci ed in each chapter.
While these tables only constitute a guide to performance, it is still
worth noting that parse times for di erent systems vary from fractions
of a second on a slow machine up to hours on a very fast one. The
reason for including both average time to accept and average time to
reject is that many systems are much slower at rejecting utterances than
at accepting them. This is because a parser can accept an utterance as
soon as it nds an interpretation of it, whereas to reject it, all possible
interpretations must rst be tried.
8 Sutcli e, McElligott, Koch

1.3.5 Accuracy of Analysis


Tables X.5.1, X.5.2 and X.5.3 provide an analysis of the ability of the
parsing system to perform the syntactic analyses A to G which were
discussed earlier. Once again the tables correspond to the three phases
of the study.
The way in which the percentages in Tables X.5.1, X.5.2 and X.5.3
are computed is now de ned. If an utterance can be recognised, then
we compute its scores as follows: First, we determine how many of the
particular construction it has, calling the answer I. Second, we determine
of those, how many are correctly recognised by the parser, calling the
answer J.
If an utterance can not be recognised then we determine its scores
as follows: First, we determine how many of the particular construction
it has, calling the answer I. Second, J is considered to have the value
zero, because by de nition the parser did not nd any instances of the
construction.
We now determine the gure in the table for each column
First, we compute the sum of the IP
P as follows:
s for each utterance u, u I and the
sum of the Js for each utterance u, u J . Second, we compute the value:
PJ
Puu I  100
In considering these tables, and indeed X.1 and X.2 also, it is im-
portant to bear in mind that the interpretation of each analysis can
di er from system to system, depending on the grammatical formalism
on which it is based. Even relative to a particular formalism, the results
of the analysis can vary depending on what is considered the `correct'
interpretation of particular utterances. However, these tables do give an
indication of how well the various systems perform on di erent types of
syntactic analysis. In addition, problems of their interpretation relative
to particular systems are fully discussed in the accompanying texts.

1.4 Structure of the Book


1.4.1 Introductory Chapters
Two introductory chapters preface those relating to speci c parsers. The
rst, by Dekang Lin, justi es the use of a dependency notation as a basis
for parser evaluation, and assesses the extent to which it is applicable to
the output of the eight parsers described in this book. The second intro-
ductory chapter, by Eric Atwell, is a comparative analysis of the output
data produced by the IPSM parsers, relating this both to dependency
notation and other forms which have been proposed as standards. The
Introduction 9

Lin and Atwell chapters taken together are an attempt to move forward
the debate relating to the standardisation of parse data to facilitate both
the evaluation of parsers and their integration into language engineering
systems.

1.4.2 Parsing Chapters


Each parser is discussed in a separate chapter which is organised around
a xed set of headings. The content of each section is outlined below:
 Introduction: A brief introduction to the chapter.
 Description of Parsing System: An outline of the parsing sys-
tem, including the algorithms used and the underlying linguistic
formalisms involved.
 Parser Evaluation Criteria: A discussion of any parser-speci c
issues which had to be addressed during the process of evaluation.
This is an important topic because not all criteria were applicable
to all parsers. For example it is not possible to measure the accu-
racy of prepositional phrase attachment if a system is not designed
to identify prepositional phrases.
 Analysis I: Original Grammar, Original Vocabulary: the
results of Analysis I when applied to the parser. (See Section 1.3.1
above for discussion.)
 Analysis II: Original Grammar, Additional Vocabulary:
The results of Analysis II when applied to the parser. (See Section
1.3.1 above for discussion.)
 Analysis III: Altered Grammar, Additional Vocabulary:
The results of Analysis III when applied to the parser. (See Section
1.3.1 above for discussion.)
 Converting Parse Tree to Dependency Notation: a discus-
sion of the problems incurred when an attempt was made to trans-
late parse trees into a dependency form.
 Summary of Findings: a general summary of the ndings relat-
ing to the parser study as a whole.
 References: a list of bibliographic references. These are also
collated at the end of the volume.
10 Sutcli e, McElligott, Koch

1.4.3 Appendices
Appendix I lists the 60 test sentences which were used for the analysis
described in this book. Appendix II gives a sample of parse trees as
produced by the eight parsers. Finally, Appendix III is a collated list of
all bibliographic references which occur within the book.

1.5 Discussion
In this section we make some concluding remarks relating to the project
as a whole. Firstly, carrying out the work within a single text domain
has proved useful in a number of respects. One of the most interesting
ndings of the analysis of utterance type in the original IPSM Corpus
(Table 1.1) is that 43% of utterances are not in fact sentences at all.
Nevertheless we wish to be able to analyse them accurately. This implies
that an e ective robust parser must not be tied to traditional notions of
grammaticality.
While much of the corpus is regular, constructs occasionally occur
which can not reasonably be analysed by any parser. The ability to
return partial analyses in such cases is extremely valuable.
Secondly, progress has been made towards our original goal of pro-
viding a direct comparison between di erent parsers. For example the
parse trees of Appendix II provide much useful information regarding
the characteristics of the di erent systems which goes beyond what is
discussed in the text. On the other hand, the range of parsing algorithms
presented here is extremely wide which means that there are very few lin-
guistic assumptions common to all systems. For example, when we talk
about a `noun phrase' each participant conjures up a di erent concept.
Direct comparisons between systems are therefore dicult.
Tables X.5.1, X.5.2 and X.5.3 provide useful and interesting data
regarding the ecacy of the di erent parsers. However, each participant
has had to make a di erent set of linguistic assumptions in order to
provide this information. Ideally we would like to have constrained the
process more and to have based the results on a larger set of utterances.
This might be accomplished in future by focusing on a task such as
predicate-argument extraction which is closely related to parsing and
can also be assessed automatically.
In conclusion, IPSM has proved to be an interesting and constructive
exercise.
Introduction 11

1.6 Acknowledgements
This introduction would not be complete without acknowledging the help
of many people. The most important of these are:
 The copyright holders of the Dynix, Lotus and Trados software
manuals, for allowing extracts from their documents to be used in
the research;
 Helen J. Wybrants of Dynix Library Systems, Michael C. Ferris of
Lotus Development Ireland and Matthias Heyn of Trados GmbH
for making the manuals available in machine-readable form;
 The National Software Directorate of Ireland (Director Barry Mur-
phy, PAT Director Seamus Gallen) for funding some of the work
under the project `Analysing Free Text with Link Grammars';
 DGXIII/E5 of the European Commission who provided travel as-
sistance to some of the participants under the project SIFT (LRE-
62030) (Programme Manager Roberto Cencioni, Project Ocers
Chris Garland and Lidia Pola);
 Denis Hickey, Tony Molloy and Redmond O'Brien who solved in-
numerable technical problems at Limerick relating to organisation
of the IPSM workshop;
 The contributors to this volume, all of whom carried out two anal-
yses and wrote two completely di erent articles describing their
results.

1.7 References
Cunningham, H., Wilks, Y., & Gaizauskas, R. (1996). GATE | a Gen-
eral Architecture for Text Engineering. Proceedings of the 16th Con-
ference on Computational Linguistics (COLING-96).
Dynix (1991). Dynix Automated Library Systems Searching Manual.
Evanston, Illinois: Ameritech Inc.
Hyland, P., Koch, H.-D., Sutcli e, R. F. E., & Vossen, P. (1996). Se-
lecting Information from Text (SIFT) Final Report (LRE-62030 De-
liverable D61). Luxembourg, Luxembourg: Commission of the Euro-
pean Communities, DGXIII/E5. Also available as a Technical Report.
Limerick, Ireland: University of Limerick, Department of Computer
Science and Information Systems.
Lin, D. (1995). A dependency-based method for evaluating broad-
coverage parsers. Proceedings of IJCAI-95, Montreal, Canada, 1420-
1425.
Lotus (1992). Lotus Ami Pro for Windows User's Guide Release Three.
Atlanta, Georgia: Lotus Development Corporation.
Sutcli e, R. F. E., Koch, H.-D., & McElligott, A. (Eds.) (1995). Proceed-
ings of the International Workshop on Industrial Parsing of Software
Manuals, 4-5 May 1995, University of Limerick, Ireland (Technical
Report). Limerick, Ireland: University of Limerick, Department of
Computer Science and Information Systems, 3 May, 1995.
Trados (1995). Trados Translator's Workbench for Windows User's
Guide. Stuttgart, Germany: Trados GmbH.
2
Dependency-Based Parser
Evaluation: A Study with
a Software Manual Corpus
Dekang Lin1
University of Manitoba

2.1 Introduction
With the emergence of broad-coverage parsers, quantitative evaluation
of parsers becomes increasingly more important. It is generally accepted
that parser evaluation should be conducted by comparing the parser-
generated parse trees (we call them answers) with manually constructed
parse trees (we call them keys). However, how such comparison should
be performed is still subject to debate. Several proposals have been put
forward (Black, Abney, Flickenger, Gdaniec, Grishman, Harrison, Hin-
dle, Ingria, Jelinek, Klavans, Liberman, Marcus, Roukos, Santorini &
Strzalkowski, 1991; Black, La erty & Roukos, 1992; Magerman, 1994),
all of which are based on comparison of phrase boundaries between an-
swers and keys.
There are several serious problems with the phrase boundary eval-
uations. First, the ultimate goal of syntactic analysis is to facilitate
semantic interpretation. However, phrase boundaries do not have much
to do with the meaning of a sentence. Consider the two parse trees for a
sentence in the software manual corpus shown at the top of Figure 2.1.
There are four phrases in the answer and three in the key, as shown at
the bottom of the gure.
According to the phrase boundary method proposed in Black, Abney
et al. (1991), the answer has no crossing brackets, 100% recall and 75%
1 Address: Department of Computer Science, University of Manitoba, Winnipeg,
Manitoba, Canada, R3T 2N2. Tel: +1 204 474 9740, Fax: +1 204 269 9178, Email:
lindek@cs.umanitoba.ca.
14 Lin

Answer: Key:
(CP (CP
(Cbar (Cbar
(IP (IP
(NP (NP
(Det A) (Det A)
(Nbar (Nbar
(N BIB) (N BIB summary screen)))
(CP (Ibar
Op[1] (VP
(Cbar (Vbar
(IP (V appears)))))))
(NP
(Nbar
(N summary)))
(Ibar
(VP
(Vbar
(V
(V_NP
(V_NP screen)
t[1]))))))))))
(Ibar
(VP
(Vbar
(V appears)))))

Phrases in Answer: Phrases in Key:


summary screen
BIB summary screen BIB summary screen
A BIB summary screen A BIB summary screen
A BIB summary screen appears A BIB summary screen appears

Figure 2.1: Two parse trees of \A BIB summary screen ap-


pears".

precision, which are considered to be very good scores. However, the


answer treats \screen" as a verb and \summary screen" as a relative
clause modifying the noun \BIB." This is obviously a very poor analysis
and unlikely leads to a correct interpretation of the sentence. Therefore,
parse trees should be evaluated according to more semantically relevant
features then phrase boundaries.
Another problem with phrase boundary evaluation is that many dif-
ferences in phrase boundaries are caused by systematic di erences be-
tween di erent parsing schemes or theories. For example, Figure 2.2
shows two parse trees for the same sentence. The rst one is from the
SUSANNE corpus (Sampson, 1995), the second one is the output by
PRINCIPAR (Lin, 1994). Although the two parse trees look very di er-
ent, both of them are correct according to their own theory. An evalua-
Dependency-Based Evaluation 15

a. SUSANNE parse tree b. PRINCIPAR parse tree


(S (CP (Cbar (IP
(Ns:s (NP
(AT The) (Det The)
(Nns (NP1s Maguire)) (Nbar
(NN1n family)) (N Maguire)
(Vsu (N family)))
(VBDZ was) (Ibar
(VVGv setting)) (Be was)
(R:n (RP up)) (VP (Vbar
(Ns:o (V (V_[NP]
(AT1 a) (V_[NP] setting up)
(JJ separate) (NP
(NNL1cb camp)) (Det a)
(R:p (RL nearby))) (Nbar
(AP (Abar (A separate)))
(N camp)))))
(AP (Abar (A nearby)))))))))

Figure 2.2: Two di erent phrase structure analysis of the same


sentence

tion scheme should not arbitrarily prefer one and penalize the other.

2.2 Dependency-Based Evaluation


In Lin (1995), we proposed a dependency-based evaluation method.
Since semantic dependencies are embedded in syntactic dependencies,
the results of the dependency-based evaluation are much more meaning-
ful than those of phrase boundary methods. Furthermore, it was shown
in Lin (1995) that many systematic di erences among di erent theories
can be eliminated by rule-based transformation on dependency trees.
In the dependency-based method, the parser outputs and treebank
parses are rst converted into dependency trees (Mel'cuk, 1987), where
every word is a modi er of exactly one other word (called its head or
modi ee), unless the word is the head of the sentence or a fragment of
the sentence in case the parser failed to nd a complete parse of the sen-
tence. Figures 2.3a and 2.3b depict the dependency trees corresponding
to Figures 2.2a and 2.2b respectively. An algorithm for transforming
constituency trees into dependency trees was presented in Lin (1995).
A dependency tree is made up of a set of dependency relationships.
A dependency relationship consists of a modi er, a modi ee and (op-
tionally) a label that speci es the type of the dependency relationship.
Since a word may participate as the modi er in at most one dependency
16 Lin

s o
n

AT NP1s NN1n VBDZ VVGv RP AT1 JJ NNL1cb RL

the Maguire faimily was setting up a separate camp nearby

(a) a dependency converted from a SUSANNE parse

adjunct
spec subj pred comp1
nn post-mod spec
adjunct
Det N N I V:[NP] U Det A N A

the Maguire faimily was setting up a separate camp nearby

(b) a dependency tree converted from a PRINCIPAR parse

Figure 2.3: Example dependency trees.

relationship, we may treat the modi ee in a dependency relationship as


the tag assigned to the modi er. Parser outputs can then be scored
on a word-by-word basis similar to the evaluation of the part-of-speech
tagging results.
For each word in the answer, we can classify it into one of the four
categories:
 if it modi es the same word in the answer and in the key or it mod-
i es no other word in both the answer and the key, it is considered
to be correct.
 if it modi es a di erent word in the answer than in the key, it is
considered to be incorrect.
 if the word does not modify any word in the answer, but modi es
Dependency-Based Evaluation 17

Parser Dependency Constituency Other


ALICE X
ENGCG X
LPARSER X
PRINCIPAR X X
RANLT X
SEXTANT X
DESPAR X
TOSCA X

Table 2.1: Output format of IPSM'95 parsers

a word in the key, then it is missing a modi ee.


 if the word does not modify any word in the key, but modi es a
word in the answer, then it has a spurious modi ee.
For example, if we compare the two dependency trees in Figures 2.3a
and 2.3b, all the words are correct, except the word nearby which has
di erent modi ees in the key and in the answer (was vs. setting).

2.3 Manual Normalization of Parser Out-


puts
Table 2.1 shows the output formats of the parsers that participated in
the IPSM workshop.
Given that the parsing literature is dominated by constituency-based
parsers and all the large tree banks used constituency grammars, it is
surprising to nd that there are more dependency-based parsers in the
workshop than constituency-based ones.
In order to apply the dependency-based method to evaluate the par-
ticipating parsers, the workshop participants conducted an experiment
in which each participant manually translated their own parser outputs
for a selected sentence into a dependency format similar to what was used
in PRINCIPAR. For dependency-based parsers, this is quite straightfor-
ward. Essentially the same kind of information is encoded in the outputs
of these parsers. The distinctions are mostly super cial. For example, in
both SEXTANT and DESPAR, words are assigned indices. Dependency
relationships are denoted by pairs word indices. SEXTANT uses an in-
teger (0 or 1) to specify the direction of the dependency relationship,
whereas DESPAR uses an arrow to indicate the direction. In PRINCI-
18 Lin

DT the 1 --> 2 [
NNS contents 2 --> 6 + SUB
IN of 3 --> 2 ]
DT the 4 --> 5 [
NN clipboard 5 --> 3 +
VBP appear 6 --> 11 ]
IN in 7 --> 6 -
DT the 8 --> 10 [
JJ desired 9 --> 10 +
NN location 10 --> 7 +
. . 11 --> 0 ]

DESPAR Output
94 NP 2 The the DET 0 1 1 (content) DET
94 NP* 2 contents content NOUN 1 0
94 NP 2 of of PREP 2 1 4 (clipboard) PREP
94 NP 2 the the DET 3 1 4 (clipboard) DET
94 NP* 2 Clipboard clipboard NOUN 4 1 1 (content) NNPREP
94 VP 101 appear appear INF 5 1 1 (content) SUBJ
94 NP 3 in in PREP 6 1 9 (location) PREP
94 NP 3 the the DET 7 1 9 (location) DET
94 NP 3 desired desire PPART 8 0
94 NP* 3 location location NOUN 9 2 8 (desire) DOBJ
5 (appear) IOBJ-in
94 -- 0 . . . 10 0

SEXTANT Output
(
(The ~ Det < contents spec)
(contents content N < appear subj)
(of ~ P_ > contents adjunct)
(the ~ Det < Clipboard spec)
(Clipboard ~ N > of comp)
(appear ~ V *)
(in ~ P_ > appear adjunct)
(the ~ Det < location spec)
(desired ~ A < location adjunct)
(location ~ N > in comp)
(. )
)

PRINCIPAR Output
Figure 2.4: The dependency trees for the sentence \The contents
of the Clipboard appear in the desired location."

PAR, the dependency relationships are speci ed using relative positions


instead of absolute indices of words.
Dependency-Based Evaluation 19

Besides the super cial distinctions between the dependency formats,


signi cant di erences do exist among the representations in SEXTANT,
DESPAR and PRINCIPAR. For example, in SEXTANT, the preposi-
tion \of" in \of the clipboard" is a modi er of \clipboard," whereas in
DESPAR and PRINCIPAR, \clipboard" is a modi er of \of." These dif-
ferences, however, can be eliminated by transformations in dependency
trees proposed in Lin (1995).
Experiments on manual translation of constituency trees into de-
pendency trees were also conducted. The main concern was that some
important information gets lost when the trees are translated into de-
pendency trees. For example, the positions of traces and the feature
values of the nodes in the parse trees. Some of the participants felt that
the parsers cannot be fairly compared when this information is thrown
away. Since the feature values, like category symbols, varies a lot from
one parser/grammar to another, the loss of this information is not due to
the transformation into dependency trees, but a necessary consequence
of comparing di erent parsers. The loss of information about the posi-
tions of traces, on the other hand, is a legitimate concern that still needs
to be addressed.

2.4 Automated Transformation from Con-


stituency to Dependency
An algorithm for converting constituency trees into dependency trees
is presented in Lin (1995). The conversion algorithm makes use of a
conversion table that is similar to Magerman's Tree Head Table for de-
termining heads (lexical representatives) in CFG parse trees (Magerman,
1994, pp. 64-66).
An entry in the conversion table is a tuple:
(<mother> [<direction>] (<child-1> <child-2> ... <child-k>))

where <mother>, and <child-1>, <child-2>, ..., <child-k> are con-


ditions on category symbols and <direction> is either `r' (default) or
`l'. If a node in the constituency tree satis es the condition <mother>,
then its head child is determined in the following two steps:
1. Find the rst condition <child-i> in <child-1>, <child-2>, ...,
and <child-k> such that one of the child nodes of satis es
<child-i>;

2. If <direction> is l, then the head child is the rst child of that


satis es the condition <child-i>; or if <direction> is r or is ab-
sent, it is the last child of that satis es the condition <child-i>.
20 Lin

(N2+/DET1a a
(N2-
(N1/APMOD1
(A2/ADVMOD1/- (A1/A (A/COMPOUND bib summary)))
(N1/N screen))))

Figure 2.5: A RANLT parse tree

The condition (reg-match <reg-exp> ... <reg-exp>) is satis ed if


a pre x of the category symbol matches one of the regular expressions.
For example:
(reg-match "[0-9]+\$" "[A-Z]")

returns true if its argument is an integer or a capitalized word. If the


condition is an atom, then it is a shorthand for the condition (reg-match
<atom>), e.g., N is a shorthand for (reg-match N).
For example, the following table was used to convert RANLT parse
trees into dependency trees.
((N (N))
(V l ((lexical) V))
(S l (V S))
(A (A))
(P l (P))
((t) l ((t))))

The entry (N (N)) means that the head child of a node whose label
begins with N, such as N2- and N2+/DET1a, is the right most child whose
label begins with N. The entry (V l ((lexical) V)) means that the
head child of a node beginning with V is the left most lexical child (i.e.,
a word in the sentence), or the left most child beginning with V if it
does not have a lexical child. The condition (t) is satis ed by any node.
Therefore, the last entry ((t) l ((t)))) means that if a node's label
does not match any of the above entries, then its head child is its left
most child.
Consider the parse tree in Figure 2.5. The node N2+/DET1a has
two children: a lexical node \a" and N2-. The entry (N (N)) in the
conversion table dictates that N2- is the head child. Similarly, the head
of N2- is N1/APMOD1 and the head child of N1/APMOD1 is N1/N . The
head child of N1/N is its only child: the lexical node screen. Once the
head child of each node is determined, the lexical head of the node is the
Dependency-Based Evaluation 21

lexical head of its head child and the dependency tree can be constructed
as follows: for each pair of a head child and a non head child of a node,
there is a dependency relationship between the lexical head of the head
child and the lexical head of the non-head child.
The algorithm was also applied to TOSCA parses. An example of a
TOSCA parse tree is shown as follows:
<tparn fun=UTT cat=S att=(act,decl,indic,intr,pres,unm)>
<tparn fun=SU cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=ART att=(indef)> A
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,sing)> BIB summary screen
</tparn></tparn>
<tparn fun=V cat=VP att=(act,indic,intr,pres)>
<tparn fun=MVB cat=LV att=(indic,intr,pres)> appears
</tparn></tparn></tparn>

The nodes in the parse are annotated with functional categories (fun),
syntactic categories (cat) and attribute values (att). We rst used the
following UNIX sed script to transform the representation into LISP-like
list structures and remove functional categories and attribute values:
sed -e '
/(/s//*LRB*/g
/)/s//*RRB*/g
/\(<tparn .* \)cat=\([^ ]*\)[^>]*>/s//(\2/g
/<\/tparn>/s//)/g
'

The output of the sed script for the above example parse is the following:
(S
(NP
(DTP
(ART A
))
(N BIB summary screen
))
(VP
(LV appears
)))

The transformed parse tree is then converted to dependency structure


with the following conversion table:
22 Lin

((N (COORD N))


(V (COORD V))
(S l (COORD V S))
(P l (COORD P))
(CL l (COORD V N))
((t) l ((t))))

The resulting dependency tree is:


(
(A ~ ART < screen)
(BIB ~ U < screen)
(summary ~ U < screen)
(screen ~ N < appears)
(appears ~ LV *)
)

2.5 Conclusion
Parser evaluation is a very important issue for broad-coverage parsers.
We pointed out several serious problems with the phrase boundary based
evaluation methods and proposed a dependency based alternative. The
dependency-based evaluation not only produces more meaningful scores,
but also allows both dependency and constituency based parsers to be
evaluated. We used the RANLT and TOSCA outputs as examples to
show that constituency based parses can be automatically translated
into dependency trees with a simple conversion table. This provides
further evidence that the dependency-based evaluation method is able
to accommodate a wide range of parsers.

2.6 References
Black, E., Abney, S., Flickenger, D., Gdaniec, C., Grishman, R., Har-
rison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman,
M., Marcus, M., Roukos, S., Santorini, B., & Strzalkowski, T. (1991).
A procedure for quantitatively comparing the syntactic coverage of
English grammars. Proceedings of the Speech and Natural Language
Workshop, DARPA, February 1991, 306-311.
Black, E., La erty, J., & Roukos, S. (1992). Development and evaluation
of a broad-coverage probabilistic grammar of English-language com-
puter manuals. Proceedings of ACL-92, Newark, Delaware, 185-192.
Lin, D. (1994). Principar | an ecient, broad-coverage, principle-based
parser. Proceedings of COLING-94, Kyoto, Japan, 482-488.
Dependency-Based Evaluation 23

Lin, D. (1995). A dependency-based method for evaluating broad-


coverage parsers. Proceedings of IJCAI-95, Montreal, Canada, 1420-
1425.
Magerman, D. M. (1994). Natural Language Parsing as Statistical Pat-
tern Recognition. Ph.D. thesis, Stanford University.
Mel'cuk I. A. (1987). Dependency syntax: theory and practice. Albany,
NY: State University of New York Press.
Sampson, G. (1995). English for the Computer: the SUSANNE Corpus
and Analytic Scheme. Oxford, UK: Clarendon Press.
3
Comparative Evaluation of
Grammatical Annotation Models
Eric Steven Atwell1
University of Leeds

3.1 Introduction
The objective of the IPSM Workshop was to empirically evaluate a num-
ber of robust parsers of English, in essence by giving each parser a com-
mon test-set of sentences, and counting how many of these sentences each
parser could parse correctly. Unfortunately, what counts as a `correct'
parse is di erent for each parser, as the output of each is very di erent in
both format and content: they each assume a di erent grammar model
or parsing scheme for English. This chapter explores these di erences in
parsing schemes, and discusses how these di erences should be taken into
account in comparative evaluation of parsers. Chapter 2 suggests that
one way to compare parser outputs is to convert them to a dependency
structure. Others, (e.g. Atwell, 1988; Black, Garside & Leech, 1993)
have advocated mapping parses onto simple context-free constituency
structure trees. Unfortunately, in mapping some parsing schemes onto
1 Address: Centre for Computer Analysis of Language And Speech (CCALAS),
Arti cial Intelligence Division, School of Computer Studies, The University of Leeds,
LEEDS LS2 9JT, Yorkshire, England. Tel: +44 113 2335761, Fax: +44 113 2335468,
Email: eric@scs.leeds.ac.uk, WWW: http://agora.leeds.ac.uk/ccalas/. I gratefully
acknowledge the UK Engineering and Physical Sciences Research Council (EPSRC)
for funding the AMALGAM project; the UK Higher Education Funding Councils'
Joint Information Systems Committee New Technologies Initiative (HEFCs' JISC
NTI) for funding the NTI-KBS/CALAS project and my participation in the IPSM
Workshop, and the EU for funding my participation in the 1996 EAGLES Text
Corpora Working Group Workshop. I also gratefully acknowledge the contributions
of co-researchers on the AMALGAM project, John Hughes and Clive Souter, and the
various contributors to the AMALGAM MultiTreebank including John Carroll, Alex
Fang, Geo rey Leech, Nelleke Oostdijk, Geo rey Sampson, Tim Willis, and (last but
not least!) all the contributors to this book.
26 Atwell

this kind of `lowest common factor', a lot of syntactic information is lost;


this information is vital to some applications.
The di erences between parsing schemes is a central issue in the
project AMALGAM: Automatic Mapping Among Lexico-Grammatical
Annotation Models. The AMALGAM project at Leeds University is
investigating the problem of comparative assessment of rival syntac-
tic analysis schemes. The focus of research is the variety of lexico-
grammatical annotation models used in syntactically-analysed Corpora,
principally those distributed by ICAME, the International Computer
Archive of Modern English based at Bergen University. For more de-
tails, see Atwell, Hughes and Souter (1994a, 1994b), Hughes and Atwell
(1994), Hughes, Souter and Atwell (1995), Atwell (1996), AMALGAM
(1996) and ICAME (1996).
Standardisation of parsing schemes is also an issue for the European
Union-funded project EAGLES: Expert Advisory Group on Language
Engineering Standards (EAGLES, 1996). Particularly relevant is the
`Final Report and Guidelines for the Syntactic Annotation of Corpora'
(Leech, Barnett & Kahrel, 1995);2 this proposes several layers of recom-
mended and optional annotations, in a hierarchy of importance.

3.2 Diversity in Grammars


The parsers in this book are diverse, in that they use very di erent al-
gorithms to nd parse-trees. However, to a linguist, the di erences in
underlying grammars or parsing schemes are more important. The dif-
ferences are not simply matters of representation or notation (although
these alone cause signi cant problems in evaluation, e.g. in alignment).
A crucial notion is delicacy or level of detail in grammatical classi ca-
tion. This chapter explores some possible metrics of delicacy, applied to
comparative evaluation of the parsing schemes used in this book.
Delicacy of parsing scheme clearly impinges on the accuracy of a
parser. A simple evaluation metric used for parsers in this book is to
count how often the parse-tree found is `correct', or how often the `cor-
rect' parse-tree is among the set or forest of trees found by the parser.
However, this metric is unfairly biased against more sophisticated gram-
mars, which attempt to capture more ne-grained grammatical distinc-
tions. On the other hand, this metric would favour an approach to
syntax modelling which lacks this delicacy. Arguably it is not sensible
to seek a scale of accuracy applicable across all applications, as di erent
applications require di erent levels of parsing; see, for example, Souter
2 DISCLAIMER: My description of the EAGLES guidelines for the syntactic an-
notation of corpora is based on the PRE-RELEASE FINAL DRAFT version of this
Report, dated July 31st 1995; the nal version, due for publication in 1996, may
include some changes.
Grammatical Annotation Models 27

and Atwell (1994). For some applications, a skeletal parser is sucient,


so we can demand high accuracy: for example, n-gram grammar mod-
elling for speech or script recognition systems (see next section); parsing
corpus texts prior to input to a lexicographer's KWIC workbench; or
error-detection in Word Processor text. For these applications, parsing
is simply an extra factor or guide towards an improved `hit rate' - all
could still work without syntactic analysis and annotation, but perform
better with it. Other applications require detailed syntactic analysis,
and cannot function without this; for example, SOME (but by no means
all!) NLP systems assume that the parse-tree is to be passed on to a se-
mantic component for knowledge extraction, a process requiring a richer
syntactic annotation.

3.3 An Extreme Case: the `Perfect Parser'


from Speech Recognition
The variability of delicacy is exempli ed by one approach to parsing
which is widely used in Speech And Language Technology (SALT). Most
large-vocabulary English speech recognition systems use a word N-gram
language model of English grammar: syntactic knowledge is captured
in a large table of word bigrams (pairs), trigrams (triples), ... N-grams
(see surveys of large-vocabulary speech recognition systems, e.g. HLT
Survey, 1995; comp.speech, 1996). This table is extracted or learnt from
a training corpus, a representative set of texts in the domain of the
speech recogniser; training involves making a record of every N-gram
which appears in the training text, along with its frequency (e.g. in this
Chapter the bigram recognition systems occurs 4 times). The `grammar'
does not make use of phrase-structure boundaries, or even word-classes
such as Noun or Verb. The job of the `parser' is not to compute a parse-
tree for an input sentence, but to estimate a syntactic probability for the
input word-sequence. The `parser' is guaranteed to come up with SOME
analysis (i.e. syntactic probability estimate) for ANY input sentence; in
this sense it is a `perfect' parser, outperforming all the other parsers in
this book.
However, this sort of `parsing' is inappropriate for many IPSM ap-
plications, where the assumption is that some sort of parse-tree is to be
passed on to a semantic component for knowledge extraction. In lin-
guistic terms, the Speech Recognition grammar model has insucient
delicacy (or no delicacy at all!).
28 Atwell

3.4 The Corpus as Empirical De nition of


Parsing Scheme
A major problem in comparative evaluation of parsing schemes is pinning
down the DEFINITIONS of the parsing schemes in question. Generally
the parser is a computer program which can at least in theory be di-
rectly examined and tested; we can evaluate the algorithm as well as
the output. Parsing schemes tend to be more intangible and ephemeral:
generally the parsing scheme exists principally in the mind of the ex-
pert human linguist, who decides on issues of delicacy and correctness
of parser output. For most of the syntactically-analysed corpora covered
by the AMALGAM project, we have some `manual annotation hand-
book' with general notes for guidance on de nitions of categories; but
these are not rigorously formal or de nitive, nor are they all to the same
standard or level of detail. For the AMALGAM project, we were forced
to the pragmatic decision to accept the tagged/parsed Corpus itself as
de nitive of the tagging/parsing scheme for that Corpus. For example,
for Tagged LOB, (Johansson, Atwell, Garside & Leech, 1986) constitutes
a detailed manual, but for the SEC parsing scheme we have to rely on a
list of categories and some examples of how to apply them; so we took
the LOB and SEC annotated corpora themselves as de nitive examples
of respective syntactic analysis schemes.
Another reason for relying on the example data rather than ex-
planatory manuals is the limitation of the human mind. Each lexico-
grammatical annotation model for English is so complex that it takes an
expert human linguist a long time, months or even years, to master it.
For example, the de nition of the SUSANNE parsing scheme is over 500
pages long (Sampson, 1995). To compare a variety of parsing schemes
via such manuals, I would have to read, digest and comprehensively
cross-reference several such tomes. Perhaps a couple of dozen linguists
in the world could realistically claim to be experts in two rival Corpus
parsing schemes, but I know of none who are masters of several. I have
been forced to the conclusion that it is unreasonable to ask anyone to
take on such a task (and I am not about to volunteer myself!).
This pragmatic approach is also necessary with the parsing schemes
used in this book. Not all the parsing schemes in use have detailed
de nition handbooks, as far as I am aware; at the very least, I do not
have access to all of them. So, comparative evaluation of parsing schemes
must be based on the small corpus of test parse-trees presented at the
IPSM workshop. Admittedly this only constitutes a small sample of
each parsing scheme, but hopefully the samples are comparable subsets
of complete grammars, covering the same set of phrase-types for each
parsing scheme. This should be sucient to at least give a relative
Grammatical Annotation Models 29

indicator of delicacy of the parsing schemes.

3.5 Towards a MultiTreebank


One advantage of the IPSM exercise is that all parsers were given the
same sentences to parse, so we have directly-comparable parses for given
sentences; the same is not true for ICAME parsed corpora, also called
treebanks. Even if we assume that, for example, the Spoken English
Corpus (SEC) treebank (Taylor & Knowles, 1988) embodies the de ni-
tion of the SEC parsing scheme, the Polytechnic of Wales (POW) tree-
bank (Souter, 1989) de nes the POW parsing scheme, etc, there is still a
problem in comparing delicacy across parsing schemes. The texts parsed
in each treebank are di erent, which complicates comparison. For any
phrase-type or construct in the SEC parsing scheme, it is not straight-
forward to see its equivalent in POW: this involves trawling through
the POW treebank for similar word-sequences. It would be much more
straightforward to have a single text sample parsed according to all the
di erent schemes under investigation, a MultiTreebank. This would al-
low for direct comparisons of rival parses of the same phrase or sentence.
However, creation of such a resource is very dicult, requiring the co-
operation and time of the research teams responsible for each parsed
corpus and/or robust parser.
A rst step towards a prototype MultiTreebank was achieved in the
Proceedings of the IPSM workshop, which contained the output of sev-
eral parsers' attempts to parse half a dozen example sentences taken from
software manuals. Unfortunately each sentence caused problems for one
or more of the parsers, so this mini-MultiTreebank has a lot of `holes' or
gaps. As an example for further investigation, I selected one of the short-
est sentences (hence, hopefully, most grammatically straightforward and
uncontroversial), which most parsers had managed to parse:
Select the text you want to protect.

To the example parses produced by IPSM participants, I have been


able to add parses conformant to the parsing schemes of several large-
scale English treebanks, with the assistance of experts in several of these
parsing schemes; see AMALGAM (1996).

3.6 Vertical Strip Grammar: a Standard


Representation for Parses
Before we can compare delicacy in the way two rival parsing-schemes
annotate a sentence, we have to devise a parsing-scheme-neutral way
30 Atwell

of representing rival parse-trees, or at least of mapping between the


schemes. I predict that most readers will be surprised by the wide diver-
sity of notation used by the parsers taking part in the IPSM workshop;
I certainly was. This can only confuse attempts to compare underlying
grammatical classi cation distinctions.
This is a major problem for the AMALGAM project. Even Corpora
which are merely wordtagged (without higher syntactic phrase bound-
aries marked) such as BNC, Brown etc, are formatted in a bewildering
variety of ways. As a `lowest common factor' , or rather, a `lowest com-
mon anchor-point', each corpus could be visualised as a sequence of word
+ wordtag pairs. Even this simpli cation raises problems of incompat-
ible alignment and segmentation. Some lexico-grammatical annotation
schemes treat various idiomatic phrases, proper-name-sequences, etc as
a single token or `word'; whereas others split these into a sequence of
words to be assigned separate tags. Some parsing schemes split o cer-
tain axes as separate lexemes or tokens requiring separate tags; while
others insist that a `word' is any character-sequence delimited by spaces
or punctuation.
However, putting this tokenisation problem to one side, it is useful to
model any wordtagged Corpus as a simple sequence of word + wordtag
pairs. This can be used to build N-gram models of tag-combination
syntax. For full parses, the words in the sentence still constitute a `lowest
common anchor point', so we have considered N-gram-like models of
parse-structures. For example, take the EAGLES basic parse-tree:
[S[VP select [NP the text [CL[NP you NP][VP want [VP to
protect VP]VP]CL]NP]VP] . S]

Words are `anchors', with hypertags between them showing opening


and/or closing phrase boundaries. These hypertags are inter-word gram-
matical tokens alternating with the words, with a special NULL hypertag
to represent absence of inter-word phrase boundary:
[S[VP
select
[NP
the
NULL
text
[CL[NP
you
NP][VP
want
[VP
to
Grammatical Annotation Models 31

NULL
protect
VP]VP]CL]NP]VP]
.
S]

When comparing rival parses for the same sentence, we can `can-
cel out' the words as a common factor, leaving only the grammatical
information assigned according to the parsing scheme. So, one way to
normalise parse-structures would be to represent them as an alternating
sequence of wordtags and inter-word structural information; this would
render transparent the amount and delicacy of structural classi catory
information. This would allow us to try quantitative comparison metrics,
e.g. the length of the hypertag-string.
However, this way of building an N-gram like model is heavily reliant
on phrase structure bracketing information, and so is not appropriate
for some IPSM parsing schemes, those with few or no explicit phrase
boundaries. The problem is that all the parses do have WORDS in
common, but not all have inter-word bracketing information. An N-
gram-like model which has states for words (but not inter-word states)
may be more general. A variant N-gram-like model which meets this
requirement is a Vertical Strip Grammar (VSG), as used in the Vertical
Strip Parser (O'Donoghue, 1993). In this, a parse-tree is represented as
a series of Vertical Strips from root to leaves. For example, given the
syntax tree:

S________________________________
| |
VP___ |
| | |
| NP_______ |
| | | | |
| | | CL__ |
| | | | | |
| | | NP VP___ |
| | | | | | |
| | | | | VP__ |
| | | | | | | |
select the text you want to protect .

This can be chopped into a series of Vertical Strips, one for each path
from root S to each leaf:
32 Atwell

S S S S S S S S
VP VP VP VP VP VP VP .
select NP NP NP NP NP NP
the text CL CL CL CL
NP VP VP VP
you want VP VP
to protect

This Vertical Strip representation is highly redundant, as the top of


each strip shares its path from the root with its predecessor. So, the
VSG representation only records the path to each leaf from the point of
divergence from the previous Strip:

S
VP .
select NP
the text CL
NP VP
you want VP
to protect

This VSG representation captures the grammatical information tied


to each word, in a compact normalised form. Output from the various
parsers can likewise be mapped onto an N-gram-like normalised VSG
form:
Sentence:
select the text you want to protect .

ALICE:

SENT VP-INF
AUX SENT INF-MARK VP-INF
? NP NP VP-ACT to protect
select DET NOUN you want
the text

ENGCG:

@+FMAINV @DN> @OBJ @SUBJ @+FMAINV @INFMARK> @-FMAINV .


V DET N PRON V INFMARK V
select the text you want to protect .
Grammatical Annotation Models 33

The ENGCG output is unusual in that it provides very detailed word-


category labelling for each word, but only minimal structural informa-
tion. In the above I have omitted the wordclass subcategory information,
e.g.
select: <*> <SVO> <SV> <P/for> V IMP VFIN

LPARSER:

O B
W D C S TO I
v the n you want to v
select text protect

PRINCIPAR:

VP
Vbar
V
V_NP
V_NP NP
select Det Nbar
the N CP
text Op[1]
Cbar
IP
NP Ibar
Nbar VP
N Vbar
you V
V_CP
V_CP CP
want Cbar
IP
PRO
Ibar
Aux VP
to Vbar
V
V_NP
V_NP
protect
t[1]
34 Atwell

PLAIN:

ILLOC
command
PROPOS
* DIR_OBJ1
imperat DETER * ATTR_ANY
select definit singula rel_clause
text PRED
SUBJECT * DIR_OBJ2
you present clause
want PROPOS
to protect

RANLT:

VP/NP
select N2+/DET
the N2-
N1/INFM
N1/RELM VP/TO
N1/N S/THATL to VP/NP
text S1a protect
N2+/PRO VP/NP TRACE1
you want E
TRACE1
E

SEXTANT:

VP NP NP VP --
INF 3 * * INF TO 4 .
select DET 1 PRON want to SUBJ .
DET DOBJ you INF
the NOUN protect
text

DESPAR:

8 3 1 5 3 7 5 0
VB DT NN PP VBP TO VB .
select the text you want to protect .
Grammatical Annotation Models 35

TOSCA:

Unfortunately this was one of only a couple of IPSM test sentences


that the TOSCA parser could not parse, due to the syntactic phe-
nomenon known as `raising': according to the TOSCA grammar, both
the verbs `select' and `protect' require an object, and although in some
deep sense `the text' is the object of both, the TOSCA grammar does
not allow for this construct. However, the TOSCA research team have
kindly constructed a `correct' parse for our example sentence, to com-
pare with others, by parsing a similar sentence and then `hand-editing'
the similar parse-tree. This includes very detailed subclassi cation in-
formation with each label (see Section 3.7.5, which includes the TOSCA
`correct' parse-tree). For my VSG normalisation I have omitted this:
NOFU,TXTU
UTT,S PUNC,PM
V,VP OD,NP .
MVB,LV DT,DTP NPHD,N NPPO,CL
Select DTCE,ART text SU,NP V,VP OD,CL
the NPHD,PN MVB,LV TO,PRTC LV,VP
you want to MVB,LV
protect

3.7 EAGLES: A Multi-Layer Standard for


Syntactic Annotation
This standard representation is still crude and appears unfair to some
schemes, particularly dependency grammar which has no grammatical
classes! Also, it assumes the parser produces a single correct parse-tree
- is it fair to parsers (e.g. RANLT) which produce a forest of possible
parses? It at least allows us to compare parser outputs more directly,
and potentially to combine or merge syntactic information from di erent
parsers.
Mapping onto a standard format allows us to focus on the substantive
di erences between parsing schemes. It turns out that delicacy is not
a simple issue, as di erent parsers output very di erent kinds or levels
of grammatical information. This brings us back to our earlier point:
parsing schemes should be evaluated with respect to a given application,
as di erent applications call for di erent levels of analysis.
To categorise these levels of grammatical analysis, we need a taxon-
omy of possible grammatical annotations. The EAGLES Draft Report
on parsing schemes (Leech, Barnett & Kahrel, 1995) suggests that these
layers of annotation form a hierarchy of importance, summarised in Table
3.1 at the end of this section.
36 Atwell

The Report does not attempt formal de nitions or stipulate stan-


dardised labels to be used for all these levels, but it does give some
illustrative examples. From these I have attempted to construct the
layers of analysis for our standard example sentence.

3.7.1 (a) Bracketing of Segments


The Report advocates two formats for representing phrase structure,
which it calls Horizontal Format and Vertical Format; see Atwell (1983).
In both, opening and closing phrase boundaries are shown by square
brackets between words; in horizontal format the text reads horizontally
down the page, one word per line, while in vertical format the text reads
left-to-right across the page, interspersed with phrase boundary brackets:
[[ select [ the text [[ you ][ want [ to protect ]]]]] . ]

3.7.2 (b) Labelling of Segments


This can also be represented compactly in vertical format:
[S[VP select [NP the text [CL[NP you NP][VP want [VP to
protect VP]VP]CL]NP]VP] . S]

The EAGLES report recommends the use of the categories S (Sen-


tence), CL (Clause), NP (Noun Phrase), VP (Verb Phrase), PP (Prepo-
sitional Phrase), ADVP (Adverb Phrase), ADJP (Adjective Phrase).
Although the EAGLES standard does not stipulate any obligatory syn-
tactic annotations, these phrase structure categories are recommended,
while the remaining layers of annotation are optional. Thus the above
EAGLES parse-tree can be viewed as a baseline `lowest common factor'
target for parsers to aim for.

3.7.3 (c) Showing Dependency Relations


The Report notes that: \as far as we know, the ENGCG parser is the
only system of corpus annotation that uses dependency syntax", which
makes the ENGCG analysis a candidate for the de-facto EAGLES stan-
dard for this layer. However, the dependency analysis is only partial
- the symbol > denotes that a word's head follows, and only two such
dependencies are indicated for our example sentence:
> >
select the text you want to protect .
Grammatical Annotation Models 37

The report cites three traditional ways of representing dependency


analyses graphically; however, the rst cited traditional method, us-
ing curved arrows drawn to link dependent words, is equivalent to the
DESPAR method using word-reference numbers:
8 3 1 5 3 7 5 0
1 2 3 4 5 6 7 8
select the text you want to protect .

3.7.4 (d) Indicating Functional Labels


The report cites SUSANNE, TOSCA and ENGCG as examples of pars-
ing schemes which include syntactic function labels such as Subject,
Object, Adjunct. In TOSCA output, every node-label is a pair of Func-
tion,Category; for example, SU,NP labels a Noun Phrase functioning as
a Subject. In the ENGCG analysis, function is marked by @:
@+FMAINV @D @OBJ @SUB @+FMAINV @INFMARK @-FMAINV .
select the text you want to protect .

3.7.5 (e) Marking Subclassi cation of Syntactic Seg-


ments
Example subclassi cation features include marking a Noun Phrase as
singular, or a verb Phrase as past tense. The TOSCA parser has one of
the richest systems of subclassi cation, with several subcategory features
attached to most nodes, lowercase features in brackets:
NOFU,TXTU()
UTT,S(-su,act,imper,motr,pres,unm)
V,VP(act,imper,motr,pres)
MVB,LV(imper,motr,pres){Select}
OD,NP()
DT,DTP()
DTCE,ART(def){the}
NPHD,N(com,sing){text}
NPPO,CL(+raisod,act,indic,motr,pres,unm,zrel)
SU,NP()
NPHD,PN(pers){you}
V,VP(act,indic,motr,pres)
MVB,LV(indic,motr,pres){want}
OD,CL(-raisod,-su,act,indic,infin,motr,unm,zsub)
TO,PRTCL(to){to}
V,VP(act,indic,infin,motr)
MVB,LV(indic,infin,motr){protect}
PUNC,PM(per){.}
38 Atwell

The ENGCG parsing scheme also includes subclassi cation features


at the word-class level:

"select" <*> <SVO> <SV> <P/for> V IMP VFIN


"the" <Def> DET CENTRAL ART SG/PL
"text" N NOM SG
"you" <NonMod> PRON PERS NOM SG2/PL2
"want" <SVOC/A> <SVO> <SV> <P/for> V PRES -SG3 VFIN
"to" INFMARK>
"protect" <SVO> V INF

3.7.6 (f) Deep or `Logical' Information


This includes traces or markers for extraposed or moved phrases, such as
capturing the information that `the text' is not just the Object of `select'
but also the (raised) Object of `protect'. This is captured by the features
+raisod and -raisod in the above TOSCA parse-tree; by cross-indexing
of Op[1] and t[1] in the PRINCIPAR parse; and by (TRACE1 E) in the
RANLT parse.
3.7.7 (g) Information about the Rank of a Syntactic
Unit
The Report suggests that \the concept of rank is applied to general cat-
egories of constituents, words being of lower rank than phrases, phrases
being of lower rank than clauses, and clauses being of lower rank than
sentences". This is not explicitly shown in most of the parser outputs,
beyond the common convention that words are in lowercase while higher-
rank units are in UPPERCASE or begin with an Uppercase letter. How-
ever, I believe that the underlying grammar models used in PRINCIPAR
and RANLT do include a rank hierarchy of nominal units: NP-Nbar-N
in PRINCIPAR, NP-N2-n1-N in RANLT.
3.7.8 (h) Special Syntactic Characteristics of Spoken
Language
This layer includes special syntactic annotations for \a range of phe-
nomena that do not normally occur in written language corpora, such
as blends, false starts, reiterations, and lled pauses". As the IPSM
test sentences were written rather than spoken texts, this layer does not
apply to us. However, we have successfully applied the TOSCA and
ENGCG parsers to spoken text transcripts at Leeds in the AMALGAM
research project.
Grammatical Annotation Models 39

Layer Explanation
(a) Bracketing of segments
(b) Labelling of segments
(c) Showing dependency relations
(d) Indicating functional labels
(e) Marking subclassi cation of syntactic segments
(f) Deep or `logical' information
(g) Information about the rank of a syntactic unit
(h) Special syntactic characteristics of spoken language

Table 3.1: EAGLES layers of syntactic annotation, forming a


hierarchy of importance.
Code Explanation
A Verbs recognised
B Nouns recognised
C Compounds recognised
D Phrase Boundaries recognised
E Predicate-Argument Relations identi ed
F Prepositional Phrases attached
G Coordination/Gapping analysed

Table 3.2: Characteristics used in IPSM parser evaluation.

3.7.9 Summary: a Hierarchy of Importance


Table 3.1 summarises the EAGLES layers of syntactic annotation, which
form a hierarchy of importance. No parsing scheme includes all the layers
(a)-(g) shown in the table; di erent IPSM parsers annotate with di erent
subsets of the hierarchy.

3.8 Evaluating the IPSM Parsing Schemes


against EAGLES
For the IPSM Workshop, each parsing scheme was evaluated in terms
of \what kinds of structure the parser can in principle recognise". Each
of the chapters after this one includes a table showing which of the
characteristics in Table 3.2 are handled by the parser.
These characteristics are di erent from the layers of annotation in
the EAGLES hierarchy, Table 3.1. They do not so much characterise
the parsing scheme, but rather the degree to which the parser can apply
it successfully. For example, criterion F does not ask whether the parsing
40 Atwell

Layer a b c d e f g Score
ALICE yes yes no no no no no 2
ENGCG no no yes yes yes no no 3
LPARSER no no yes yes no no no 2
PLAIN yes yes no yes no no no 3
PRINCIPAR yes yes yes no no yes yes 5
RANLT yes yes no no no yes yes 4
SEXTANT yes yes yes yes no no no 4
DESPAR no no yes no no no no 1
TOSCA yes yes no yes yes yes no 5

Table 3.3: Summary Comparative Evaluation of IPSM Gram-


matical Annotation Models, in terms of EAGLES layers of syn-
tactic annotation. Each cell in the table is labelled yes or no to
indicate whether an IPSM parsing scheme includes an EAGLES
layer (at least partially). score is an indication of how many layers
a parser covers.

scheme includes the notion of Prepositional Phrase (all except DESPAR


do, although only PRINCIPAR and TOSCA explicitly use the label PP);
rather it asks whether the parser is `in principle' able to recognise and
attach Prepositional Phrases correctly. Furthermore, most of the char-
acteristics relate to broad categories at the `top' layers of the EAGLES
hierarchy.
Table 3.3 is my alternative attempt to characterise the rival parsing
schemes, in terms of EAGLES layers of syntactic annotation. Each IPSM
parsing scheme is evaluated according to each EAGLES criterion; and
each parsing scheme gets a very crude overall `score' showing how many
EAGLES layers are handled, at least partially.
Note that this is based on my own analysis of output from the IPSM
parsers, and I may have misunderstood some capabilities of the parsers.
PRINCIPAR is unusual in being able to output two parses, to give both
Dependency and Constituency analysis; I have included both in my anal-
ysis, hence its high `score'. The TOSCA analysis is based on the `hand-
crafted' parse supplied by the TOSCA team, given that their parser
failed with the example sentence; I am not clear whether the automatic
parser can label deep or `logical' information such as the raised Object
of protect.
Grammatical Annotation Models 41

3.9 Summary and Conclusions


In this chapter, I have attempted the comparative evaluation of IPSM
grammatical annotation models or parsing schemes. The rst problem
is that the great variety of output formats hides the underlying sub-
stantive similarities and di erences. Others have proposed mapping all
parser outputs onto a Phrase-Structure tree notation, but this is ar-
guably inappropriate to the IPSM evaluation exercise, for at least two
reasons:
1. several of the parsers (ENGCG, LPARSER, DESPAR) do not out-
put traditional constituency structures, and
2. most of the parsers output other grammatical information which
does not ` t' and would be lost in a transformation to a simple
phrase-structure tree.
The chapter by Lin proposes the alternative of mapping all parser out-
puts to a Dependency structure, but this is also inappropriate, for similar
reasons:
1. most of the parsers do not output Dependency structures, so to
force them into this minority representation would seem counter-
intuitive; and
2. more importantly, most of the grammatical information output by
the parsers would be lost in the transformation: dependency is only
one of the layers of syntactic annotation identi ed by EAGLES.
In other words, mapping onto either constituency or dependency
structure would constitute `degrading' parser output to a lowest common
factor, which is a particularly unfair evaluation procedure for parsers
which produce `delicate' analyses, covering several layers in the EAGLES
hierarchy.
As an alternative, I have transformed IPSM parser outputs for a sim-
ple example sentence onto a compromise Vertical Strip Grammar format,
which captures the grammatical information tied to each word, in a com-
pact normalised form. The VSG format is derived from a constituent-
structure tree, but it can accommodate partial structural information
as output by the ENGCG and LPARSER systems. The VSG format
is NOT intended for use in automatic parser evaluation experiments, as
clearly the VSG forms of rival parser outputs are still clearly di erent,
not straightforwardly comparable. The VSG format is intended as a
tool to enable linguists to compare grammatical annotation models, by
factoring out notational from substantive di erences.
The EAGLES report on European standards for syntactic annotation
identi es a hierarchy of levels of annotation. Transforming IPSM parser
42 Atwell

Layer a b c d e f g Score
ALICE 7 6 0 0 0 0 0 13
ENGCG 0 0 5 4 3 0 0 12
LPARSER 0 0 5 4 0 0 0 09
PLAIN 7 6 0 4 0 0 0 17
PRINCIPAR 7 6 5 0 0 2 1 21
RANLT 7 6 0 0 0 2 1 16
SEXTANT 7 6 5 4 0 0 0 22
DESPAR 0 0 5 0 0 0 0 05
TOSCA 7 6 0 4 3 2 0 22

Table 3.4: Summary Comparative Evaluation of IPSM Gram-


matical Annotation Models, weighted in terms of EAGLES hier-
archy of importance. Each cell in the table is given a weighted
score if the IPSM parsing scheme includes an EAGLES layer (at
least partially). score is a weighted overall measure of how many
layers a parser covers.

outputs to a common notation is a useful exercise, in that it highlights


the di erences between IPSM parsing schemes. These di erences can be
categorised according to the EAGLES hierarchy of layers of importance.
Table 3.3 in turn highlights the fact that no IPSM parser produces a
`complete' syntactic analysis, and that di erent parsers output di erent
(overlapping) subsets of the complete picture.
One conclusion is to cast doubt on the value of parser evaluations
based purely on success rates, speeds, etc without reference to the com-
plexity of the underlying parsing scheme. At the very least, whatever
score each IPSM parser achieves should be modi ed by a `parsing scheme
coverage' factor. Table 3.3 suggests that, for example, the PRINCIPAR
and TOSCA teams should be given due allowance for the richer an-
notations they attempt to produce. A crude yet topical3 formula for
weighting scores for success rate could be:
overall-score = success-rate * (parsing-scheme-score - 1)
However, I assume this formula would not please everyone, particu-
larly the DESPAR team! This weighting formula can be made even more
controversial by taking the description hierarchy of importance at face
value, and re-assigning each yes cell in Table 3.3 a numerical value on
3 At the time of writing, UK university researchers are all busy preparing for the
HEFCs' Research Assessment Exercise: all UK university departments are to have
their research graded on a scale from 5 down to 1. RAE will determine future HEFCs
funding for research; a possible formula is: Funding-per-researcher = N*(Grade-1),
where N is a (quasi-)constant.
Grammatical Annotation Models 43

a sliding scale from 7 (a) down to 1 (g), as in Table 3.4. The TOSCA,
SEXTANT and PRINCIPAR parsing schemes appear to be \best" as
they cover more of the \important" layers of syntactic annotation.
A more useful conclusion is that prospective users of parsers should
not take the IPSM parser success rates at face value. Rather, to repeat
the point made in Section 3.2, it is not sensible to seek a scale of accuracy
applicable across all applications. Di erent applications require di erent
levels of parsing. Prospective users seeking a parser should rst decide
what they want from the parser. If they can frame their requirements in
terms of the layers of annotation in Table 3.1, then they can eliminate
parsers which cannot meet their requirements from Table 3.3. For exam-
ple, the TOSCA parser was designed for use by researchers in Applied
Linguistics and English Language Teaching, who require a complex parse
with labelling similar to grammar conventions used in ELT textbooks.
In practice, of the IPSM participants only the TOSCA parser produces
output suitable for this application, so its users will probably continue
to use it regardless of its comparative `score' in terms of accuracy and
speed.
To end on a positive note, this comparative evaluation of grammatical
annotation schemes would not have been possible without the IPSM
exercise, which generated output from a range of parsers for a common
test corpus of sentences. It is high time for more linguists to take up
this practical, empirical approach to comparing parsing schemes!

3.10 References
AMALGAM. (1996). WWW home page for AMALGAM.
http://agora.leeds.ac.uk/amalgam/
Atwell, E. S. (1983). Constituent Likelihood Grammar ICAME Journal,
7, 34-67. Bergen, Norway: Norwegian Computing Centre for the
Humanities.
Atwell, E. S. (1988). Transforming a Parsed Corpus into a Corpus
Parser. In M. Kyto, O. Ihalainen & M. Risanen (Eds.) Corpus Lin-
guistics, hard and soft: Proceedings of the ICAME 8th International
Conference (pp. 61-70). Amsterdam, The Netherlands: Rodopi.
Atwell, E. S. (1996). Machine Learning from Corpus Resources for
Speech And Handwriting Recognition. In J. Thomas & M. Short
(Eds.) Using Corpora for Language Research: Studies in the Honour
of Geo rey Leech (pp. 151-166). Harlow, UK: Longman.
Atwell, E. S., Hughes, J. S., & Souter, D. C. (1994a). AMALGAM:
Automatic Mapping Among Lexico-Grammatical Annotation Models.
In J. Klavans (Ed.) Proceedings of ACL workshop on The Balancing
Act: Combining Symbolic and Statistical Approaches to Language (pp.
44 Atwell

21-28). Somerset, NJ: Association for Computational Linguistics.


Atwell, E. S., Hughes, J. S., & Souter, D. C. (1994b). A Uni ed Multi-
Corpus for Training Syntactic Constraint Models. In L. Evett & T.
Rose (Eds.) Proceedings of AISB workshop on Computational Lin-
guistics for Speech and Handwriting Recognition. Leeds, UK: Leeds
University, School of Computer Studies.
Black, E., Garside, R. G., & Leech, G. N. (Eds.) (1993). Statistically-
driven Computer Grammars of English: the IBM / Lancaster Ap-
proach. Amsterdam, The Netherlands: Rodopi.
comp.speech. (1996). WWW home page for comp.speech Frequently
Asked Questions. http://svr-www.eng.cam.ac.uk/comp.speech/
EAGLES. (1996). WWW home page for EAGLES.
http://www.ilc.pi.cnr.it/EAGLES/home.html
HLT Survey. (1995). WWW home page for the NSF/EC Survey of the
State of the Art in Human Language Technology.
http://www.cse.ogi.edu/CSLU/HLTsurvey/
Hughes, J. S., & Atwell, E. S. (1994). The Automated Evaluation of
Inferred Word Classi cations. In A. Cohn (Ed.) Proceedings of Euro-
pean Conference on Arti cial Intelligence (ECAI'94) (pp. 535-539).
Chichester, UK: John Wiley.
Hughes, J. S., Souter, D. C., & Atwell, E. S. (1995). Automatic Ex-
traction of Tagset Mappings from Parallel-Annotated Corpora. In
E. Tzoukerman & S. Armstrong (Eds.) Proceedings of Dublin ACL-
SIGDAT workshop `From text to tags: issues in multilingual language
analysis'. Somerset, NJ: Association for Computational Linguistics.
ICAME. (1996). WWW home page for ICAME.
http://www.hd.uib.no/icame.html
Johansson, S., Atwell, E. S., Garside, R. G., & Leech, G. N. (1986). The
Tagged LOB Corpus. Bergen, Norway: Norwegian Computing Centre
for the Humanities.
Leech, G. N., Barnett, R., & Kahrel, P. (1995). EAGLES Final Report
and Guidelines for the Syntactic Annotation of Corpora (EAGLES
Document EAG-TCWG-SASG/1.5, see EAGLES WWW page). Pisa,
Italy: Istituto di Linguistica Computazionale.
O'Donoghue, T. (1993). Reversing the process of generation in systemic
grammar. Ph.D. Thesis. Leeds, UK: Leeds University, School of
Computer Studies.
Sampson, G. (1995). English for the Computer: the SUSANNE Corpus
and Analytic Scheme. Oxford, UK: Clarendon Press.
Souter, C. (1989). A Short Handbook to the Polytechnic of Wales Corpus
(Technical Report). Bergen, Norway: Bergen University, ICAME,
Norwegian Computing Centre for the Humanities.
Souter, D. C., & Atwell, E. S. (1994). Using Parsed Corpora: A review
of current practice. In N. Oostdijk & P. de Haan (Eds.) Corpus-based
Grammatical Annotation Models 45

Research Into Language (pp. 143-158). Amsterdam, The Netherlands:


Rodopi.
Taylor, L. J., & Knowles. G. (1988). Manual of information to accom-
pany the SEC corpus: The machine readable corpus of spoken English
(Technical Report). Lancaster, UK: University of Lancaster, Unit for
Computer Research on the English Language.
4
Using ALICE to Analyse a
Software Manual Corpus
William J. Black1
Philip Neal
UMIST

4.1 Introduction
The ALICE parser (Analysis of Linguistic Input to Computers in En-
glish) was developed for use in the CRISTAL project concerned with
multilingual information retrieval (LRE project P 62-059). It is designed
eventually to be used in open-ended situations without restrictions on
user vocabulary, but where fragmentary analysis will be acceptable when
complete parses turn out not to be possible. At all stages of its develop-
ment, the emphasis has been on robustness at the expense of full gram-
matical coverage, and on a small lexicon augmented by morphological
analysis of input words.

4.2 Description of Parsing System


The grammatical framework is Categorial Grammar (CG) (Wood, 1993),
in which the words of a sentence are analysed as functions and arguments
of each other by analogy with the structure of a mathematical equation.
In the same way that the minus sign in \-2" is a functor forming a nu-
meral out of a numeral and the equals sign in \2 + 3 = 5" one forming an
equation from two numerals, categorial grammar analyses the adjective
in \poor Jemima" as a functor forming a noun (phrase) out of a noun
and the transitive verb in \France invaded Mexico" as a functor forming
a sentence out of two nouns. The category of a function from a noun to
1 Address: Centre for Computational Linguistics, University of Manchester Insti-
tute of Science and Technology, Sackville Street, PO Box 88, Manchester M60 1QD,
UK. Tel: +44 161 200 3096, Fax: +44 161 200 3099, Email: bill@ccl.umist.ac.uk.
48 Black, Neal

a noun is symbolised as n=n and a transitive verb as s=nnn, where the


direction of the slash represents the direction in which a functor seeks its
arguments. Arguments are given feature structures to force agreement
of number and gender, and to handle phenomena such as attachment
and gapping.
It is a property of Categorial Grammar that the syntactic potential
of words is represented entirely in the lexicon. The current parsing unit
only employs the two standard rules of left and right function application.
Rules of composition have been tried out in some versions of the parser
and may well be restored at a later stage.
ALICE operates in three phases: preprocessing, parsing and post-
parsing.

4.2.1 Preprocessing
Preprocessing is concerned with tokenisation and morphological analy-
sis. The purpose of the tokeniser is to recognise the distinction between
punctuation and decimal points, construct single tokens for compound
proper names, attempt to recognise dates, and other similar tasks: its
e ect is to partition a text into proto-sentences.
The morphological analyser nds the likely parts of speech, and where
appropriate the in ected form of each token in the current sentence
string. An attempt is made to guess the part of speech of all content
words from their morphological and orthographic characteristics. A very
limited use is made of the position of words in the sentence. No use is
currently made of syntactic collocations (n-grams).
A nal sub-phase of preprocessing inserts NP gaps into the well-
formed substring table at places where they could be used in relative
clauses and questions. Between stage I and stage III, some attempt was
made to extend the use of gaps to the analysis of complex conjunctions.
In future development of ALICE it is intended to make much more exten-
sive use of local constraints on the same principle as Constraint Grammar
(Voutilainen & Jarvinen, 1996).

4.2.2 Parsing
The parsing system has the following characteristics:
 It is based on a well-formed substring table (a chart but not an
active chart).
 Parsing proceeds bottom up and left to right, with edges added
to the data structure as rules are completed (rather than when
\ red" as in the Active Chart parsing algorithm).
ALICE 49

 Rules are in Chomsky Normal Form; that is, each rule has exactly
two daughters.

 Term Uni cation is supported as the means of expressing feature


constraints and the construction of phrasal representations.

 The parser uses the predicate-argument analysis to construct a


semantics for an input string in which lexemes, and some purely
grammatical features, are treated as quasi-logical predicates with
coinstantiated variables. This was originally regarded as the prin-
cipal output, to be used in conjunction with an inference system
able to construct from it a disambiguated scoped logical represen-
tation. None of this has been included in the delivered parses.
The lexicon when used in stage I of the tests had about 800 word en-
tries, and about 1200 word entries by stage III of the tests. The content
vocabulary was mostly drawn from the nancial domain, and the modi-
cations between the stages of the tests were motivated by the demands
of the CRISTAL project rather than the tests themselves. The number
of types of CG sign represented (roughly comparable to rules in a phrase
structure formalism) was 117 at stage I and 139 at stage III.

4.2.3 Postprocessing
In addition to the default output based on predicate-argument structure,
the postprocessing phase is able to extract a surface syntactic tree, which
is useful for debugging the grammar. For the convenience of those who
prefer to read trees rather than predicate-argument structure, the surface
tree nodes bear labels corresponding to the CG signs actually used. This
post-parsing translation is supported by a set of rules putting CG into
correspondence with Phrase Structure Grammar (PSG), but there are
about half as many such rules as there are category types in the lexicon,
so some categories or category-feature combinations are con ated.
Generally, a sentence will have multiple parses or no complete parse.
In the latter case the postprocessor extracts from the chart a set of well-
formed fragments. Currently, this is done from right to left, picking
out the longest fragment and recursively extracting from that point left-
wards. This heuristic, and minor variations on it which have been tried,
is not all that it might be (it characteristically fails to identify an entire
verb group where the main complement has not been identi ed). The
introduction of rules of composition should alter the operation of this
rule.
50 Black, Neal

Characteristic A B C D E F G
ALICE yes yes yes yes yes yes yes

Table 4.1: Linguistic characteristics which can be detected by


ALICE. See Table 4.2 for an explanation of the letter codes.
Code Explanation
A Verbs recognised
B Nouns recognised
C Compounds recognised
D Phrase Boundaries recognised
E Predicate-Argument Relations identi ed
F Prepositional Phrases attached
G Coordination/Gapping analysed

Table 4.2: Letter codes used in Tables 4.1, 4.5.1 and 4.5.3

4.3 Parser Evaluation Criteria


We have evaluated the parser at two stages of its development: rstly
its state shortly after the IPSM meeting in March 1995 (certain changes
were made after the meeting, but linguistic coverage was not modi ed),
and secondly its state in December 1995. The suggestion that improve-
ments in the lexicon should be tested separately against the original
grammar and against an improved grammar is not appropriate to the
categorial formalism, in which almost all the syntactic information is
incorporated into the lexicon. We have thus prepared two analyses but
numbered them I and III for ease of comparison with the other results
presented in this book.
ALICE recognises all seven characteristics prescribed in Table 4.1, in
the sense that it attempts to recognise them and often succeeds.
We have claimed that ALICE accepted all 60 sentences of the test
set: this means that for each sentence it produced either a full parse or
a set of fragmentary parses spanning the whole sentence. Since it is not
anticipated that ALICE will ever produce complete, accurate parses for
all sentences of unrestricted input, we have concentrated on improve-
ments which will produce accurate fragments and some accurate full
parses rather than aiming for full parses at all costs.
The parsing times were obtained on a Sun workstation (ALICE is
also portable to PC). They do not include the time taken to convert
default output to tree notation.
Output in the form of tree notation was used to evaluate the success
ALICE 51

of the parser in recognising the prescribed features. At various points,


decisions had to be made about what counted as a given feature.
We have taken the category of verb to include auxiliary verbs, and
also present and past participles where these are constituents of a verb
phrase but not otherwise. Thus \the sun is rising" contains two verbs,
while \the rising sun" contains none.
In the category of noun we have included: verb nouns such as \hunt-
ing" in a phrase like \the hunting of the snark", proper nouns, and also
phrases like \Edit/Cut" in sentence L24. We have excluded pronouns
(though these have the same category as noun in our parsing scheme)
and also nouns used to modify other nouns such as \source" in \source
sentence".
We have taken compounds to mean strings, mainly names, which
are indivisible units from the grammatical point of view but are writ-
ten with white space between them; examples in the IPSM test set are
\Word for Windows 6.0" (T2) and \Translate Until Next Fuzzy Match"
(T8). ALICE attempts to identify such phrases in preprocessing, using
capitalisation as the main clue.
A number of questions arise in evaluating the analysis of phrase
boundaries. Our categorial formalism commits us to a subject-predicate
analysis which is equivalent to a set of PSG rules of binary form only,
so that for instance a conjunction of sentences has to be analysed as the
application of \and" to one sentence to produce a modi er of the other
sentence; but from a grammatical point of view it is an arbitrary choice
which sentence is seen as modifying the other. Other examples of arbi-
trary choices involve attachment, where very often there is no semantic
di erence between di erent ways of bracketing a sequence of noun or
verb modi ers (the problem of \spurious ambiguity"). It is thus not
possible to compare an actual parse with an ideal parse, because there is
often more than one correct way of parsing a sentence. Furthermore, our
formalism commits us to deep rather than shallow nesting of phrases in-
side each other, so that errors are very liable to be propagated upwards:
for instance, where a verb fails to attach to one of three arguments, the
three phrases in which the argument should have been nested will all be
wrongly analysed.
We have therefore chosen to assign scores to boundaries between
words. Any two neighbouring words will belong to exactly two di er-
ent phrases which are the immediate constituents of exactly one third
phrase. Take for instance the sentence (with brackets indicating phrase
structure)
((We) (expect (that ((Major Porter) (will (wear (his medals)))))))

Here the boundary between \Porter" and \will" is the division between
the immediate constituents of \Major Porter will wear his medals", and
every such boundary between words is the boundary between exactly
52 Black, Neal

two immediate constituents of a third phrase. To evaluate ALICE, we


ask ourselves whether \Porter" has been assigned to an immediate con-
stituent of the correct category (noun phrase) and whether \will" has
been assigned to an immediate constituent of the correct category (verb
phrase). We do not ask whether those constituents have themselves been
given their own correct boundaries. A false fragmentary parse
((We) (expect (that Major)))
((Porter) (will (wear (his (medals)))))

would score an error for the failed attachment of \Major" to \Porter",


but a correct score for the boundary between \Porter" and \will", pro-
vided that \Porter" on its own is still analysed as a noun phrase.
In evaluating the analysis of subject-predicate relations we have con-
sidered whether adjectives are correctly attached to the nouns which
they modify and whether verbs take the correct noun phrase arguments.
Where a verb has more than one argument we have given a correct score
for each argument attached and an incorrect score for each argument not
attached. The category adjective is here taken to include present and
past participles modifying a noun; the category verb does not include
the verb \be".
We count a correct prepositional phrase attachment where we have
correct scope relations in a string of prepositional phrases, and where a
single prepositional phrase correctly attaches to a noun or a verb (in-
cluding the verb \be"). We have excluded sentence modi ers like \for
example" from the analysis.
A coordinate construction is counted as correct where two noun
phrases or two verb phrases are correctly joined by a conjunction into
one phrase of the same type. The comma as used in \Hell, Hull and
Halifax" is counted as a conjunction. Our grammar identi es a gapping
construction in relative clauses and in some non-sentential coordinations;
where the existence of such a construction is suspected, the preparser
inserts a dummy noun-phrase which can turn a relative clause into a sen-
tence and a conjoined transitive verb into an intransitive verb. Where
these constructions correctly unify with a dummy noun-phrase we count
a correct analysis; where they do not, we count an incorrect analysis.

4.4 Analysis I: Original Grammar, Origi-


nal Vocabulary
The results of the rst analysis using ALICE with its original grammar
and vocabulary are shown in the following tables.
ALICE 53

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100 0
Lotus 20 20 0 100 0
Trados 20 20 0 100 0
Total 60 60 0 100 0

Table 4.3.1: Phase I acceptance and rejection rates for ALICE.


Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 1676 26.3 0.0
Lotus 0904 15.1 0.0
Trados 1269 21.2 0.0
Total 3849 19.4 0.0

Table 4.4.1: Phase I parse times for ALICE. The rst column
gives the total time to attempt a parse of each sentence.
Char. A B C D E F G Avg.
Dynix 83% 88% 100% 50% 59% 0% 0% 54%
Lotus 62% 81% 022% 45% 57% 0% 8% 39%
Trados 80% 88% 071% 50% 65% 0% 0% 42%
Average 75% 86% 064% 48% 60% 0% 3% 45%

Table 4.5.1: Phase I Analysis of the ability of ALICE to recog-


nise certain linguistic characteristics in an utterance. For example
the column marked `A' gives for each set of utterances the per-
centage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 4.2.

4.5 Analysis II: Original Grammar, Addi-


tional Vocabulary

As was explained above, modi cations to the lexicon of ALICE also


a ect the grammar: we therefore only performed a single evaluation of
improvements made between March and December 1995. This meant
that Analysis II could not take place.
54 Black, Neal

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100 0
Lotus 20 20 0 100 0
Trados 20 20 0 100 0
Total 60 60 0 100 0

Table 4.3.3: Phase III acceptance and rejection rates for ALICE.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 2357 39.3 0.0
Lotus 1623 27.1 0.0
Trados 4678 78.0 0.0
Total 8658 48.1 0.0

Table 4.4.3: Phase III parse times for ALICE. The rst column
gives the total time to attempt a parse of each sentence.
Char. A B C D E F G Avg.
Dynix 75% 77% 100% 53% 58% 42% 04% 58%
Lotus 79% 90% 022% 59% 66% 65% 19% 57%
Trados 92% 88% 071% 82% 66% 39% 06% 63%
Average 82% 85% 064% 65% 63% 49% 10% 59%

Table 4.5.3: Phase III Analysis of the ability of ALICE to recog-


nise certain linguistic characteristics in an utterance. For example
the column marked `A' gives for each set of utterances the per-
centage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 4.2.

4.6 Analysis III: Modi ed Grammar, Ad-


ditional Vocabulary
We describe here the performance of a modi ed version of the ALICE
system. It should be emphasised, however, that the modi cations made
during this period were designed to improve the performance of the
parser as used in the CRISTAL project: neither grammar nor vocab-
ulary were tailored to produce better results on the IPSM corpus. The
power of ALICE to recognise and correctly categorise individual words
remained much the same, but a new treatment of agreement phenom-
ena resulted in a substantial improvement in the recognition of phrase
ALICE 55

boundaries and correct attachment of modi ers. Work in the area of co-
ordination and gapping yielded much less progress; an attempt to recog-
nise more kinds of coordination by assigning multiple syntactic categories
to conjunctions increased the size of the chart and slowed down parsing
times (particularly of the long sentences in the Dynix corpus). Future
versions of ALICE will attempt to identify coordination phenomena in
the preparsing stage.

4.7 Converting Parse Tree to Dependency


Notation
Unfortunately it was not possible to carry out the dependency conversion
study using the ALICE data.

4.8 Summary of Findings


ALICE demonstrates the ability of a robust parser with a restricted
lexicon to produce acceptable partial parses of unrestricted text. It also
shows that the combination of categorial type-assignment with morphol-
ogy based preprocessing is a potential to rival constraint based systems
in recognising the category of individual words on the basis of form and
limited information about context. Our results do not yet approach
the accuracy of over 90% often claimed for commercial parsers and tag-
gers, but we nd our performance encouraging and anticipate that future
modi cations will continue the kind of progress we have demonstrated
here.

4.9 References
Wood, M. M. (1993). Categorial Grammars. London, UK: Routledge.
Voutilainen, A. & Jarvinen, T. (1996). Using English Constraint Gram-
mar to Analyse a Software Manual Corpus. In R. F. E. Sutcli e,
H.-D. Koch & A. McElligott (Eds.) Industrial Parsing of Software
Manuals. Amsterdam, The Netherlands: Editions Rodopi.
5
Using the English Constraint
Grammar Parser to Analyse a
Software Manual Corpus
Atro Voutilainen1
Timo Jarvinen
University of Helsinki

5.1 Introduction
This chapter reports using the English Constraint Grammar Parser ENG-
CG (Karlsson, 1990; Voutilainen, Heikkila & Anttila, 1992; Karlsson,
Voutilainen, Heikkila & Anttila, 1995) in the analysis of three computer
manual texts, 9,033 words in all. Close attention is paid to the problems
the texts from this category posed for ENGCG, and the main modi ca-
tions to the system are reported in detail.
First, ENGCG is outlined. Then three experiments are reported on
the test texts: (i) using ENGCG as such; (ii) using the lexically updated
ENGCG; and (iii) using the lexically and grammatically updated ENG-
CG. A summary of the main ndings concludes the paper.
Our experiences are reported against the whole 9,033-word manual
corpus made available to the participants of the workshop. However,
some of the tables also report ENGCG's performance against a subset of
the sentences to render the system more commensurable with the other
systems reported in this book.
1 Address: Research Unit for Multilingual Language Technology, Department of
General Linguistics, P.O. Box 4, FIN-00014 University of Helsinki, Finland. Tel:
+358 0 191 3507 (Voutilainen), +358 0 191 3510 (Jarvinen), Fax: +358 0 191
3598, Email: Atro.Voutilainen@Helsinki.FI, Timo.Jarvinen@Helsinki.FI. Acknowl-
edgements: The Constraint Grammar framework was rst proposed by Fred Karlsson.
The original ENGCG description was designed by Atro Voutilainen, Juha Heikkila
and Arto Anttila. The twol program for morphological analysis was implemented by
Kimmo Koskenniemi; Pasi Tapanainen has made the current CG parser implemen-
tation.
58 Voutilainen, Jarvinen

Work on the lexicon and the grammar for morphological disambigua-


tion was carried out and reported by Voutilainen; work on syntax was
carried out and reported by Jarvinen.

5.2 Description of Parsing System


The English Constraint Grammar Parser ENGCG is a rule-based system
for the shallow surface-syntactic analysis of Standard Written English of
the British and American varieties. ENGCG is based on the Constraint
Grammar framework originally proposed by Karlsson (1990). New Con-
straint Grammar descriptions are emerging for Finnish, Swedish, Dan-
ish, Basque, German, Swahili and Portuguese; most of them are already
quite extensive though as yet unpublished.

5.2.1 Sample Output


For the sentence This means that the human annotator needs to consider
only a small fraction of all the cases, ENGCG (Version from March 1995)
proposes the following analysis:
"<*this>"
"this" <*> DET CENTRAL DEM SG @DN> ;; determiner
"this" <*> PRON DEM SG @SUBJ ;; subject
"<means>"
"mean" V PRES SG3 VFIN @+FMAINV ;; finite main verb
"means" N NOM SG/PL @SUBJ
"<that>"
"that" <**CLB> CS @CS ;; subordinating conjunction
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<human>"
"human" <Nominal> A ABS @AN> ;; adjectival attribute
"<annotator>"
"annotator" <DER:or> N NOM SG @SUBJ
"<needs>"
"need" <SVO> <SV> V PRES SG3 VFIN @+FMAINV
"<to>"
"to" INFMARK> @INFMARK> ;; infinitive marker
"<consider>"
"consider" V INF @-FMAINV ;; nonfinite main verb
"<only>"
"only" ADV @ADVL ;; adverbial
"<a>"
"a" <Indef> DET CENTRAL ART SG @DN>
ENGCG 59

"<small>"
"small" A ABS @AN>
"<fraction>"
"fraction" N NOM SG @OBJ ;; object
"<of>"
"of" PREP @<NOM-OF ;; postmodifying ``of''
"<all>"
"all" <Quant> DET PRE SG/PL @QN> ;; quantifier
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<cases>"
"case" N NOM PL @<P ;; preposition complement
"<$.>"

Each input word is given in angle brackets. Indented lines contain


the base form and morphosyntactic tags. For instance, cases is analysed
as a noun in the nominative plural, and syntactically it is a preposition
complement @<P. Sometimes the constraint-based ENGCG leaves an
ambiguity pending, e.g. means above was left morphologically ambigu-
ous. On the other hand, most of the words usually retain the correct
morphological and syntactic analysis.

5.2.2 System Architecture


ENGCG consists of the following sequentially applied modules:
1. Tokenisation
2. Lookup of morphological tags
(a) Lexical component
(b) Guesser
3. Resolution of morphological ambiguities
4. Lookup of syntactic tags
5. Resolution of syntactic ambiguities
The rule-based tokeniser identi es punctuation marks and word-like
units (e.g. some 7,000 di erent multiword idioms and compounds). It
also splits enclitic forms into grammatical words. Lookup of morpholog-
ical analyses starts with lexical analysis. The lexicon and morphological
description are based on Koskenniemi's Two-Level Model (Koskenniemi,
1983). It contains some 90,000 entries each of which represents all in-
ected and central derived word forms.
60 Voutilainen, Jarvinen

The lexical description uses some 140 morphological tags. Those


categories which have been ranked to a part-of-speech status are listed
in Table 5.6. Further grammatical information is also provided. For
example the ner ENGCG classi ers for determiners may be seen in
Table 5.7.
The task of the lexical analyser is to assign all possible analyses to
each recognised word. Many words receive more than one analyses, e.g.
"<that>"
"that" <**CLB> CS @CS
"that" DET CENTRAL DEM SG @DN>
"that" ADV @AD-A>
"that" PRON DEM SG
"that" <NonMod> <**CLB> <Rel> PRON SG/PL

This cohort is ambiguous due to ve competing morphological analyses.


The lexicon represents some 95-99.5% of all word form tokens in run-
ning text, depending on text type. The remaining words receive mor-
phological analyses from a heuristic rule-based module (guesser) where
the rules mainly consult word shape, giving a nominal analysis if none
of the form rules apply.
The next operation is resolution of morphological ambiguities. For
this, the system uses the rule-based Constraint Grammar Parser. The
parser uses a set of constraints typically of the form Discard alternative
reading X in context Y. After disambiguation, optimally only the correct
alternative survives as the nal analysis.
The constraints usually are partial and negative paraphrases of form
de nitions of syntactic constructs such as the noun phrase or the nite
verb chain. The English grammar for morphological disambiguation con-
tains about 1,200 `grammar-based' constraints plus an additional 200
heuristic ones for resolving some of those ambiguities that the best 1,200
constraints are unable to resolve.2
After morphological disambiguation, the next lookup module is acti-
vated. A simple program introduces all possible syntactic tags as alter-
natives to each word. In the worst case, more than ten alternatives can
be introduced for a single morphological reading, for instance:
"<*searching>"
"search" <*> <SVO> <SV> <P/for> PCP1 @NPHR @SUBJ
@OBJ @I-OBJ @PCOMPL-S @PCOMPL-O @APP @NN>
@<P @<NOM-FMAINV @-FMAINV @<P-FMAINV @AN>
2 The remaining few ambiguities can be resolved e.g. manually or by using a
statistical tagger, if necessary.
ENGCG 61

Characteristic A B C D E F G
ENGCG yes yes yes/no yes/no yes/no yes/no yes/no

Table 5.1: Linguistic characteristics which can be detected by


the ENGCG. See Table 5.2 for an explanation of the letter codes.
Code Explanation
A Verbs recognised
B Nouns recognised
C Compounds recognised
D Phrase Boundaries recognised
E Predicate-Argument Relations identi ed
F Prepositional Phrases attached
G Coordination/Gapping analysed

Table 5.2: Letter codes used in Tables 5.1, 5.5.1, 5.5.2 and 5.5.3.

Thus, an \-ing" form can serve in the following syntactic functions:


stray NP head; subject; object; indirect object; subject complement;
object complement; apposition; premodifying noun; premodifying ad-
jective; preposition complement; postmodifying non nite verb; non nite
verb as a preposition complement; other non nite verb.
Finally, the parser consults a syntactic disambiguation grammar.
The rule formalism is very similar to the above-outlined one; the only
di erence is that only syntactic tags are discarded (rather than entire
morphological analyses). The present English Constraint Grammar con-
tains some 300 context-sensitive mapping statements and 830 syntactic
constraints (Jarvinen, 1994).

5.2.3 Implementation
ENGTWOL lexical analyser has been implemented by Kimmo Kosken-
niemi (1983), the disambiguator and parser by Pasi Tapanainen. The
ENGCG parser has been implemented for PCs, MacIntoshes and dif-
ferent workstations. On a Sun SparcStation 10/30, ENGCG analyses
about 400 words per second, from preprocessing through syntax. The
system is available, and it can also be tested over the network. Send an
empty e-mail message to engcg-info@ling.helsinki. for further details,
or contact the authors.
62 Voutilainen, Jarvinen

5.3 Parser Evaluation Criteria


5.3.1 Towards General Criteria
Ideally, the parser produces only correct morphological and syntactic
analyses. Usually this would mean that each word in a sentence gets one
correct morphological and syntactic analysis. In practice, 100% success
has not been reached in the analysis of running text. Various measures
can be used for indicating how close to the optimum the parser gets at
various levels of analysis. One intuitive term pair is ambiguity rate and
error rate. Ambiguity rate could be de ned as the percentage of words
with more than one analysis, error rate as the percentage of words with-
out a correct analysis. A more controversial issue in parser evaluation
is how we actually determine whether the parser has produced a correct
analysis. Manual checking of the parser's output may be unreliable e.g.
because some misanalyses can remain unnoticed.
The method recently used in evaluating ENGCG starts with the fol-
lowing routine for preparing the benchmark corpus. First, the text,
totally new to the system and suciently large (at least thousands of
words long), is analysed with the morphological analyser. Then two ex-
perts independently disambiguate the ambiguities by hand. Then the
potential di erences in the analyses in the disambiguated corpora are
determined automatically. The experts then discuss the di erences to
determine whether they are due to a clerical error, incompleteness of
the coding manual, or a genuine di erence of opinion. On the basis of
these negotiations, the nal version of the benchmark corpus is prepared.
In the case of disagreement or genuine ambiguity, multiple analyses are
accepted per word; otherwise only one analysis is given per word. The
error rate of the parser is determined from the comparison between the
parser's output and the benchmark corpus. The ambiguity rate can be
determined directly from the parser's output.
Our experiences (cf. Voutilainen & Jarvinen, 1995) indicate that at
the morphological level, an interjudge agreement of virtually 100% can
be reached; e.g. multiple analyses were needed only three times in a
corpus of 8,000 words (even these were due to genuine ambiguity, as
agreed by both judges in the experiment). ENGCG syntax seems to
be similar except that about 0.5% of words seem to be syntactically
genuinely ambiguous, i.e. multiple syntactic analyses are given to 0.5%
of all words.
The tables on the previous page summarise results from recent test-
based evaluations (Karlsson, Voutilainen, Anttila & Heikkila, 1991; Vou-
tilainen, Heikkila & Anttila, 1992; Voutilainen, 1993; Voutilainen &
Heikkila, 1994; Tapanainen & Jarvinen, 1994; Jarvinen, 1994; Tapanai-
nen & Voutilainen, 1994; Voutilainen, 1994; Voutilainen, 1995a; Karlsson
ENGCG 63

et al., 1995). The test texts were new to the system and they were taken
from newspapers, journals, technical manuals and encyclopaedias.

5.3.2 Remarks on the Present Evaluation


The characteristics referred to in Table 5.1 are not really dichotomous.
Some remarks concerning them are in order. Generally, the percentages
in the collated tables do not show the amount of ambiguity, but refer to
the number cases where the correct reading is retained in the analysis.
5.3.2.1 Verbs
Verbs are recognised. Participles in verbal or nominal functions can
be recovered only after syntactic analysis. The verb count includes also
nite verb forms (VFIN), in nitives (INF) and participles in verbal func-
tions (@-FMAINV, @<P-FMAINV and @<NOM-FMAINV).
5.3.2.2 Nouns
All nouns are recognised in the morphological disambiguation, with the
exception of \-ing" forms whose nominal vs. verbal function is deter-
mined only during syntactic analysis.
5.3.2.3 Compounds
ENTWOL lexicon contains some 6000 compound entries, most of them
nominals. Most of the remaining compounds can be recognised using
the syntactic premodi er and head function tags. The set of compounds
in the computer manuals were not added to the lexicon; updating the
lexicon would probably have improved the system's accuracy.
5.3.2.4 Predicate-Argument relations
In ENGCG analysis, predicate-argument relations are implicitly repre-
sented, using morphosyntactic labels. Seven di erent relations for sur-
face NP heads are distinguished: subject, formal subject, object, indirect
object, subject complement, object complement and object adverbial
(@SUBJ, @F-SUBJ, @OBJ, @I-OBJ, @PCOMPL-S, @PCOMPL-O and
@O-ADVL, respectively). For example, in the dynix data (20 sentences),
our benchmark contained 59 above mentioned argument labels. The syn-
tactic analysis discarded 2 correct labels and left the analysis ambiguous
in 4 cases.
However, the relations are not explicit | to make them so would re-
quire some additional work which is probably not possible in the current
ENGCG framework. Further discussion is in Section 5.7.
64 Voutilainen, Jarvinen

5.3.2.5 Prepositional Phrase attachment


Prepositional phrases are never attached explicitly. The syntactic am-
biguity is made between adverbial (@ADVL) or high attachment, and
postmodifying (@<NOM) or low attachment. This ambiguity is resolved
when it can be done reliably.
There were 840 prepositions, including the preposition \of", which
has a syntactic label of its own when postmodifying (@<NOM-OF), in
the original data. The results are shown in Table 5.10. Approximately
29% of the prepositions remain two-way ambiguous in the above sense.
This means that the overall accuracy of the attachments is quite high.
Only one change was needed in the rules for handling PP attachment in
the computer manuals texts.
5.3.2.6 Coordination and gapping
Simple coordination is handled quite reliably in most of the cases.
Complex coordination is most probably highly ambiguous, but the
correct function should be among the alternatives.
Gapping is not handled explicitly by the ENGCG rules. But quite
often the correct syntactic function is present in the output.
The original Dynix data contains 87 simple or complex coordinations,
and 75 of them were analysed correctly. More speci cally, only few errors
result if the coordinated items belong to a same part-of-speech.
In 12 coordinations, one or more errors were found. In most of these
cases, a kind of apposition was involved, e.g. parenthetical structures
within a sentence .. using the Previous Title (PT) and Next Title (NT)
commands.
Long lists of coordinated items are also dicult. Quite often they lead
to several consecutive errors. For instance, the sentence There are four
standard authority searches: author, subject, and series or uniform title
originally contained four errors: all listed items were tagged as subject
complements (@PCOMPL-S).
It seems, however, that list structures can be handled by a large
number of rules describing di erent types of coordinated items. The
problem with coordination in Constraint Grammar is that the same rule
schema must be duplicated for each syntactic function. So far, only very
simple coordinations are described thoroughly.

5.3.3 Current Evaluation Setting


The performance of ENGCG on the three computer manual texts was
evaluated by automatically comparing the parser's outputs to a manually
prepared benchmark version of the test texts. The benchmark corpus
was created as speci ed above, except that each part of the corpus was
ENGCG 65

analysed only by one person (due to lack of time). The corpus is likely to
contain some tagging errors due to the annotator's inattention, though
probably most of the tagging errors could be identi ed and corrected
during the evaluation of the parser.

5.4 Analysis I: Original Grammar, Origi-


nal Vocabulary
5.4.1 Observations about Morphological Analysis and
Disambiguation
The unmodi ed ENGCG morphological disambiguator was applied to
the three texts, 9,033 words in all. The ambiguity statistics can be seen
in Table 5.11 while the error sources are shown in Table 5.12.
Compared to previous tests results with di erent texts, these texts
proved very hard for the ENGCG lexicon and morphological disambigua-
tor. Firstly, the rate of remaining ambiguity was higher than usual (for
instance, the typical readings/word rate after grammar-based morpho-
logical disambiguation is between 1.04-1.09, while here it was 1.11.
An even more alarming result was the number of errors (missing
correct tags). While previous reports state an error rate of 0.2-0.3% for
ENGCG up to grammar-based morphological disambiguation, here the
error rate was as much as 1.3%.
5.4.1.1 Lexical errors
Out of the 60 errors due to the lexicon and lexical guesser, 9 were of a
domain-generic sort: the contextually appropriate morphological analy-
sis was not given as an alternative even though there was nothing peculiar
or domain-speci c in the text use of the word. Some examples:
 Displays records from an index that begin with and match your
search request alphabetically, character-for-character, left-to-right
[A / ADV].
 An authority search allows you to enter an authority heading that
the system matches alphabetically, character-by-character, left-to-
right [A / ADV].
 Alphabetical title searching allows you to view a browse [PRES /
N] list of titles in alphabetical order.
 If you perform an alphabetical title search, the system displays an
alphabetical browse [PRES / N] list of titles that most closely match
your search request.
66 Voutilainen, Jarvinen

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100 0
Lotus 20 20 0 100 0
Trados 20 20 0 100 0
Total 60 60 0 100 0

Table 5.3.1: Phase I acceptance and rejection rates for the ENG-
CG.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 0.8 0.04 N.A.
Lotus 0.5 0.03 N.A.
Trados 0.8 0.04 N.A.
Total 2.1 0.04 N.A.

Table 5.4.1: Phase I parse times for the ENGCG. The rst col-
umn gives the total time to attempt a parse of each sentence.
Char. A B C D E F G Avg.
Dynix 98% 100% 0% 97% 97% 083% 80% 92%
Lotus 98% 099% 0% 98% 98% 100% 82% 96%
Trados 98% 100% 0% 96% 94% 071% 60% 86%
Average 98% 100% 0% 97% 96% 082% 77% 92%

Table 5.5.1: Phase I Analysis of the ability of the ENGCG to


recognise certain linguistic characteristics in an utterance. For
example the column marked `A' gives for each set of utterances the
percentage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 5.2.

 A [ABBR NOM / DET] browse [PRES / N] list of titles appears:


 A standard authority search accesses the heading from the SUB-
JECT, AUTHOR, or SERIES eld on the BIB record and displays
an alphabetical browse [PRES / N] list of corresponding subjects,
authors, or series.
For instance, left-to-right was analysed as an adjective although it
was used as an adverb. (Notice that these errors sometimes trigger
the `domino e ect': new errors come up because of previous ones. For
instance, the determiner A was analysed as an abbreviation because
browse missed the noun analysis.) Of the 60 lexical errors, the remaining
ENGCG 67

51 were due to special domain vocabulary. Some examples:


 When you have nished, enter \ SO [ADV / N] " to return to the
search menu.
 If this is not what you intended, you may be able to use Undo
[PRES / N] to retrieve the text.
 Press [N / IMP] BACKSPACE [PRES / N] to delete characters to
the left of the insertion point, or press [INF,PRES,N / IMP] DEL
to delete characters to the right of the insertion point.
 If you want to display multiple documents, each in its own window,
deselect Close [A,ADV / N,IMP] current le.
 Undo [IMP / N] and Revert [PRES / N] do not reverse any action
once editing changes have been saved using File/Save, CTRL+S,
or Auto Timed Save [PRES / N].
 Select the desired Undo [IMP / N] level.
 Choose File/Revert to Saved to use revert [PRES / N].
 Just mark them with your mouse, and copy them to the target eld
by using the Copy and Paste functions from Word's Edit [PRES /
N] menu.
 Simply enter the path and lename of the abbreviation le into the
edit [INF / N] box of this dialog.
 To add style names to the list, simply click on the Add [INF /
N] button, then enter the name of the paragraph style represent-
ing the text that should be left untranslated - in our example, \
DoNotTranslate " -, then click on [ADV / PREP] OK [ADV / N].
 The most intuitive button to start TM mode is (Open Get [PRES
/ N]).
 The TWB1 button, also labeled Translate Until [CS / PREP] Next
Fuzzy Match, tells the Workbench to do precisely this.
Sometimes a particular word was given a part-of-speech use that
should not even be allowed in a domain-generic lexicon, e.g. the noun
use of revert.
Sometimes a word sequence, e.g. a clause, was used as a noun phrase,
typically a name, e.g. Translate Until Next Fuzzy Match.
68 Voutilainen, Jarvinen

5.4.1.2 A solution
An automatic routine was designed, and a prototype was implemented
for the generation of a domain lexicon. In particular, a module was
implemented for recognising nouns on the basis of contextual or ortho-
graphic information. The following routine was used:
1. Find words without nominal analyses that appear to be nominals
on the basis of contextual or orthographic information. This was
implemented as a small Constraint Grammar: rst, a ex scanner
introduces a new reading to words with a verb (but without a
noun) analysis, cf. allows and browse below:

..
"<allows>"
"allow" UNCONVENTIONAL N NOM PL
"allow" <SVOO> <SVO> V PRES SG3 VFIN @+FMAINV
"<you>"
"you" <NonMod> PRON PERS NOM SG2/PL2
"you" <NonMod> PRON PERS ACC SG2/PL2
"<to>"
"to" PREP
"to" INFMARK> @INFMARK>
"<view>"
"view" <SVO> V SUBJUNCTIVE VFIN @+FMAINV
"view" <SVO> V IMP VFIN @+FMAINV
"view" <SVO> V INF
"view" <SVO> V PRES -SG3 VFIN @+FMAINV
"view" N NOM SG
"<a>"
"a" <Indef> DET CENTRAL ART SG @DN>
"<browse>"
"browse" UNCONVENTIONAL N NOM SG
"browse" <SV> <SVO> V SUBJUNCTIVE VFIN @+FMAINV
"browse" <SV> <SVO> V IMP VFIN @+FMAINV
"browse" <SV> <SVO> V INF
"browse" <SV> <SVO> V PRES -SG3 VFIN @+FMAINV
..

2. Then a small set of constraints was written for discarding un-


likely UNCONVENTIONAL candidates before the main grammar
is used. Only the most obvious candidates are accepted for further
analysis. Sometimes this minigrammar fully disambiguates a word
with an initial UNCONVENTIONAL analysis:
ENGCG 69
..
"<allows>"
"allow" <SVOO> <SVO> V PRES SG3 VFIN @+FMAINV
"<you>"
"you" <NonMod> PRON PERS NOM SG2/PL2
"you" <NonMod> PRON PERS ACC SG2/PL2
"<to>"
"to" PREP
"to" INFMARK> @INFMARK>
"<view>"
"view" <SVO> V SUBJUNCTIVE VFIN @+FMAINV
"view" <SVO> V IMP VFIN @+FMAINV
"view" <SVO> V INF
"view" <SVO> V PRES -SG3 VFIN @+FMAINV
"view" N NOM SG
"<a>"
"a" <Indef> DET CENTRAL ART SG @DN>
"<browse>"
"browse" UNCONVENTIONAL N NOM SG
..

3. Finally, the main grammar is used for casual disambiguation.


The purpose of this extra module is to correct cases where ENGCG
would always fail. The minigrammar was written using ten million words
of The Times newspaper text as the empirical basis. A particular concern
was to avoid introducing new errors to the system (i.e. misanalysing
cases that could be correctly analysed using ENGCG alone). Writing
this grammar of 16 constraint rules took about two hours.
This module was tested against 1.5 million words from the Economist
newspaper. The system introduced noun analyses to 167 words, e.g.
.. is a *waddle* to and from ..
.. of Catherine *Abate*, New York's commissioner ..
.. to be a *blip* rather than the beginning ..
.. the Rabbi of *Swat*.
.. the *drivel* that passes for ..
.. the *straggle* of survivors ..
.. such as Benny *Begin*, son of Menachem ..

Only once in the analysis of these 167 cases did the system fail in
the sense that the desired analysis was missing after disambiguation,
namely:
.. novel about his roots, `Go *Tell* it on the mountain'.
70 Voutilainen, Jarvinen

where Tell presumably should have been analysed as an imperative


rather than as a noun.
As regards the analysis of the IPSM texts: this module assigned
28 new noun readings, all of them correctly. However, only some of
the relevant tokens were identi ed by this module, namely those with
the most obviously `noun-like' local syntactic context, e.g. edit in the
sequence the edit. Some other occurrences of these lexemes remained
uncorrected because of a `weaker' context, e.g. situations where there is
an adjective between the erroneously analysed word and its determiner.
To improve the situation, the new prototype was enriched with a
lexicon update mechanism that makes lexical entries of those words that
have been identi ed as domain nouns. Then this automatically updated
lexicon is applied to the whole text, and this time all occurrences get a
noun analysis as one alternative, e.g. browse in
"<an>"
"an" <Indef> DET CENTRAL ART SG @DN>
"<alphabetical>"
"alphabetical" A ABS
"<browse>"
"browse" N NOM SG
"browse" <SV> <SVO> V SUBJUNCTIVE VFIN @+FMAINV
"browse" <SV> <SVO> V IMP VFIN @+FMAINV
"browse" <SV> <SVO> V INF
"browse" <SV> <SVO> V PRES -SG3 VFIN @+FMAINV

This time, the module assigned 35 new noun readings, all of them
correctly.
This automatic module for assigning noun readings to words with
verb analyses only did not identify other kinds of lexical errors; the
remaining 25 cases were taken care of by manually updating the domain
lexicon. For instance, the frequently occurring SO and OK were given
noun analyses.3 Also some multiword names were added to the domain
lexicon, e.g. translate until next fuzzy match.

5.4.2 Observations about Syntax


First, some remarks on the input format. ENGCG contains an e ective
preprocessing module which is capable of detecting sentence boundaries
in the running text. The requirements for the input texts are very exi-
ble. It is assumed that all sentence fragments are either markup coded
or separated from the text passages e.g. by two or more blank lines.
3 An obvious long-term solution would be to extend the above module for other
classes too.
ENGCG 71

We assumed that each batch of the manual data contained 200 indi-
vidual passages which can be either full sentences or sentence fragments
(headings, terms or even individual words). But it seems that in some
cases one sentence was quite arbitrarily put into two passages, especially
in the Lotus data, e.g.
Today's date

Permanently inserts the current system date.

Analysing these lines separately leads to incorrect analysis, because the


distinction between sentence and a sentence fragment is crucial for syn-
tactic analysis. If there is no nite or in nite verb in a sentence, the set
of possible tags for nouns are reduced to @NN>, @NPHR and @APP,
i.e. premodifying noun, stray NP head and apposition.
The data could not have been handled by preprocessor either, because
the same data also included consecutive lines which seemed more like a
heading and an independent sentence, e.g.
Understanding text formatting and text enhancements
You can change the way text appears in a document.

5.4.2.1 Problems in the original grammar


There were 309 errors in the rst round of syntactic analysis of the
original computer manual data. The main sources of these errors can be
classi ed as follows:
1. Immature syntactic rule components
2. Errors in ENGCG syntactic rules
3. Leaks in heuristic syntactic rules
4. Inappropriate input
It is possible to speak about di erent components or subparts of the
syntactic description, though in the ENGCG description there is consid-
erable interaction between the rules. What is meant by immaturity is
that if there are no rules for picking up the frequent obvious cases of a
certain grammatical function, then other, more heuristic rules, tend to
discard these labels, and the result is an error.
The most immature part of the ENGCG syntax seemed to be the
recognition of appositions. Consider for instance For example, an author
authority search on the name \Borden" displays the index list .. . Here,
both name and Borden got the subject label @SUBJ.
72 Voutilainen, Jarvinen

Consider These lists must be in \Text Only" format, that is, standard
ANSI Windows text. Here, the standard ANSI Windows text seemed to
be an apposition. These kinds of apposition, preceded by for example,
for instance and that is was recurrent, but it could be easily handled by
the syntax due to overt marking.
Six of the added rules were concerned with the recognition of appo-
sitions, which seemed to be a very typical feature of software manuals.
A couple of plain errors were found among the newly-written rules.
Also heuristic rules were used in syntactic analysis. As their name im-
plies, they are more prone to leak when a complex structure is encoun-
tered.

5.5 Analysis II: Original Grammar, Addi-


tional Vocabulary
5.5.1 Observations about Morphological Disambigua-
tion
After the lexical corrections described above, the enriched ENGCG (up
to heuristic disambiguation) still made 85 errors. These were due to the
constraints (or the grammatical peculiarities of the texts).
5.5.1.1 Typical problems with grammar or text
The grammar-based constraints for morphological disambiguation dis-
carded the correct reading 43 times. The most frequent problem types
were the following:
1. Underrecognition of imperatives (18 cases)
 If you search for a hyphenated term, and if you choose to not
enter the hyphen, be [INF / IMP]4 sure that you enter the
term as separate words.
 For example, if you are searching for the title, \ The Silver
Chair, " simply enter [PRES / IMP] \ Silver Chair. "
 When you want to reverse an action, choose [INF,PRES /
IMP] Edit/Undo.
2. Clauses without an apparent subject (10 cases)
4 The word-form \be" was analysed as an in nitive; it should have been analysed
as an imperative.
ENGCG 73

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100 0
Lotus 20 20 0 100 0
Trados 20 20 0 100 0
Total 60 60 0 100 0

Table 5.3.2: Phase II acceptance and rejection rates for the


ENGCG.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 0.8 0.04 N.A.
Lotus 0.5 0.03 N.A.
Trados 0.8 0.04 N.A.
Total 2.1 0.04 N.A.

Table 5.4.2: Phase II parse times for the ENGCG. The rst
column gives the total time to a parse of each sentence.
Char. A B C D E F G Avg.
Dynix 98% 100% 0% 97% 97% 083% 80% 92%
Lotus 98% 099% 0% 98% 98% 100% 82% 96%
Trados 98% 100% 0% 96% 94% 071% 60% 86%
Average 98% 100% 0% 97% 96% 082% 77% 92%

Table 5.5.2: Phase II Analysis of the ability of the ENGCG


to recognise certain linguistic characteristics in an utterance. For
example the column marked `A' gives for each set of utterances the
percentage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 5.2. Note that the gures
given for phase I and II analyses are the same, because the test
with the additional vocabulary was carried out only separately
with the morphological disambiguation.

 Displays [N / PRES] the records that have a speci c word or


words in the TITLE, CONTENTS, SUBJECT, or SERIES
elds of the BIB record, depending on which elds have been
included in each index.
 Permanently inserts [N / PRES] the current system date
[PRES / N].
 Inserts [N / PRES] a date that is updated to the current sys-
tem date each time you open the document and display that
page.
74 Voutilainen, Jarvinen

3. Names (some cases)


 For information, refer to [INFMARK> / PREP] \ To set
User Setup defaults " in Chapter 3.
4. Short centre-embedded relative clauses without a relative pronoun
(some cases)
 Text you copy remains [N / PRES] on the Clipboard until you
copy or cut other text, data, or a picture.
 The level you specify [INF / PRES] determines the number
of actions or levels Ami Pro can reverse.
5. before (some cases)
 The letter before [CS / PREP] the full stop is uppercase.
 The character before [CS / PREP] the full stop is a number.

5.5.1.2 Solutions
Using compound nouns for avoiding grammar errors. Some er-
rors would have been avoided if certain compound nouns had been recog-
nised, e.g.
 For example, a title keyword search on \ robot " might yield a high
number of matches, as would an author authority search [INF / N]
on \ Asimov, Isaac. "
If author authority search had been recognised as a compound noun
during lexical analysis, the misanalysis of search would not have oc-
curred.
A system was made for recognising compound nouns in the text and
updating the lexicon with them. The architecture of this system is the
following:
1. Extract unambiguous modi er-head sequences of the form \one or
two premodi ers + a nominal head" from the text using NPtool,
a previously developed noun phrase detector (Voutilainen, 1993).
Note that also NPtool was enriched with the above outlined mech-
anism for adding noun readings to words described as verbs only
in the original ENGCG lexicon (cf. Section 5.4.1.2 above). Here is
an unedited NPtool sample from the IPSM manuals (from a list of
almost 400 modi er-head sequences):
ENGCG 75

abbreviation file
abbreviation list
above example
above segment
accelerated search
accelerated search command
accelerated searching
accelerated searching abbreviation
adapted translation
alphabetical browse list
alphabetical heading
alphabetical list
alphabetical listing
alphabetical order
alphabetical search
alphabetical title
alphabetical title search
alphabetical title search option
ami pro

2. Convert these into a lexical entry; update the lexicon and the mul-
tiword construction identi er.
3. Use the updated system. This time, every occurrence of a previ-
ously attested compound will be recognised as such. An example:
"<*for=example>"
"for=example" <*> ADV ADVL @ADVL
"<$,>"
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<search_menu>"
"search_menu" N NOM SG
"<in>"
"in" PREP
"<your>"
"you" PRON PERS GEN SG2/PL2 @GN>
"<*public_*access_module>"
"public_access_module" <*> N NOM SG
"<may>"
"may" V AUXMOD VFIN @+FAUXV
..

The underlying empirical observation is that word sequences that


have been attested as compound nouns (even once) almost always be-
have as compound nouns (Voutilainen, 1995b). In principle this means
76 Voutilainen, Jarvinen

that if a compound noun is unambiguously analysed as such even once,


this analysis should prevail whenever a similar word sequence is encoun-
tered again in the text, irrespective of whether ENGCG is able to analyse
this construction unambiguously on the basis of purely grammatical cri-
teria. The expected consequences of using this technique are (i) fewer
misanalyses and (ii) fewer remaining ambiguities.
Grammar modi cations. Most of the other modi cations were
domain generic. For instance, the account of imperatives was improved
by correcting those few constraints that made the misprediction. To
ensure that the modi cations were domain generic, also other texts from
di erent domains were used for testing the modi ed grammar.
Also some heuristic constraints were corrected, though here more
errors were tolerated to keep the heuristics e ective.
The only potentially domain-speci c modi cation concerned the treat-
ment of apparently subjectless clauses. In the workshop's input, each
utterance was given on a line of its own. What often appeared to corre-
spond to the missing subject was actually on a line of its own, e.g:
Alphabetical title.

Displays titles in an index that contain


the word or term you enter.

Here, the unmodi ed version of ENGCG discarded the present tense


reading of Displays.
For the correct analysis of the sentence-initial present tense verbs, a
small subgrammar (two constraints only) was written, using texts from
The Times newspaper as the empirical basis. This subgrammar selects
present tense verb readings in the third person singular as the correct al-
ternative whenever the context conditions were satis ed. Let us examine
one of the constraints:
(@w =! (PRES SG3)
(NOT *-1 VERB/PREP)
(NOT -1 UNAMB-PL-DET)
(NOT -1 SEMICOLON/COLON)
(1C DET/A/GEN/NUM/ACC)
(NOT 1 <1900>)
(NOT 2 TO)
(NOT *1 HEADER))

The constraint reads something like this: \Select the present tense
verb reading in the third person singular as the correct analysis if the
following context conditions are satis ed:
 There are no verbs nor prepositions in the left-hand context,
ENGCG 77

 The rst word to the left is not an unambiguous plural determiner,


 The rst word to the left is not a semicolon or a colon,
 The rst word to the right is an unambiguous determiner, adjec-
tive, genitive, numeral or accusative pronoun,
 The rst word to the right does not signify a year,
 The second word to the right is not TO,
 There are no header markup codes in the right-hand context."
The fourth context-condition identi es a likely beginning of a noun
phrase in the immediate right-hand context, a typical slot for a verb.
Also a noun phrase can take this position, but this chance is diminished
with the other negative context conditions.
The subgrammar was also tested against a sample from The Economist
newspaper. The Constraint Grammar parser optionally leaves a trace
of every rule application in its output, so observing the constraints' pre-
dictions is straightforward. A total of 474 predictions were examined:
the grammar made 10 false predictions, i.e. about 98% of the predictions
were correct.
In this section, we reported only corrections to the disambiguation
grammar. The other task, writing new constraints for resolving remain-
ing ambiguities, is not addressed here. A likely solution would be to
write a set of domain constraints with a good performance in the com-
puter manual domain. However, developing a realistic grammar for this
purpose would require more texts from this domain than our little 9,000
word corpus.5

5.6 Analysis III: Altered Grammar, Addi-


tional Vocabulary
5.6.1 Observations about Morphological Disambigua-
tion
The lexically and grammatically updated ENGCG morphological dis-
ambiguator was applied to the original texts. The results are shown in
Tables 5.13 and 5.14.
5 However, Kyto and Voutilainen (1995) show that the idea of a rst-pass domain
grammar is reasonable both in terms of the grammarian's work load and accuracy;
their experiments are based on `backdating' the ENGCG grammar for the analysis
of 16th British English.
78 Voutilainen, Jarvinen

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100 0
Lotus 20 20 0 100 0
Trados 20 20 0 100 0
Total 60 60 0 100 0

Table 5.3.3: Phase III acceptance and rejection rates for the
ENGCG.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 0.8 0.04 N.A.
Lotus 0.5 0.03 N.A.
Trados 0.8 0.04 N.A.
Total 2.1 0.04 N.A.

Table 5.4.3: Phase III parse times for the ENGCG. The rst
column gives the total time to a parse of each sentence.
Char. A B C D E F G Avg.
Dynix 098% 100% 0% 98% 098% 83% 090% 94%
Lotus 100% 099% 0% 99% 098% 93% 082% 95%
Trados 098% 100% 0% 98% 100% 75% 100% 95%
Average 098% 100% 0% 98% 099% 82% 091% 95%

Table 5.5.3: Phase III Analysis of the ability of the ENGCG


to recognise certain linguistic characteristics in an utterance. For
example the column marked `A' gives for each set of utterances the
percentage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 5.2.

The amount of ambiguity became somewhat smaller as far as mor-


phological analysis and grammar-based morphological disambiguation
are concerned. This is probably mainly due to the use of compound
nouns in the analysis. The e ect of the heuristic constraints decreased
rather much as a result of the corrections: the overall ambiguity rate is
somewhat higher than originally.

The error rate fell very low. Almost all of the errors were due to the
heuristic constraints.
ENGCG 79

Feature Description Example


A adjective small
ABBR abbreviation Ltd.
ADV adverb soon
CC coordinating conjunction and
CS subordinating conjunction that
DET determiner any
INFMARK> in nitive marker to
INTERJ interjection hooray
N noun house
NEGPART negative particle not
NUM numeral two
PCP1 ing form writing
PCP2 ed/en form written
PREP preposition in
PRON pronoun this
V verb write

Table 5.6: Categories in ENGCG which are considered to have


part-of-speech status.
Feature Description Example
<**CLB> clause boundary which
<Def> de nite the
<Genord>general ordinal next
<Indef> inde nite an
<Quant> quanti er some
ABS absolute form much
ART article the
CENTRAL central determiner this
CMP comparative form more
DEM demonstrative determiner that
GEN genitive whose
NEG negative form neither
PL plural few
POST postdeterminer much
PRE predeterminer all
SG singular much
SG/PL singular or plural some
SUP superlative form most
WH wh-determiner whose

Table 5.7: The ner ENGCG classi ers for determiners.


80 Voutilainen, Jarvinen

5.6.2 Observations about Syntax


Table 5.15 compares the original ENGCG syntax to the changes made in
Phase III. All gures are given to the original computer manuals data.
Note that the syntax is evaluated to morphologically correct input. This
means that the error rate is approximately one percent better than it
would have been without this two-stage testing procedure.
The error rate diminished most radically in the Lotus text. The
reason for this improvement was that it contained many recurrent struc-
tures which were not handled properly by the ENGCG parser. Because
some of the changes made the rules more restrictive, there was also a
marginal rise in the ambiguity rate as a result. This loss could be re-
gained by using additional levels of less restrictive, heuristic rules. A
newly implemented facility makes it possible to use up to 255 successive
levels, thus making it possible to choose between less reliable but slightly
more ambiguous and more reliable but less ambiguous output.
The ambiguities (866 words) are distributed to part-of-speech classes
as seen in Table 5.16.
Syntactic ambiguities for nouns are dicult to handle without a con-
siderable e ort because the existing constraints on them are numerous
and rather interdependent. Therefore, the rst thing to tackle is the PP
attachment ambiguity. It is a relatively isolated syntactic phenomenon,
but also lexical information seems necessary for making correct attach-
ment decisions. Extending the ENGCG lexicon to contain a corpus-
based collocational information is obviously the next thing to do.
When counting the e ort been put into the changes, two di erent
things must be taken into account:
 doing the benchmark corpora: 3 workdays
 nding errors from the corpus, writing and testing new rules: 2
workdays
For the third part of this experiment 15 new rules were added, so that
the syntactic part comprises now 1154 new rules (310 rules mappings and
844 syntactic disambiguation rules).
The modi cation of the syntax was mainly concerned with diminish-
ing the error rate, not resolving the remaining ambiguities. Some rules
that seemed to leak were corrected by making the context conditions
more restrictive. There were 24 such changes or additions to context
conditions. Therefore, also some of the correct rule applications were
blocked which led to a slight increase in the ambiguity rate. The in-
crease was about 0.5% as seen in Table 5.15.
With the modi cations made to syntax, a certain stagnation phase
could be observed: an improvement in one set of sentences brought about
an error or two in others.
ENGCG 81

Ambiguity Readings/ Error


Rate (%) Word Rate (%)
Lexicon + Guesser 35-50 1.7-2.2 0.1
Constraints 03-07 1.04-1.09 0.2-0.3
Heuristics 02-04 1.02-1.04 0.3-0.6

Table 5.8: Summary of the results of recent test-based evalu-


ations of ENGCG disambiguation. Ambiguity rate indicates the
percentage of (morphologically or syntactically) ambiguous words.
Readings/word indicates how many (morphological or syntactic)
readings there are per word, on average. Error rate indicates the
percentage of those words without an appropriate (morphological
or syntactic) reading. The gures are cumulative, e.g. the errors
after the lookup of syntactic ambiguities include earlier errors due
to morphological analysis and disambiguation.
Ambiguity Readings/ Error
Rate (%) Word Rate (%)
Syntactic Lookup 40-55 3.0-4.5 0.3-0.8
Constraints 10-15 1.15-1.25 1.5-2.5

Table 5.9: Summary of the results of recent test-based evalua-


tions of ENGCG syntax. See explanations above.

All but two rules are general and because they are also tested against
the benchmark corpora, which comprises some 45,000 words, they are
proved to be valid as ENGCG rules. For instance, the constraint
(@w =s! (@APP)
(0 N)
(1 CC/COMMA)
(*-1 COLON LR1)
(LR1 NPL))

means that for a noun, apposition (@APP) reading is to be accepted if


there is a coordinating conjunction or comma immediately to the right,
and if somewhere to the left is a colon which is immediately preceded by
a plural noun.
Other frequent error types were due to text-speci c word sequences,
e.g. compounds. The proper way to treat them is probably not in syntax,
but in preprocessing (at least in the ENGCG system).
The rst type is illustrated by a sentence from the Trados data:
82 Voutilainen, Jarvinen

I III
Correct 820 829
Ambiguous 243 247
Incorrect 020 011
Table 5.10: Preposition attachment ambiguity with original syn-
tax (I) and with modi ed syntax (III).
Readings Readings/Word
Lexicon + Guesser 19409 2.15
Constraints 10029 1.11
Heuristics 09374 1.04

Table 5.11: Ambiguity statistics for the unmodi ed ENGCG


morphological disambiguator when applied to the three texts.
Errors Cumulative Cum. Error
Rate (%)
Lexicon 58 058 0.64
Lexical Guesser 02 060 0.66
Grammar 58 118 1.31
Heuristic Rules 27 145 1.61

Table 5.12: Error sources for the unmodi ed ENGCG morpho-


logical disambiguator when applied to the three texts.

Now press the key combination [Alt] + [1].

The second type is compounds containing plural premodi ers. Since


plural premodi cation is very restricted in ENGCG syntax, one or more
errors resulted from the items:

Options command
Windows word processor
Windows text format
Non-Translatable Paragraphs... menu item
SERIES field

It seems possible to use the capitalisation of the plural noun as a


syntactic cue which allows the premodi cation. But also here, a better
solution is to add these items as compound nouns to the lexicon.
ENGCG 83

Readings Readings/Word
Lexicon + Guesser 17319 1.92
Constraints 10006 1.11
Heuristics 09479 1.05

Table 5.13: Ambiguity statistics for morphological disambigua-


tion (Phase III).
Errors Cumulative Cum. Error
Rate (%)
Lexicon 00 00 0.00
Lexical Guesser 00 00 0.00
Grammar 01 01 0.01
Heur. Constraints 14 15 0.17

Table 5.14: Error sources for morphological disambiguation


(Phase III).

5.7 Converting Parse Tree to Dependency


Notation
Due to morphological and syntactic ambiguity in the test sentence That
is, these words make the source sentence longer or shorter than the TM
sentence., the analysis shown below can not be converted mechanically
into a dependency structure. Analysis III below contains all correct
labels. Note, however, that even those syntactic labels, which indicate
direction of the head (> or <, to the right or to the left, respectively),
are underspeci c, so that they do not state explicitly which word is the
head. If the direction is not speci ed, the tag is even more underspeci c,
e.g. the correct unambiguous subject label (@SUBJ) says exactly that
the word is the head of some subject NP, i.e, that somewhere in the
sentence there is a main verb as a head.
What is needed is therefore some additional processing (i) to spec-
ify all dependency links between legitimate syntactic labels and (ii) to
disambiguate, i.e., remove all illegitimate syntactic tags.
Both aforementioned tasks are nontrivial, and it is therefore ex-
tremely speculative to try to convert the representation into explicit
dependency structure. A possible solution to the reduction of syntactic
ambiguity is suggested by Tapanainen and Jarvinen (1994), which re-
ports success rates between 88% and 94% in various text samples when
the syntactic ambiguity is reduced to zero. Note that unique syntactic
84 Voutilainen, Jarvinen

Data Dynix Lotus Trados


Phase I III I III I III
Errors 83 54 76 24 136 66
% 2.9 1.9 3.1 1.0 3.8 1.9
Ambiguous Words 273 283 283 304 322 327
% 9.5 9.9 11.6 12.3 9.0 9.2
Corr. Unamb. 65 67 75 71 44 49
Corr. 145 161 150 177 117 150
Words 2868 2459 3558
MSL 14.3 12.7 17.8

Table 5.15: Results of the syntactic disambiguation. Codes:


I = initially; III = modi ed syntax; Corr. = Number of correct
sentences; Corr. unamb. = Number of correct and unambiguous
sentences; MSL = mean sentence length.
N PREP PCP1 V PCP2 NUM PRON A
383 247 67 60 50 26 15 10

Table 5.16: Syntactic ambiguities within morphological classes.

dependency structure does not require that all morphological ambigui-


ties are removed, e.g. a nite verb (@+FMAINV) may be either in the
present tense or a subjunctive, but still syntactically unique.
In the current output, the ambiguities are present because it is not
possible to disambiguate some local contexts without a risk of several
successive errors as a domino e ect. Consider the noun/verb ambiguity
in the word sentence. Let us suppose that the guess is that sentence is
a noun. After that decision the current syntax would choose the word
source as a premodifying noun (@NN>), or the subject (@SUBJ) would
be rejected as an illegitimate option. If, on the other hand, the main
verb reading for sentence would have chosen, the grammar would also
accept the subject label for source and probably the adverb (@ADVL)
reading for the word longer, i.e. three successive errors.
"<That=is>"
"that=is" <*> ADV ADVL @ADVL
"<$,>"
"<these>"
"this" DET CENTRAL DEM PL @DN>
"<words>"
"word" N NOM PL @SUBJ
ENGCG 85

"<make>"
"make" V PRES -SG3 VFIN @+FMAINV
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<source>"
"source" N NOM SG @SUBJ @NN>
"<sentence>"
"sentence" N NOM SG @OBJ
"sentence" V INF @-FMAINV
"sentence" V PRES -SG3 VFIN @+FMAINV
"<longer>"
"long" ADV CMP ADVL @ADVL
"long" A CMP @PCOMPL-O
"<or>"
"or" CC @CC
"<shorter>"
"short" A CMP @PCOMPL-S @PCOMPL-O @<NOM
"<than>"
"than" PREP @<NOM
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<*t*m>"
"t*m" <*> ABBR NOM SG @NN>
"<sentence>"
"sentence" N NOM SG @<P
"<$.>"

5.8 Summary of Findings


The computer manuals were surprisingly hard even for this extensively
tested and modi ed system.
Promising techniques for domain lexicon construction were proposed.
They should be extended and elaborated further. Still it is likely that
some manual lexicon updating remains necessary also in the future.
Most other errors turned out to be due to the grammar, rather than
the text.
At least some text-speci c phenomena appear to be manageable by
means of simple subgrammars that can be applied before the main gram-
mar.
Modifying the lexicon and grammar is easy and took relatively little
time (a couple of days in all).
No new domain-generic constraints were written for morphological
disambiguation. Probably these could be written to improve the result.
This remains to be tested with larger material.
86 Voutilainen, Jarvinen

5.9 References
Jarvinen, T. (1994). Annotating 200 million words: the Bank of English
project. Proceedings of COLING-94, Kyoto, Japan, Vol. 1.
Karlsson, F. (1990). Constraint Grammar as a Framework for Parsing
Running Text. Proceedings of COLING-90, Helsinki, Finland, Vol.
3.
Karlsson, F., Voutilainen, A., Anttila, A., & Heikkila, J. (1991). Con-
straint Grammar: a Language-Independent System for Parsing Unre-
stricted Text, with an Application to English. In Natural Language
Text Retrieval: Workshop Notes from the Ninth National Conference
on Arti cial Intelligence (AAAI-91). Anaheim, CA: American Asso-
ciation for Arti cial Intelligence.
Karlsson, F., Voutilainen, A., Heikkila, J., & Anttila, A. (Eds.) (1995).
Constraint Grammar. A Language-Independent System for Parsing
Unrestricted Text. Berlin, Germany, New York, NY: Mouton de
Gruyter.
Koskenniemi, K. (1983). Two-level Morphology. A General Computa-
tional Model for Word-form Production and Generation (Publication
No. 11). Helsinki, Finland: University of Helsinki, Department of
General Linguistics.
Kyto, M., & Voutilainen, A. (1995). Backdating the English Constraint
Grammar for the analysis of English historical texts. Proc. 12th Inter-
national Conference on Historical Linguistics, ICHL, 13-18 August.
1995, University of Manchester, UK.
Tapanainen, P., & Jarvinen, T. (1994). Syntactic analysis of natural
language using linguistic rules and corpus-based patterns. Proceedings
of COLING-94, Kyoto, Japan, Vol. 1.
Tapanainen, P., & Voutilainen, A. (1994). Tagging accurately | Don't
guess if you know. Proceedings of Fourth ACL Conference on Applied
Natural Language Processing, Stuttgart, Germany.
Voutilainen, A. (1993). NPtool, a Detector of English Noun Phrases.
Proceedings of the Workshop on Very Large Corpora, Ohio State Uni-
versity, Ohio, USA.
Voutilainen, A. (1994). A noun phrase parser of English. In Robert
Eklund (Ed.) Proceedings of `9:e Nordiska Datalingvistikdagarna',
Stockholm, Sweden, 3-5 June 1993. Stockholm, Sweden: Stockholm
University, Department of Linguistics and Computational Linguistics.
Voutilainen, A. (1995a). A syntax-based part of speech analyser. Pro-
ceedings of the Seventh Conference of the European Chapter of the
Association for Computational Linguistics, Dublin, Ireland, 1995.
Voutilainen, A. (1995b). Experiments with heuristics. In F. Karlsson, A.
Voutilainen, J. Heikkila & A. Anttila (Eds.) Constraint Grammar. A
Language-Independent System for Parsing Unrestricted Text. Berlin,
ENGCG 87

Germany, New York, NY: Mouton de Gruyter.


Voutilainen, A., & Heikkila, J. (1994). An English constraint grammar
(ENGCG): a surface-syntactic parser of English. In U. Fries, G. Tottie
& P. Schneider (Eds.) Creating and using English language corpora.
Amsterdam, The Netherlands: Rodopi.
Voutilainen, A., Heikkila, J., & Anttila, A. (1992). Constraint Gram-
mar of English. A Performance-Oriented Introduction (Publication
21). Helsinki, Finland: University of Helsinki, Department of Gen-
eral Linguistics.
Voutilainen, A. & Jarvinen, T. (1995). Specifying a shallow grammatical
representation for parsing purposes. Proceedings of the Seventh Con-
ference of the European Chapter of the Association for Computational
Linguistics, Dublin, Ireland, 1995.
6
Using the Link Parser of Sleator
and Temperley to Analyse a
Software Manual Corpus
Richard F. E. Sutcli e1
Annette McElligott
University of Limerick

6.1 Introduction
The Link Parser (LPARSER) is a public domain program capable of
analysing a wide range of constructions in English (Sleator and Tem-
perley, 1991). The system works with a Link Grammar (LG) which is
a lexicon of syntagmatic patterns. During parsing, the syntagmatic re-
quirements of each lexeme must be satis ed simultaneously and in this
respect LPARSER is similar to the PLAIN system (Hellwig, 1980). The
object of the study outlined in this article was to establish the ecacy
of the Link Parser for analysing the utterances contained in technical
instruction manuals of the kind typically supplied with PC software.
The work was conducted on a corpus of 60 utterances obtained from
three di erent manuals, the Dynix Automated Library Systems Search-
ing Manual (1991), the Lotus Ami Pro for Windows User's Guide Re-
lease Three (1992) and the Trados Translator's Workbench for Windows
1 Address: Department of Computer Science and Information Systems, Uni-
versity of Limerick, Limerick, Ireland. Tel: +353 61 202706 (Sutcli e), +363
61 202724 (McElligott), Fax: +353 61 330876, Email: richard.sutcli e@ul.ie, an-
nette.mcelligott@ul.ie. We are indebted to the National Software Directorate of
Ireland under the project `Analysing Free Text with Link Grammars' and to the
European Union under the project `Selecting Information from Text (SIFT)' (LRE-
62030) for supporting this research. We also thank Microsoft who funded previous
work investigating the use of LPARSER in computer assisted language learning (Bre-
hony, 1994). We also acknowledge gratefully the help of Michael C. Ferris of Lotus
Development Ireland. This work could not have been done without the assistance of
Denis Hickey, Tony Molloy and Redmond O'Brien.
90 Sutcli e, McElligott

Characteristic A B C D E F G
Link Parser yes yes yes yes no yes yes

Table 6.1: Linguistic characteristics which can be detected by


LPARSER. See Table 6.2 for an explanation of the letter codes.
Code Explanation
A Verbs recognised
B Nouns recognised
C Compounds recognised
D Phrase Boundaries recognised
E Predicate-Argument Relations identi ed
F Prepositional Phrases attached
G Coordination/Gapping analysed

Table 6.2: Letter codes used in Tables 6.1, 6.5.1, 6.5.2 and 6.5.3.

User's Guide (1995). The aim was to determine the following informa-
tion:
 how many of the utterances could be recognised by LPARSER,
 how accurate the analyses returned were,
 what changes could be made to the system to improve coverage
and accuracy in this domain.
Longer term aims are to integrate LPARSER into a language engi-
neering workbench and to develop methods for extracting information
from the linkages returned.
The work was divided into three phases. In the rst phase, an at-
tempt was made to analyse the corpus using the LPARSER system with
its original lexicon, using minimal pre-processing. In the second phase,
an initial pass was made through the utterances converting each instance
of a multiple word term into a single lexeme, and some new entries were
added to the lexicon. In the third phase, some existing entries in the
lexicon were changed. In the next section we brie y describe the char-
acteristics of LPARSER. After this we describe the work carried out in
Phases I, II and III before summarising the results.

6.2 Description of Parsing System


Link Parsing is based upon syntagmatic links between terminal sym-
bols in the language. Each link has a type and joins two classes of
LPARSER 91

word. In the following example, a link of type D joins a word of class


determiner to a word of class noun:
+---D---+
| |
the software.n

The direct object of a verb phrase is linked to the verb by a link of


type O:
+-------O------+
| +---D---+
| | |
install.v the software.n

Thirdly, the subject of a sentence can be connected to the verb by a


link of type S:
+-------O------+
+---S--+ +---D---+
| | | |
you install.v the software.n

All grammatical constructions are captured by the use of simple bi-


nary links between pairs of words in the utterance. The result of a parse
is thus always a set of binary relations which can be drawn as a graph.
In many cases the analysis is in fact a tree, as the above examples show.
The process of parsing with a link grammar constitutes the search
for a linkage of the input sentence which has the following properties:
 The links must not cross.
 All words of the sentence are connected together.
 The linking requirements of each word are met.
The linking requirements for each word are expressed as a pattern.
The basic link parsing system comes with a set of around 840 syntagmatic
patterns which are linked to a lexicon of 25,000 entry points. The existing
grammar can handle a wide range of phenomena, including complex
coordination, separated verb particles, imperatives, some topicalisations,
punctuation, compound nouns and number agreement.
The system comes with a parser which uses a very ecient search al-
gorithm to determine all legal analyses of an input sentence. In addition,
the parser employs a number of heuristics to order multiple analyses re-
turned for ambiguous inputs by their probability of correctness. Thus
the most promising parses are returned to the user rst. During each of
the three phases of this study results were computed by analysing the
rst parse tree returned by the LPARSER system.
92 Sutcli e, McElligott

6.3 Parser Evaluation Criteria


Prior to undertaking the three phases of analysis we will clarify what
we understood by the linguistic characteristics speci ed in Tables 6.1
and 6.2. The utterance `Scrolling changes the display but does not move
the insertion point.' will be used to illustrate the points. Firstly, when
counting verbs, we include auxiliaries. Thus in the example the verbs
are `changes', `does' and `move'. The nouns are `Scrolling', `display',
`insertion' and `point'. Naturally we only count a word as a noun if it is
serving as a noun in the particular context in which it is being observed.
Thus `changes' is not a noun. The same applies for verbs, meaning that
`display' is not a verb. `not' is not counted as a verb and neither are
nominalisations like `Scrolling'.
The number of nouns and verbs in each utterance was established by
tagging the corpus with the Brill Tagger (Brill, 1993) and correcting the
output by hand.
We took category C to mean compound nouns only. Thus we did
not consider compound verbs (such as `write down') in any phase of the
analysis. In Phase I no compound analysis was allowed before parsing.
However, LPARSER can analyse compound nouns quite well.
In our analysis there are deemed to be two types of phrase: noun
phrases (NPs) and prepositional phrases (PPs). Consider the utterance
`The contents of the Clipboard appear in the desired location'. Here, the
NPs are `The contents of the Clipboard' and `the desired location', while
the PPs are `of the Clipboard' and `in the desired location'. Verb groups
were not considered in the analysis as they tended to be very simple in the
test corpus. In the utterance `Choose Edit/Cut or Edit/Copy to place
the selected text on the Clipboard.' The PP `on the clipboard' should
attach to `place' because the Clipboard is the desired destination of
`the selected text'. If `on the clipboard' attached to the preceding NP,
it would be because it was being used to specify a particular type of
selected text which was to be placed in some unknown location. For our
purposes the rst interpretation is deemed correct.
Much ambiguity exists in relation to phrase attachment. The main
situation in which an attachment ambiguity occurs is where the verb
group is followed by a simple NP followed by a PP, as in `Ami Pro
provides three modes for typing text'. Here `for typing text' may at-
tach either to the verb `provides' or to the simple NP following it `three
modes'. Compare the following parse trees:
LPARSER 93

+-------O------+
+-MP+---S---+ +---D--+--M--+---M--+---O---+
| | | | | | | |
Ami Pro provides.v three modes.n for typing.v text.n

+---------EV---------+
+-------O------+ |
+-MP+---S---+ +---D--+ +---M--+---O---+
| | | | | | | |
Ami Pro provides.v three modes.n for typing.v text.n

This example shows a case in which we consider that either attach-


ment is correct | a particular mode being picked out in the rst tree and
a particular means of providing something in the second tree. Thus the
parser would be deemed to have attached the PP correctly irrespective
of which parse was produced rst. In this study it was decided to include
constructions such as `for typing text' in Category F because, while they
are not strictly PPs, they exhibit identical attachment behaviour. In
order to perform the analysis, each such attachment problem was in-
spected and one or more of the possible attachments was selected as
being correct.
As the LPARSER does not directly recognise predicate-argument
relations, this analysis was not performed.
In the nal category the forms of coordination considered were those
using `and', `or' and `,'. For each utterance including such a coordination,
a decision was made as to what the correct analysis should be. This was
then compared with the candidate parse tree which was thus judged
correct or incorrect. In the utterance `Write down any information you
need, or select the item if you are placing a hold.', `or' is grouping two
separate clauses, `Write down any information you need,' and `select the
item if you are placing a hold.'.
When performing calculations for all phases the number of each char-
acteristic correctly identi ed by the LPARSER system was divided by
the number that should have been correctly identi ed. Results in all
cases were multiplied by 100 in order to produce a percentage.
When performing calculations for Tables 6.4.1, 6.4.2 and 6.4.3 the
time required to load the system was omitted. The time to load the
LPARSER system using a SPARC II with 32 MB is 9.2s.
94 Sutcli e, McElligott

6.4 Analysis I: Original Grammar, Origi-


nal Vocabulary
6.4.1 Pre-Processing
Phase I of the study comprised an analysis of the test sentences using the
parsing system in its original form. Only the process of lexical analysis
could be tailored to suit the task. Various transformations were per-
formed on the utterance of each input text before they were submitted
to LPARSER for analysis. The changes a ect punctuation, quotations,
the use of the ampersand and minus sign and material in round brackets.
Firstly, LPARSER does not recognise punctuation other than the full
stop and comma. In particular, two clauses separated by a colon can not
be recognised as such, even if each clause can be recognised separately.
For this reason, any colon occurring in an input utterance was deleted.
Furthermore, if that colon occurred in the middle of the utterance, the
text was split into two utterances at that point, and each half was placed
on a separate line to be processed separately.
Secondly, the LPARSER system can not handle utterances containing
quotations such as `The \View" tab in Word's Options dialog'. There-
fore each such quotation was transformed into a single capital S which
LPARSER will assume is a proper name and analyse accordingly.
Next, there is the problem of ampersands and minus signs. An am-
persand is occasionally used to group two constituents together (e.g.
`Drag & Drop'). Any such ampersand was changed into `and' (e.g. `Drag
and Drop'). (Note that in Phase II, compound processing was carried our
before ampersand substitution to avoid the corruption of terms.) Lexi-
cal analysis in LPARSER is not designed for material in round brackets
such as `(\No match")'. Such material can be syntactically anomalous
(as here) and moreover can be inserted almost anywhere in what is other-
wise a grammatical utterance. All material in brackets was thus deleted
from the texts and no further attempt was made to analyse it.
Capitalisation causes many problems. By default, any word starting
with a capital will be considered as a proper name in LPARSER unless
it occurs at the start of a sentence. This allows the system to recognise
unknown terms such as `Ami Pro' with surprising felicity. However, a
section heading in a document may be capitalised even though it contains
normal vocabulary as well as terms (e.g. `Alphabetical Title Search' is
really `alphabetical title search'). In this case incorrect recognition can
result. As an initial solution to this problem it was decided to convert
the rst word of any utterance other than a sentence or NP to lower case
but to leave the case of all other words as found. The case of the initial
word in either a sentence or an NP was left unaltered.
Finally, one characteristic of the original LPARSER system is that
LPARSER 95

it can only recognise complete sentences and not grammatically cor-


rect constituents such as NPs (e.g. `Byzantine empire' or in nitive verb
phrases (IVPs) (e.g. `To use keyboard shortcuts to navigate a docu-
ment'). Because such constructions occur frequently in the test corpus
a simple addition was made to any utterance which was not a sentence
in order to transform it into one. The exact additions used for each type
of utterance can be seen as follows:
1. `:' deleted and utterance split in two if necessary,
2. quotation \`..."' changed into `S',
3. `&' changed into `and' (in Phase I only),
4. bracketed material `(...)' deleted,
5. sentence or verb phrase submitted with no change,
6. in nitive verb phrase pre xed with `it is',
7. third person singular pre xed with `it',
8. progressive verb phrase pre xed with `it is',
9. noun phrase pre xed with `it is' and then `it is the'.

6.4.2 Results
Data concerning the acceptance rates for Phase I may be seen in
Table 6.3.1. Overall 40% of input utterances could be recognised. By
this we mean simply that a linkage could be produced for this proportion
of the utterances. Of the 60% rejected (i.e., 36 utterances), 58% (i.e.,
21 utterances) failed the parsing process as they contained one or more
words which were not in the lexicon.
The times required to process utterances in Phase I may be seen in
Table 6.4.1. The Total Time to Parse entry of 5626s for Dynix text
is arti cially in ated by one utterance `Displays the records that have
a speci c word or words in the TITLE, CONTENTS, SUBJECT, or
SERIES elds of the BIB record, depending on which elds have been
included in each index.'. LPARSER nds 4424 parses of this utterance,
a process which takes 4206s. If this utterance had been deleted the time
to analyse the remaining 19 would be 1420s. It is a characteristic of
LPARSER that the occasional highly ambiguous input is slow to parse
while the majority of utterances are analysed very quickly.
The next step was to determine the accuracy of the analyses. The
issue of attachment is important for us because we wish to construct ac-
curate semantic case frames from parse trees for use in applications such
96 Sutcli e, McElligott

Number Accept Reject % Accept % Reject


Dynix 20 09 11 45 55
Lotus 20 09 11 45 55
Trados 20 06 14 30 70
Total 60 24 36 40 60

Table 6.3.1: Phase I acceptance and rejection rates for


LPARSER.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 5626 211.3 0.60
Lotus 0920 001.0 0.03
Trados 3290 002.9 0.50
Total 9836 071.7 0.38

Table 6.4.1: Phase I parse times for LPARSER using a SPARC


II with 32 MB. The rst column gives the total time to attempt a
parse of each sentence. See the text for a discussion of the Dynix
times.
Char. A B C D E F G Avg.
Dynix 088% 100% 100% 094% 0% 088% 100% 81%
Lotus 100% 100% 100% 100% 0% 100% 083% 83%
Trados 100% 100% 100% 092% 0% 090% 100% 83%
Average 096% 100% 100% 095% 0% 093% 094% 82%

Table 6.5.1: Phase I Analysis of the ability of LPARSER to


recognise certain linguistic characteristics in an utterance. For
example the column marked `A' gives for each set of utterances the
percentage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 6.2.

as information retrieval. The results of the ability of LPARSER to recog-


nise certain linguistic characteristics can be seen in Table 6.5.1. These
results show that the LPARSER is excellent at recognising nouns and
nearly as good at recognising verbs. While compound analysis was not
performed in this phase a check was made to see if LPARSER recognised
sequences of nouns that would constitute noun compounds in subsequent
phases, for example, `research computer centers'.
In the Dynix text, 10 utterances contained PPs of which 94% were
recognised. In the Lotus text 11 contained PPs, all of which were recog-
nised. In the Trados text 14 contained PPs of which 92% were recognised.
LPARSER 97

Having undertaken this analysis it was observed that LPARSER almost


always attaches a PP to the closest constituent which is not always cor-
rect.
An analysis of coordination was also carried out as part of Phase I.
A number of di erent coordination types were identi ed in the test cor-
pus. The most commonly occurring types of coordination are noun (e.g.
`SHIFT+INS or CTRL+V'), verb (e.g. `move or copy'), determiner (e.g.
`1, 2, 3, or 4'), adjective (e.g. `longer or shorter') and NP (e.g. `libraries,
research computer centers, and universities'). The main nding is that
LPARSER is extremely good at handling coordination. In many cases,
the system is only wrong because the coordination is in uenced by the
attachment of a PP which sometimes happens to be incorrect.
A nal point to note here is that the overall averages (i.e., the last
column in Table 6.5.1) are distorted due to the fact that LPARSER does
not recognise predicate-argument relations.

6.5 Analysis II: Original Grammar, Addi-


tional Vocabulary
The following steps were undertaken in this stage. Firstly, the input text
was subjected to compound processing. Each utterance was searched
for potential compounds and the longest possible instances found were
replaced by a single lexeme. Thus for example, `Ami Pro' would be re-
placed with `Ami Pro'. Each such compound was then given its own lexi-
cal entry, providing it with syntagmatic linkage of an existing word. Thus
for example, `Ami Pro' could be made equivalent to `John' (a proper
name).
Secondly, the corpus was run through the parser to determine all
words not in the lexicon. Each such word was then added to the lexicon.
As with the terminology, an appropriate word in the existing lexicon
was selected and its template used verbatim as the linkage requirements
for the new word. No new syntagmatic templates were added in this
phase. Neither did we allow a word which was already in the lexicon
to be moved from one template to another. The corpus was then run
through the parser again.
The results for this phase can be seen in Tables 6.3.2, 6.4.2 and 6.5.2.
The following observations were made from these results. Firstly, by
performing compound processing and adding missing entries to the lexi-
con the ability of the LPARSER to recognise more utterances increased
by 25% to an overall total of 65%. Secondly, parsing times were much
improved for those utterances that could be recognised while those ut-
terances for which no linkages could be found it took on average more
time to parse. Once again the Dynix utterance `Displays the records
98 Sutcli e, McElligott

Number Accept Reject % Accept % Reject


Dynix 20 15 05 75 25
Lotus 20 14 06 70 30
Trados 20 10 10 50 50
Total 60 39 21 65 35

Table 6.3.2: Phase II acceptance and rejection rates for


LPARSER.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 5124 204.1 0.9
Lotus 0860 000.6 0.1
Trados 0650 000.2 0.5
Total 6634 068.3 0.5

Table 6.4.2: Phase II parse times for LPARSER. The rst col-
umn gives the total time to attempt a parse of each sentence. See
the text for a discussion of the Dynix times.
Char. A B C D E F G Avg.
Dynix 092% 100% 100% 094% 0% 093% 100% 83%
Lotus 100% 100% 100% 100% 0% 100% 094% 85%
Trados 100% 100% 100% 095% 0% 094% 100% 86%
Average 097% 100% 100% 096% 0% 096% 098% 84%

Table 6.5.2: Phase II Analysis of the ability of LPARSER to


recognise certain linguistic characteristics in an utterance. For
example the column marked `A' gives for each set of utterances the
percentage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 6.2.

that have a speci c word or words in the TITLE, CONTENTS, SUB-


JECT, or SERIES elds of the BIB record, depending on which elds
have been included in each index.' increased the total time to parse,
in this case from 1050s to 5124s. Comparing Table 6.4.2 with 6.4.1 we
can see that it now takes a longer period of time to reject an utterance
than to accept an utterance. This is due to the fact that an utterance is
rejected more quickly if one or more words are not in the lexicon than
if all possible candidate linkages must rst be enumerated. Thirdly, the
ability of LPARSER to recognise certain linguistic characteristics has
stayed the same or improved on all fronts, the most signi cant being its
LPARSER 99

ability to recognise PP attachment.


A nal point to note here is that the overall averages (i.e., the last
column in Table 6.5.2) are distorted due to the fact that LPARSER does
not recognise predicate-argument relations.

6.6 Analysis III: Altered Grammar, Addi-


tional Vocabulary
This phase allowed for alterations to the parsing algorithms, grammar or
vocabulary. In practice, the only change which was made to LPARSER
was to delete existing entries in the original lexicon for certain words, and
to insert new ones for them. This amounted to changing the syntagmatic
template associated with a word from one existing pattern to another.
No new patterns were added to the system during our analysis. In other
words, the fundamental `grammar' we were evaluating was exactly that
provided by Sleator and Temperley.
The main changes made to the lexicon were thus as follows:
1. Altering certain nouns from common nouns to proper nouns. This
a ects the linkage requirements in relation to preceding determin-
ers.
2. Changing certain verbs so that intransitive as well as transitive
usages were allowed.
Perhaps the most surprising discovery was that while `right' was in
the lexicon as a noun, `left' was only present as a verb.
The results from this phase can be seen in Tables 6.3.3, 6.4.3 and
6.5.3. This time our results show an 82% acceptance rate with the
Dynix results attaining a 90% acceptance rate. The improvement in per-
formance has however increased the time to parse an utterance whereas
the time required to reject an utterance is similar to that of Phase II.
In terms of recognising certain linguistic characteristics we see improve-
ments again on all fronts, most notably those in relation to the ability
of the LPARSER to recognise verbs, phrase boundaries and PP attach-
ment. As previously stated in the results of Phases I and II the values in
the nal column in Table 6.5.3 are distorted by the fact that LPARSER
does not recognise predicate-argument relations.

6.7 Converting Parse Tree to Dependency


Notation
The output from LPARSER is a set of binary relations between terminal
symbols. It is thus already in a dependency format and needs no further
100 Sutcli e, McElligott

Number Accept Reject % Accept % Reject


Dynix 20 18 02 90 10
Lotus 20 16 04 80 20
Trados 20 15 05 75 25
Total 60 49 11 82 18

Table 6.3.3: Phase III acceptance and rejection rates for


LPARSER.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 5374 204.3 0.9
Lotus 1101 000.7 0.1
Trados 0850 000.4 0.6
Total 7325 068.5 0.5

Table 6.4.3: Phase III parse times for LPARSER. The rst col-
umn gives the total time to attempt a parse of each sentence. See
the text for a discussion of the Dynix times.
Char, A B C D E F G Avg.
Dynix 098% 100% 100% 100% 0% 100% 100% 85%
Lotus 100% 100% 100% 100% 0% 100% 096% 85%
Trados 100% 100% 100% 098% 0% 098% 100% 85%
Average 099% 100% 100% 099% 0% 099% 099% 85%

Table 6.5.3: Phase III Analysis of the ability of LPARSER to


recognise certain linguistic characteristics in an utterance. For
example the column marked `A' gives for each set of utterances the
percentage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 6.2.

conversion. As it happens, LPARSER was unable to parse the utterance


`That is, these words make the source sentence longer or shorter than
the TM sentence.'.
It is interesting to note that while LPARSER returns a graph which
contains no non-terminal symbols, it is drawn by the system in a tree-
like fashion. When analysing linkages, the authors found that they were
reasoning in terms of the traditional concepts of phrase structure gram-
mar such as `noun phrase', `prepositional phrase', `verb group' and so
on. These appear to be present in the output while in fact they are only
imagined.
LPARSER 101

6.8 Summary of Findings


Our main ndings are, rstly, that lexical analysis, pre-processing and
capitalisation cause many diculties in the technical domain. Punctua-
tion can occur within a lexeme making it hard to be sure that the input
is tokenised correctly. Some of the utterances in this analysis were in
fact section headings or other labels. These are capitalised in a fashion
that makes it dicult to distinguish between terms and ordinary vo-
cabulary. Additionally, there are diculties imposed by quotations and
bracketed expressions whose analysis is also to a large extent a matter of
pre-processing. While these problems sound relatively minor their e ect
on the accuracy of analysis can be considerable.
Secondly, LPARSER is extremely good at handling coordination.
Moreover the analysis of coordinated constructions is often correct if
constituents such as PPs are attached correctly.
Thirdly, when LPARSER encounters prepositional attachment phe-
nomena it almost always returns rst the analysis which attaches the PP
to the closest constituent. This is often wrong. However, the selectional
restrictions of verbs in the computer domain appear to be tightly con-
strained. Thus for example each verb takes only PPs heralded by certain
prepositions. We have carried out an initial study to determine whether
this information could be used to attach constituents as a post-processing
operation with good results (Sutcli e, Brehony and McElligott, 1994).
Fourthly, the LPARSER is a remarkably ecient system. A typical
sentence will be analysed in less than one second on a SPARC II. The
time taken to parse a sentence is dependent on its length and degree
of ambiguity, as with many other systems. This means that the occa-
sional long and highly coordinated sentence can be very slow to analyse,
because each possible analysis is e ectively being computed.
Fifthly, LPARSER has certain weaknesses. Movement phenomena
(e.g. topicalisation), certain forms of ellipsis (in particular gapping) id-
ioms and adverbial constructions all cause problems for the system. For
example, `Certain abbreviations may work at one prompt but not at
another.' does not work whereas `Certain abbreviations may work at
one prompt but they may not work at another prompt.' will parse cor-
rectly. LPARSER is unable to handle preposed in nitive complements,
for example, `To perform an accelerated search, follow these instruc-
tions'. Generally, the syntagmatic approach to syntax is at its weakest
when the order of constituents is least constrained. Adverbials are par-
ticularly problematic because they can occur in many di erent places
in an utterance. Essentially each possible position of such a construct
has to be catered for by a separate potential linkage in a syntagmatic
template. Luckily the technical manual domain is one in which most of
the above phenomena are rare.
In conclusion, we nd LPARSER to be an impressive system. The
parser is ecient in operation and moreover the syntagmatic lexicon
supplied with it can handle many complex constructions occurring in
the software manual domain.

6.9 References
Brehony, T. (1994). Francophone Stylistic Grammar Checking using
Link Grammars. Computer Assisted Language Learning, 7(3). Lisse,
The Netherlands: Swets and Zeitlinger.
Brill, E. (1993). A Corpus-Based Approach to Language Learning. Ph.D.
Dissertation, Department of Computer and Information Science, Uni-
versity of Pennsylvania.
Dynix (1991). Dynix Automated Library Systems Searching Manual.
Evanston, Illinois: Ameritech Inc.
Hellwig, P. (1980). PLAIN | A Program System for Dependency Anal-
ysis and for Simulating Natural Language Inference. In L. Bolc (Ed.)
Representation and Processing of Natural Language (271-376). Mu-
nich, Germany, Vienna, Austria, London, UK: Hanser & Macmillan.
Lotus (1992). Lotus Ami Pro for Windows User's Guide Release Three.
Atlanta, Georgia: Lotus Development Corporation.
Sleator, D. D. K., & Temperley, D. (1991). Parsing English with a
Link Grammar (Technical Report CMU-CS-91-196). Pittsburgh, PA:
Carnegie Mellon University, School of Computer Science.
Sutcli e, R. F. E., Brehony, T., & McElligott, A. (1994). Link Grammars
and Structural Ambiguity: A Study of Technical Text. Technical Note
UL-CSIS-94-15, Department of Computer Science and Information
Systems, University of Limerick, December 1994.
Trados (1995). Trados Translator's Workbench for Windows User's
Guide. Stuttgart, Germany: Trados GmbH.
7
Using PRINCIPAR to Analyse a
Software Manual Corpus
Dekang Lin1
University of Manitoba

7.1 Introduction
Government-Binding (GB) Theory (Chomsky, 1981, 1986) is a linguistic
theory based on principles, as opposed to rules. Whereas rule-based
approaches spell out the details of di erent language patterns, principle-
based approaches strive to capture the universal and innate constraints
of the world's languages. The constraints determine whether a sentence
can be assigned a legal structure. If it can, the sentence is considered
grammatical; otherwise, it is considered somehow deviant. A sentence
that can be assigned more than one structural representation satisfying
all the constraints is considered syntactically ambiguous.
Principle-based grammars o er many advantages (e.g. modularity,
universality) over rule-based grammars. However, previous principle-
based parsers are inecient due to their generate-and-test design. In
Lin (1993) and Lin (1994) we presented an ecient, broad-coverage,
principle-based parser for the English language, called PRINCIPAR. It
avoids the generate-and-test paradigm and its associated problems of
ineciency. Principles are applied to the descriptions of structures
instead of to structures themselves. Structure building is, in a sense,
deferred until the descriptions satisfy all principles.
In this chapter, we analyze the results of applying PRINCIPAR to
sentences from software manuals. Our experiments show that, contrary
to the belief of many researchers, principle-based parsing can be ecient
and have broad-coverage.
1 Address: Department of Computer Science, University of Manitoba, Winnipeg,
Manitoba, Canada, R3T 2N2. Tel: +1 204 474 9740, Fax: +1 204 269 9178, Email:
lindek@cs.umanitoba.ca.
104 Lin

Lexical Input
Lexicon Analyzer text
dynamic
data
static
data
MessingPassing Lexical processing
Grammar GBParser items
Network module

data flow
Parse Parse Tree Parse
Trees Retriever forest

Figure 7.1: The architecture of PRINCIPAR.

7.2 Description of Parsing System


Figure 7.1 shows the architecture of PRINCIPAR. Sentence analysis is
divided into three steps. The lexical analyzer rst converts the input
sentence into a set of lexical items. Then, a message passing algorithm
for principle-based parsing is used to construct a shared parse forest.
Finally, a parse tree retriever is used to enumerate the parse trees.

7.2.1 Parsing by Message Passing


The parser in PRINCIPAR is based on a message-passing framework pro-
posed by Lin (1993) and Lin and Goebel (1993), which uses a network
to encode the grammar. The nodes in the grammar network represent
grammatical categories (e.g., NP, Nbar, N) or subcategories, such as
V:NP (transitive verbs that take NPs as complements). The links in the
network represent relationships between the categories. GB-principles
are implemented as local constraints attached to the nodes and perco-
lation constraints attached to links in the network. Figure 7.2 depicts
a portion of the grammar network for English.
There are two types of links in the network: subsumption links
and dominance links.
 There is a subsumption link from to if subsumes . For
example, since V subsumes V:NP and V:CP, there is a subsumption
link from V to each one of them.
 There is a dominance link from node to if can be immediately
PRINCIPAR 105

CP

Cbar

IP
C
Ibar
CPSpec I

AP PP NP VP

Abar Pbar Det Nbar Vbar

A P N V

V:NP V:CP

adjunct dominance complement dominance specialization

specifier dominance head dominance barrier

Figure 7.2: A portion of the grammar network for English used


by PRINCIPAR.

dominated by . For example, since an Nbar may immediately


dominate a PP adjunct, there is a dominance link from Nbar to
PP.
A dominance link from to is associated with an integer id that
determines the linear order between and other categories dominated by
, and a binary attribute to specify whether is optional or obligatory.2
2 In order to simplify the diagram, we did not label the links with their ids in
Figure 7.2. Instead, the precedence between dominance links is indicated by their
starting points, e.g, C precedes IP under Cbar since the link leading to C is to the
left of the link leading to IP.
106 Lin

Input sentences are parsed by passing messages in the grammar net-


work. The nodes in the network are computing agents that communicate
with each other by sending messages in the reverse direction of the links
in the network. Each node has a local memory that stores a set of
items. An item is a triplet that represents a (possibly incomplete) X-bar
structure :
<str, att, src>,
where, str is an integer interval [i,j] denoting the i'th to j'th word in
the input sentence; att is the attribute values of the root node of the
X-bar structure; and src is a set of source messages from which this item
is combined. The source messages represent immediate constituents of
the root node. Each node in the grammar network has a completion
predicate that determines whether an item at the node is \complete,"
in which case the item is sent as a message to other nodes in the reverse
direction of the links.
When a node receives an item, it attempts to combine the item with
items from other nodes to form new items. Two items
<[i1 ,j1 ], A1 , S1 > and <[i2 ,j2 ], A2 , S2 >
can be combined if
1. their surface strings are adjacent to each other: i2 = j1 +1.
2. their attribute values A1 and A2 are uni able.
3. the source messages come via di erent links: links(S1 ) \ links(S2 )
= ;, where links(S) is a function that, given a set of messages,
returns the set of links via which the messages arrived.
The result of the combination is a new item:
<[i1 ,j2 ], unify(A1 , A2 ), S1 [ S2 >.
The new item represents a larger X-bar structure resulting from the
combination of the two smaller ones. If the new item satis es the local
constraint of the node it is considered valid and saved into the local mem-
ory. Otherwise, it is discarded. A valid item satisfying the completion
predicate of the node is sent further as messages to other nodes.
The input sentence is parsed in the following steps.
Step 1: Lexical Look-up: Retrieve the lexical entries for all the words
in the sentence and create a lexical item for each word sense. A lexical
item is a triple: <[i,j], avself, avcomp>, where [i,j] is an interval denoting
the position of the word in the sentence; avself is the attribute values of
the word sense; and avcomp is the attribute values of the complements
of the word sense.
Step 2: Message Passing: For each lexical item <[i,j], avself, avcomp>,
create an initial message <[i,j], avself, ;> and send this message to the
grammar network node that represents the category or subcategory of
PRINCIPAR 107

the word sense. When the node receives the initial message, it may
forward the message to other nodes or it may combine the message with
other messages and send the resulting combination to other nodes. This
initiates a message passing process which stops when there are no more
messages to be passed around. At that point, the initial message for the
next lexical item is fed into the network.
Step 3: Build a Shared Parse Forest: When all lexical items have
been processed, a shared parse forest for the input sentence can be built
by tracing the origins of the messages at the highest node (CP or IP),
whose str component is the whole sentence. The parse forest consists of
the links of the grammar network that are traversed during the tracing
process. The structure of the parse forest is similar to Billot and Lang
(1989) and Tomita (1986), but extended to include attribute values.
PRINCIPAR is able to output the parse trees in both constituency
format and the dependency format. See Figures 7.3 and 7.4 for examples.
A constituency tree is represented as nested lists, which is a fairly
standard way of representing trees and needs no further explanation.
A dependency tree is denoted by a sequence of tuples, each of which
represents a word in the sentence. The format of a tuple is:
(word root cat position modifiee relationship)
where
word is a word in the sentence;
root is the root form of word; if root is \=", then word
is in root form;
cat is the lexical category or subcategory of word;
V:IP is the subcategory of verbs that take an IP
as the complement; V:[NP] is the subcategory of
verbs that take an optional NP as a complement;
modifiee is the word that word modi es;
position indicates the position of modifiee relative to
word. It can take one of the following values: f<,
>, <<, >>, <<<, ..., *g, where < (or >)
means that the modifiee of word is the rst oc-
currence to the left (or right) of word; << (or >>)
means modifiee is the second occurrence to the
left (or right) of word. If position is `*', then the
word is the head of the sentence;
relationship is the type of the dependency relationship between
modifiee and word, such as subj (subject), adjn
(adjunct), comp1 ( rst complement), spec (speci-
er), etc.
For example, in the above dependency tree, \may" is the root of the
sentence; \abbreviations" is a modi er (subject) of \may" and \certain"
is a modi er (adjunct) of \abbreviations."
108 Lin

(S
(CP (Cbar (IP
(NP (Nbar
(AP (Abar (A (A_[CP] Certain))))
(N abbreviations)))
(Ibar
(Aux may)
(VP (Vbar
(V (V_[NP] work))
(PP
(PP (Pbar (P
(P at)
(NP (Nbar
(AP (Abar (A one)))
(N prompt))))))
but
(PP
(AP (Abar (A not)))
(Pbar (P
(P at)
(NP (Nbar (N another)))))))))))))
.)

Figure 7.3: The constituency output given by PRINCIPAR for


the sentence `Certain abbreviations may work at one prompt but
not at another.'.

(
(Certain ~ A_[CP] < abbreviations adjunct)
(abbreviations abbreviation N < may subj)
(may ~ I *)
(work ~ V_[NP] > may pred)
(at ~ P_ > work adjunct)
(one ~ A < prompt adjunct)
(prompt ~ N > at comp)
(but )
(not ~ A < at spec)
(at ~ P_ > at conj)
(another ~ N > at comp)
(. )
)

Figure 7.4: The dependency output given by PRINCIPAR for


the sentence `Certain abbreviations may work at one prompt but
not at another.'.
PRINCIPAR 109

7.2.2 Implementation
PRINCIPAR has been implemented in C++ and runs on Unix and MS
Windows. There are about 37,000 lines of C++ code in total. The
composition of the code is as follows:
Utilities (23k lines): The utility classes include
 Container classes, such as lists, hashtable, and associations,
 Discrete structures, such as BitSet, Partially Ordered Set, and
Graph,
 Attribute Value Vectors and various types of attribute values,
 LISP-like script language interpreter that allows functions to
be de ned in C++ and called in LISP-like expressions,
 Lexicon and lexical retrieval,
 Lexical analysis: morphological analyzer, pattern matching
on lexical items.
Abduction (5k lines): The message passing algorithm employed in
PRINCIPAR is an instantiation of a generic message passing al-
gorithm for abduction, which takes as inputs a set or sequence of
nodes in a network and returns a minimal connection between the
nodes in a network representing the domain knowledge (Lin, 1992).
The application of abduction in di erent domains amounts to dif-
ferent instantiations of local and percolation constraints. The same
algorithm has also been used in plan recognition (Lin & Goebel,
1991) and causal diagnosis (Lin & Goebel, 1990).
Principle-based Parsing (5k lines): The language-independent com-
ponents account for 4k lines with the rest being English-speci c.
Graphical User Interface (4k lines): This is an optional component.
The GUI on X-Windows was implemented in InterViews.
The grammar network for English consists of 38 nodes and 73 links,
excluding nodes that represent subcategories (e.g., V:NP, V:PP) and the
links that are adjacent to them. These nodes and links are added to the
network dynamically according to the results of lexical retrieval.
The lexicon consists of close to 110,000 root entries. The lexical en-
tries come from a variety of sources, such as Oxford Advanced Learner's
Dictionary (OALD), Collins English Dictionary (CED), proper name
lists from the Consortium of Lexical Research as well as hand-crafted
entries.
110 Lin

Characteristic A B C D E F G
PRINCIPAR yes yes yes yes yes yes yes
Table 7.1: Linguistic characteristics which can be detected by
PRINCIPAR. See Table 7.2 for an explanation of the letter codes.
Code Explanation
A Verbs recognised
B Nouns recognised
C Compounds recognised
D Phrase Boundaries recognised
E Predicate-Argument Relations identi ed
F Prepositional Phrases attached
G Coordination/Gapping analysed

Table 7.2: Letter codes used in Tables 7.1, 7.5.1, 7.5.2 and 7.5.3.

7.3 Parser Evaluation Criteria


We adopted the methodology proposed in Lin (1995) to evaluate the 60
selected sentences out of the 600 sentence software manual corpus.
The key parses for the sentences are obtained by manually correcting
PRINCIPAR outputs for the sentences. The evaluation program is used
to identify the di erences between the answers and keys. The di erences
are shown by inserting the correct dependency relationships after an
incorrect dependency relationship in the answer. Figure 7.5 shows part
of a sample output given by the evaluation program.
The output indicates that there are two errors in the parse. The rst
one is that the clause headed by \using" is parsed as a relative clause
modifying \suggestions." In the key parse, it is an adjunct clause mod-
ifying \proceed." The second error is the attachment of the preposition
\with."
The percentage numbers reported in this paper are obtained by man-
ually examining the output and classify the errors into the categories in
Table 7.2. For the above sentence, the rst error is an incorrect at-
tachment of the clause \using ..." and cannot be classi ed into any of
the categories in Table 7.2. The second error is classi ed as a F-error
(incorrect prepositional phrases attachment).
The error categories are de ned as follows:
A. Verbs recognised: the percentage of verbs in the key that are rec-
ognized as verbs by the parser.
PRINCIPAR 111

(
(The ~ Det < following spec)
(following ~ N < are subj)
(are be I *)
(suggestions suggestion N_[CP] > are pred)
(on ~ P_ > suggestions adjunct)
(how ~ A < to spec)
(to ~ I > on comp)
(proceed ~ V_[CP] > to pred)
(when ~ A < using adjunct)
(using use V_NP > suggestions rel)
; (using use V_NP > proceed adjunct)
(the ~ Det < Translator spec)
(Translator ~ N < Workbench spec)
('s )
(Workbench ~ N > using comp1)
(together ~ A < with spec)
(with ~ P_ > Workbench adjunct)
; (with ~ P_ > using adjunct)
(Word ~ U < 6.0 pre-mod)
(for ~ U < 6.0 pre-mod)
(Windows ~ U < 6.0 pre-mod)
(6.0 "Word for Windows 6.0" N > with comp)
(. )
)

Figure 7.5: Some sample output of the program used to evaluate


PRINCIPAR analyses.

B. Nouns recognised: the percentage of nouns in the key that are


recognized as nouns by the parser.
C. Compounds recognised: the percentage of compound nouns in
the key that are recognized as a noun phrase by the parser.
D. Phrase Boundaries recognised: the percentage of words that are
assigned the same head word in the answer and in the key.
E. Predicate-Argument Relations identi ed: the percentage of sub-
ject or complement relations in the key that are correctly recog-
nized.
F. Prepositional Phrases attached: the percentage of PPs correctly
attached.
112 Lin

(
(choose ~ V_IP *)
(the ~ Det < menu spec)
(Options... ~ N < menu noun-noun)
(menu ~ N < from subj)
(item ~ A < from spec)
(from ~ P_ > choose comp1)
(Word ~ N < menu spec)
('s )
(Tools ~ N < menu noun-noun)
(menu ~ N > from comp)
)

Figure 7.6: An example parse showing how a single error may


belong to several error categories.

G. Coordination/Gapping analysed: the percentage of 'conj' rela-


tionships correctly recognized.
A single error may belong to zero or more of the above categories.
This is illustrated in the parse of Figure 7.6 where the word \item" is
treated as an adverb instead of a noun. It causes an error in category
B (\item" is not recognized as a noun), category C (\menu item" is not
recognized as a compound noun), and category E (\menu item" is not
recognized as the complement of \choose").

7.4 Analysis I: Original Grammar, Origi-


nal Vocabulary
7.4.1 Setting-Up the Experiment
In Analysis I, the sentences were parsed with the original grammar and
lexicon. The input sentences are stored one per line. PRINCIPAR con-
tains a sentence boundary recognizer. However, some of the sentences do
not have sentence ending punctuation marks. Therefore, an empty line,
which is one of the sentence boundaries recognized by PRINCIPAR, is in-
serted between every pair of sentences. No other manual pre-processing
was performed.
7.4.2 Results
When PRINCIPAR fails to parse a complete sentence, it retrieves the
largest parse fragments that cover the whole sentence. The fragments
PRINCIPAR 113

Number Accept Reject % Accept % Reject


Dynix 20 20 00 100 00
Lotus 20 20 00 100 00
Trados 20 20 00 100 00
Total 60 60 00 100 00

Table 7.3.1: Phase I acceptance and rejection rates for PRIN-


CIPAR.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 07.6 0.38 N.A.
Lotus 06.0 0.30 N.A.
Trados 08.8 0.44 N.A.
Total 22.4 0.37 N.A.

Table 7.4.1: Phase I parse times for PRINCIPAR using a


Pentium-90 PC with 24MB. The rst column gives the total time
to attempt a parse of each sentence.
Char. A B C D E F G Avg.
Dynix 98% 96% 96% 90% 90% 96% 92% 94%
Lotus 93% 96% 91% 88% 84% 88% 83% 89%
Trados 98% 97% 90% 83% 90% 86% 67% 87%
Average 96% 96% 92% 87% 88% 90% 81% 90%

Table 7.5.1: Phase I Analysis of the ability of PRINCIPAR to


recognise certain linguistic characteristics in an utterance. For
example the column marked `A' gives for each set of utterances the
percentage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 7.2.

are concatenated together to form the top-level constituents of the parse


tree. In constituency trees, the parse fragments are child nodes of a
dummy root node labled S. For example, the following parse tree means
that the parser found a Complementizer Phrase (CP) fragment that
spans the front part of the sentence and a Verb Phrase (VP) fragment
that spans the rest of the sentence, but failed to combine them together:
114 Lin

(S
(CP
...
)
(VP
...
))

When dependency format is used, the fragmented trees are indicated


by multiple root nodes. For example:
To perform an accelerated search, follow these instructions:
The reason for the failure to parse the whole sentence is that in the
grammar network, purpose clauses are modi ers of IPs, instead of VPs,
where imperative sentences are analysed as VPs. Therefore, the parser
would fail to attach a purpose clause to an imperative sentence.
With the mechanism to retrieve parse fragments, the parser always
produces an analysis for any input sentence. Therefore, the acceptance
rates in Table 7.3.1 are all 100%.
Table 7.4.1 reports the timing data in the experiment. The times
reported are in seconds on a Pentium 90MHz PC with 24M memory
running Linux.
The percentages of various linguistic characteristics are reported in
Table 7.5.1.

7.4.3 Causes of Errors


The causes of parser errors can be classi ed into the following categories:
 Insucient lexical coverage was responsible for most of the
errors the parser made, even though the lexicon used by the parser
is fairly large. For example, the lexical entry for \make" did not
include the use of the word as in \make it available." The lack of
domain speci c compound nouns in the lexicon is responsible for
many misanalyses of compound nouns, such as \keyword search"
and \insertion point," in which the words \search" and \point"
were treated as verbs.
 Insucient grammatical coverage is the cause of several parser
failures:
{ wh-clauses as the complements of prepositions, for example:
The following are suggestions on how to proceed when
using the Translator 's Workbench together with Word
for Windows 6.0.
PRINCIPAR 115

{ adverbs between verbs and their complements, for example:


The TWB1 button , also labeled Translate Until Next
Fuzzy Match , tells the Workbench to do precisely this.
{ free relative clauses, for example:
Another important category of non-textual data is
what is referred to as \hidden text."
{ incomplete coverage of appositives, for example:
If the Workbench cannot nd any fuzzy match, it
will display a corresponding message (\No match")
in the lower right corner of its status bar and you will
be presented with an empty yellow target eld.
The grammar only included appositives denoted by commas,
but not by parentheses.

7.5 Analysis II: Original Grammar, Addi-


tional Vocabulary
In Analysis II, we augmented the system's lexicon with about 250 phrasal
words in the corpus. We also made corrections to some of the entries
in the lexicon. For example, in our original lexicon, the word \date" is
either a noun or a transitive verb. We added in the noun entry that
the word \date" may take a clause as an optional complement. This
modi cation allows the sentence
Permanently inserts the date the current document was cre-
ated
to be parsed correctly. All together there are 523 new or modi ed entries
in Analysis II. Approximately one person week was spent on Analysis II.
The timing data for Analysis II is shown in Table 7.4.2. The parse
time for Analysis II is a little better than Analysis I, even though a larger
lexicon was used. The reason is that some of the compound nouns in the
additional lexical entries contains an attribute +phrase. For example,
the entry for \author authority search" is:
(author authority search
(syn (N +phrase))
)

which means that it is a common noun (N) with +phrase attribute. There
is a lexical rule in PRINCIPAR such that if a lexical item contains the
attribute +phrase, then all the lexical items that span on a smaller sur-
face string are removed. When the phrase \author authority search"
116 Lin

appears in a sentence, this lexical rule will remove the lexical items rep-
resenting the meanings of the individual words in the phrase: \author,"
\authority," and \search." As far as the parser is concerned, the phrase
becomes one word with a single meaning, instead of three words each
with multiple meanings. This explains why the parser is slightly faster
with the additional lexicon.
Table 7.5.2 show the performance measures for the linguistic char-
acteristics. All of the measures have been improved over the Analysis
I. The most visible improvement is in compound noun recognition. It
achieved 100% correct for all three sets of test sentences. The improve-
ment in other categories are mostly consequences of the better treatment
of compound nouns. For example, in \author authority search," both
\author and \search" can either be noun or verb. When the lexicon
contains an entry for the compound noun \author authority search,"
the lexical items for the verb meanings of \author" and \search" will be
removed. Therefore, it is not possible for the parser to mistakenly take
them as verbs.
The improvement on compound nouns is much larger with the Lotus
(91% ! 100%) and Trados (90% ! 100%) sentences than Dynix (96% !
100%) sentences. Correspondingly the overall Analysis II improvement
for Lotus (89% ! 97%) and Trados (87% ! 91%) sentences is much
more signi cant than for Dynix (94% ! 95%) sentences.

7.6 Analysis III: Altered Grammar, Addi-


tional Vocabulary
The Analysis III was not performed.

7.7 Converting Parse Tree to Dependency


Notation
Since the parser can output dependency trees directly, no conversion was
necessary.

7.8 Summary of Findings


Our experiments show that PRINCIPAR is very ecient. Acceptable
speed was achieved on low end workstations. The syntactic-coverage
of PRINCIPAR is also found to be adequate, especially if the lexicon
is augmented with domain speci c vocabularies. The additional lexical
PRINCIPAR 117

Number Accept Reject % Accept % Reject


Dynix 20 20 00 100 00
Lotus 20 20 00 100 00
Trados 20 20 00 100 00
Total 60 60 00 100 00

Table 7.3.2: Phase II acceptance and rejection rates for PRIN-


CIPAR.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 07.4 0.37 N.A.
Lotus 06.4 0.32 N.A.
Trados 07.8 0.39 N.A.
Total 21.6 0.36 N.A.

Table 7.4.2: Phase II parse times for PRINCIPAR using


Pentium-90 PC with 24MB. The rst column gives the total time
to attempt a parse of each sentence.
Char. A B C D E F G Avg.
Dynix 098% 099% 100% 91% 94% 96% 092% 96%
Lotus 100% 100% 100% 95% 92% 92% 100% 97%
Trados 098% 099% 100% 90% 97% 89% 067% 91%
Average 099% 099% 100% 92% 94% 92% 086% 95%

Table 7.5.2: Phase II Analysis of the ability of PRINCIPAR


to recognise certain linguistic characteristics in an utterance. For
example the column marked `A' gives for each set of utterances the
percentage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 7.2.

entries not only reduce the error rate, but also improve the eciency
slightly.

7.9 References
Billot, S., & Lang, B. (1989). The structure of shared forests in ambigu-
ous parsing. Proceedings of ACL-89, Vancouver, Canada, June 1989,
143-151.
Chomsky, N. (1981). Lectures on Government and Binding. Cinnamin-
son, NJ: Foris Publications.
Chomsky, N. (1986). Barriers. Cambridge, MA: MIT Press, Linguistic
Inquiry Monographs.
Lin, D. (1992). Obvious Abduction. Ph.D. thesis, Department of Com-
puting Science, University of Alberta, Edmonton, Alberta, Canada.
Lin, D. (1993). Principle-based parsing without overgeneration. Pro-
ceedings of ACL-93, Columbus, Ohio, 112-120.
Lin, D. (1994). Principar | an ecient, broad-coverage, principle-based
parser. Proceedings of COLING-94, Kyoto, Japan, 482-488.
Lin, D., & Goebel, R. (1990). A minimal connection model of abductive
diagnostic reasoning. Proceedings of the 1990 IEEE Conference on
Arti cial Intelligence Applications, Santa Barbara, California, 16-22.
Lin, D., & Goebel, R. (1991). A message passing algorithm for plan
recognition. Proceedings of IJCAI-91, Sidney, Australia. 280-285.
Lin, D., & Goebel, R. (1993). Context-free grammar parsing by message
passing. Proceedings of the First Conference of the Paci c Association
for Computational Linguistics, Vancouver, British Columbia, 203-211.
Tomita, M. (1986). Ecient Parsing for Natural Language. Norwell,
Massachusetts: Kluwer Academic Publishers.
8
Using the Robust Alvey Natural
Language Toolkit to Analyse a
Software Manual Corpus
Miles Osborne1
University of York

8.1 Introduction
Within the last decade there has been considerable research devoted
to the problem of parsing unrestricted natural language (e.g. Alshawi,
1992; Black, Garside & Leech, 1993; Magerman, 1994). By unrestricted,
we mean language that is in everyday use. Examples of unrestricted lan-
guage can be found in such places as requirement documents, newspaper
reports, or software manuals. If unrestricted language can be success-
fully parsed, then we will be a lot closer to achieving long-terms goals
in machine translation, document summarising, or information extrac-
tion. Research on parsing unrestricted language encompasses a variety
of approaches, from those that are statistically-based to those that are
logico-syntactically-based. Each approach has its own strengths and
weaknesses. For example, statistically-based systems are usually able to
assign a parse to each string that they encounter, irrespective of how
badly formed that string might be. However, the price to pay for this
coverage is that such parses are often shallow. More traditional logico-
syntactic approaches usually do reveal, in greater detail, the syntactic
structure of the sentence, but often fail to account for all of the sentences
that they ought to.
1 Address: Computer Laboratory, New Museums Site, Pembroke Street, Cam-
bridge CB2 3QG, UK. Tel: +44 1223 334617, Fax: +44 1223 334678, Email:
Miles.Osborne@cl.cam.ac.uk. The author would like to thank the Intelligent Systems
Group at York and the SIFT Project LRE-62030 for providing travel assistance to
enable the author to participate in the IPSM'95 workshop. This work was supported
by the DTI/SERC Proteus Project IED4/1/9304.
120 Osborne

The choice regarding which approach to take depends upon the do-
main. Here at York, as part of the Proteus Project (which is concerned
with dealing with changing requirements in safety-critical systems), we
are investigating using natural language processing to support the task
of creating clear requirements documents and their translation into a for-
mal representation (Du y, MacNish, McDermid & Morris, 1995; Burns,
Du y, MacNish, McDermid & Osborne, 1995). By `clear', we mean,
among other things, that syntactic and semantic ambiguities are de-
tected and resolved. Creating clear documents is only part of the task.
We are also concerned that the user, which in our case will be an engi-
neer, can express his or her requirements in as natural a way as possible.
To carry out our task requires us to parse (relatively) unrestricted nat-
ural language. Since we need to reject strings as being ill-formed (these
reduce the clarity of a document), and map those sentences that are
parsed into logical forms, we have decided to adopt a syntax-orientated
approach when dealing with controlled languages. We have therefore
decided to use the Alvey Natural Language Toolkit (ANLT) (Grover,
Briscoe, Carroll & Boguraev, 1993). The Toolkit makes a clear distinc-
tion between sentences and strings, and has a semantic component. It
is therefore an obvious choice as the basis of our Controlled Language
system. Choosing to use the Toolkit confronts us with the practicali-
ties of parsing unrestricted language: we have to address the problem
of making a brittle system robust. In this paper, within the context of
parsing software manuals, we present a series of experiments showing
how various extensions to ANLT help overcome the brittleness problem.
We refer to the extended versions of the system as the Robust Alvey
Natural Language Toolkit (RANLT). The aim is to use the experience
of parsing software manuals to boost the robustness of our requirement
document processing system.

Section 8.2 brie y presents ANLT, and the modi cations we made
to it to produce RANLT. These modi cations enable RANLT to deal
with unknown words, to deal with unparsable sentences, and to have
a higher throughput than does the original ANLT. Section 8.3 outlines
the criteria which had to be applied when evaluating RANLT within
the context of the IPSM workshop. Sections 8.4 to 8.6 then present the
results of this evaluation, which was restricted to Analysis I. The issue
of converting a RANLT parse tree into dependency form is addressed in
Section 8.7. Finally, Section 8.8 concludes the paper by discussing these
results, and pointing the way forward for parsing unrestricted naturally
occurring language.
RANLT 121

Characteristic A B C D E F G
RANLT yes yes yes yes no no no

Table 8.1: Linguistic characteristics which can be detected by


RANLT. See Table 8.2 for an explanation of the letter codes.
Code Explanation
A Verbs recognised
B Nouns recognised
C Compounds recognised
D Phrase Boundaries recognised
E Predicate-Argument Relations identi ed
F Prepositional Phrases attached
G Coordination/Gapping analysed

Table 8.2: Letter codes used in Tables 8.1 and 8.5.1.

8.2 Description of Parsing System


Here, we describe the basic ANLT. We then go on to present a series of
modi cations we have made to it.

8.2.1 The Basic ANLT


ANLT was developed in three parallel projects, at the Universities of
Cambridge, Edinburgh, and Lancaster, from 1987 to 1993.2 It consists
of:
 A wide-covering grammar and associated semantic component (for
British English).
 A morphological analyser and associated lexicon.
 An optimised chart parser.3
The grammar is written in a meta-grammatical formalism, based upon
Generalised Phrase Structure Grammar (Gazdar, Klein, Pullum & Sag,
1985), which is then automatically compiled into an object grammar.
The object grammar is expressed in a syntactic variant of the De nite
Clause Grammar formalism. An example (object) rule is:
2 See John Carroll's thesis for an in-depth description of the ANLT system and the
problem of parsing unrestricted language (Carroll, 1993).
3 ANLT also contains a non-deterministic LR parser, designed to parse sentences
or NPs. Since we need to parse phrases of all categories, we did not use this parser.
122 Osborne

S[-INV, +FIN, CONJ NULL, VFORM NOT, BEGAP -, PAST @12, PRD @13,
AUX @14, COMP NORM, SLASH @19, PRO @92, WH @39, UB @40,
EVER @41, COORD -, AGR N2[+NOM, NFORM @20, PER @21,
PLU @22, COUNT @23], UDC -, ELLIP -] -->
N2[-PRD, -POSS, +NOM, -NEG, +SPEC, CONJ NULL, BEGAP -,
SLASH NOSLASH, NFORM @20, PER @21, PLU @22, COUNT @23,
PN @, PRO @, PROTYPE @, PART -, AFORM @, DEF @, ADV @,
NUM @, WH @39, UB @40, EVER @41, COORD @, EFL @,
QFEAT NO, DEMON @, COADV @, KIND @],
VP[+FIN, CONJ NULL, VFORM NOT, H +, BEGAP -, PAST @12,
PRD @13, AUX @14, NEG @, SLASH @19, PRO @92, COORD -,
AGR N2[+NOM, NFORM @20, PER @21, PLU @22, COUNT @23],
ELLIP -]

This can be paraphrased as the rule S ! NP VP. Within the object


rules, each category consists of a list of feature-value pairs. Feature
`values' of the form @x represent variables that might be instantiated
during parsing. Within the grammar there are 84 features.
The meta-grammar, when compiled, becomes 782 object rules. As
an indication of the coverage of the grammar, Taylor et al have used
the grammar to parse 96:8% of 10; 000 noun phrases taken from the
Lancaster-Oslo-Bergen Corpus (Taylor, Grover & Briscoe, 1989).
Known oversights in the grammar's coverage include no treatment of
parenthetical constructions, a limited account of punctuation, and a va-
riety of inadequacies relating to topics such as co-ordination, gapping,
complex ellipsis, and so on.
The grammar assigns steep, relatively detailed parses to sentences
that it generates. As an example of the parses assigned to sentences,
ANLT when parsing the sentence:
who is the abbot with?
produces the following result:

90 Parse>> who is the abbot with

2810 msec CPU, 3000 msec elapsed


556 edges generated
1 parse

((who) (is (the ((abbot))) (((with (E))))))

This has the following parse tree:


RANLT 123

(S/NP_UDC2 (N2+/PRO who)


(VP/BE is
(N2+/DET1a the
(N2 (N1/N abbot)))
(PRD3 (P2/P1 (P1/NPa with
(NP E))))))

Here, the node labelled S/NP_UDC2 refers to a sentence with a preposed


NP; the node labelled N2/PRO+ refers to a pronoun; the node labelled
VP/BE refers to a \be" followed by a predicative argument; the node
labelled N2+/DET1a is an NP; the nodes labelled N2,N1 and N are nouns
of bar levels 2, one and zero; the node labelled PRD3 is a PP; the nodes
labelled P2 and P1 are prepositional categories of bar levels two and one;
nally, the node labelled NP is an NP.
The morphological analyser is capable of recognising words as being
morphological variants of a base lexeme. For example, the lexicon con-
tains the lexeme abdicate. When presented with the word abdicates, the
analyser recognises this as the third person singular form of the verb:
104 Parse>> who abdicates in the abbey
--- abdicates: 170 msec CPU

2860 msec CPU, 3000 msec elapsed


392 edges generated
1 parse

((who) (((abdicate +s) (E)) (((in (the ((abbey))))))))

Using a morphological analyser therefore reduces the size of the lexi-


con. That is, the lexicon does not have to contain explicit entries for each
morphological variation of a word. ANLT contains a lexicon containing
about 40; 000 lexemes, which were semi-automatically derived from the
Longman's Dictionary of Contemporary English. Note that the analyser
only deals with sequences of letters, possibly containing hyphens, and
certain end-of-sentence markers. There is no treatment of punctuation
or Arabic numbers.
The chart parser works in a bottom-up manner and computes all
possible parses for sentences within the language de ned by the object
grammar. Bottom-up chart parsing is a well-known algorithm and does
not need to be discussed here. One novelty of ANLT's parser is a mech-
anism to help reduce the time and space costs associated with parsing
ambiguous sentences. Naive all paths parsers, that compute distinct
parses trees for each distinct syntactic analysis of a sentence are quickly
swamped by the vast numbers of parses that wide-covering grammars
assign to naturally occurring sentences. Hence, ANLT's parser has a
packing mechanism which stores local trees only once. With this mech-
anism, the space and time requirements associated with parsing highly
124 Osborne

ambiguous sentences are drastically reduced. Consequently, the parser


is capable of parsing relatively long sentences, containing thousands of
analyses.

8.2.2 The Robust ANLT


In order to deal with unrestricted language, several problems need to be
addressed:
 Lexical incompleteness: what do we do when the system encounters
an unknown word? The basic ANLT simply reports an error when
it encounters an unknown word in a sentence.
 Parse selection: how do we reduce the number of spurious parses?
Even with a packing mechanism, sentences with an extremely large
number of parses will cause ANLT to crash.
 Acceptable turnaround time: how do we ensure that the parser
does not spend an inordinate amount of time searching for a parse?
ANLT has no concept of timing-out and sometimes takes thirty
minutes or more to nd a parse.
 Ill-formedness: what do we do when the system encounters some
sentence not within the language generated by the grammar? As
with unknown words, ANLT will give-up when it cannot generate
some sentence.
Solving these problems helps reduce the brittleness of ANLT and makes
it more able to deal with unrestricted language. Hence, we call such
a system robust. Robustness is a vague term, but here we mean that
the system operates within space bounds associated with real machines,
and fails gracefully when encountering an unknown word or unparsable
sentence. Note that we still wish our system to reject hopelessly ill-
formed sentences, and wish the system to return one or more parses
for sentences that are ill-formed, but still intelligible. Hence, we do not
intend robustness to imply always returning a parse for some sentence.
For the problem of lexical incompleteness, we have coupled ANLT
with a publically available stochastic tagger (Cutting, Kupiec, Pedersen
& Silbun, 1992). A stochastic tagger is a program that, after being
trained on a sequence of tagged sentences, assigns tags to unseen, un-
tagged sentences. For example, if in the training set the sequence \the
man" was tagged as a determiner followed by a noun, then in the unseen
text, a similar sequence, such as \the boy", would receive the same tag
sequence. The tagger acts as a fall-back mechanism for cases when the
system encounters some unknown word. For such cases, the system tags
the sentence, and then looks-up the tag of the word in an associated
RANLT 125

tag-ANLT conversion lexicon. That is, each tag is associated with a set
of ANLT lexemes, which are fed to the parser in lieu of an entry for the
word itself. We constructed the tag-ANLT conversion lexicon as follows.
A chapter from the author's thesis was tagged. Then, for each tag in
the tagset used by the tagger, all the words in the chapter receiving that
tag were collected together. These words were then looked-up in the
ANLT lexicon. The looked-up word senses were then paired with the
tag. These pairings then formed the tag-ANLT lexicon. For example,
ANLT contains the rules and lexical items to parse the sentence:
I am in the abbey
Suppose now that it did not contain an entry for the word car. Or-
dinarily, ANLT would reject the sentence:
I am in the car
on the grounds of lexical incompleteness. However, if we allow ANLT to
use a stochastic tagger and tag the sentence, then the word \car" would
be tagged as NN (which is a singular noun). Within the tag-ANLT
lexicon, the tag NN has the entries:
(nn "" (N (COUNT -) (GROUP +)) nn ())
(nn "" (N (COUNT +)) nn ())

That is, the tag NN is either a countable or an uncountable noun.


The entries also contain as semantics the logical constant nn. With these
entries, ANLT can then parse the sentence as desired.
ANLT does have a parse selection mechanism (Briscoe & Carroll,
1991). Parse selection mechanisms lter out implausible parses for a
given sentence. They therefore reduce the chance of the parser being
swamped by implausible parses. Unfortunately, this mechanism is un-
available for research use. Hence, because we have not yet implemented
this device, we force the parser to halt when it has produced the rst n
(n = 1) packed parse trees for the sentence in question. Note that this
is not the same as saying halt after producing the rst n parse trees.
We place a resource bound upon the parser: it will halt parsing when
m edges have been constructed. From a practical perspective, this helps
the parser from growing too large (and so start to thrash, or eventually
crash). From a slightly more theoretical perspective, this represents the
idea that if a sentence has a parse, then that parse can be found quickly.
Otherwise, the extra e ort is simply wasted work. Previous workers
have used resource bounds in a similar way (Magerman & Wier, 1992;
Osborne & Bridge, 1994). Note that our use of resource bounds only
makes sense for parsers, such as ANLT's chart parser, which have a form
of `best- rst' search.
Finally, we have augmented ANLT with a form of error-correction.
The idea is that if ANLT encounters a string that it cannot parse, it will
126 Osborne

try to relax some grammatical constraints and see if the string can then
be parsed. For example, if the system encounters a number disagreement,
then it will try to relax the constructions that enforce number agreement.
Relaxation is not new (see, e.g. Hayes, 1981; Weischedel, 1983; Fain,
Carbonell, Hayes & Minton, 1985; Douglas, 1995; Vogel & Cooper, 1995)
and has been used to deal with ill-formed sentences, and to generate error
reports for such sentences. Within a feature-based formalism, one way
to achieve this relaxation is to use a form of default uni cation (Shieber,
1986; Bouma, 1992).
Default uni cation can be thought of as a way of overwriting in-
consistent information within a feature structure, such that ordinary
uni cation will succeed. For example, the uni cation of:
"
N+ # "N+ #
V- t V-
Past + Past -
will ordinarily fail. However, if, by default, the Past feature can be
overwritten, then default uni cation will succeed:
"
N+ # "N+ # N+
V- t! V - = V-
Past + Past -
The default uni cation of two feature structures A and B is written as
At!B . In our implementation, we have a set of features that can be
overwritten (the defaults). The default uni cation of two features (with
the same name, but potentially di erent instantiations) that are within
this set is a variable. Default uni cation therefore succeeds for cases
when certain (default) features are inconsistent, but fails for cases when
the inconsistency lies within other features. For example, in the previous
example, we have not allowed the N feature to be overwritten, and hence
inconsistent values of N will lead to both an ordinary uni cation failure,
along with a default uni cation failure.
We use default uni cation to model the set of features that we allow
to be relaxed when, under ordinary circumstances, a sentence cannot be
parsed. Sentences might not be parsed either due to undergeneration,
or due to the sentence being ungrammatical. Hence default uni cation
can be used to deal with both reasons for parsing failure.
After the parser fails to parse a sentence, that sentence is reparsed,
but this time relaxing a set of designated features. We then collect any
relaxed parses produced. Note that the actual choice of which features
to relax is crucial: select too many features and the parser will su er
from a combinatorial explosion of spurious parses; select too few and the
parser will fail to nd a relaxed parse that it ought to. Deciding upon
the features to relax is non-trivial within the GPSG-style framework we
RANLT 127

use. This is because the features interact in complex ways. For example,
how does the N feature relate to the PFORM feature? The only answer
is somewhere within the 782 object rules. Also, the large number of
features (84) makes impractical an automated search to nd the optimal
set of features to relax. In the experiments, we empirically found a set
of features that when relaxed did not lead to an unacceptable number
of extra edges being generated. This set is therefore ad-hoc.
Since we use term uni cation, the arity of the feature structures plays
a role in determining if two structures will unify. This is independent of
the set of features that can be relaxed. Hence, our implementation of
default uni cation will only work for feature structures that are broadly
similar: for example, a feature structure for a VP cannot be default uni-
ed with (say) a feature structure for a determiner. Given that arity
matters within the grammar, and hence arity di erences cannot be con-
strued as being accidental, our using arity helps ensure that relaxation,
roughly speaking, preserves the basis structure of parse trees across re-
laxation. That is, even while relaxing, a VP remains a VP, and is not
relaxed into a PP.
Relaxation is only part of the solution for unparsable sentences. For
some cases, the grammar will undergenerate; for other cases, the sen-
tence may contain extra words, or missing words. Mellish presents ap-
plicable work dealing with missing words and simple forms of ill-formed
constructs (Mellish, 1989). Osborne describes work dealing with un-
dergeneration (Osborne, 1994). Future work will extend ANLT with
components that deal with these aspects of processing unrestricted lan-
guage. We believe that our use of relaxation preserves the robustness
(within limits) approach used in this work: only certain sentences can
be relaxed; others are best rejected.

8.3 Parser Evaluation Criteria


For an indication of parsing time, we noted how long our system took
to parse the Trados sentences. Throughout, we used a Sparc 10, with 96
Mb of memory, and AKCL Lisp. We have not considered if the system
has returned the desired parse, given that we simply selected the rst
parse produced. Also, we have not attempted any crossing-rate metrics
(Sundheim, 1991), given that we don't have a set of benchmark parses
for the sentences our system parsed. Throughout these experiments,
we used a resource bound of 25; 000 edges and stopped parsing when 1
packed node had been found dominating the fragment being parsed.
While all three analyses were carried out on the original corpus of 600
sentences as reported at the workshop, practical diculties connected
with the author moving from York to Aberdeen meant that only Analysis
128 Osborne

I could be carried out on the subset of 60 test utterances used for this
volume.

8.4 Analysis I: Original Grammar, Origi-


nal Vocabulary
8.4.1 Pre-Processing
As was previously stated in Section 8.2.1, ANLT does not deal, at the
morphological level, with arabic numbers or most punctuation marks.
ANLT also does not deal, at the syntactic level, with parentheticals.
Hence, sentences need to be pre-processed prior to parsing. We therefore
wrote a Lex program to pre-process the raw sentences prior to parsing.
This program mapped arabic numbers into the lexeme \number", re-
moved punctuation marks that did not terminate a sentence, and also
mapped other characters (such as >) into the lexeme \symbol". As the
morphological analyser is case-sensitive, we also mapped all upper-case
letters to lower-case letters. More controversially, the program splits the
raw sentences into fragments. Each fragment corresponds to a string of
words terminated by a punctuation mark. For example, the raw sentence
(taken from the Dynix corpus):
Depending on where you are on the system, you use di erent procedures
to start a search
would be pre-processed into the fragments:
depending on where you are on the system
you use di erent procedures to start a search
The reason for this is two-fold. Since ANLT ignores most punctua-
tion, punctuated sentences such as the one above would be automatically
rejected. Furthermore, some of the punctuated sentences are very long,
and in their raw state, would either lead to a very large number of
parses, or would cause the parser to thrash (or both). Given the lack of
a treatment of punctuation, the desire to reduce the sentence length, and
the idea that punctuation marks delimit phrasal boundaries (Osborne,
1995), we took the step of chopping the raw, punctuated sentences into
unpunctuated fragments. The advantages of chopping sentences are that
we do not need a detailed treatment of punctuation within the grammar,
and that we have a (reasonably) motivated way of segmenting punctu-
ated sentences (short sentences are easier to process and more likely to be
parsed). The disadvantages are that the fragments may not correspond
always to phrases; also, the task of joining the fragments back together
remains. There are several ways that one could integrate punctuation
RANLT 129

Number Accept Reject % Accept % Reject


Dynix 036 27 09 75 25
Lotus 039 22 17 56 44
Trados 034 17 17 50 50
Total 109 75 43 61 39

Table 8.3.1: Phase I acceptance and rejection rates for RANLT.


Input sentences were divided into fragments separated by punc-
tuation and then analysed individually. It is for this reason that
the total number of utterances listed is 109 rather than 60.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 04185 N.A. N.A.
Lotus 04749 N.A. N.A.
Trados 08378 N.A. N.A.
Total 17312 N.A. N.A.

Table 8.4.1: Phase I parse times for RANLT. The rst column
gives the total time to attempt a parse of each sentence. As the
parser does not give times for individual sentences, the second
and third columns are left blank. Note, however, that RANLT is
generally quicker to reject a sentence than to accept it.
Char. A B C D E F G Avg.
Dynix 73% 56% 13% 45% 0% 0% 0% 46%
Lotus 54% 56% 65% 78% 0% 0% 0% 63%
Trados 40% 34% 24% 19% 0% 0% 0% 29%
Average 55% 49% 34% 47% 0% 0% 0% 46%

Table 8.5.1: Phase I Analysis of the ability of RANLT to recog-


nise certain linguistic characteristics in an utterance. For example
the column marked `A' gives for each set of utterances the percent-
age of verbs occurring in them which could be recognised. The
full set of codes is itemised in Table 8.2. The average in column
eight is that of characteristics A to D only.

and parsing. See Nunberg (1990), Briscoe (1994) and Jones (1994) for
discussions on issues relating to punctuation and parsing.
130 Osborne

8.4.2 Results
The rst point to note in considering the RANLT parser is that there are
certain characteristics which it can not extract from an utterance. The
capabilities of the system are summarised in Table 8.1. The system has
no parse selection mechanism, so it makes no sense to say either whether
PPs are attached correctly, or whether coordination and gapping are
correctly treated.
The results of parsing the 60 test utterances can be seen in Tables
8.3.1, 8.4.1 and 8.5.1. When studying Table 8.3.1 it is important to
bear in mind that RANLT works with a version of the sentences which
have been transformed into fragments at the pre-processing stage. Thus
the analysis is in terms of 109 fragments rather than the original 60
utterances.
Timings for the parser are shown in Table 8.4.1. As RANLT does not
give timings for individual sentences, it was not possible to determine
the times to accept or reject each one. Thus two columns in the table
are left blank.
The measured ability of the system to recognise the constructs A to D
is summarised in Table 8.5.1. It should be noted that the average gures
shown in the right hand column are the averages of the characteristics
A to D only. Characteristics E to G are excluded because RANLT can
not handle them.

8.5 Analysis II: Original Grammar, Addi-


tional Vocabulary
Analysis II was not carried out on the set of 60 utterances although it
was performed on the original 600 sentence corpus | see the proceedings
of the workshop for more details.

8.6 Analysis III: Altered Grammar, Addi-


tional Vocabulary
Again, Analysis III was not carried out on the set of 60 utterances al-
though it was performed on the original 600 sentence corpus | see the
proceedings of the workshop for more details.
RANLT 131

8.7 Converting Parse Tree to Dependency


Notation
The organisers of the workshop wanted a parse expressed in dependency
form. We understand `dependency form' to mean an unlabelled tree that
spans the sentence, such that all intermediate non-terminal nodes and
gaps are deleted.
The suggested sentence was as follows:
That is these words make the source sentence longer or shorter than the
TM sentence
The rst parse our system produced was:
(S1a (N2+/PRO that)
(VP/BE_NP is
(N2+/N1PROa
(N1/POST_APMOD1
(N1/RELMOD2 (N1/PRO2 these)
(S/THATLESSREL
(S1a (N2+/N2-a (N2- (N1/N words)))
(VP/OR_BSE (MSLASH1a) make
(N2+/DET1a the (N2- (N1/N source)))
(VP/NP sentence (TRACE E))))))
(A2/COORD2 (A2/ADVMOD1/- (A1/A longer))
(CONJ/A2 or
(A2/COMPAR1 (A1/A shorter)
(P2/P1
(P1/NPa than
(N2+/DET1a the
(N2- (N1/N (N/COMPOUND1 tm sentence)))))))))))))

In dependency form, this parse becomes:


(that
(is these words
(make (the source sentence))
(longer (or shorter
(than
(the (tm sentence)))))))

As can be seen, in the dependency tree, much information is lost:


the gaps are no longer present and constituency is impoverished. Us-
ing dependency trees to compare systems is therefore only a very weak
measure.
132 Osborne

140

120

N 100
o
o
f 80
F
r
g 60
m
t
s 40

20

0
0 5 10 15 20 25 30 35 40
Fragment Length

Figure 8.1: The distribution of all fragments by length.

8.8 Summary of Findings


In conclusion, ANLT is capable of assigning parses to most of the soft-
ware manual fragments. Also, the coverage of the grammar is, at least
for the software manual fragments, inversely related to fragment length.
This is shown by the following graphs. If we examine the distribution of
all the fragments by length (Figure 8.1) then we can see that most of the
fragments are short. If we examine the distribution of those fragments
that were parsed (in this case in analysis I) then we can see that most
of the parsed fragments are short (Figure 8.2). For ampli cation, if we
make a graph of fragment length by percentage of such fragments that
were parsed (Figure 8.3), then we can see that coverage is approximately
linear with respect to fragment length.
This nding is also backed-up by the mean lengths of the fragments
parsed and those not parsed. In all of the experiments, the fragments
parsed were all on average shorter than the fragments overall. Further-
more, those fragments that were rejected all had a mean length greater
than that of the fragments overall. This result, of the chance of parsing a
RANLT 133

120

100
N
o 80
o
f
F 60
r
g
m
t 40
s

20

0
0 5 10 15 20 25 30
Fragment Length

Figure 8.2: The distribution of all parsed fragments by length.

fragment being related to its length, re ects the computational fact that
the amount of space required to parse a sentence is exponential with
respect to sentence length4 and hence longer sentences are more likely
to be abandoned due to exceeding a resource bound. Also, it re ects the
linguistic fact that the longer the fragment, the more likely it is that the
fragment will contain a novel construct, or a word used in some novel
way. Hence, it is important to nd ways of chunking long sentences into
shorter, more manageable fragments. Using punctuation is one way of
achieving this.
Analysing the errors (i.e. those fragments that did not receive an
ordinary parse) in the Dynix corpus, we found the following;
 70:66% of the errors were due to inadequacies in either the ANLT
lexicon, or in the lexicon constructed for the corpus.
 17:33% of the errors were due to parenthetical constructions.
4 Packing mechanisms can only give polynomial space and time bounds when the
grammar contains n-ary rules, for some xed n. The ANLT formalism imposes no
such restriction upon the number of categories that a rule can have.
134 Osborne

100
90
80
70
%
P
60
a
r 50
s
e 40
d
30
20
10
0
0 5 10 15 20 25 30
Fragment Length

Figure 8.3: The percentage of all parsed fragments by length.

 5:33% of the errors were ill-formed fragments.


 4% of the errors were due to examples of American English con-
structions being parsed with a grammar for British English.
 2:67% of the errors were either due to mistakes in pre-processing
the sentences into fragments, or were due to sentences containing
idioms.
Hence, when using the same resource bounds, and dealing with the lexi-
cal errors, we might expect to be able to parse at least 85% of the Dynix
fragments. Given that the corpora were all from the same genre, we do
not expect this error analysis to be substantially di erent for the other
corpora.
Interestingly enough, two of the software manual fragments were in
American English. This did not present a problem to ANLT, even though
it used a grammar of British English. Most of the di erences were lexical,
and not syntactic.
Since the grammar is clearly wide-covering, and by the di erences
between Analysis I and Analysis II, it is evident that a major obstacle to
RANLT 135

parsing unrestricted, naturally occurring language is creating a suitable


lexicon. Not only is this labour intensive, but it is also error-prone.
Unfortunately, the stochastic tagger we used does not promise to be the
solution to this problem. What is needed is a tagger that uses a richer
tagset.
For further work, we shall investigate the addition of punctuation to
ANLT, methods of locating features to relax and minimising the amount
of redundant re-parsing involved, and ways of reducing the lexical incom-
pleteness problem.

8.9 References
Alshawi, H. (Ed.) (1992). The CORE Language Engine. Cambridge,
MA: MIT Press.
Black, E., Garside, R., & Leech, G. (Eds.) (1993). Statistically driven
computer grammars of English the IBM-Lancaster approach. Amster-
dam, The Netherlands: Rodopi.
Bouma, G. (1992). Feature Structures and Nonmonotonicity. Computa-
tional Linguistics, 18(2). (Special Issue on Inheritance I.)
Briscoe, T. (1994). Parsing (with) Punctuation etc. (Technical Report).
Grenoble, France: Rank Xerox Research Centre.
Briscoe, T., & Carroll, J. (1991). Generalised Probabilistic LR Parsing of
Natural Language (Corpora) with Uni cation-based Grammars (Tech-
nical Report Number 224). Cambridge, UK: University of Cambridge,
Computer Laboratory.
Burns, A., Du y, D., MacNish, C., McDermid, J., & Osborne, M.
(1995). An Integrated Framework for Analysing Changing Require-
ments (PROTEUS Deliverable 3.2). York, UK: University of York,
Department of Computer Science.
Carroll, J. (1993). Practical Uni cation-based Parsing of Natural Lan-
guage. Ph.D. Thesis, University of Cambridge, March 1993.
Cutting, D., Kupiec, J., Pedersen, J., & Silbun, P. (1992). A practical
part-of-speech tagger. Proceedings of the Third Conference on Applied
Natural Language Processing, ANLP, 1992.
Douglas, S. (1995). Robust PATR for Error Detection and Correction.
In A. Schoter and C. Vogel (Eds.) Edinburgh Working Papers in Cog-
nitive Science: Nonclassical Feature Systems, Volume 10 (pp. 139-
155). Unpublished.
Du y, D., MacNish, C., McDermid, J., & Morris, P. (1995). A framework
for requirements analysis using automated reasoning. In J. Iivari, K.
Lyytinen and M. Rossi (Eds.) CAiSE*95: Proceedings of the Seventh
Advanced Conference on Information Systems Engineering (pp. 68-
81). New York, NY: Springer-Verlag, Lecture Notes in Computer
136 Osborne

Science.
Fain, J., Carbonell, J. G., Hayes, P. J., & Minton, S. N. (1985). MUL-
TIPAR: A Robust Entity Oriented Parser. Proceedings of the 7th
Annual Conference of the Cognitive Science Society, 1985.
Gadzar, G., Klein, E., Pullum, G. K., & Sag, I. A. (1985). General-
ized Phrase Structure Grammar. Cambridge, MA: Harvard University
Press.
Grover, C., Briscoe, T., Carroll, J., & Boguraev, B. (1993). The Alvey
Natural Language Tools Grammar (4th Release) (Technical report).
Cambridge, UK: University of Cambridge, Computer Laboratory.
Hayes, P. J. (1981). Flexible Parsing. Computational Linguistics, 7(4),
232-241.
Jones, B. E. M. (1994). Can Punctuation Help Parsing? 15th Interna-
tional Conference on Computational Linguistics, Kyoto, Japan.
Magerman, D. M. (1994). Natural Language Parsing as Statistical Pat-
tern Recognition. Ph.D. thesis, Stanford University, February 1994.
Magerman, D., & Weir, C. (1992). Eciency, Robustness and Accuracy
in Picky Chart Parsing. Proceedings of the 30th ACL, University of
Delaware, Newark, Delaware, 40-47.
Mellish, C. S. (1989). Some chart-based techniques for parsing ill-formed
input. ACL Proceedings, 27th Annual Meeting, 102-109.
Nunberg, G. (1990). The linguistics of punctuation (Technical Report).
Stanford, CA: Stanford University, Center for the Study of Language
and Information.
Osborne, M. (1994). Learning Uni cation-based Natural Language Gram-
mars. Ph.D. Thesis, University of York, September 1994.
Osborne, M. (1995). Can Punctuation Help Learning? IJCAI95 Work-
shop on New Approaches to Learning for Natural Language Process-
ing, Montreal, Canada, August 1995.
Osborne, M., & Bridge, D. (1994). Learning uni cation-based gram-
mars using the Spoken English Corpus. In Grammatical Inference
and Applications (pp. 260-270). New York, NY: Springer Verlag.
Shieber, S. M. (1986). An Introduction to Uni cation-Based Approaches
to Grammar (Technical Report). Stanford, CA: Stanford University,
Center for the Study of Language and Information.
Sundheim, B. M. (1991). Third Message Understanding Evaluation and
Conference (MUC-3): Methodology and Test Results. Natural Lan-
guage Processing Systems Evaluation Workshop, 1-12.
Taylor, L. C., Grover, C., & Briscoe, E. J. (1989). The Syntactic Reg-
ularity of English Noun Phrases. Proceedings, 4th European Associa-
tion for Computational Linguistics, 256-263.
Vogel, C., & Cooper, R. (1995). Robust Chart Parsing with Mildly
Inconsistent Feature Structures. In A. Schoter and C. Vogel (Eds.)
RANLT 137

Edinburgh Working Papers in Cognitive Science: Nonclassical Feature


Systems, Volume 10 (pp. 197-216). Unpublished.
Weischedel, R. M. (1983). Meta-rules as a Basis for Processing Ill-formed
Input. Computational Linguistics, 9, 161-177.
9
Using the SEXTANT Low-Level
Parser to Analyse a Software
Manual Corpus
Gregory Grefenstette1
Rank Xerox Research Centre, Grenoble

9.1 Introduction
Parsers are used to attack a wide range of linguistic problems, from un-
derstanding mechanisms universal to all languages to describing speci c
constructions in particular sublanguages. We view the principal require-
ments of an industrial parser to be those of robustness, and of accuracy.
Robustness means that the parser will produce results given any type
of text, and accuracy means that these results contain few errors with-
out missing the important syntactic relations expressed in the text. The
question of what is important varies, of course, in function of the use of
the output. For example, if the parser were to be used as the rst step in
database access of structured data, then ne-grained syntactic relations
between anaphora and referents and between principal and subordinate
clauses would be necessary to produce correct database queries. In what
follows, we consider that the output will be used for some type of infor-
mation retrieval of terminological extraction, and what is important are
the relations between the information bearing words.
In this chapter, we present a low-level, robust parser, used in SEX-
TANT (Grefenstette, 1994) a text exploration system, that has been
used over large quantities of text to extract simple, common syntactic
structures. This parser is applied here to a small sample of technical
documentation: a corpus of 600 utterances obtained from three di erent
manuals, the Dynix Automated Library Systems Searching Manual, the
1 Address: Rank Xerox Research Centre, 6 chemin de Maupertuis, Meylan, France.
Tel: +33 76 615082, Fax: +33 76 615099, Email grefen@xerox.fr.
140 Grefenstette

Lotus Ami Pro for Windows User's Guide Release Three and the Tra-
dos Translator's Workbench for Windows User's Guide. A minimal set
of binary dependency relations is de ned for evaluating the parser out-
put, and output from the di erent versions of the parser implemented is
compared to a manually parsed portion of the test collection.
The parser described here2 was designed not to explore any particular
linguistic theory, nor to provide a complete hierarchical description of a
given sentence. It was developed to extract frequently occurring binary
syntactic dependency patterns in order to compare word use in a large
corpus.
The data derived from the parser was used in a number of exper-
iments in automatic thesaurus construction (Grefenstette, 1994), and
more recently as a front-end to lexicographic tools exploring corpora
(COMPASS, 1995; Grefenstette and Schulze, 1995).
The SEXTANT parser, similar to a chunking parser (Abney, 1991),
is based on ideas described by Debili (1982) in which a number of nite-
state lters were proposed for recognizing verbal chains and noun chains,
and for extracting binary dependency relations from tagged text. Sec-
tion 9.2 describes an implementation of these ideas for English.

9.2 Description of Parsing System


The parsing system used here is very rudimentary. It uses an essentially
nite-state approach to the problem of parsing, based on regular describ-
ing syntagmatic chains and then heuristically drawing relations within
2 Rank Xerox is pursuing research in nite-state parsing which uses regular ex-
pressions involving permissible syntactic patterns over the whole sentence (Chanod,
1996). For example, one rule states that a principal clause cannot contain nouns
tagged as subject both preceding and following the main verb. This is possible since
our nite-state morphological analyzers provide possible syntactic as well as gram-
matical tags. These rule-based regular expressions are composed with a nite-state
network representing all possible interpretations of the sentence in order to eliminate
most interpretations. The system currently being developed for French is similar in
philosophy to that developed by the University of Helsinki in their English Constraint
Grammar (Voutilainen, Heikkila & Anttila, 1992) but with the signi cant di erence
that rules are applied to nite state networks rather than to sequences of tagged
words, allowing a considerable reduction in the number of rules to be developed.
The result of this nite-state parser currently being developed will be an input
string tagged with parts of speech and with syntactic roles such as `Subject', `Main
Verb', `Object of Preposition', etc. These syntactically tagged strings will then be
fed into a complete Lexical Functional Grammar, concurrently under development at
Xerox PARC and the Rank Xerox Research Centre.
The parser described here in this chapter was a much simpler nite-state parser,
propelled by three data les de ning precedence rules. After the work described in
this chapter, these les were replaced by equivalent nite-state expressions, and the
whole parser moved from a C-based program into a cascade of nite state transducers
and lters (Grefenstette, 1996).
SEXTANT 141

original string: `He's late (again!)' she said. `What?'


tokenized result: ` He's late ( again ! ) ' she said .
` What ? '

Figure 9.1: Example of a tokenized sequence, divided into sen-


tences and tokens.

and between syntagmatic groups. It can also be seen as a sequence of


linear passes over the text, each time marking and remarking the words
with di erent information.
The system can be divided into two parts: a text preparation se-
quence, and a parsing sequence. Text preparation is described below in
Section 9.2.1 and the parser in Section 9.2.2.

9.2.1 Preparsing Processing


The input text to be parsed must rst be processed in a number of ways.
The next six subsections describe this preprocessing.

9.2.1.1 Tokenization
Tokenization divides the input string into tokens and sentences. After
this point, all other treatment is con ned to individual sentences. There
are no inter-sentential relations created by this parser, such as would be
found in an anaphor-resolving system or discourse parser.
Finding sentences boundaries in text entails nding word boundaries
in order to decide which token terminating periods correspond to full-
stops. This problem is not easy to solve exactly, but heuristics (Grefen-
stette and Tapanainen, 1994) with high success rates can be found.
These heuristics were incorporated into a nite-state tokenizer in our
system.3 An example of the tokenization process is given in Figure 9.1.
In the technical documentation that was provided as test suites for
the IPSM'95 conference, although full sentences appeared one per line,
there was no indication of when an isolated line was a section heading,
or the heading of a list whose elements followed on succeeding lines. A
choice was made to add a period to the end of each input line that did
not already have one, before treatment. This period-adding was the only
IPSM'95-speci c modi cation made to the original text les.
3 A similar tokenizer programmed with the generic Unix tool lex, is given in Ap-
pendix 1 of Grefenstette (1994).
142 Grefenstette

tokenized text: Any existing text automatically moves .

morphologically analyzed text:


any any+Psg any+Dsg any+Adj
existing exist+Vprog
text text+Nsg
automatically automatically+Adv
moves move+Vsg3 move+Npl
. .

Figure 9.2: The morphological analyzer assigns all appropriate


parts of speech to each token. This is the extent of the lexical
information used by the parser. There is no information, for in-
stance, on subcategorization or valency.

9.2.1.2 Name recognition


Our name recognition nodule uses capitalization conventions in English
to join proper names into a single token. Tokens produced by the pre-
ceding step are passed through a lter4 that joins together uppercase
non-initial sequences into proper name units (e.g. \You tell Ami Pro
that : : : " is rewritten \You tell Ami-Pro that : : : "). This simple lter is
only employed for non-sentence-initial capitals, so a sentence beginning
with \Ami Pro : : : " remains untouched. A second loop that would pick
up these sentence initial ambiguities might be developed but this was
not implemented here.

9.2.1.3 Morphological analysis


Morphological analysis attaches all possible parts of speech to each
token in a sentence. Our system uses nite-state morphological analyzers
created using two-level rules (Karttunen, Kaplan & Zaenen, 1992). This
representation allows a compact storage of a language's lexicon in a
nite-state transducer (Karttunen, 1994). These transducers can possess
loops, allowing a small nite representation for the in nitive vocabulary
of agglutinative languages. Lookup of a word is also extremely fast, since
it is simply the following of a path through an automaton. See Figure
9.2 for a small sample. When a word is not found in the lexicon, it is
guessed during the tagging process by another guessing transducer. The
transducers provide morphological information, lemmatized form, and
part of speech tags for each token.
4 The awk code for this lter is also found in Appendix 1 of Grefenstette (1994).
SEXTANT 143

In/IN Insert/NN mode/NN ,/CM you/PPSS insert/VB text/NN


at/IN the/AT position/NN of/IN the/AT insertion/NN
point/NN and/CC any/DTI existing/NN text/NN automatically/RB
moves/VBZ ./SENT

Figure 9.3: Text tagged by the XSoft tagger. The tagset is


slightly di erent from that used in the morphological analyzer.
This is characteristic of a layered approach to natural language
processing in which di erent lexical perspectives are used at dif-
ferent times. For example, the SEXTANT parser described below
uses yet another reduced tagset.

9.2.1.4 Tagging
Tagging chooses one part-of-speech tag for each token. Our tokenized
text is passed through an Hidden Markov Model tagger (Cutting, Kupiec,
Pedersen & Sibun, 1992).5 This tagger6calculates the most likely path
through the possible tags attached to the words of the sentence, using a
probability model built by reducing entropy of tag transitions through
an untagged training corpus. The version of the tagger that we use also
includes tokenizing and morphological analyzing transducers. We used
a tokenizer here one extra time so that name recognition is performed
before calling the tagger.
9.2.1.5 Lemmatization
Lemmatization attaches a lemmatized form of the surface token to the
tagged token. The tagger that we use does not yet include an option
for outputting the lemmatized version of words, even though this infor-
mation is included in the morphological transducer that it employs. In
order to obtain the lemmatized forms (which are not actually necessary
for the parser but are useful for ulterior tasks), we developed a trans-
ducer that converts tagged forms of words into lemmas. This lemma
appears in the parse output along with the surface form, see Figure 9.6
below.
5 XSoft now commercializes a very fast C-based version of such taggers for En-
glish, French, German, Spanish, Italian, Dutch, and Portuguese. These taggers in-
clude tokenizers. Contact Ms. Daniella Russo at Xerox XSoft for more information,
russo@xsoft.xerox.com, Tel: +1 415 813 6804, Fax: +1 415 813 7393. These taggers
can be tested at http://www.xerox.fr/grenoble/mltt/Mos/Tools.html.
6 The Common Lisp source code for version 1.2 of the Xerox part-of-speech tagger
is available for anonymous FTP from parcftp.xerox.com in the le pub/tagger/tagger-
1-2.tar.Z. Another freely available English tagger, developed by Eric Brill, uses
rules based on surface strings and tags. This can be found via anonymous ftp to
ftp.cs.jhu.edu in pub/brill/Programs and pub/brill/Papers.
144 Grefenstette

In/PREP/in Insert/NOUN/insert mode/NOUN/mode ,/CM/,


you/PRON/you insert/INF/insert text/NOUN/text at/PREP/at
the/DET/the position/NOUN/position of/PREP/of the/DET/the
insertion/NOUN/insertion point/NOUN/point and/CC/and
any/DET/any existing/NOUN/existing text/NOUN/text
automatically/ADV/automatically moves/ACTVERB/move ././.

Figure 9.4: The parser receives a simpli ed tagset and surface


and lexical forms of the input words.

9.2.1.6 Tag simpli cation


Tag simpli cation reduces the number of tags fed to the parser. The
tags provided by the morphological transducer, similar to the original
Brown tagset (Francis & Kucera, 1982) for English, include information
about number and person, as well as part-of-speech. For the vocabulary
in the test corpus, the tagger returns 64 di erent tags: ABN ABX AP
AT BE BED BEDZ BEG BEN BER BEZ CC CD CM CS DO DOD
DOZ DT DTI DTS DTX EX HV HVG HVZ IN JJ JJR JJT MD NN
NN$ NNS NOT NP NP$ NPS NR OD PN PP$ PPL PPO PPS PPSS
PPSSMD PUNCT RB RBC RN SENT TO UH VB VBD VBG VBN
VBZ WDT WP$ WPO WPS WRB. The parser uses a tagset that is
simpler, reducing the above tags to the following twenty (given with
their frequency in the test corpus): ACTVERB (256) ADJ (574) ADV
(262) AUX (182) BE (191) CC (311) CM (449) CS (359) DET (1289)
INF (770) INGVERB (187) NOMPRON (5) NOUN (2701) NUM (96)
PPART (214) PREP (856) PRON (403) RELCJ (7) TO (219).7 This is
possible because the parser does not need to distinguish, for example, be-
tween a personal possessive pronoun and a determiner since no anaphor
resolution is performed. See Figure 9.4 for a sample of simpli ed tags
sent to the parser.
9.2.2 Parsing
At this point each sentence is ready to be parsed. Each token is decorated
with a simpli ed part-of-speech tag and a lemmatized form.
The parser takes the tagged text as input and adds additional syntac-
tic tags and dependencies to each word via a number of passes from right
7 The tags stand for Active Verb, Adjective, Adverb, Auxiliary Verb (other than
forms of to be), All Forms of to be, Coordinating Conjunctions, Commas, Determin-
ers, In nitives (other than to be), {ing verb forms (other than being), there, Nouns,
Numbers, Past Participles (other than been), Prepositions (other than to), Pronouns,
Relative Conjunctions, and to.
SEXTANT 145

actvb adv aux be inf ingvb ppart prep to


actvb 1 1 1 1 1 1
adv 1 1 1 1 1 1 1
aux 1 1 1 1 1
be 1 1 1 1 1 1 1 1
inf 1 1 1 1 1 1
ingvb 1 1 1 1 1 1 1
ppart 1 1 1 1 1 1 1
prep 1 1
to 1 1 1
begin: AUX PREP TO INF ACTVB BE PPART
end: INF ACTVERB BE INGVB PPART AUX

Figure 9.5: Verbal chain precedence matrix used in parsing ex-


periments. A word tagged with the name of a row can be followed
by a word tagged with a name of a column if the corresponding
matrix entry is 1. For example, an in nitive (INF) can be followed
by an adverb (ADV) but not by an auxiliary verb (AUX) in the
same verbal chain. Note, all forms of the verb `to be' are tagged
BE.

to left and from left to right over the sentence. The parser's construction
was based on the image of an ideal English sentence being a collection
of verb chains with each verb chain being balanced by a preceding noun
phrase corresponding to its subject and a succeeding noun phrase corre-
sponding to its object, with prepositional phrases and adverbs scattered
throughout.
9.2.2.1 Noun chain identi cation
Noun chain identi cation isolates and incrementally numbers all se-
quences of nouns that enter into noun phrases and prepositional phrases.
A regular expression is built which recognizes maximal length noun
chains. One of the three data les supplied to the parser contains a
list of part-of-speech categories which are possible beginnings of a noun
chain; another list of possible endings; and a third list, a linked list rep-
resentation of a matrix, stating for each category, what other categories
can follow it while still staying in the same noun chain. For example,
determiners, numbers, nominal pronouns, adjectives, prepositions are
possible noun chain initiators, but not adverbs or coordinating conjunc-
tions. The third list of noun chain continuators states that an adjective
146 Grefenstette

can be followed by noun, but a pronomial noun cannot be followed by


anything. A matrix similar to the one shown for verb chains in Figure
9.5 was manually created for determining noun chains with the simpli ed
tags shown in Section 9.2.1.6.
Once a noun chain is isolated, possible heads of noun phrases and
prepositional phrases are marked in the following way. Another pass
is made through the noun chain. Whenever a preposition, a comma, a
conjunction or the end of the noun chain is reached, the most recently
seen element in a user-declared set, called the ReceiveSet,8 is marked as
a head. If the subchain did not contain an element in this set, then the
rightmost element is marked as a head. For example, in \the door frame
corner in the wall" \corner" and \wall" will be marked as heads; in \with
the same," even if \same" is marked as an ADJ, it will be marked as a
head since no other candidate was found. In Figure 9.6 these elements
marked as heads are labeled NP*.

9.2.3 List Recognition


As a speci c development for this technical documentation test set, given
the large numbers of lists in the IPSM utterances, a list identi cation
pass was added for the test described in Analysis III (Section 9.6). List
identi cation breaks apart noun sequences at commas except when the
noun chain is recognized as a conjunctive list. If a noun chain contains
commas, but no conjunctions, then all the commas are marked as chain
boundaries. If the chain contains commas and conjunctions then the
commas before the conjunctions are retained and those after the con-
junction are considered as noun chain boundaries.
9.2.3.1 Left-to-right attachment
Dependency relations are created within noun phrases using nite-state
describable lters, which can be seen as working from left-to-right or
right-to-left9 over the tagged text. In order to create these lters set
of tags are declared in the parser data le mentioned in the footnote in
Section 9.2.2.1 as being possible dependents. These are currently the
tags DET, ADJ, NUM, NOUN, PREP, PPART, and TO. Determiners
8 This parser reads three les when its starts: one de ning noun chains, one de ning
verb chains, and one declaring what tags can correspond to dependent elements in a
noun phrase, what tags can correspond to head elements in a noun chains, and what
tags correspond to prepositions, conjunctions, past participles, pronouns, etc. In this
way, the parser is independent of a particular tag set and deals with classes.
9 Indeed in the version of the parser described here, which was implemented in a
program simulating nite-state lters in C code, these lters were really applied from
left to right and then from right to left by the control structures in the program.
Since then, we have re-written these lters as true nite-state expressions which are
compiled and applied using Xerox's nite-state compiler technology.
SEXTANT 147

52 ------------------
During the translation of this example , the Workbench should ignore
the second sentence when moving from the first sentence to the third
one .
52 NP 2 During during PREP 0 1 2 (translation) PREP
52 NP 2 the the DET 1 1 2 (translation) DET
52 NP* 2 translation translation NOUN 2 0
52 NP 2 of of PREP 3 1 5 (example) PREP
52 NP 2 these this DET 4 1 5 (example) DET
52 NP* 2 examples example NOUN 5 1 2 (translation) NNPREP
52 -- 0 , , CM 0
52 NP 3 the the DET 7 1 8 (workbench) DET
52 NP* 3 Workbench workbench NOUN 8 0
52 VP 101 should should AUX 9 0
52 VP 101 ignore ignore INF 10 1 8 (workbench) SUBJ
52 NP 4 the the DET 11 1 13 (sentence) DET
52 NP 4 second second NUM 12 1 13 (sentence) ADJ
52 NP* 4 sentence sentence NOUN 13 1 10 (ignore) DOBJ
52 -- 0 when when CS 14 0
52 NP* 5 moving move INGVB 15 0
52 NP 5 from from PREP 16 1 19 (sentence) PREP
52 NP 5 the the DET 17 1 19 (sentence) DET
52 NP 5 first first NUM 18 1 19 (sentence) ADJ
52 NP* 5 sentence sentence NOUN 19 1 15 (move) IOBJ-from
52 NP 5 to to PREP 20 1 23 (one) PREP
52 NP 5 the the DET 21 1 23 (one) DET
52 NP 5 third third NUM 22 1 23 (one) ADJ
52 NP* 5 one one NUM 23 1 19 (sentence) ADJ
52 -- 0 . . . 24 0

Figure 9.6: SEXTANT Sample parse. The rst column is simply the sen-
tence number. The fourth column is the original surface form, and the fth
is the lemmatized form. The second column states whether the token is part
of a noun chain (NP), a verb chain (VP) or other (--). NP* means that the
word can be a noun or prepositional phrase head. The third column numbers
the chain (starting from one for noun chains, from 101 for verb chains). The
sixth column (e.g. PREP) gives the simpli ed tag for the word. The seventh
column is just the number of the word in the sentence. The eighth column
describes with how many words the current word is in a subordinate depen-
dency relation. The description of these dependency relations follow in elds
nine onwards. The structure of this description is dominating-word-number
(dominating-word) relation. For example, the word number 10 ignore is tied
to the word number 8 workbench which is its subject (SUBJ).

and prepositions will attach to the rst head to their right. The other
tags will attach to the rst element to its right in the ReceiveSet, as
well as the rst element marked as a head to its right. The relationship
created will depend upon the tags of each element. For example, in
148 Grefenstette

SUBJ for an active verb, what word is marked as its subject.


DOBJ for a verb, what word is its direct object.
IOBJ for a verb, what other word is a prepositional adjunct or
argument.
ADJ for an adjective, what word does it modify.
NN for a noun, what other noun does it modify.
NNPREP for a noun, head of a prepositional phrase, what other noun is
this noun attached to.
Figure 9.7: Binary dependency relations retained for initial eval-
uation of the parser.

\the door frame corner in the wall", \the" will attach to \corner", in a
determiner relation. \Door" will attach to \frame" and to \corner", as a
noun-noun modi er. \In" and \the" will attach to \wall", as determiner
and prepositional modi ers.
When a preposition attaches to a head, the head becomes marked
with this information and it will no longer be available as a subject or
direct object.

9.2.3.2 Right-to-left attachment


Right to left attachment attaches prepositional phrase heads to the head
immediately preceding them. Gibson and Pearlmutter (1993) have cal-
culated that this heuristic of attaching low is correct two-thirds of the
time. During the verb attachment phase, prepositional phrases, already
attached to noun phrases, can be ambiguously attached to preceding
verbs as well.

9.2.3.3 Verb chain identi cation


Verb chain identi cation isolates and collects verbal chains, then marks
the heads of each verbal chain and determines its mode. Figure 9.5 shows
the precedence matrix used to isolate verb chains. Mode is determined
by the presence of forms of the verb to be and by {ing forms. The
head of the verb chain is chosen to be the last verb in the chain. For
example in the chain \wanted to see", see is chosen as the head, and the
subject of \wanted" will be the subject of \see". This simplifying choice
was made to group all complex verb modi cations as forms of modal
auxiliaries since the parser was designed to extract relations between
content-bearing, and it was thought that if someone \wanted to see", for
example, then that person was capable of \seeing". Proper treatment of
in nitive phrases is beyond the linear treatment of this parser.
SEXTANT 149

9.2.3.4 Verb attachment


The preceding passes have accounted for all the nouns attached to other
nouns, or modi ed by prepositions, and have chosen heads for each ver-
bal chain. At this point, some nouns are not dependent on any others.
Starting from left to right, each active verbal chain will span out back-
wards to pick up unattached nouns as subjects, and then to its right
to pick up unattached nouns as direct objects. When the verbal chain
is passive, a preceding free noun will be considered as a direct object.
Finding agentive phrases headed by by is not attempted.
When the main verb is a form of the verb to be (i.e., the verb chain
is attributive), a dependency relationship is created between the object
and the subject of the verb. The rst prepositionally modi ed noun to
a verb chain's right will be attached as an argument/adjunct, the parser
making no distinction between these two classes.

9.2.3.5 Gerund attachment


Gerund attachment makes local attachments of gerunds appearing in
noun chains to nearby noun heads. The result of this and all preceding
steps can be seen in Figure 9.6.

9.3 Parser Evaluation Criteria


Figure 9.6 shows an example of the parse of a sentence from the TRA-
DOS text sentences. The binary relations that we think that a parser
should be able to extract from this sentence are shown in Figure 9.8.
The actual parser was initially evaluated by calculating three numbers
for each sentence:
1. Correct relations extracted, CORRECT
2. Incorrect relations extracted, INCORRECT
3. Missing correct relations, MISSING
With these numbers, we can calculate the precision of the parser as being
the number corrected divided by the total number correct:
CORRECT
(INCORRECT + CORRECT )
and the recall of the parser as being
CORRECT
(MISSING + CORRECT )
150 Grefenstette

During the translation of this example , the Workbench should ignore


the second sentence when moving from the first sentence to the third
one .

NNPREP example translation


SUBJ workbench ignore
DOBJ ignore sentence
ADJ second sentence
IOBJ-from move sentence
IOBJ-to move one
ADJ first sentence
ADJ third one

Figure 9.8: We consider that a parser should extract this set of


binary relations from the given sentence.

Correct Incorrect Missing


NNPREP example translation
SUBJ workbench ignore
DOBJ ignore sentence
ADJ second sentence
IOBJ-from move sentence
IOBJ-to move one
ADJ first sentence
ADJ one sentence
ADJ third one

Figure 9.9: Our parser returned the relations given in the rst
two columns for the sentence of Figure 9.8, of which we consider
that one is incorrect. The binary relation listed in the third col-
umn was missing.

For example, for the sentence given in Figure 9.8, our parser returned
the binary relations shown in Figure 9.9, eight of which we consider
correct, one incorrect and one missing. For this sentence, the precision
is 8/9 or 89%, as is its recall.
In a rst phase of testing, we used our original SEXTANT parser
on the IPSM'95 test beds. Due to time constraints, we only evaluated
the parser results on the rst 130 sentences of the LOTUS corpus. For
each sentence, we drew up a list of the relations that we thought that
the parser should return from the list given in Figure 9.7, and com-
pared those to the actual relations returned from the parser. For this
evaluation we used a format of output of the parser as shown in Figure
9.8. Over these 130 sentences, we calculated that the original parser re-
SEXTANT 151

Characteristic A B C D E F G
SEXTANT yes yes yes yes yes yes yes/no

Table 9.1: Linguistic characteristics which can be detected by


the SEXTANT Parser. See Table 9.2 for an explanation of the
letter codes.
Code Explanation
A Verbs recognised
B Nouns recognised
C Compounds recognised
D Phrase Boundaries recognised
E Predicate-Argument Relations identi ed
F Prepositional Phrases attached
G Coordination/Gapping analysed

Table 9.2: Letter codes used in Tables 9.1 and 9.5.3.

turned 432 correct binary dependency relations, 186 incorrect, and 248
were missing. In terms of percentages, this means that, under the evalu-
ation conditions stated above, that the original parser had a precision of
70% (432/618) and a recall rate of 64% (432/680) for binary dependency
relations. Remember that these precision and recall rates are only for
binary relations and cannot be directly compared to the much harder
problem of nding the correct hierarchical parse of the entire sentence.
Tables 9.1 and 9.5.3 use an IPSM-wide evaluation scheme whose
codes appear in Table 9.2.10 In these tables a number of linguistic char-
acteristics are identi ed with the letters A to G. Our interpretation of
these characteristics follows. The letters A and B indicate whether verbs
and nouns are correctly identi ed. In our case, since we use a tagger that
makes these choices before parsing, these characteristics evaluate the ac-
curacy of this tagger. Table 9.5.3 indicates that the tagger was function-
ing as well as reported in Cutting, Kupiec, Pederson & Silbun, (1992).
Letter C corresponds to compounds recognized. By this we understood
both proper names, such as \Ami Pro", as discussed in Section 9.2.1.2,
as well as proper attachment of noun compounds (marked NN in Figure
9.7) during parsing, such as in \insertion point". We counted the parser
as successful if the attachment was correctly indicated. This does not an-
swer the terminological question of whether all noun{noun modi cations
are compounds.
For \phrase boundaries", indicated with the letter D, we noted
10 We were only able to completely evaluate our Phase III parser using these criteria.
152 Grefenstette

whether our maximal noun chains and verb chains end and begin where
we expected them to. For example, in the fragment \is at the beginning
of the the text you want to select" we expected our parser to divide the
text into four chains: \is", \at the beginning of the text", \you", and
\want".
For predicate-argument relations (E) we counted the relations men-
tioned as SUBJ and DOBJ in Figure 9.7.
The letter F concerns prepositional phrase attachment. Our parser
attaches prepositional phrases to preceding nouns or verbs, and in the
case of a prepositional phrase following a verb, it will attach the prepo-
sitional phrase ambiguously to both. We counted a correct attachment
when one of these was right, supposing that some ulterior processing
would decide between them using semantic knowledge not available to
the parser. Here are some examples of errors: In \Select the text to
be copied in the concordance window..." the parser produced \copy in
window" rather than \select in window". In \certain abbreviations may
work at one prompt but not at another" the parser produced only \work
at prompt" but not \work at another". This was counted as one success
and one failure.

9.4 Analysis I: Original Grammar, Origi-


nal Vocabulary
The original grammar was developed principally for newspaper text
and scienti c writing in mind, i.e. linguistic peculiarities linked to di-
alogue and questions were not treated. The technical documentation
text treated here has two distinguishing characteristics: frequent use of
the imperative verb form, e.g. \Press ENTER to ", and common
:::

use of lists, e.g. in describing di erent ways to perform a given action.


These two characteristic violate the idealized view of balanced verb{
noun phrases described in Section 9.2.2. Both characteristics prompted
the parser modi cation described below in Section 9.6.
In order to evaluate the original parser on this text, a decision must
be made of what an ideal parser would return. In keeping with the
idea of an industrial parser, we adopt the stringent requirement that the
SEXTANT 153

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100 0
Lotus 20 20 0 100 0
Trados 20 20 0 100 0
Total 60 60 0 100 0

Table 9.3.1: Phase I acceptance and rejection rates for the SEX-
TANT Parser.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 0005 0.3 0.0
Lotus 0005 0.3 0.0
Trados 0005 0.3 0.0
Total 0015 0.3 0.0

Table 9.4.1: Phase I parse times for the SEXTANT Parser using
a SPARC 20 with 192 Megabytes. The rst column gives the total
time to attempt a parse of each sentence.

ideal parser should only return one parse for each sentence. Our parser
does not return a global tree structure but rather draws binary labeled
relations between sentence elements. Some elements such as introductory
prepositional phrases are left unattached, cf. Fidditch (Hindle, 1993).
We decided therefore to use as evaluation criteria the six binary relations
shown in Figure 9.7. These relations are only a minimal set of the types
of relations that one would ask of a parser. For example, relations such
as those between adjectives and verbs, e.g. ready to begin, are missing,
as are those between words and multi-word structures.

9.5 Analysis II: Original Grammar, Addi-


tional Vocabulary
No additional vocabulary was added to the system. This phase was
empty.
154 Grefenstette

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100 0
Lotus 20 20 0 100 0
Trados 20 20 0 100 0
Total 60 60 0 100 0

Table 9.3.3: Phase III acceptance and rejection rates for the
SEXTANT Parser.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 0005 0.3 0.0
Lotus 0005 0.3 0.0
Trados 0005 0.3 0.0
Total 0015 0.3 0.0

Table 9.4.3: Phase III parse times for the SEXTANT Parser
using a SPARC 20 with 192 MB. The rst column gives the total
time to attempt a parse of each sentence.
Char, A B C D E F G Avg.
Dynix 98% 98% 75% 93% 83% 54% 31% 76%
Lotus 97% 97% 92% 91% 79% 57% 08% 74%
Trados 99% 97% 83% 88% 70% 65% 00% 72%
Average 98% 97% 83% 91% 77% 59% 13% 74%

Table 9.5.3: Phase III Analysis of the ability of the SEXTANT


Parser to recognise certain linguistic characteristics in an utter-
ance. For example the column marked `A' gives for each set of
utterances the percentage of verbs occurring in them which could
be recognised. The full set of codes is itemised in Table 9.2.

9.6 Analysis III: Altered Grammar, Addi-


tional Vocabulary
The principal parser change was the introduction of heuristics to
recognize conjunctive lists. The original parser took a simple view of
the sentence as a sequence of noun phrases and prepositional phrases
interspersed with verbal chains. The technical text used as a test bed
had a great number of conjunctive lists, e.g., \With a mouse, you can
use the vertical scroll arrows, scroll bar, or scroll box on the right side
of the screen to go forward or backward in a document by lines, screens,
SEXTANT 155

Correct Incorrect Missing Precision Recall


Original 432 186 248 70% 64%
Parser
Modi ed 550 116 130 83% 81%
Parser
Table 9.6: Predicate argument recgonition improvement from
the Phase I parser to the Phase III parser which no longer ignores
commas and which identi es nominal list structures.

or pages.".
In order to treat these constructions, the parser was modi ed in four
ways: First, commas were reintroduced into the text, and were allowed to
appear within a nominal chain. Commas and all other punctuation had
been stripped out in the original parser. Secondly, rather than having
only one head marked, nominal chains were allowed to have multiple
heads. A head could appear before a comma or a conjunction, as well as
at the end of the chain. Third, an additional pass went through all the
nominal chains in the sentence and split the chain at commas if the chain
did not contain a conjunction. Fourth, search for subjects and objects
of verb chains was modi ed to allow for multiple attachments. These
modi cations took one and a half man-days to implement and test.
A number of errors found in the original parser were due to tagging
errors. For the modi ed parser, we decided to retag ve words: press,
use, and remains were forced to be verbs, bar(s) was forced to be a noun,
and the tag for using was changed to a preposition. Other words caused
problems, such as select which was often tagged as an adjective, or toggle
whose only possible tag was a verb, but these were not changed. Since
the tagger does not use lexical probabilities, words like copy and hold
sometimes appeared as nouns, eliminating all possible binary relations
that would have derived from the the verb, as well as introducing incor-
rect nominal relations. This might be treated by reworking the tagger
output, using corpus speci c data. This was not done here except for the
ve words mentioned. The results obtained must be regarded in light of
this minimalist approach.
When these changes were included in the parser, the same 130 sen-
tences from LOTUS were reparsed, and the results from this modi ed
parser11 were recomputed. The results are given in Table 9.6 which show
an improvement in precision to 83% (550/666) and an improvement in
recall to 81% (550/680).
11 The parser modi cations consisted only in incorporating limited list processing
rather than any other corpus speci c treatment.
156 Grefenstette

9.7 Converting Parse Tree to Dependency


Notation
The output of the SEXTANT parser as shown in Figure 9.6 is not a
parse tree but a forest of labeled binary trees. Each output line with a
number greater than zero in the eighth column can be converted into a
binary labeled tree using columns nine and beyond. For example, the
tenth line (as numbered in column seven) corresponding to \ignore" in
this gure, shows that the subject (SUBJ) of \ignore" is the word in
line eight \Workbench". The \sentence" in line thirteen is marked as
the direct object (DOBJ) of the word \ignore" in line ten. The word
\sentence" in line nineteen is in a binary tree labeled \IOBJ-from" with
the word \move" in line fteen. These three examples could be rewritten
as:
subject(Workbench,ignore)
direct-object(ignore,sentence)
verb-prep(move,from,sentence)

9.8 Summary of Findings


In all, 1.5 man-weeks was spent on this task. The principal strengths of
this parser are its robustness (it always returns a parse), its generality (no
domain speci c information in lexicon), and its speed (all 600 sentences
are parsed in under 41 seconds CPU on a Sparc 20).
Its weaknesses are many. It cannot deal with unmarked embedded
clauses, i.e. in \ the changes you make appear ", \make appear"
::: :::

is not recognized as two verbal chains. Tagging errors cannot be recov-


ered, so errors in tagging are propagated throughout the parse.12 Being
designed to work on declarative sentences, it can misread the subject
of imperative sentences.13 There is no attempt to identify the subject
of an in nitive verb phrase which is not part of an active verb chain,
e.g., in \ you want to go ", \you" will be identi ed as a subject of
::: :::

\go" but not in \you will be able to go". Progressive verb phrases are
succinctly handled by seeing if a noun precedes the verb and calling it a
subject if it does. Questions are not treated. Gaps are not recognized or
lled; many words are simply thrown away, e.g. adverbs. No relations
12 Though this is a general problem in any system in which parsing and tagging are
independent.
13 The tagger was trained on untagged text with few imperatives and itself does
not return an imperative verb tag, so imperative verbs are marked as active verb,
in nitives or nouns (especially following conjunctions).
SEXTANT 157

are created between adjectives and verbs, e.g. \ready to begin" yields
nothing. Being word based, the parser provides no relations between a
word and a larger syntactic unit, such as between \refer" and a title as
in \refer to Understanding formatting".
In other words, the level of analysis returned by this parser is of little
utility for higher linguistic tasks, for example, automatic translation,
that require more complete analysis of the sentence. It might serve,
however, for lower level tasks such as terminology and subcategorization
extraction, or for information retrieval.

9.9 References
Abney, S. (1991). Parsing by Chunks. In R. Berwick, S. Abney & C.
Tenny (Eds.) Principle-Based Parsing. Dordrecht, The Netherlands:
Kluwer Academic Publishers.
Chanod, J. P. (1996). Rules and Constraints in a French Finite-State
Grammar (Technical Report). Meylan, France: Rank Xerox Research
Centre, January.
Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A Practi-
cal Part-of-Speech Tagger. Proceedings of the Third Conference on
Applied Natural Language Processing, April, 1992.
COMPASS (1995). Adapting Bilingual Dictionaries for online Compre-
hension Assistance (Deliverable, LRE Project 62-080). Luxembourg,
Luxembourg: Commission of the European Communities.
Debili, F. (1982). Analyse Syntaxico-Semantique Fondee sur une Acqui-
sition Automatique de Relations Lexicales-Semantiques. Ph.D. The-
sis, University of Paris XI.
Francis, W. N., & Kucera, H. (1982). Frequency Analysis of English.
Boston, MA: Houghton Miin Company.
Gibson, E., & Pearlmutter, N.. (1993). A Corpus-Based Analysis of
Constraints on PP Attachments to NPs (Report). Pittsburgh, PA:
Carnegie Mellon University, Department of Philosophy.
Grefenstette, G. (1994). Explorations in Automatic Thesaurus Discov-
ery. Boston, MA: Kluwer Academic Press.
Grefenstette, G. (1994). Light Parsing as Finite State Filtering. Pro-
ceedings of the Workshop `Extended nite state models of language',
European Conference on Arti cial Intelligence, ECAI'96, Budapest
University of Economics, Budapest, Hungary, 11-12 August, 1996
Grefenstette, G., & Schulze, B. M. (1995). Designing and Evaluating
Extraction Tools for Collocations in Dictionaries and Corpora (De-
liverable D-3a, MLAP Project 93-19: Prototype Tools for Extracting
Collocations from Corpora). Luxembourg, Luxembourg: Commission
of the European Communities.
Grefenstette, G., & Tapanainen, P. (1994). What is a Word, What
is a Sentence? Problems of Tokenization. Proceedings of the 3rd
Conference on Computational Lexicography and Text Research, COM-
PLEX'94, Budapest, Hungary, 7-10 July.
Hindle, D. (1993). A Parser for Text Corpora. In B. T. S. Atkins & A.
Zampolli (Eds.) Computational Approaches to the Lexicon. Oxford,
UK: Clarendon Press.
Karttunen, L. (1994). Constructing Lexical Transducers. Proceedings
of the 15th International Conference on Computational Linguistics,
COLING'94, Kyoto, Japan.
Karttunen, L., Kaplan, R. M., & Zaenen, A. (1992). Two-Level Morphol-
ogy with Composition. Proceedings of the 14th International Con-
ference on Computational Linguistics, COLING'92, Nantes, France,
23-28 August, 1992, 141-148.
Voutilainen, A., Heikkila, J., & Anttila, A. (1992). A Lexicon and Con-
straint Grammar of English. Proceedings of the 14th International
Conference on Computational Linguistics, COLING'92, Nantes,
France, 23-28 August, 1992.
10
Using a Dependency Structure
Parser without any Grammar
Formalism to Analyse a Soft-
ware Manual Corpus
Christopher H. A. Ting1
Peh Li Shiuan
National University of Singapore
DSO

10.1 Introduction
When one designs algorithms for the computer to parse sentences, one is
asking the machine to determine the syntactic relationships among the
words in the sentences. Despite considerable progress in the linguistic
theories of parsing and grammar formalisms, the problem of having a
machine automatically parse the natural language texts with high accu-
racy and eciency is still not satisfactorily solved (Black, 1993). Most
existing, state-of-the-art parsing systems in the world rely on grammar
rules (of a certain language) expressed in a certain formal grammar for-
malism such as the (generalized) context free grammar formalism, the
tree-adjoining grammar formalism and so on. The grammar rules are
written and coded by linguists over many years. Alternatively, they
can be \learned" from an annotated corpus (Carroll & Charniak, 1992;
Magerman, 1994). In any case, the underlying assumption of using a
particular grammar formalism is that most, if not all, of the syntactic
1 Address: Computational Science Programme, Faculty of Science, National Uni-
versity of Singapore, Lower Kent Ridge Road, Singapore 0511. Tel: +65 373 2016,
Fax: +65 775 9011, Email: thianann@starnet.gov.sg. C. Ting is grateful to Dr How
Khee Yin for securing a travel grant to attend the IPSM'95 Workshop. He also thanks
Dr Paul Wu and Guo Jin of the Institute of System Science, National University of
Singapore for a discussion based on a talk given by Liberman (1993).
160 Ting, Peh

constructs of a certain language can be expressed by it. But is it really


true? Is grammar formalism indispensable to parsing algorithms?
In this work, we propose a novel, hybrid approach to parsing that
does not rely on any grammar formalism (Ting, 1995a; Ting 1995b; Peh
& Ting 1995). Based on an enhanced hidden Markov model (eHMM), we
built a parser, DESPAR, that produces a dependency structure for an
input sentence. DESPAR is designed to be modular. It aims to achieve
the following:
 accurate;
 capable of handling vocabulary of unlimited size;
 capable of processing sentences written in various styles;
 robust;
 easy for non-linguists to build, ne-tune and maintain;
 fast.
The linguistic characteristics of DESPAR are tabulated in Table 10.1.
The verbs and nouns are recognized by the part-of-speech tagger, the
compounds by a noun phrase parser which is also statistical in nature,
the phrase boundaries by a segmentation module and the attachment and
coordination by the synthesis module. The predicate-argument structure
is analyzed by a rule-based module which reads out the subject, the
object, the surface subject, the logical subject and the indirect object
from the noun phrases and adjectives governed by the verbs. In other
words, these rules are based on the parts of speech of the nodes governed
by the predicates, and their positions relative to the predicates.
We tested DESPAR on a large amount of unrestricted texts. In
all cases, it never failed to generate a parse forest, and to single out
a most likely parse tree from the forest. DESPAR was also tested on
the IPSM'95 software manual corpus. In the following, we present the
analysis results after a brief description of the parsing system (Section
10.2) and the parser evaluation criteria (Section 10.3).

10.2 Description of Parsing System


In 1994, we were given the task to build a wide-coverage parser. Not
knowing any grammar formalism, we wondered whether the conventional
wisdom of \parsing is about applying grammar rules expressed in a cer-
tain grammar formalism" was an absolute truth. We began to explore
ways of re-formulating the problem of parsing so that we could build a
parser without having to rely on any formal grammar formalism.
By coincidence, we came to know of the success of using HMM to
build a statistical part-of-speech tagger (Charniak, Hendrickson, Jacob-
son & Perkowitz, 1993; Merialdo, 1994), and Liberman's idea of viewing
DESPAR 161

Characteristic A B C D E F G
DESPAR yes yes yes yes yes yes yes

Table 10.1: Linguistic characteristics which can be detected by


DESPAR. See Table 10.2 for an explanation of the letter codes.
Code Explanation
A Verbs recognised
B Nouns recognised
C Compounds recognised
D Phrase Boundaries recognised
E Predicate-Argument Relations identi ed
F Prepositional Phrases attached
G Coordination/Gapping analysed

Table 10.2: Letter codes used in Tables 10.1, 10.4.1 and 10.4.2.

dependency parsing as some kind of \tagging" (Liberman, 1993). If de-


pendency parsing were not unlike tagging, then would not it be a good
idea to model it as some hidden Markov process? The only worry, of
course, was whether the Markov assumption really holds for parsing.
What about the long-distance dependency? How accurate could it be?
Does it require a large annotated corpus of the order of a few million
words? Nobody seemed to know the answers to these questions.
We then began to work out a few hidden Markov models for this pur-
pose. Through a series of empirical studies, we found that it was possible,
and practical, to view dependency parsing as some kind of tagging, pro-
vided one used an enhanced HMM. It is an HMM aided by a dynamic
context algorithm and the enforcement of dependency axioms. Based on
this enhanced model, we constructed DESPAR, a parser of dependency
structure. DESPAR takes as input a sentence of tokenized words, and
produces a most likely dependency structure (Mel'cuk, 1987). In addi-
tion, it also captures other possible dependency relationships among the
words. Should one decide to unravel the syntactic ambiguities, one can
return to the forest of parse trees and select the second most likely one
and so on.
An advantage of analysing dependency structure is probably its rel-
ative ease in extracting the predicate-argument structure from the parse
tree. The version of dependency structure that we use is motivated by
the need to enable non-professional linguists to participate in annotating
the corpus. The building of DESPAR involves the following:
162 Ting, Peh

He cleaned a chair at the canteen


Tagging
Parsing VB verb

PP VB DT NN IN DT NN
preposition

cleaned PP NN IN
pronoun noun
result
He chair at DT NN
determiner
a canteen
the DT

Figure 10.1: An overview of the ow of processing in DESPAR.


Given a tokenized sentence, \He cleaned a chair at the canteen",
it is rst tagged as \PP VB DT NN IN DT NN". These codes corre-
spond to the parts of speech of the words in the sentence. Based
on these parts of speech, the computer is to arrive at the depen-
dency structure of the sentence on the right. The pronoun PP,
noun NN and the preposition IN are linked (directly) to the verb
VB as they are the subject, object and the head of the preposi-
tional phrase respectively. They are said to depend on the verb as
they provide information about the action \cleaned". The result
of parsing is a dependency parse tree of the sentence.

 obtain a large part-of-speech (POS) corpus,


 develop a statistical POS tagger,
 invent an unknown word module for POS tagger,
 develop a computational version of dependency structure,
DESPAR 163

He cleaned a chair at the canteen .

Part-of-speech tagging

Nomination of candidate governors


by
Dynamic context algorithm

Pruning away invalid candidates


by
the Axioms of dependency structure

Hidden Markov Process

cleaned

He chair at

a canteen

the

Figure 10.2: An illustration of the enhanced HMM. After tag-


ging, the dynamic context algorithm searches for the likely gov-
ernors for each part of speech. The axioms of the dependency
structure are employed to prune away invalid candidates. Once
all the statistical parameters are estimated, we use the Viterbi
algorithm to pick up the dependency tree with the maximum like-
lihood from the forest.

 build up a small corpus of dependency structures,


 invent a hidden Markov theory to model the statistical properties
of the dependency structures,
 invent a dynamic contextual string matching algorithm for gener-
ating possible and most probable parses,
 incorporate syntactic constraints into the statistical engine,
164 Ting, Peh

Tokenization of sentence For example , you can use an accelerated search command .

Part-of-speech tagging IN NN , PP MD VB DT VBN NN NN .

Noun phrase bracketing IN [NN]_1 , [PP]_2 MD VB [DT VBN NN NN]_3 .

NN

NN PP DT NN
Dependency parsing of noun phrases

VBN

Segmentation of sentence { IN N1 } , { N2 MD VB N3}

IN VB
Dependency parsing of segments

N1 N2 MD N3

.
Synthesis of parsed segments
VB

IN N2 MD N3

N1

Figure 10.3: DESPAR operating in the divide-and-conquer


mode. After the part-of-speech tagging, it performs noun phrase
parsing and submits each noun phrase for dependency parsing.
Then it disambiguates whether the comma is prosodic, logical
conjunctive or clausal conjunctive. The string of parts of speech
is segmented and each segment is parsed accordingly. The parsed
segments are then synthesized to yield the nal analysis.

 develop a rule-based segmentation module and synthesis module


to divide-and-conquer the parsing problem.
An overview of the ow of processing is illustrated in Figure 10.1.
The corpus-based, statistical approach to building the parser has
served us well. By now, the e ectiveness of the corpus-based, statistical
approach is well documented in the literature. The attractive feature of
this approach is that it has a clear cut division between the language-
dependent elements and the inferencing operations. Corpora, being the
data or collections of examples used by people in the day-to-day usage
of the language, together with the design of the tags and annotation
DESPAR 165

Figure 10.4: The parse tree produced by DESPAR for sentence


L8 in Analysis I, `Scrolling changes the display but does not move
the insertion point .'. All the words were attached correctly.

symbols, are the linguistic inputs for building up the system. Statistical
tools such as the hidden Markov model (HMM), the Viterbi algorithm
(Forney, 1973) and so on, are the language-independent components. If
one has built a parser for English using English corpora, using the same
statistical tools, one can build another parser for French if some French
corpora are available. The main modules of our system are described
below.

Part-of-Speech Tagger
With an annotated Penn Treebank version of the Brown Corpus and
Wall Street Journal (WSJ) Corpus (Marcus, Marcinkiewicz, & Santorini,
1993), we developed a statistical tagger based on a rst-order (bigram)
HMM (Charniak, Hendrickson, Jacobson & Perkowitz, 1993). Our tag-
ger gets its statistics from Brown Corpus' 52,527 sentences (1.16 million
words) and WSJ Corpus' 126,660 sentences (2.64 million words).
166 Ting, Peh

Figure 10.5: The parse tree produced by DESPAR for sen-


tence T125 in Analysis I, `If the Workbench cannot nd any fuzzy
match, it will display a corresponding message ( \No match " ) in
the lower right corner of its status bar and you will be presented
with an empty yellow target eld.'. In this sentence, the tokenizer
made a mistake in not detaching " from No. The word "No was
an unknown word to the tagger and it was tagged as proper noun
by the unknown word module. The tokens ( and empty were at-
tached wrongly by DESPAR. Though these were minor mistakes,
they were counted as errors because they did not match their
respective counterparts in the annotated corpus.

Unknown Word Module


To make the tagger and parser robust against unknown words, we de-
signed an unknown word module based on the statistical distribution of
rare words in the training corpus. During run-time, the dynamic context
algorithm estimates the conditional probability of the POS given that
the unknown word occurs in the context of a string of POSs of known
words. Then we apply the Viterbi algorithm again to disambiguate the
DESPAR 167

Figure 10.6: The parse tree produced by DESPAR for sentence


L113 in Analysis I, `To move or copy text between documents .'
Here, the tagger tagged `copy' wrongly as a noun, which was fatal
to the noun phrase parser and the dependency parser.

POS of the unknown word. With the unknown word module installed,
our parser e ectively has unlimited vocabulary.

Computational Dependency Structure


With the aim of maintaining consistency in annotating the corpus, we
standardized a set of conventions for the annotators to follow. For in-
stance, we retain the surface form of the verb and make other words
which provide the tense information of the verb depend on it.2 The con-
ventions for the dependency relationships of punctuations, delimiters,
dates, names etc. are also spelled out.

Dependency Parser
We manually annotated a small corpus (2,000 sentences) of dependency
structures, and used it to estimate the statistical parameters of a rst-
order enhanced HMM for parsing. The key idea is to view parsing as
2 Examples of these are the modals \will", \can" etc., the copula, the in nitive
indicator \to" and so on.
168 Ting, Peh

Figure 10.7: The parse tree produced by DESPAR for sentence


D84 in Analysis I, `For example , you can use an accelerated search
command to perform an author authority search or a title keyword
search .'. The highlight of this example is the attachment of `per-
form'. In our version of computational dependency structure, this
is a fully correct parse. This is so because we can re-order the sen-
tence as `For example , to perform an author authority search or a
title keyword search you can use an accelerated search command
.'

if one is tagging the dependency \codes" for each word in the sentence.
The dependency structure of a sentence can be represented or coded
by two equivalent schemes (see the appendix). These dependency codes
now become the states of the HMM. To reduce the perplexity, we also
use the dynamic context algorithm that estimates the conditional prob-
abilities of the dependency codes given the contexts of POSs during run
time. And to enhance the performance of the system, we also make
use of the axioms of dependency structures to throw out invalid can-
didate governors and to constrain possible dependency parse trees. It
is worthwhile to remark that the dynamic context algorithm and the
DESPAR 169

language-independent axioms are critical in the dependency \tagging"


approach to parsing. The HMM aided by the dynamic context algo-
rithm and the axioms, called the enhanced HMM (eHMM), is the novel
statistical inference technique of DESPAR.
Noun Phrase Parser
Using the eHMM, we have also succeeded in building an (atomic) noun-
phrase parser (Ting, 1995b). A key point in our method is a representa-
tion scheme of noun phrases which enables the problem of noun phrase
parsing to be formulated also as a statistical tagging process by eHMM.
The noun phrase requires only 2,000 sentences for estimating the statisti-
cal parameters; no grammar rules or pattern templates are needed. Our
experimental results show that it achieves 96.3% on the WSJ Corpus.

Divide-and-Conquer Module
The divide-and-conquer module is designed to enhance the e ectiveness
of the parser by simplifying complex sentences before parsing. It par-
titions complex sentences into simple segments, and each segment is
parsed separately. The rule-based segmentation module decides where
to segment based on the outcome of a disambiguation process (Peh &
Ting, 1995). The noun phrase bracketing provided by the noun phrase
parser is also used in this module. Finally, a rule-based synthesizer glues
together all the segments' parse trees to yield the overall parse of the
original complex sentence.
The working mode of the parser is illustrated in Figure 10.2 and
Figure 10.3.
All the program code of the parser system was written in-house in
Unix C. Currently, the parser system runs on an SGI 5.3 with the fol-
lowing con guration:
 1 150 MHZ IP19 Processor
 CPU: MIPS R4400 Processor Chip Revision: 5.0
 FPU: MIPS R4010 Floating Point Chip Revision: 0.0
 Data cache size: 16 Kbytes
 Instruction cache size: 16 Kbytes
 Secondary uni ed instruction/data cache size: 1 Mbyte
 Main memory size: 256 Mbytes, 1-way interleaved

10.3 Parser Evaluation Criteria


Since dependency parsing in our approach is about assigning a depen-
dency code to each word, we can evaluate the accuracy of the parser's
170 Ting, Peh

outputs in the same way as we evaluate the performance of the tagger.


We evaluate the performance of DESPAR at the word level and at the
sentence level de ned as follows:
 word level: a word is said to be tagged correctly if the answer
given by the system matches exactly that of the annotated corpus.
 sentence level: a sentence is said to be recognized correctly if
the parse tree given by the system matches exactly that of the
annotated corpus.
The word level is a somewhat lenient evaluation criterion. On the
other extreme, sentence level is very stringent. It favours short sentences
and discriminates against long sentences. If there is just one tag of a
word in the sentence that does not match exactly that of the annotated
corpus, the whole sentence is deemed to be analyzed wrongly by the
system. It may be an over-stringent criterion, because a sentence may
have more than one acceptable parse. Scoring a low value at the sentence
level is no indication that the parser is useless.
The accuracy of the noun phrase parser is evaluated according to the
exact match of the beginning and the ending of noun phrase brackets
with those in the annotated corpus. For example, if the computer returns
[w1 w2] [w3 w4] w5 w6 [w7 w8] and the sentence in the annotated corpus
is [w1 w2 w3 w4] w5 w6 [ w7 w8 ], then there are two wrong noun phrases
and one correct one.
These measures account for the consistency between the system's
outputs and human's annotation. It could happen that the tag of a
particular word was annotated wrongly. As a result, though the system
produces the correct result, it is counted as wrong, because it does not
match the tag in the corpus. We estimate the corpora to be contaminated
with 3 to 6% \noise". These measures therefore give a lower bound of
how well a system can perform in terms of producing the really correct
outputs.

10.4 Analysis I: Original Grammar, Orig-


inal Vocabulary
After having received the 600 sentences in 3 les from the organizers of
the IPSM'95 Workshop, we tokenized the sentences, namely, we used a
computer program to detach the punctuations, quotation marks, paren-
theses and so on from the words, and the isolated tokens were retained
in the sentences. This was the only pre-processing we did.
Then, we annotated the 600 sentences in a bootstrapping manner.
The so obtained IPSM'95 Corpus (see Appendix A for a sample) becomes
DESPAR 171

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100 0
Lotus 20 20 0 100 0
Trados 20 20 0 100 0
Total 60 60 0 100 0

Table 10.2.1: Phase I acceptance and rejection rates for


DESPAR.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 209 10.5 N.A.
Lotus 185 09.3 N.A.
Trados 224 11.2 N.A.
Total 618 10.3 N.A.

Table 10.3.1: Phase I parse times for the DESPAR. The rst
column gives the total time (seconds) to parse 20 sentences in
each le. The last column is not applicable (N.A.) to DESPAR.
Char. A B C D E F G Avg.
Dynix 098% 97% 97% 98% 84% 85% 080% 91%
Lotus 096% 96% 90% 92% 83% 61% 067% 84%
Trados 100% 98% 94% 99% 86% 70% 100% 92%
Average 098% 97% 94% 96% 84% 72% 082% 89%

Table 10.4.1: Phase I Analysis of the ability of DESPAR to


recognise certain linguistic characteristics in an utterance. For
example the column marked `A' gives for each set of utterances the
percentage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 10.2.

the standard for checking against the outputs of our tagger, noun-phrase
parser and dependency structure parser.3 Though we spared no e ort
to make sure that the Corpus be free of errors, we estimate the IPSM'95
Corpus to contain 2 to 4% noise.
For Analysis I, the POS tagger was trained on the PennTree Bank's
Brown Corpus and the Wall Street Journal Corpus, while the noun
phrase parser and the dependency structure parser on a small subset
of it.
3 These, together with the unknown word module and the divide-and-conquer mod-
ule, form a total system called DESPAR.
172 Ting, Peh

Dynix Error Total Accuracy


POS (word level) 10 343 97.1 %
Dependency (word level) 44 343 87.2 %
IN (word level) 04 027 85.2 %
CC (word level) 02 010 80.0 %
POS (sentence level) 09 020 55.0 %
Dependency (sentence level) 15 020 25.0 %
Lotus Error Total Accuracy
POS (word level) 13 289 95.5 %
Dependency (word level) 51 289 82.4 %
IN (word level) 10 026 61.5 %
CC (word level) 04 012 66.7 %
POS (sentence level) 10 020 50.0 %
Dependency (sentence level) 14 020 30.0 %
Trados Error Total Accuracy
POS (word level) 11 389 097.2 %
Dependency (word level) 56 389 085.6 %
IN (word level) 14 047 070.2 %
CC (word level) 00 004 100.0 %
POS (sentence level) 09 020 055.0 %
Dependency (sentence level) 16 020 020.0 %

Table 10.5.1: A detailed break down of the performance of


DESPAR at the word level and the sentence level for Analysis
I.

For each test sentence, DESPAR will always select one parse tree out
of the forest generated by the dynamic context algorithm. The selection
of one parse tree is carried out by the Viterbi algorithm in the enhanced
HMM framework. In this sense, all the sentences can be recognized
(i.e. parsed), although not exactly as those in the annotated corpus.
DESPAR is absolutely robust; it produces parses for sentences which
contain grammatical errors, even random strings of words.4
Figure 10.4 shows a parse tree of the test sentence L8 which was
analysed correctly at the sentence level by DESPAR, although some of
the words were tagged wrongly by the part-of-speech tagger.
Figure 10.5 shows that DESPAR is tolerant to some errors in the
tokenization. The current version of DESPAR uses only the parts of
4 We designed DESPAR not with the intention of using it as a grammar checker.
Rather, we wanted DESPAR to be very robust. Currently we use it for a machine
translation project, and other natural language applications in the pipe-line.
DESPAR 173

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100 0
Lotus 20 20 0 100 0
Trados 20 20 0 100 0
Total 60 60 0 100 0

Table 10.2.2: Phase II acceptance and rejection rates for


DESPAR.
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 208 10.4 N.A.
Lotus 185 09.3 N.A.
Trados 225 11.3 N.A.
Total 618 10.3 N.A.

Table 10.3.2: Phase II parse times for the DESPAR. The rst
column gives the total time (seconds) to parse 20 sentences in each
le. The last column is not applicable (N.A.) to DESPAR.
Char. A B C D E F G Avg.
Dynix 098% 98% 97% 98% 86% 86% 080% 92%
Lotus 096% 97% 95% 94% 85% 62% 067% 85%
Trados 100% 99% 94% 99% 86% 70% 100% 93%
Average 098% 98% 95% 97% 86% 73% 083% 90%

Table 10.4.2: Phase II Analysis of the ability of DESPAR to


recognise certain linguistic characteristics in an utterance. For
example the column marked `A' gives for each set of utterances the
percentage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 10.2.

speech to perform dependency parsing. If the tagger tags wrongly, the


dependency parser is still able to parse correctly in the cases of mistaking
a noun by a proper noun, a noun by an adjective and so on. However,
if a noun is tagged wrongly as a verb and vice versa, the analysis by
DESPAR is usually unacceptable, as in Figure 10.6.
Figure 10.7 gives a avour of the computational version of depen-
dency structure we adopt.
A detailed summary of the performance of DESPAR at the word level
and the sentence level is in Table 10.5.1.
174 Ting, Peh

Dynix Error Total Accuracy


POS (word level) 09 343 97.4 %
Dependency (word level) 40 343 88.3 %
IN (word level) 04 028 85.7 %
CC (word level) 02 010 80.0 %
POS (sentence level) 08 020 60.0 %
Dependency (sentence level) 15 020 25.0 %
Lotus Error Total Accuracy
POS (word level) 09 289 96.9 %
Dependency (word level) 51 289 82.4 %
IN (word level) 10 026 61.5 %
CC (word level) 04 012 66.7 %
POS (sentence level) 07 020 65.0 %
Dependency (sentence level) 14 020 30.0 %
Trados Error Total Accuracy
POS (word level) 09 389 097.7 %
Dependency (word level) 55 389 085.9 %
IN (word level) 14 047 070.2 %
CC (word level) 00 004 100.0 %
POS (sentence level) 08 020 060.0 %
Dependency (sentence level) 16 020 020.0 %

Table 10.5.2: A detailed break down of the performance of


DESPAR at the word level and the sentence level for Analysis
II.

10.5 Analysis II: Original Grammar, Ad-


ditional Vocabulary
For Analysis II, we re-trained our POS tagger. The training corpora
are the Brown Corpus, the WSJ Corpus, and the IPSM'95 Corpus itself.
That was all we did to incorporate \additional vocabulary". We did not
re-train the noun phrase parser nor the dependency parser. We also did
not use the lists of technical terms distributed by the organizers. The
only di erence with Analysis I is that now, all the words are known to
the tagger.
We ran the tagger and the parsers on the 3 les again as we did for
Analysis I. The results are tabulated below.
Since most of the mistakes which the tagger made were not fatal,
in the sense that there were only isolated instances where a verb was
mistaken as a noun and vice versa, the performance of DESPAR in
DESPAR 175

Analysis II was not signi cantly di erent from that in Analysis I. While
the tagger registered a hefty 20% of error reduction, the dependency
parser only improved by 3% error reduction. These gures show that
one need not feed DESPAR with additional vocabulary for it to perform
reasonably well. The unknown word module, though making mistakes
in tagging unknown nouns as proper nouns and so on, is sucient for
the approach we take in tackling the parsing problem.
A detailed summary of the performance of DESPAR at the word level
and the sentence level is in Table 10.5.2.

10.6 Analysis III: Altered Grammar, Ad-


ditional Vocabulary
Analysis III was not carried out on DESPAR.

10.7 Converting Parse Tree to Dependency


Notation
The issue of conversion to dependency notation was not addressed for
the DESPAR system.

10.8 Summary of Findings


The results show that one can analyse the dependency structure of a
sentence without using any grammar formalism; the problem of depen-
dency parsing can be formulated as a process of tagging the dependency
codes. When tested on IPSM'95 Corpus, DESPAR is able to produce a
parse for each sentence with an accuracy of 85% at the word level.
Its performance can be improved by having a collocation module to
pre-process the sentence before submitting it to DESPAR for analysis.
To attain higher accuracy, it is also desirable to have some module that
can process dates, time, addresses, names etc.
As no formal grammar formalism is used, it is relatively easy to
maintain the parser system by simply providing it with more corpora.
Our current corpus for training the eHMM has only 2,000 sentences. If
a dependency-code corpus of the order of millions of sentences in size
is available, it will be interesting to see how far the enhanced HMM,
namely, HMM + dynamic context + dependency structure axioms can
go.
Another dimension for improvement is to further develop the statis-
tical inference engine, eHMM. The current eHMM is based on rst-order
176 Ting, Peh

(i.e., bigram) state transition. We expect the system to do better if we


use second-order (trigram) transition and other adaptive models to tune
the statistical parameters.
In conclusion, we remark that deviation from the established Chom-
sky mode of thinking is both fruitful and useful in opening up a new
avenue for creating a practical parser. We also show that it is feasible
to model dependency structure parsing with a hidden Markov model
supported by a dynamic context algorithm and the incorporation of de-
pendency axioms.

10.9 References
Black, E. (1993). Parsing English By Computer: The State Of The
Art (Internal report). Kansai Science City, Japan: ATR Interpreting
Telecommunications Research Laboratories.
Carroll, G., & Charniak, E. (1992). Two Experiments On Learning
Probabilistic Dependency Grammars From Corpora (TR CS-92-16).
Providence, RI: Brown University, Department of Computer Science.
Charniak, E., Hendrickson C., Jacobson, N., & Perkowitz M. (1993).
Equations for Part-of-Speech Tagging. Proceedings of AAAI'93, 784-
789.
Forney, D. (1973). The Viterbi Algorithm, Proceedings of the IEEE. 61,
268-278.
Liberman, M. (1993). How Hard Is Syntax. Talk given at Taiwan.
Magerman, D. (1994). Natural Language Parsing As Statistical Pattern
Recognition. Ph.D. Thesis, Stanford University.
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building
a Large Annotated Corpus of English: The Penn Treebank. Compu-
tational Linguistics, 19, 313-330.
Mel'cuk, I. A. (1987). Dependency Syntax: Theory and Practice. Stony
Brook, NY: State University of New York Press.
Merialdo, B. (1994). Tagging English Text With A Probabilistic Model.
Computational Linguistics, 20, 155-171.
Peh, L. S., & Ting, C. (1995) Disambiguation of the Roles of Commas
and Conjunctions in Natural Language Processing (Proceedings of the
NUS Inter-Faculty Seminar). Singapore, Singapore: National Univer-
sity of Singapore.
Ting, C. (1995a). Hybrid Approach to Natural Language Processing
(Technical Report). Singapore, Singapore: DSO.
Ting, C. (1995b). Parsing Noun Phrases with Enhanced HMM (Pro-
ceedings of the NUS Inter-Faculty Seminar). Singapore, Singapore:
National University of Singapore.
DESPAR 177

Appendix A: Samples of dependency struc-


tures in the IPSM'95 Corpus
#D77
IN If 1 --> 10 -
PP you 2 --> 3 [ SUB
VBP need 3 --> 1 ]
DT the 4 --> 5 [
NP BIB 5 --> 3 +
NN information 6 --> 3 + OBJ
, , 7 --> 6 ]
PP you 8 --> 10 [ SUB
MD can 9 --> 10 ]
VB create 10 --> 22 -
DT a 11 --> 14 [
VBN saved 12 --> 14 +
NN BIB 13 --> 14 +
NN list 14 --> 10 + OBJ
CC and 15 --> 10 ]
VB print 16 --> 15 -
PP it 17 --> 16 [
RP out 18 --> 16 ]
WRB when 19 --> 16 -
PP you 20 --> 21 [ SUB
VBP like 21 --> 19 ]
. . 22 --> 0 -

#L30
NP Ami 1 --> 2 [
NP Pro 2 --> 3 + SUB
VBZ provides 3 --> 10 ]
JJ additional 4 --> 6 [
NN mouse 5 --> 6 +
NNS shortcuts 6 --> 3 + OBJ
IN for 7 --> 6 ]
VBG selecting 8 --> 7 -
NN text 9 --> 8 [ OBJ
. . 10 --> 0 ]

#T18
DT The 1 --> 5 [
" " 2 --> 3 +
NN View 3 --> 5 +
" " 4 --> 3 +
NN tab 5 --> 11 +
IN in 6 --> 5 ]
NP Word 7 --> 8 [
POS 's 8 --> 9 +
NNS Options 9 --> 10 +
VBP dialog 10 --> 6 ]
. . 11 --> 0 -
The sentences are taken one each from the 3 les, Dynix, Lotus and
Trados respectively. The rst eld in each line is the part of speech
(see Marcus, Marcinkiewicz & Santorini, 1993, for an explanation of the
notation symbols), the second eld is the word, the third is the serial
number of the word, the forth is an arrow denoting the attachment of
the word to the fth eld, which is the serial number of its governor.
For example, in the rst sentence, the word number 1 \If" is attached
to word number 10 \create". It is apparent that there is a one-to-one
mapping from this scheme of representing the dependency structure to
the dependency parse tree. As a convention, the end-of-sentence punc-
tuation is attached to word number 0, which means that it does not
depend on anything in the sentence. This scheme of coding the depen-
dency structure of the sentence makes it easy for a human to annotate
or verify; one just needs to edit the serial number of the governor of each
word.
The sixth eld of each line is the atomic noun phrase symbol associ-
ated to each location of the sentence. There are 5 noun phrase symbols
de ned as follows:
[: start of noun phrase.
]: end of noun phrase.
": end and start of two adjacent noun phrases.
+: inside the noun phrase.
-: outside the noun phrase.
A conventional and perhaps more intuitive look of these symbols will
be to write the sentence horizontally and then shift each symbol by half a
word to the left and omit the + and -. For example: [ Ami Pro ] provides
[ additional mouse shortcuts ] for selecting [ text ] .
The seventh eld of each line is the argument of the predicate to
which it is attached. We use SUB to indicate subject, OBJ is object,
S_SUB is surface subject, L_SUB is the logical subject, and I_OBJ is the
indirect object.
The dependency structure coded in this manner is equivalent to a
dependency tree. Another way of representing the same structure is via
the tuple of ( ), where is the POS of the governor, and is the
g; o g o

relative o set of the governor. In other words, instead of using the serial
number, the governor of each word is represented as ( ). We use ( )
g; o g; o

as the state of the eHMM when parsing the dependency structure.


11
Using the TOSCA Analysis
System to Analyse a Software
Manual Corpus
Nelleke Oostdijk1
University of Nijmegen

11.1 Introduction
The TOSCA analysis system was developed by the TOSCA Research
Group for the linguistic annotation of corpora.2 It was designed to fa-
cilitate the creation of databases that could be explored for the purpose
of studying English grammar and lexis. Such databases can serve a dual
use. On the one hand they can provide material for descriptive studies,
on the other they can serve as a testing ground for linguistic hypotheses.
In either case, the annotation should meet the standard of the state of
the art in the study of English grammar, and should therefore exhibit
the same level of detail and sophistication. Also, the descriptive notions
and terminology should be in line with state-of-the-art descriptions of
the English language.
In the TOSCA approach, the linguistic annotation of corpora is
viewed as a two stage process. The rst stage is constituted by the
tagging stage, in which each word is assigned to a word class, while ad-
ditional semantico-syntactic information may be provided in the form of
added features as appropriate. The parsing stage is concerned with the
structural analysis of the tagged utterances. Analysis is carried out by
means of a grammar-based parser.
1 Address: Nelleke Oostdijk. Dept. of Language and Speech, University of Ni-
jmegen. P.O. Box 9103, 6500 HD Nijmegen, The Netherlands. Tel: +31 24 3612765,
Fax: +31-24-3615939, Email: oostdijk@let.kun.nl.
2 The text in the introductory sections in this paper, describing the TOSCA analy-
sis system, is an adapted version of Aarts, van Halteren and Oostdijk (1996). Thanks
are due to Hans van Halteren for his help in preparing the nal version of this paper.
180 Oostdijk

The analysis system is an ambitious one in two respects. Not only


should the analysis results conform to the current standards in descrip-
tive linguistics, but it is required that for each corpus sentence the
database should contain only the one analysis that is contextually ap-
propriate. It will be clear that whatever contextual (i.e. semantic, prag-
matic and extra-linguistic) knowledge is called upon, input from a hu-
man analyst is needed. This is the major reason why the analysis system
calls for interaction between computer and human analyst. However, for
reasons of consistency, human input should be minimized. Since con-
sistency is mainly endangered if the human analyst takes the initiative
in the analysis process, it is better to have the linguist only react to
prompts given by automatic processes, by asking him to choose from a
number of possibilities presented by the machine. This happens in two
places in the TOSCA system: once after the tagging stage (tag selection)
and once after parsing (parse selection).
Finally, it should be pointed out that one requirement that often
plays a role in discussions of automatic analysis systems has not been
mentioned - the robustness of the system. In our view, in analysis sys-
tems aiming at the advancement of research in descriptive linguistics the
principle of robustness should play a secondary role. A robust system
will try to come up with some sort of analysis even for (often marginally
linguistic) structures that cannot be foreseen. Precisely in such cases we
think that it should not be left to the system to come up with an answer,
but that control should be passed to the linguist.
So far the TOSCA analysis system has been successfully applied in
the analysis of (mostly written, British English) corpus material that
originated from a range of varieties, including ction as well as non-
ction. Through a cyclic process of testing and revising, the formal
grammar underlying the parser | which was initially conceived on the
basis of knowledge from the literature as well as intuitive knowledge
about the language | has developed into a comprehensive grammar of
the English language. It is our contention that, generally speaking, for
those syntactic structures that belong to the core of the language, the
grammar has reached a point where there is little room for improvement.3
However, as novel domains are being explored, structures are encoun-
tered that so far were relatively underrepresented in our material. It
is especially with these structures that the grammar is found to show
omissions. Since the system has not before been applied to the domain
of computer manuals, and instructive types of text in general have been
underrepresented in our test materials, the experiment reported on in
this paper can be considered a highly informative exercise.
3 Remaining lacunae in the grammar often concern linguistically more marginal
structures. While their description is not always unproblematic, quite frequently it
is also unclear whether the grammar should describe them at all (see Aarts, 1991).
TOSCA 181

Characteristic A B C D E F G
TOSCA yes yes yes yes yes yes yes

Table 11.1: Linguistic characteristics which can be detected by


TOSCA. See Table 11.2 for an explanation of the letter codes.
Code Explanation
A Verbs recognised
B Nouns recognised
C Compounds recognised
D Phrase Boundaries recognised
E Predicate-Argument Relations identi ed
F Prepositional Phrases attached
G Coordination/Gapping analysed

Table 11.2: Letter codes used in Tables 11.1 and 11.5.1.

11.2 Description of Parsing System


In the next section we rst give an outline of the TOSCA analysis envi-
ronment, after which two of its major components, viz. the tagger and
the parser are described in more detail.

11.2.1 The TOSCA Analysis Environment


As was observed above, the TOSCA system provides the linguist with an
integrated analysis environment for the linguistic enrichment of corpora.
In principle the system is designed to process a single utterance from
beginning to end. The successive stages that are distinguished in the
annotation process are: raw text editing, tokenization and automatic
tagging, tag selection, automatic parsing, selection of the contextually
appropriate analysis, and inspection of the nal analysis. For each of
these steps the linguist is provided with menu options, while there are
further options for writing comments and for moving from one utterance
to the next. However, it is not necessary to use the environment interface
for all the steps in the process. Since the intermediate results have a well-
de ned format, it is possible to use other software for speci c steps. A
much used time-saver is a support program which starts with a complete
sample in raw text form, tags it and splits it into individual utterances,
so that the linguist can start at the tag selection stage.
During the tagging stage each word in the text is assigned a tag
indicating its word class and possibly a number of morphological and/or
182 Oostdijk

semantico-syntactic features. The form of the tags is conformant with


the constituent labels in the syntactic analysis and is a balance between
succinctness and readability. The set of word classes and features are
a compromise between what is needed by the parser and what was felt
could be easily handled by linguists.
After the automatic tagger has done its work, the linguist can use
the tag selection program to make the nal selection of tags. He is
presented with a two column list, showing the words and the proposed
tags. A mouse click on an inappropriate tag calls up a list of alternative
tags for that word from which the correct tag can then be selected. In the
case of ditto tags, selection of one part of the tag automatically selects
the other part(s). If the contextually appropriate tag is not among the
list of alternatives, it is possible to select (through a sequence of menus)
any tag from the tag set. Since this is most likely to occur for unknown
words, a special Top Ten list of the most frequent tags for such unknown
words is also provided. The tag selection program can also be used to
add, change or remove words in the utterance in case errors in the raw
text are still discovered at this late stage. Finally, the program allows
the insertion of syntactic markers (see below) into the tagged utterance.
During the automatic parsing stage, all possible parses for the utter-
ance, given the selected tags and markers, are determined. The output
of this stage, then, is a set of parse trees. Before they are presented to
the linguist for selection, these parse trees are transformed into analysis
trees which contain only linguistically relevant information. The human
linguist then has to check whether the contextually appropriate analysis
is present and mark it for selection. Since it is not easy to spot the di er-
ence between two trees and because the number of analyses is sometimes
very large, it is impractical to present the trees one by one to the lin-
guist. Instead, the set of analysis trees is collapsed into a shared forest.
The di erences between the individual trees are represented as local dif-
ferences at speci c nodes in the shared forest. This means that selection
of a single analysis is reduced to a small number of (local) decisions.
For this selection process a customized version of the Linguistic Data
Base program (see below) is used. The tree viewer of this program shows
a single analysis, in which the selection points are indicated with tilde
characters, as shown in Figure 11.1. The linguist can focus on these
selection points and receives detailed information on the exact di er-
ences between the choices at these points, concerning function, category
and attributes of the constituent itself and the functions, categories and
boundaries of its immediate constituents. In most cases this information
is sucient for making the correct choice. Choices at selection points
are never nal; it is always possible to return to a selection point and
change one's choice. Only when the current set of choices is pronounced
to represent the contextually appropriate parse are the choices xed.
TOSCA 183

He was worried about his father .


1 1 He
2 1 was
~1~ 1 worried
3 2 1 about
* 2 1 1 his
2 father
2 .

1 UTT:S(decl,intens,unm,past,indic,act) SU:He V:was CS:worried_father


2 UTT:S(decl,intens,unm,past,indic,act) SU:He V:was CS:worried A:about_fath

1(2)/1(1) UTTERANCE:SENTENCE(declarative,intensive,unmarked,past,indi
command:
scroll:YUDLR<>() focus:FS1-90PNMJ amb:T=+-CA view:V help:? exit:X

Figure 11.1: An analysis selection screen

The resulting analysis is stored in a binary format for use with the stan-
dard version of the Linguistic DataBase system (cf. van Halteren & van
den Heuvel, 1990) and in an ASCII/SGML format for interchange with
other database systems.
If an utterance cannot be parsed, either because it contains construc-
tions which are not covered by the grammar or because its complexity
causes time or space problems for the parser, a tree editor can be used to
manually construct a parse. Restrictions within the tree editor and sub-
sequent checks ensure that the hand-crafted tree adheres to the grammar
as much as possible.

11.2.2 The Tagger


The rst step of the tagging process is tokenization, i.e. the identi ca-
tion of the individual words in the text and the separation of the text
into utterances. The tokenizer is largely rule-based, using knowledge
about English punctuation and capitalization conventions. In addition,
statistics about e.g. abbreviations and sentence initial words are used to
help in the interpretation of tokenization ambiguities. Where present,
a set of markup symbols is recognized by the tokenizer and, if possible,
used to facilitate tagging (e.g. the text unit separator # ). < >

Next, each word is examined by itself and a list of all possible tags for
184 Oostdijk

that word is determined. For most words, this is done by looking them
up in a lexicon. In our system we do not have a distinct morphologi-
cal analysis. Instead we use a lexicon in which all possible word forms
are listed. The wordform lexicon has been compiled using such diverse
resources as tagged corpora, machine readable dictionaries and exper-
tise gained in the course of years. It currently contains about 160,000
wordform-tag pairs, covering about 90,000 wordforms.
Even with a lexicon of this size, though, there is invariably a non-
negligible number of words in the text which are not covered. Rather
than allowing every possible tag for such words (and shifting the problem
to subsequent components), we attempt to provide a more restricted list
of tags. This shortlist is based on speci c properties of the word, such
as type of rst character (upper case, lower case, number, etc.) and the
nal characters of the word. For example, an uncapitalized word ending
in -ly can be assumed to be a general adverb. The statistics on such
property-tag combinations are based on sux morphology and on the
tagging of hapax legomena in corpus material.
The last step in the tagging process is the determination of the con-
textually most likely tag. An initial ordering of the tags for each word
is given by the probability of the word-tag pair, derived from its fre-
quency in tagged corpora. This ordering is then adjusted by examining
the direct context, the possible tags of the two preceding and the two
following words. The nal statistical step is a Markov-like calculation
of the `best path' through the tags, i.e. the sequence of tags for which
the compound probability of tag transitions is the highest. The latter
two steps are both based on statistical information on tag combinations
found in various tagged corpora. The choice of the most likely tag is
not done purely by statistical methods, however. The nal word is given
to a rule-based component. This component tries to correct observed
systematic errors of the statistical components.
The tagset we employed in the analysis of the three computer manuals
reported on here consists of around 260 tags, with a relatively high degree
of ambiguity: when we compare our tagset to other commonly used
tagsets such as the Brown tagset (which has only 86 tags), we nd that
with the Brown tagset the number of tags for a given word ranges from
1 to 7 tags and fully 60% of the words or tokens in the text appears to
be unambiguous, while with our tagset the number of tags ranges from
1 to 33 and only 29% of the words are unambiguous.

11.2.3 The Parser


The TOSCA parser is grammar-based, i.e. the parser is derived from a
formal grammar. The rules of the grammar are expressed in terms of
rewrite rules, using a type of two-level grammar, viz. Ax Grammar
TOSCA 185

c VP VERB PHRASE (cat operator, complementation, niteness, INDIC, voice):


f OP OPERATOR (cat operator, niteness, INDIC),
n FURTHER VERBAL OPTIONS (cat operator, complementation, INDIC, voice1),
n establish voice (cat operator, voice, voice1).

n FURTHER VERBAL OPTIONS (cat operator, INTR, INDIC, ACTIVE):


.
n FURTHER VERBAL OPTIONS (cat operator, complementation, INDIC, voice):
n A ADVERBIAL OPTION,
n next aux expected (cat operator, next cat),
n expected niteness (cat operator, niteness),
f AVB AUXILIARY VERB (next cat, niteness, INDIC),
n FURTHER VERBAL OPTIONS (next cat, complementation, INDIC, voice1),
n establish voice (next cat, voice1, voice);
n A ADVERBIAL OPTION,
n expected niteness (cat operator, niteness),
f MVB MAIN VERB (complementation1, niteness, INDIC, voice),
n establish voice (cat operator, ACTIVE, voice),
n reduced compl when passive (voice, complementation1, complementation).

Figure 11.2: Some example rules in AGFL.4

over Finite Lattices (AGFL) (Nederhof & Koster, 1993). This formalism
and the parser generator that is needed for the automatic conversion
of a grammar into a parser, were developed at the Computer Science
Department of the University of Nijmegen.5 The parser is a top-down
left corner recursive backup parser.
In our experience, linguistic rules can be expressed in the AGFL for-
malism rather elegantly: the two levels in the grammar each play a dis-
tinctive role and contribute towards the transparency of the description
and resulting analyses. An outline of the overall structure is contained
in the rst level rules, while further semantico-syntactic detail is found
4 As parse trees are transformed into analysis trees the pre xes that occur in these
rules are used to lter out the linguistically relevant information. The pre x f iden-
ti es the function of a constituent, c its category, while n is used to indicate any
non-terminals that should not be included in the analysis tree.
5 AGFL comes with a Grammar Workbench (GWB), which supports the devel-
opment of grammars, while it also checks their consistency. The AGFL formal-
ism does not require any special hardware. The parser generator, OPT, is rela-
tively small and runs on regular SPARC-systems and MS-DOS machines (386 and
higher). Required harddisk space on the latter type of machine is less than 1
MB. AGFL was recently made available via FTP and WWW. The address of the
FTP-site is: ftp://hades.cs.kun.nl/pub/ag / The URL of the AGFL home page is:
http://www.cs.kun.nl/ag /
186 Oostdijk

on the second level. Thus generalizations remain apparent and are not
obscured by the large amount of detail. Some example rules in AGFL
are given in Figure 11.2.
In these rules the rst level describes the (indicative) verb phrase in
terms of an operator which may be followed by further verbal elements,
i.e. auxiliary verbs and/or a main verb. The rst level description is
augmented with the so-called ax level. The axes that are associated
with the verb phrase record what type of auxiliary verb realizes the
function of operator (e.g. modal, perfective, progressive or passive),
what complementation (objects and/or complements) can be expected
to occur with the verb phrase, whether the verb phrase is nite or non-
nite, and whether it is active or passive. The predicate rules that are
given in small letters (as opposed to the other rst level rules for which
capital letters are used) are rules that are used to impose restrictions on
or to e ect the generation or analysis of a particular ax value elsewhere.
For example, the predicate rule `next aux expected' describes the co-
occurrence restrictions that hold with regard to subsequent auxiliary
verbs.
The objective of the formalized description is to give a full and ex-
plicit account of the structures that are found in English, ideally in
terms of notions that are familiar to most linguists. As such, the formal
grammar is interesting in its own right. The descriptive model that is
employed in the case of the TOSCA parser is based on that put forward
by Aarts and Aarts (1982), which is closely related to the system found
in Quirk, Greenbaum, Leech and Svartvik (1972). This system is based
on immediate constituent structure and the rank hierarchy. Basically,
the three major units of description are the word, the phrase and the
clause/sentence. As was said above, words are assigned to word classes
and provided with any number of features, which may be morphologi-
cal, syntactic or semantic in character. They form the building blocks
of phrases; in principle, each word class can function as the head of a
phrase. Every unit of description that functions in a superordinate con-
stituent receives a function label for its relation to the other elements
within that constituent. On the level of the phrase we nd function labels
like head and modi er, on the level of the clause/sentence we nd sen-
tence functions like subject and object. The relational concepts that are
expressed in function labels are essentially of three types: subordination
and superordination (e.g. in all headed constituents), government (e.g.
in prepositional phrases and in verb complementation), and concatena-
tion (in compounding on word level, in apposition and in coordination).
The analysis result is presented to the linguist in the form of a la-
belled tree. Unlike the tree diagrams usually found in linguistic studies,
the trees grow from left to right. With each node at least function and
category information is associated, while for most nodes also more de-
TOSCA 187

PUNC,PM
'
oquo

PRSU,PRIT It

V,VP MVB,LV
act,indic encl,indic 's
intens,pres intens,pres

CS,AJP AJHD,ADJ
wonderful
prd prd

A,AVP AVHD,ADV
how
inter inter

NPHD,N
RPDU,S SU,NP Friary
prop,sing
act,decl
extsu,indic
V,VP MVB,LV
intens,pres
act,indic indic,intr gets
NOSU,CL intr,pres pres
act,indic
A,AVP AVHD,ADV
intr,pres back
phr phr
unm,zsub

P,PREP on

DTCE,ART
A,PP DT,DTP the
def
PC,NP
NPHD,N
job
com,sing
-,TXTU
PUNC,PM
,
comma

PUNC,PM
'
cquo

NPHD,PN
SU,NP he
pers,sing

V,VP MVB,LV
act,indic indic,intr said
intr,past past

PUNC,PM
,
comma

RPGT,S P,PREP by way of


act,decl
indic,intr
V,VP MVB,LV
past,unm
act,indic indic,motr finding
motr,presp presp
PC,CL
A,PP DTCE,PN
-su,act DT,DTP some
ass
indic,motr
presp,unm
NPPR,AJP AJHD,ADJ
zsub OD,NP casual
attru attru

NPHD,N
remark
com,sing

PUNC,PM
.
per

Figure 11.3: Example analysis.

tailed semantico-syntactic information is provided. An example analysis


is given in Figure 11.3. The information that is found at the nodes of
the tree is as follows: function-category pairs of labels are given in cap-
ital letters (e.g. \SU,NP" describes the constituent as a subject that is
realized by a noun phrase), while further detail is given in lower case.
Lexical elements are found at the leaves of the tree.
188 Oostdijk

The present grammar comprises approximately 1900 rewrite rules (in


AGFL) and has an extensive coverage of the syntactic structures that
occur in the English language. It describes structures with unmarked
word order, as well as structures such as cleft, existential and extraposed
sentences, verbless clauses, interrogatives, imperatives, clauses and sen-
tences in which subject-verb inversion occurs and/or in which an object
or complement has been preposed. Furthermore, a description is pro-
vided for instances of direct speech, which includes certain marked and
highly elliptical clause structures, enclitic forms, as well as some typical
discourse elements (e.g. formulaic expressions, connectives, reaction sig-
nals). A small number of constructions has not yet been accounted for.
These are of two types. First, there are constructions the description
of which is problematic given the current descriptive framework. An
example of this type of construction is constituted by clauses/sentences
showing subject-to-object [1], or object-to-object raising [2].
[1] Who do you think should be awarded the prize?
[2] Who do you think they should award the prize?
The traditional analysis of these structures is arrived at by postu-
lating some kind of deep structure. In terms of the present descriptive
framework, however, only one clausal level is considered at a time, while
the function that is associated with a constituent denotes the role that
this constituent plays within the current clause or sentence. The analy-
sis of [1] is problematic since Who, which by a deep structure analysis
can be shown to be the subject of the embedded object clause, occurs
as a direct object in the matrix clause or sentence. In [2] Who must
be taken to be the indirect object of award in the object clause. On
the surface, however, it appears as a direct object of think in the matrix
clause. Second, there are constructions that occur relatively infrequently
and whose description can only be given once we have gained sucient
insight into their nature, form and distribution of occurrence or which,
so far, have simply been overlooked. In handbooks on English grammar
the description of these structures often remains implicit or is omitted
altogether.
The parser is designed to produce at least the contextually appro-
priate analysis for each utterance. But apart from the desired analysis,
additional analyses may be produced, each of which is structurally cor-
rect but which in the given context cannot be considered appropriate.
Therefore, parse selection constitutes a necessary step in the analysis
process. In instances where the parser produces a single analysis the
linguist checks whether this is indeed the contextually appropriate one,
while in the case of multiple analyses the linguist must select the one he
nds appropriate. Although it has been a principled choice to yield all
possible analyses for a given utterance, rather than merely yielding the
one (statistically or otherwise) most probable analysis, the overgenera-
TOSCA 189

tion of analyses has in practice proved to be rather costly. Therefore, in


order to have the parser operate more eciently (in terms of both com-
puter time and space) as well as to facilitate parse selection, a number
of provisions have been made to reduce the amount of ambiguity. Prior
to parsing, the boundaries of certain constituents must be indicated by
means of syntactic markers. This is partly done by hand, partly auto-
matically. For example, the linguist must insert markers for constituents
like conjoins, appositives, parenthetic clauses, vocatives and noun phrase
postmodi ers. As a result, certain alternative parses are prohibited and
will not play a role in the parsing process. In a similar fashion an auto-
matic lookahead component contributes to the eciency of the parsing
process. In the tagged and syntactically marked utterance, lookahead
symbols are inserted automatically. These are of two types: they either
indicate the type of sentence (declarative, interrogative, exclamatory or
imperative), or they indicate the potential beginnings of subclauses. In
e ect these lookahead symbols tell the parser to bypass parts of the
grammar. Since the analysis result may still be ambiguous, in spite of
the nonambiguous tag assignment of an utterance, its syntactic marking
and the insertion of lookahead symbols, the analysis result may still be
ambiguous, a (rule-based) lter component has been developed which
facilitates parse selection by ltering out (upon completion of the pars-
ing process) intuitively less likely analyses. For a given utterance, for
example, analyses in which marked word order has been postulated are
discarded automatically when there is an analysis in which unmarked
word order is observed.
The selection of the contextually appropriate analysis for a given
utterance is generally fairly straightforward. Moreover, since a formal
grammar underlies the parser, consistency in the analyses is also war-
ranted, that is, up to a point: in some instances, equally appropriate
analyses are o ered. It is with these instances that the linguist has to be
consistent in the selection he makes from one utterance to the next. For
example, the grammar allows quantifying pronouns to occur as predeter-
miner, but also as postdeterminer. As predeterminers they precede items
like the article, while as postdeterminers they typically follow the article.
While there is no ambiguity when the article is present, a quantifying
pronoun by itself yields an ambiguous analysis.

11.3 Parser Evaluation Criteria


For the IPSM'95 workshop that was held at the University of Limer-
ick in May 1995 participants were asked to carry out an analysis of the
IPSM'95 Corpus of Software Manuals. The corpus comprised some 600
sentences. The material was to be subjected to analysis under varying
190 Oostdijk

circumstances. In the experiment three phases were distinguished and


for each phase the material had to be re-analysed. During phase I of the
experiment the system made use of its original grammar and vocabulary,
while in phases II and III changes to the lexicon and grammar were per-
mitted. The ndings of the participants were discussed at the workshop.
In order to facilitate comparisons between the di erent systems, for the
present paper participants were instructed to report on their ndings on
the basis of only a small subset (60 sentences) of the original corpus. In
our view the number of sentences to be included is unfortunately small:
as our results show, the success rate is very much in uenced by chance.
While on the basis of the subset of 60 sentences we can claim the success
rate to be 91.7% (on average), the results over the full corpus are less
favourable (88.3% on average). Therefore, in order to provide a more
accurate account of the system and its performance, we have decided
to include not only our ndings for the subset of 60 sentences, but also
those for the full corpus.
The structure of the remainder of this paper, then, is as follows: Sec-
tions 11.4, 11.5 and 11.6 describe the procedure that was followed in each
of the di erent phases of the experiment. In Section 11.7 a description
is given of the way in which output from the TOSCA system can be
converted to a dependency structure. A summary of our ndings in the
experiment is given in Section 11.8. Section 11.9 lists the references.

11.4 Analysis I: Original Grammar, Orig-


inal Vocabulary
For the rst analysis of the material we ran the TOSCA system without
any adaptations on an MS/DOS PC with a 486DX2/66 processor and
16Mb of RAM. Each of the three samples (hereafter referred to as LO-
TUS, DYNIX and TRADOS) was tagged and then further processed. As
a preparatory step to the tagging stage, we inserted text unit separators
to help the tokenizer to correctly identify the utterance boundaries in
the texts.6 This action was motivated by the observation that the use of
punctuation and capitalization in these texts does not conform to com-
mon practice: in a great many utterances there is no nal punctuation,
while the use of quotes and (esp. word-initial) capital letters appears to
be abundant. These slightly adapted versions of the raw texts were then
submitted to the tagger. As a result of the (automatic) tagging stage
the texts were divided into utterances, and with each of the tokens in an
utterance the contextually most likely tag was associated.
6 In fact, at the beginning of each new line in the original text a text unit separator
was inserted.
TOSCA 191

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100.0 00.0
Lotus 20 18 2 090.0 10.0
Trados 20 20 0 100.0 00.0
Total 60 58 2 096.7 03.3

Table 11.3.1: Phase I acceptance and rejection rates for TOSCA.


The machine used was a 486DX2/66 MS/DOS PC. The gures
presented here are the number of utterances for which the system
produces an analysis (not necessarily appropriate).
Total Time Average Time Average Time
to Parse (s) to Accept (s) to Reject (s)
Dynix 0258 12.9
Lotus 1065 59.1 0.5
Trados 0673 33.7
Total 1996 34.4 0.2

Table 11.4.1: Phase I parse times for TOSCA. The machine


used was a 486DX2/66 MS/DOS PC with 1.6 Mb of memory.
Char. A B C D E F G Avg.
Dynix 100% 100% 100% 100% 100% 100% 100% 100%
Lotus 085% 092% 094% 091% 088% 092% 100% 092%
Trados 100% 100% 100% 100% 100% 100% 100% 100%
Avg. 095% 097% 098% 097% 096% 097% 100% 097%

Table 11.5.1: Phase I Analysis of the ability of TOSCA to recog-


nise certain linguistic characteristics in an utterance. For example
the column marked `A' gives for each set of utterances the per-
centage of verbs occurring in them which could be recognised.
The full set of codes is itemised in Table 11.2.

A characterization of the three texts in terms of the language variety


exempli ed, the number of words, tokens, and utterances and the mean
utterance length is given in Table 11.6.2. The material in the selected
subset can be characterized as in Table 11.6.1.
Since the parser will only produce the correct analysis (i.e. the con-
textually appropriate one) when provided with non-ambiguous and fully
correct input, tag correction constitutes a necessary next step in the
analysis process. Therefore, after the texts had been tagged, we then
192 Oostdijk

Number Accept Reject % Accept % Reject


Dynix 20 20 0 100.0 00.0
Lotus 20 17 2 085.0 10.0
Trados 20 18 0 090.0 00.0
Total 60 55 2 091.7 03.3

Table 11.3.1a: Phase I acceptance and rejection rates for


TOSCA. The machine used was a 486DX2/66 MS/DOS PC. The
gures presented here are the number of sentences for which the
contextually appropriate analysis is produced.

proceeded by checking | and where necessary correcting | the tag-


ging of each utterance and inserting the required syntactic markers. As
had been expected, a great many utterances required tag correction and
syntactic marker insertion. With regard to the need for tag correction,
the texts did not greatly di er: for approximately 20-25 per cent of the
utterances a fully correct tagging was obtained, while in the remainder
of the utterances minimally one tag needed to be corrected.7 Syntactic
marker insertion was required with the majority of utterances. In this
respect the LOTUS text was least problematic: in 35 per cent of the
utterances no syntactic markers were required (cf. the DYNIX and the
TRADOS texts in which 32.5 and 25 per cent respectively could remain
without syntactic markers). More interesting, however, are the di er-
ences between the texts when we consider the nature of the syntactic
markers that were inserted. While in all three texts coordination ap-
peared to be a highly frequent phenomenon, which explains the frequent
insertion of the conjoin marker, in the LOTUS text only one other type
of syntactic marker was used, viz. the end-of-noun-phrase-postmodi er
marker. In the DYNIX and TRADOS texts syntactic markers were also
used to indicate apposition and the occurrence of the noun phrase as
adverbial.
After an utterance has been correctly tagged and the appropriate
syntactic markers have been inserted, it can be submitted to the parser.
As the analyst hands back control to the system, the words are stripped
from the utterance and the tags together with the syntactic markers
7 The success rate reported here is by TOSCA standards: a tag is only considered
to be correct if it is the one tag that we would assign. All other taggings are counted
as erroneous, even though they may be very close to being correct (as for example
the tagging of a noun as a N(prop, sing) instead of N(com, sing), or vice versa), or
subject to discussion (monotransitive vs. complex transitive). The major problem
for the tagger was constituted by the compound tokens that occurred in the texts.
The list of terms provided here was of little use since it contained many items that
in terms of our descriptive model are not words but (parts of) phrases.
TOSCA 193

Lotus Dynix Trados


Am. English Am. English Eur. English
Language with with with
Variety Am. Spelling Am. Spelling Am. Spelling
No. of Words 256 293 340
No. of Tokens 296 340 386
No. of Utterances 020 020 020
Mean Utt. Length
(In No. of Tokens) 14.8 17.0 19.3

Table 11.6.1: Characterization of the texts (subset of 60 sen-


tences).
Lotus Dynix Trados
Am. English Am. English Eur. English
Language with with with
Variety Am. Spelling Am. Spelling Am. Spelling
No. of Words 2477 2916 3609
No. of Tokens 2952 3408 4221
No. of Utterances 0200 0200 0207
Mean Utt. Length
(In No. of Tokens) 14.7 17.0 20.4

Table 11.6.2: Characterization of the texts (original IPSM'95


Corpus).

are put to the automatic lookahead component which inserts two types
of lookahead symbol: one that indicates the type of the utterance (d =
declarative, m = imperative), and one that indicates possible beginnings
of (sub)clauses (#). Upon completion of the insertion of lookahead sym-
bols, the parser is then called upon. In Figure 11.4 an example is given
of an utterance as it occurs (1) in its original format and (2) in its tagged
and syntactically marked format, including the lookahead symbols that
were added by the automatic lookahead component.

11.4.1 Ecacy of the Parser


In the light of our experiences with the analysis of other types of text,
the ecacy of the parser with this particular text type was somewhat
disappointing. While the success rate (i.e. the percentage of utterances
for which a contextually appropriate analysis is yielded) for ction texts
ranges from 94 to 98 per cent, the overall success rate with the three
194 Oostdijk

Original format of the utterance:


The level you specify determines the number of actions or levels
Ami Pro can reverse.
Input format for the parser:
d ART(def) N(com,sing) # PRON(pers) V(montr, pres)
MARK(enppo) V(montr,pres) # ART(def) N(com, sing)
PREP(ge) MARK(bcj) N(com, plu) MARK(ecj) CONJUNC(coord)
MARK(bcj) N(com, plu) MARK(ecj) MARK(enppo) # N(prop,
sing) AUX(modal, pres) V(montr, in n) MARK(enppo)
PUNC(per)
Figure 11.4: Example utterance. For explanation of codes see
Figure 11.5.
ART(def) de nite article
AUX(modal, pres) modal auxiliary, present tense
CONJUNC(coord) coordinating conjunction
N(com, sing) singular common noun
N(prop, sing) singular proper noun
PREP(ge) general preposition
PRON(pers) personal pronoun
PUNC(per) punctuation, period
V(montr, pres) monotransitive verb, present tense
V(montr, in n) monotransitive verb, in nitive
MARK(bcj) beginning-of-conjoin marker
MARK(ecj) end-of-conjoin marker
MARK(enppo) end-of-noun-phrase postmodi er marker
Figure 11.5: Explanation of codes used in Figure 11.4. With the
tags capitalized abbreviations indicate word class categories, while
features are between brackets (using small letters). The syntactic
markers are labelled MARK, while their nature is indicated by
means of the information given between brackets.

texts under investigation is 88.3% on average, ranging from 85 per cent


for the DYNIX text to 91.5 per cent for the LOTUS text. Note that if
we only take the subset into account, the success rate is 91.7 per cent
on average, ranging from 85 per cent for the LOTUS text to 100 per
cent for the DYNIX text. A breakdown of the analysis results is given
in Table 11.7.2 (for the original, full IPSM'95 Corpus) and Table 11.7.1
(for the subset of 60 utterances).
As is apparent from the breakdown in Table 11.7.2, the success rate,
especially in the case of the DYNIX and TRADOS texts, is very much
negatively in uenced by the percentage of utterances for which the pars-
TOSCA 195

Lotus Dynix Trados


# # % of # % of # % of
Analyses Utts Utts Utts Utts Utts Utts
Parse Failure 2 10.0 0 00.0 0 00.0
Erroneous 1 05.0 0 00.0 2 10.0
Inconclusive 0 00.0 0 00.0 0 00.0
1 5 25.0 7 35.0 3 15.0
2 5 25.0 3 15.0 4 20.0
3 0 00.0 2 10.0 1 05.0
4 2 10.0 2 10.0 2 10.0
5 1 05.0 0 00.0 1 05.0
6 0 00.0 0 00.0 1 05.0
>6 4 20.0 6 30.0 6 30.0

Table 11.7.1: Breakdown of analysis results (success rate and


degree of ambiguity; subset of 60 sentences).
Lotus Dynix Trados
# # % of # % of # % of
Analyses Utts Utts Utts Utts Utts Utts
Parse Failure 09 04.5 13 06.5 05 02.4
Erroneous 06 03.0 05 02.5 08 03.9
Inconclusive 02 01.0 12 06.0 11 05.3
1 50 25.0 71 35.5 52 25.1
2 53 26.5 23 11.5 40 19.3
3 07 03.5 09 04.5 08 03.9
4 15 07.5 16 08.0 17 08.2
5 05 02.5 01 00.5 05 02.4
6 09 04.5 13 06.5 08 03.9
>6 44 22.0 37 18.5 53 25.6

Table 11.7.2: Breakdown of analysis results (success rate and


degree of ambiguity; original IPSM'95 Corpus).

ing stage did not yield a conclusive result. For up to 6 per cent of the
utterances (in the DYNIX text) parsing had to be abandoned after the
allotted computer time or space had been exhausted.8 If we were to
correct the success rate with the percentage of utterances for which no
conclusive result could be obtained under the present conditions, assum-
ing that a PC with a faster processor and more disk space would alleviate
our problems, the result is much more satisfactory, as is shown in Table
8 Here it should be observed that the problem did not occur while parsing the
utterances of the subset.
196 Oostdijk

Lotus Dynix Trados


TOSCA(1) 91.5% 85.0% 88.4%
TOSCA(2) 92.5% 91.5% 93.7%
Table 11.8: Success rate in parsing the texts (original IPSM'95
Corpus; TOSCA(1) gives the strict success rate, while TOSCA(2)
gives the corrected success rate).

11.8.
The degree of ambiguity of the three texts appears much higher than
that observed in parsing other types of text. For example, with ction
texts on average for approximately 48 per cent of the utterances the
parser produces a single analysis, some 63 per cent of the utterances
receive one or two analyses, and 69 per cent receive up to three analyses.
For the computer manual texts these gures are as given in Tables 11.8
and 11.9.1.
An examination of the utterances for which the parser failed to yield
an analysis revealed the following facts:
 all parse failures in the LOTUS text and half the failures in the
DYNIX text could be ascribed to the fact that the grammar un-
derlying the parser does not comprise a description of structures
showing raising. Typical examples are:
[3] Move the mouse pointer until the I-beam is at the beginning of
the text you want to select.
[4] Select the text you want to move or copy.
[5] The type of search you wish to perform.
 the parser generally fails to yield an analysis for structures that
do not constitute proper syntactic categories in terms of the de-
scriptive model that is being employed; occasionally the parser will
come up with an analysis, which as is to be expected, is always er-
roneous. For example, in the DYNIX text the parser failed in the
analysis of the following list-items:
[6] From any system menu by entering \S" or an accelerated search
command.
[7] From the item entry prompt in Checkin or Checkout by entering
\.S".
 apart from the two types of structure described above, there did
not appear to be any other systematic cause for failure. Especially
TOSCA 197

No. of Analyses Lotus Dynix Trados


Single 25.0% 35.0% 15.0%
One or two 50.0% 50.0% 35.0%
One to three 50.0% 60.0% 40.0%

Table 11.9.1: Degree of ambiguity (subset of 60 sentences).


No. of Analyses Lotus Dynix Trados
Single 25.0% 35.5% 25.1%
One or Two 51.5% 37.0% 44.4%
One to Three 55.0% 41.5% 48.3%

Table 11.9.2: Degree of ambiguity (original IPSM'95 Corpus).

in the TRADOS text it would seem that parse failure should be


ascribed to omissions in the grammar, more particularly where
the description of apposition is concerned (in combination with
coordination).
The percentage of utterances for which the parser did not yield the
contextually appropriate analysis (i.e. where only erroneous analyses
were produced) was relatively high when compared to our experiences
with other types of text. On examination we found that this was only in
part due to omissions in the grammar. A second factor appeared to be
overspeci cation: in an attempt to reduce the amount of ambiguity as
much as possible a great many restrictions were formulated with regard
to the co-occurrence of consecutive and coordinating categories. Some of
these now proved to be too severe. An additional factor constituted the
lter component that comes into operation upon completion of the pars-
ing process and which, as was explained above, serves to automatically
lter out intuitively less likely analyses. In a number of instances the
parser would yield the correct analysis, but this would then be discarded
in favour of (an) analysis/-es that, at least by the assumptions underly-
ing the lter component, was/were considered to be more probable.
Omissions in the grammar were found to include the following:
 the grammar does not describe the possible realization of the prepo-
sitional complement by means of a wh-clause; this explains the
erroneous analysis of utterances such as [8] and [9]:
[8] The following are suggestions on how to proceed when using the
Translator's Workbench together with Word for Windows 6.0.
[9] This has an in uence on how your Translation Memory looks
198 Oostdijk

Lotus Dynix Trados


#CPU # % of # % of # % of
Secs Utts Utts Utts Utts Utts Utts
t5 15 75.0 15 75.0 14 70.0
5  t  10 01 05.0 00 00.0 01 05.0
10  t  15 00 00.0 01 05.0 01 05.0
15  t  20 01 05.0 01 05.0 00 00.0
20  t  25 00 00.0 01 05.0 00 00.0
25  t  30 00 00.0 00 00.0 00 00.0
t > 30 03 15.0 02 10.0 04 20.0

Table 11.10.1: Breakdown of parsing times (subset of 60 sen-


tences).
Lotus Dynix Trados
#CPU # % of # % of # % of
Secs Utts Utts Utts Utts Utts Utts
t5 114 57.0 110 55.0 114 55.1
5  t  10 028 14.0 017 08.5 024 11.6
10  t  15 007 03.5 012 06.0 009 04.4
15  t  20 005 02.5 005 02.5 004 01.9
20  t  25 002 01.0 003 01.5 009 04.4
25  t  30 008 04.0 001 00.5 004 01.9
t > 30 038 19.0 052 26.0 043 20.8

Table 11.10.2: Breakdown of parsing times (original IPSM'95


Corpus).

and how the target-language sentences are transferred from Trans-


lation Memory to the yellow target eld in WinWord 6.0 in the
case of a fuzzy match.
 the grammar does not describe the realization of an object comple-
ment by means of a wh-clause; this explains the erroneous analysis
of utterances such as [10]:
[10] Place the insertion point where you want to move the text.
A typical example of an utterance for which only an erroneous anal-
ysis was yielded as a result of overspeci cation is given in [11]:
[11] This is a sample sentence that should be translated, formatted
in the paragraph style \Normal".
While the grammar describes relative clauses and zero clauses as pos-
sible realizations of the function noun phrase postmodi er, restrictions
TOSCA 199

have been formulated with regard to the co-occurrence of consecutive


postmodifying categories. One of the assumptions underlying these re-
strictions is that zero clauses always precede relative clauses, an assump-
tion which here is shown to be incorrect.
The lter component typically failed with utterances for which it was
assumed that an analysis postulating a sentence as the category realizing
the utterance was more probable than one in which the utterance was
taken to be realized by a prepositional phrase. For example, while on
the basis of the rules contained in the grammar the correct analysis
was produced for the utterance in [12], it was discarded and only the
erroneous analysis remained.
[12] From any system menu that has \Search" as a menu option.

11.4.2 Eciency of the Parser


The parsing times recorded in Tables 11.10.2 and 11.10.1 are the times it
took the parser to parse the utterance after it had been properly tagged
and syntactically marked.
At a rst glance there do not appear to be any major di erences
between the three texts: the proportion of utterances for which a result
is yielded within 5 seconds is similar for all three texts. However, the
proportion of utterances for which it takes the parser more than 30 sec-
onds to produce a result is much higher in the DYNIX text than it is
in the other two texts (26 per cent vs. 19.0 and 20.8 per cent respec-
tively). As we saw above, the DYNIX text di ers from the other two
texts in other respects as well: it has the highest percentage of utter-
ances for which no conclusive result is obtained, while the percentage of
successfully parsed utterances that receives a single analysis stands out
as well. It would appear that an explanation for the observed di erences
may be sought in the fact that the length of utterances in the DYNIX
text varies a great deal, a fact which does not emerge from the mean
length of the utterances recorded above (cf. Table 11.6.2). Although the
relationship between the length of utterances on the one hand and the
success rate and eciency in parsing on the other is not straightforward,
the complexity of utterances is generally found to be greater with longer
utterances so that even when the analysis result is not extremely am-
biguous, the amount of ambiguity that plays a role during the parsing
process may be problematic.

11.4.3 Results
This section will present the linguistic characteristics of the TOSCA
system and the results based on the subset of the original corpus. As
was described in Section 11.2.3 the TOSCA parser is based upon a formal
200 Oostdijk

grammar. This grammar has an extensive coverage and describes most


linguistic structures. The descriptive model is essentially a constituency-
model, in which the word, the phrase and the clause/sentence form the
major units of description. Each constituent is labelled for both its
function and its category. The relational concepts that are expressed in
function labels are of three types: subordination and superordination,
government, and concatenation. Table 11.1 lists some of the constituents
that the parser can in principle recognise.
The following observations are in order:
1. the TOSCA system employs a two stage analysis model in which
a tagging stage precedes the parsing stage. Each tag that results
from the (automatic) tagging is checked and if necessary corrected
before the parser is applied. Therefore, while the parser recognises
verbs, nouns and compounds (they occur after all as terminal sym-
bols in the formal grammar from which the parser is derived), any
ambiguity that arises at the level of the word (token) is actually
resolved beforehand, during the tagging stage and subsequent tag
selection.
2. PP-attachment and the analysis of coordinations and instances of
gapping in the TOSCA system is not problematic due to the fact
that the user of the system must insert syntactic markers with cer-
tain constituents. Thus the conjoins in coordinations are marked,
as are prepositional phrases (and indeed all categories) that func-
tion as noun phrase postmodi ers.
The TOSCA parser is fairly liberal. We permit the parser to over-
generate, while at the same time we aim to produce for any given (ac-
ceptable) utterance at least the contextually appropriate analysis. For
the present subset, we are not entirely successful (cf. Table 11.3.1). As
many as 58 of the 60 utterances (96.7%) receive some kind of analysis; for
55 utterances (91.7%) the contextually appropriate analysis is actually
present (cf. Table 11.3.1a).
The two utterances for which the parser fails to produce an analysis
are both instances of raising. Raising also explains one of the instances
in which no appropriate analysis was obtained.
In Table 11.4.1 the performance of the parser is given in terms of
the total time (in CPU seconds) that it took the parser to attempt to
produce a parse for each of the utterances. In the third column (headed
\avg. time to accept") the average time is listed that it took the parser
to produce a parse when the input was found to be parsable, while the
fourth column (\avg. time to reject") lists the average time that was
required to determine that the input could not be parsed. The average
times are hardly representative: for example, in the Lotus text there
TOSCA 201

is one utterance for which it took the parser 973 seconds to produce a
parse as a result of which the reported average time is much worse than
it is for the two other texts. Moreover, the average times obscure the
fact that as many as 44 (i.e. 73.2 per cent) of the 60 utterances were
actually parsed within 5 seconds (cf. Table 11.10.1).
Finally, in Table 11.5.1 the accuracy of the TOSCA parser has been
determined by computing the percentage of the number of instances in
which the parser could identify various constructions. Here it should be
pointed out that for utterances for which the parser failed to produce an
analysis, none of the constructions that occurred in it could contribute
towards the success (in other words, if for some reason, the input could
not be parsed it was assumed that the identi cation of all constructions
that occurred in it failed).

11.5 Analysis II: Original Grammar, Ad-


ditional Vocabulary
A separate second analysis in which the system was used with its original
grammar and additional vocabulary was, in view of the design of the
TOSCA system, not deemed to make any sense. As was explained above,
the system requires non-ambiguous and fully correct input, that is, with
each token the contextually appropriate tag must be associated in order
for the parser to yield the correct analysis. In fact, one can say that
the proposed second analysis was already carried out during the rst
analysis as incorrect tags were being replaced by correct ones.
What would have been interesting in the TOSCA setting is to have
the tagger train on texts from this speci c domain. As the relative proba-
bilities of the wordform-tag combinations are adjusted, the performance
of the tagger for this type of text is bound to improve. However, for
lack of time as well as of proper test materials this experiment was not
undertaken.

11.6 Analysis III: Altered Grammar, Ad-


ditional Vocabulary
With regard to the grammar, a number of alterations can be envisaged.
Basically these are of two types: alterations that, on implementation,
will extend the coverage of the grammar/parser, and alterations that
will contribute towards an improvement of the eciency of the parser.
Alterations of the rst type are constituted by, for example, a revision or
adaptation of the descriptive model, and the formulation of additional
rules. Since there appeared to be few omissions in the grammar, and a
202 Oostdijk

revision of the descriptive model is not something that can be carried


out overnight, we did not pursue this any further. The eciency of the
parser can be improved by adapting the grammar so that it contains
only those rules that are relevant to this speci c domain. The yield,
however, will be relatively small: the automatic lookahead component
already serves to bypass parts of the grammar. One small experiment
that we did carry out consisted in removing the syntactic markers from
the grammar. The motivation for undertaking this experiment was that
the insertion of syntactic markers considerably increases the need for
intervention and therefore the amount of manual work one has to put
in. Removal of the syntactic markers from the grammar can be viewed
as a step towards further automating the system as a whole. Unfortu-
nately, the outcome of the experiment only con rmed the present need
for syntactic markers. An attempt at running the parser without requir-
ing any syntactic markers failed: a test run of an adapted version of the
parser on the LOTUS text showed the parser to be extremely inecient
so that for a great many utterances a conclusive result could no longer
be obtained.

11.7 Converting Parse Tree to Dependency


Notation
At the IPSM'95 workshop, it was decided to attempt a translation of
the parse of a selected sentence into a dependency structure format for
all parsing systems that do not already yield such a format. It was
argued that dependency structure analyses are better suited for the cal-
culation of precision and recall measures (cf. Lin, this volume). An
attempt to translate the analyses produced by di erent systems would
test the feasibility of a general translation and measurement mechanism.
A translation from a constituent structure analysis, however, is certainly
not straightforward. One problem is that, in any marked relation be-
tween two items in the dependency analysis, one item must be de ned
as dominating and the other as dependent. For several objects in a con-
stituent structure analysis this is not normally the case, so that rules
have to be de ned which determine dominance for the translation and
which thus add (possibly gratuitous) information. On the other hand,
information may be lost since constituent structure analysis allows the
binding of information to units larger than a single word, whilst in the
translation there is not always a clear binding point for such information.
We have attempted to provide a translation of the analysis yielded by
the TOSCA system into a rather richer dependency structure analysis
than agreed, in order to prevent the loss of information we deem impor-
tant. The original analysis and the translation of the example sentence
TOSCA 203

DIFU,CON That is

PUNC,PM
,
comma

DTCE,PN
DT,DTP these
dem,plu
SU,NP
NPHD,N
words
com,plu

V,VP MVB,LV
act,cxtr cxtr,indic make
indic,pres pres

DTCE,ART
DT,DTP the
def
UTT,S
OD,NP
act,cxtr
NPHD,N
decl,indic source sentence
com,sing
-,TXTU pres,unm
CJ,ADJ
longer
comp,prd

AJHD,COORD COOR,CONJN
or
comp,prd coord

CJ,ADJ
shorter
comp,prd
CO,AJP
comp,prd
SBHD,CONJN
SUB,SUBP than
subord
AJPO,CL
act,comp DTCE,ART
DT,DTP the
indic,red def
sub SU,NP
NPHD,N
TM sentence
com,sing

PUNC,PM
.
per

Figure 11.6: Constituent structure.

are shown in Figures 11.6 and 11.7.


Since we are not currently interested in the translation other than for
the purpose of this experiment, we have not implemented a translation
algorithm but have translated the analysis by hand. However, we have
tried to adhere to the following general rules. For all headed phrases, the
head becomes the dominating element and the other constituents become
dependent upon it. For unheaded phrases or clauses, one constituent is
chosen to ll the role of the head: the main verb for verb phrases (i.e.
auxiliaries, adverbials and main verb), the preposition for prepositional
phrases, the main verb (the translation of the verb phrase) for clauses,
the coordinator for coordination and the subordinator in subordination.
For multi-token words, the last token becomes the dominating element
and the preceding tokens become dependent. Each dependency link is
marked with an indication of the type of the relation which is derived
from the function label of the dependent element at the level of the phrase
or clause which is collapsed. In the case of multi-token words, the link
1
204 Oostdijk

That AVPR ! is
is CON ) < discourse > ADV(connec)
, PM ) < discourse > PUNC(comma)
these DT ! words PRON(dem, plu)
words SU ! make N(com, plu)
make UTT ) < discourse > V(act, cxtr, indic, pres)
the DT ! sentence ART(def)
source NPPR ! sentence
sentence OD make N(com, sing)
longer CJ ! or ADJ(comp)
or CO make CONJN(coor, adj, comp)
shorter CJ or ADJ(comp)
than AJPO or CONJN(cl, act, indic, red, sub)
the DET ! sentence ART(def)
TM NPPR ! sentence
sentence SU than N(com, sing)
. PM ) < discourse > PUNC(period)

Figure 11.7: Dependency notation.

is labelled as a premodi er of whatever the word as a whole represents.


In addition, each element by itself is marked with information derived
from all constituents in which it dominates.
Apart from extending the format to include extra information, we
also deviate from the original proposal in that we assume an additional
level dominating the utterance as a whole, called \< discourse >", on
which the main elements of the utterance are dependent. Alternatively,
we could make punctuation and connectives dependent on the verb or
vice versa. Neither alternative, however, is particularly appealing.
Our rules lead to a consistent but not always satisfactory translation.
The choice of the dominating element is sometimes arbitrary, as is illus-
trated in the treatment of \that is". Another problem is constituted by
the lack of binding points for information. This is illustrated in \than
the TM sentence", where clausal information has to be bound to the sub-
ordinator and \the TM sentence" has to be called a subject of \than".
Both problems make themselves felt in \longer or shorter", where \or"
is chosen as dominating rather than having \or shorter" dependent on
\longer" or \longer or" on \shorter". In the current solution \longer"
and \shorter" are equal, as we think they should be, but \or" has to be
labelled with adjectival information.
From this experience we conclude that the use of this translation
is undesirable except for the express purpose of comparison with other
systems. The problems that occur within this single example sentence
TOSCA 205

demonstrate clearly that even a translation for comparison has its pit-
falls. Only when the designers of the comparison de ne very clearly how
each construction (e.g. coordination) is to be represented will it be pos-
sible for all participants to come to a translation which is comparable.
If all information in the translation needs to be compared, it can only
contain elements from the greatest common divider of all systems, which
means that a lot of richness of the original is lost. We at least would be
disappointed if we were to be judged on such a meagre derivative of our
analyses.

11.8 Summary of Findings


If we take the three texts under investigation to be representative of
the domain of computer/software manuals, application of the TOSCA
system to this domain can, on the whole, be considered successful. There
were not too many problems in getting the parser to parse the utterances.
The coverage of the grammar proved satisfactory: only few structures
were encountered that pointed to omissions in the grammar. While we
had expected the grammar to fall short especially with structures that
so far in our own test materials had been relatively underrepresented
(such as imperatives), this was not the case.
The experiment once more con rmed what earlier experiences with
other types of text had already brought to light: the system clearly shows
the e ects of a compromise between eciency and robustness on the one
hand, and coverage and detail on the other. As we pointed out above,
both eciency and robustness are considered to be only of secondary
importance. In the design of the system linguistic descriptive and ob-
servational adequacy are taken to be the primary objectives. However,
practical considerations force us to somehow optimize the analysis pro-
cess. This explains the present need for intervention at various points
in the process, viz. tag selection, syntactic marking and selection of the
contextually appropriate analysis. As a result, operation of the present
system is highly labour-intensive.
Judging from our experiences in carrying out the experiment it would
appear that the performance of the TOSCA system when applied to a
speci c (restricted) domain could be improved by a number of domain-
speci c adjustments to some of its components. For example, the devel-
opment and implementation of a domain-speci c tokenizer and tagger
would yield a better performance of the tagger and reduce the need for
tag correction. The eciency of the parser might be increased by re-
stricting the grammar to the speci c domain. While the present gram-
mar/parser seeks to cover the whole of the English language and must
allow for all possible structural variation, a parser derived from a domain-
speci c grammar could be much more concise and far less ambiguous.
As for the automatic lter component that is used to automatically lter
out the intuitively most probable analysis, this appears to be domain-
dependent: while usually in running text the sentence analysis of an ut-
terance is intuitively more likely than its analysis as a phrase or clause,
in instructive types of text this is not necessarily true.

11.9 References
Aarts, J. (1991). Intuition-based and observation-based grammars. In
K. Aijmer & B. Altenberg (Eds.) English Corpus Linguistics (pp.
44-62). London, UK: Longman.
Aarts, F., & Aarts, J., (1982). English Syntactic Structures. Oxford,
UK: Pergamon.
Aarts, J., de Haan, P., & Oostdijk, N., (Eds.) (1993). English lan-
guage corpora: design, analysis and exploitation. Amsterdam, The
Netherlands: Rodopi.
Aarts, J., van Halteren, H., & Oostdijk, N., (1996). The TOSCA anal-
ysis system. In C. Koster & E. Oltmans (Eds.) Proceedings of the
rst AGFL Workshop (Technical Report CSI-R9604, pp. 181-191).
Nijmegen, The Netherlands: University of Nijmegen, Computing Sci-
ence Institute.
Aijmer, K. & Altenberg, B., (Eds.) (1991). English Corpus Linguistics.
London, UK: Longman.
Dynix (1991). Dynix Automated Library Systems Searching Manual.
Evanston, Illinois: Ameritech Inc.
van Halteren, H. & van den Heuvel, T., (1990). Linguistic Exploitation
of Syntactic Databases. The use of the Nijmegen Linguistic DataBase
Program. Amsterdam, The Netherlands: Rodopi.
Lotus (1992). Lotus Ami Pro for Windows User's Guide Release Three.
Atlanta, Georgia: Lotus Development Corporation.
Nederhof, M. & Koster, K., (1993). A customized grammar workbench.
In J. Aarts, P. de Haan & N. Oostdijk, (Eds.) English language cor-
pora: design, analysis and exploitation (pp. 163-179). Amsterdam,
The Netherlands: Rodopi.
Quirk, R., Greenbaum, G., Leech, G., & Svartvik, J., (1972). A Gram-
mar of Contemporary English. London, UK: Longman.
Trados (1995). Trados Translator's Workbench for Windows User's
Guide. Stuttgart, Germany: Trados GmbH.
Appendix I
The 60 IPSM Test Utterances
208 Appendix I

Dynix Test Utterances


If your library is on a network and has the Dynix Gateways product,
patrons and staff at your library can use gateways to access information
on other systems as well.

For example, you can search for items and place holds on items at other
libraries, research computer centers, and universities.

Typically, there are multiple search menus on your system, each of which
is set up differently.

The search menu in the Circulation module may make additional search
methods available to library staff.

For example, an alphabetical search on the word "Ulysses" locates all


titles that contain the word "Ulysses."

Displays the records that have a specific word or words in the TITLE,
CONTENTS, SUBJECT, or SERIES fields of the BIB record, depending on which
fields have been included in each index.

For example, you can use an accelerated search command to perform an


author authority search or a title keyword search.

The search abbreviation is included in parentheses following the search


name:

A system menu

Any screen where you can enter "SO" to start a search over

Certain abbreviations may work at one prompt but not at another.

To perform an accelerated search, follow these instructions:

That item's full BIB display appears.

Write down any information you need, or select the item if you are
placing a hold.

Alphabetical Title Search.

Enter the line number of the alphabetical title search option.

A BIB summary screen appears, listing the titles that match your entry:

When you access the BIB record you want, you can print the screen, write
down any information you need, or select the item if you are placing a
hold.

The cursor symbol (>) appears on the alphabetical list next to the
heading that most closely matches your request.

Byzantine empire
60 IPSM Test Utterances 209

Lotus Test Utterances


When you are editing a document, you want to be able to move quickly
through the pages.

Scrolling changes the display but does not move the insertion point.

To use keyboard shortcuts to navigate a document

Move the mouse pointer until the I-beam is at the beginning of the text
you want to select.

For information, refer to "Undoing one or more actions" in this chapter.

Ami Pro provides three modes for typing text.

In Insert mode, you insert text at the position of the insertion point
and any existing text automatically moves.

If you press BACKSPACE, Ami Pro deletes the selected text and one
character to the left of the selected text.

You can disable Drag & Drop.

If you want to move the text, position the mouse pointer anywhere in the
selected text and drag the mouse until the insertion point is in the
desired location.

The contents of the Clipboard appear in the desired location.

To move or copy text between documents

Choose Edit/Cut or Edit/Copy to place the selected text on the Clipboard.

If the document into which you want to paste the text is already open,
you can switch to that window by clicking in it or by choosing the Window
menu and selecting the desired document.

Press SHIFT+INS or CTRL+V.

Select the text you want to protect.

Permanently inserts the date the current document was created.

You can select Off, 1, 2, 3, or 4 levels.

When you want to reverse an action, choose Edit/Undo.

Modifying the Appearance of Text


210 Appendix I

Trados Test Utterances


The following are suggestions on how to proceed when using the
Translator's Workbench together with Word for Windows 6.0.

Another important category of non-textual data is what is referred to as


"hidden text."

Alternatively, choose the Options... menu item from Word's Tools menu.

Thus, you will make sure that you see all the information that the
Workbench manages during translation:

Always put one abbreviation on a line, followed by a period.

During the translation of this example, the Workbench should ignore the
second sentence when moving from the first sentence to the third one.

In Word, you can immediately recognize a 100% match from the color of the
target field.

The TWB1 button, also labeled Translate Until Next Fuzzy Match, tells the
Workbench to do precisely this.

You can also use the shortcut [Alt] + [x] on the separate numeric keypad
to start this function.

That is, these words make the source sentence longer or shorter than the
TM sentence.

Likewise, if something has been left out in the source sentence, you will
have to delete the corresponding parts in the suggested translation as
well.

Automatic Substitution of Interchangeable Elements

If the Workbench cannot find any fuzzy match, it will display a


corresponding message ("No match") in the lower right corner of its
status bar and you will be presented with an empty yellow target field.

Then go on translating until you want to insert the next translation.

Select the text to be copied in the Concordance window, usually the


translation of the sentence part that you have searched for.

The same goes for formatting:

Making Corrections

If you would like to make corrections to translations after their initial


creation, you should always do this in TM mode so that the corrections
will be stored in Translation Memory as well as in your document.

But consider the following example where text is used within an index
entry field:
60 IPSM Test Utterances 211

If a perfect or fuzzy match is found, the Workbench will again


automatically transfer its translation into the target field in WinWord.
Appendix II
Sample Parser Outputs
The following pages contain the output from the eight parsing systems
for ve test utterances drawn from the set of 60 in Appendix I. Very
slight modi cations had to be made to the format of some parse trees
to allow them to t on the page. Those trees a ected are marked with
a star.
214 Appendix II

ALICE: D20*
Enter the line number of the alphabetical title search option.

("Enter" "the" "line" "number" "of" "the" "alphabetical" "title" "search"


"option")

;;; GC: 292490 words [1169960 bytes] of dynamic storage in use.


;;; 166260 words [665040 bytes] of free storage available before a GC.
;;; 625010 words [2500040 bytes] of free storage available if GC is
;;; disabled.

Parsing time: 34.832098 seconds.


String length: 10
Chart size: 91
spanning edges are: NIL
Chart size: 91

Fragment No. 1
>From 0 To 1

(NOUN "Enter")
Fragment No. 2
>From 1 To 4

(NP (DET "the") (NOUN (ADJ "line") (NOUN "number")))


Fragment No. 3
>From 4 To 10

(NP-MOD (NP-CONJ "of")


(NP (DET "the")
(NOUN (ADJ "alphabetical")
(NOUN (ADJ "title") (NOUN (ADJ "search") (NOUN "option"))))))
Sample Parser Outputs 215

ENGCG: D20 (Analysis I)


Enter the line number of the alphabetical title search option.

"<Enter>"
"enter" <*> <SVO> <SV> <P/in> <P/for> V IMP VFIN @+FMAINV
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<line>"
"line" N NOM SG @NN>
"<number>"
"number" N NOM SG/PL @OBJ
"<of>"
"of" PREP @<NOM-OF
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<alphabetical>"
"alphabetical" <DER:ic> <DER:al> A ABS @AN>
"<title>"
"title" N NOM SG @NN> @<P
"<search>"
"search" N NOM SG @NN>
"<option>"
"option" N NOM SG @<P
"<$.>"
216 Appendix II

LPARSER: D20 (Analysis I)


Enter the line number of the alphabetical title search option.

Accepted
Unique linkage, cost vector = (0, 0, 13)

+--------------------J-------
| +------------------D------
+---------O---------+ | | +-------------A-
| +------D------+ | | | +----
+---W--+ | +---AN--+--M--+ | | |
| | | | | | | | |
///// enter.v the line.n number.n of the alphabetical.a title.n

------------+
------------+
------------+
---AN-------+
+---AN--+
| |
search.n option
Sample Parser Outputs 217

PRINCIPAR: D20
Enter the line number of the alphabetical title search option.

;; time: 0.3 sec.


;; Fragments: 1
(S
(VP (Vbar (V (V_[NP]
(V_[NP] Enter)
(NP
(Det the)
(Nbar
(N line number)
(PP (Pbar (P
(P of)
(NP
(Det the)
(Nbar
(AP (Abar (A alphabetical)))
(N title)
(N search option))))))))))))
.)
218 Appendix II

RANLT: D20 (Analysis I)


Enter the line number of the alphabetical title search option.

(VP/NP enter
(N2+/DET1a the
(N2-
(N1/PPMOD (N1/N (N/COMPOUND1 line number))
(P2/P1
(P1/NPa of
(N2+/DET1a the
(N2-
(N1/N
(N/COMPOUND2 alphabetical
(N/COMPOUND1 title (N/COMPOUND1 search option))))))))))))
Sample Parser Outputs 219

SEXTANT: D20 (Analysis I)*


Enter the line number of the alphabetical title search option.

162 VP 101 Enter enter INF 0 0


162 NP 2 the the DET 1 1 3 (number) DET
162 NP 2 line line NOUN 2 1 3 (number) NN
162 NP* 2 number number NOUN 3 1 0 (enter) DOBJ
162 NP 2 of of PREP 4 1 9 (option) PREP
162 NP 2 the the DET 5 1 9 (option) DET
162 NP 2 alphabetical alphabetical ADJ 6 2 7 (title) ADJ
9 (option) ADJ
162 NP 2 title title NOUN 7 2 8 (search) NN
9 (option) NN
162 NP 2 search search NOUN 8 1 9 (option) NN
162 NP* 2 option option NOUN 9 2 3 (number) NNPREP
0 (enter) IOBJ-of
162 -- 0 . . . 10 0
220 Appendix II

DESPAR: D20 (Analysis I)


Enter the line number of the alphabetical title search option.

#016
VB enter 1 --> 11 -
DT the 2 --> 4 [
NN line 3 --> 4 +
NN number 4 --> 1 + OBJ
IN of 5 --> 4 ]
DT the 6 --> 10 [
JJ alphabetical 7 --> 8 +
NN title 8 --> 9 +
NN search 9 --> 10 +
NN option 10 --> 5 +
. . 11 --> 0 ]
Sample Parser Outputs 221

TOSCA: D20 (Analysis I)*


Enter the line number of the alphabetical title search option.

<tparc>
1 analysis in 2 seconds with TOSCA-ICE/V0.3.950102
</tparc>
<tparn fun=- cat=TXTU>
<tparn fun=UTT cat=S att=(-su,act,imper,motr,pres,unm)>
<tparn fun=V cat=VP att=(act,imper,motr,pres)>
<tparn fun=MVB cat=LV att=(imper,motr,pres)> Enter
</tparn></tparn>
<tparn fun=OD cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=ART att=(def)> the
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,sing)> line number
</tparn>
<tparn fun=NPPO cat=PP>
<tparn fun=P cat=PREP> of
</tparn>
<tparn fun=PC cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=ART att=(def)> the
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,sing)> alphabetical title search
option
</tparn></tparn></tparn></tparn></tparn>
<tparn fun=PUNC cat=PM att=(per)> .
</tparn></tparn>
222 Appendix II

ALICE: D22*
Displays the records that have a specific word or words in the TITLE,
CONTENTS, SUBJECT, or SERIES fields of the BIB record, depending on which
fields have been included in each index.

("Displays" "the" "records" "that" "have" "a" "specific" "word" "or"


"words" "in" "the" "TITLE" "," "CONTENTS" "," "SUBJECT" "," "or"
"SERIES" "fields" "of" "the" "BIB" "record" "," "depending" "on"
"which" "fields" "have" "been" "included" "in" "each" "index")

;;; GC: 297762 words [1191048 bytes] of dynamic storage in use.


;;; 160988 words [643952 bytes] of free storage available before a GC.
;;; 619738 words [2478952 bytes] of free storage available if GC is
;;; disabled.

Parsing time: 101.074216 seconds.


String length: 36
Chart size: 267
spanning edges are: NIL
Chart size: 267

Fragment No. 1
>From 0 To 1

(NP "Displays")
Fragment No. 2
>From 1 To 3

(NP (DET "the") (NOUN "records"))


Fragment No. 3
>From 3 To 4

(DET "that")
Fragment No. 4
>From 4 To 5

(UNK-CAT "have")
Fragment No. 5
>From 5 To 8

(POSTMOD (UNK-CAT "a") (NOUN (ADJ "specific") (NOUN "word")))


Fragment No. 6
>From 8 To 10

((NOUN MOD) (UNK-CAT "or") (NOUN "words"))


Fragment No. 7
>From 10 To 11

(PREP "in")
Fragment No. 8
>From 11 To 12

(DET "the")
Sample Parser Outputs 223

Fragment No. 9
>From 12 To 13

(NP "TITLE")
Fragment No. 10
>From 13 To 14

(UNK-CAT ",")
Fragment No. 11
>From 14 To 15

(NP "CONTENTS")
Fragment No. 12
>From 15 To 16

(UNK-CAT ",")
Fragment No. 13
>From 16 To 17

(NP "SUBJECT")
Fragment No. 14
>From 17 To 18

(UNK-CAT ",")
Fragment No. 15
>From 18 To 21

((NOUN MOD) (UNK-CAT "or") (NOUN (ADJ "SERIES") (NOUN "fields")))


Fragment No. 16
>From 21 To 25

(NP-MOD (NP-CONJ "of")


(NP (DET "the") (NOUN ((NOUN MOD) "BIB") (NOUN "record"))))
Fragment No. 17
>From 25 To 26

(UNK-CAT ",")
Fragment No. 18
>From 26 To 27

(PRESP "depending")
Fragment No. 19
>From 27 To 28

(UNK-CAT "on")
Fragment No. 20
>From 28 To 31

(UNK-CAT (NP (DET "which") (NOUN "fields")) (UNK-CAT "have"))


Fragment No. 21
>From 31 To 33

(VP-PASS (AUX "been") (PPART "included"))


Fragment No. 22
>From 33 To 36
224 Appendix II

(SENT-MOD (UNK-CAT "in") (NP (DET "each") (NOUN "index")))


Sample Parser Outputs 225

ENGCG: D22 (Analysis I)


Displays the records that have a specific word or words in the TITLE,
CONTENTS, SUBJECT, or SERIES fields of the BIB record, depending on which
fields have been included in each index.

"<Displays>"
"display" <*> <SVO> V PRES SG3 VFIN @+FMAINV
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<records>"
"record" N NOM PL @OBJ
"<that>"
"that" <NonMod> <**CLB> <Rel> PRON SG/PL @SUBJ
"<have>"
"have" <SVO> <SVOC/A> V PRES -SG3 VFIN @+FMAINV
"<a>"
"a" <Indef> DET CENTRAL ART SG @DN>
"<specific>"
"specific" <DER:ic> A ABS @AN>
"<word>"
"word" N NOM SG @OBJ
"<or>"
"or" CC @CC
"<words>"
"word" N NOM PL @OBJ
"<in>"
"in" PREP @<NOM @ADVL
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<TITLE>"
"title" <*> N NOM SG @<P
"<$,>"
"<CONTENTS>"
"content" <*> N NOM PL @SUBJ @APP @<P @<NOM
"<$,>"
"<SUBJECT>"
"subject" <*> N NOM SG @APP @<P @<NOM
"<$,>"
"<or>"
"or" CC @CC
"<SERIES>"
"series" <*> N NOM SG/PL @NN>
"<fields>"
"field" N NOM PL @SUBJ @APP @<P
"<of>"
"of" PREP @<NOM-OF
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<BIB>"
"bib" <*> N NOM SG @NN>
"<record>"
"record" <SVO> <SV> V PRES -SG3 VFIN @+FMAINV
"record" N NOM SG @<P
226 Appendix II

"<$,>"
"<depending>"
"depend" <SV> PCP1 @-FMAINV
"<on>"
"on" PREP @ADVL
"<which>"
"which" DET CENTRAL WH SG/PL @DN>
"<fields>"
"field" N NOM PL @SUBJ
"<have>"
"have" <SVO> <SVOC/A> V PRES -SG3 VFIN @+FAUXV
"<been>"
"be" <SVC/N> <SVC/A> PCP2 @-FAUXV
"<included>"
"include" <SVO> <P/in> PCP2 @-FMAINV
"<in>"
"in" PREP @ADVL
"<each>"
"each" <Quant> DET CENTRAL SG @QN>
"<index>"
"index" N NOM SG @<P
"<$.>"
Sample Parser Outputs 227

LPARSER: D22 (Analysis I)


Displays the records that have a specific word or words in the TITLE,
CONTENTS, SUBJECT, or SERIES fields of the BIB record, depending on which
fields have been included in each index.

Accepted (4424 linkages, 192 with no P.P. violations)


Linkage 1, cost vector = (0, 1, 72)

+--------------------------------------------------------
+-------------------------------------------X------------
| +------------------------------
| | +------------O---
| +------O------+------B------+ +-------D----
+-CL-+---S--+ +---D--+---C---+--Z--+ | +----A
| | | | | | | | |
///// it displays.v the records.n that have.v a specific.a

------------------X---------------------------------------------
-------------------------------+
X------------------------------+
--------+ +------------------X-------------
---+ | +---J---+ | +------------+-----AN-
---+----+----+--M--+ +-DP-+ | +----+----+ +--+----+
| | | | | | | | | | | | |
word.n or words.n in the TITLE , CONTENTS , SUBJECT , or SERIES

-----------------------------+
|
|
----+ +------J------+ |
----+ | +-----D----+ +-------M------+-----CL----+
+--M--+ | +--AN--+--X-+---M--+ +-J-+ +---S
| | | | | | | | | |
fields.n of the BIB record.n , depending.v on which fields.n

+----J----+
---+--T--+---V---+--EV--+ +--D--+
| | | | | |
have.a been included.v in each index.n
228 Appendix II

PRINCIPAR: D22*
Displays the records that have a specific word or words in the TITLE,
CONTENTS, SUBJECT, or SERIES fields of the BIB record, depending on which
fields have been included in each index.

;; time: 2.02 sec.


;; Fragments: 1
(S
(VP (Vbar (V (V_NP
(V_NP Displays)
(NP
(Det the)
(Nbar
(N records)
(CP
Op[1]
(Cbar
(C that)
(IP
t[1]
(Ibar (VP (Vbar (V (V_NP
(V_NP have)
(NP
(NP
(Det a)
(Nbar
(AP (Abar (A specific)))
(N word)))
or
(NP (Nbar
(N words)
(PP (Pbar (P
(P in)
(NP
(NP
(Det the)
(Nbar (N TITLE)))
,
(NP
(NP (Nbar (N CONTENTS)))
,
(NP
(NP (Nbar (N SUBJECT)))
,
or
(NP (Nbar
(N SERIES fields)
(PP (Pbar (P
(P of)
(NP
(Det the)
(Nbar (N BIB record))))))))))
,))))
Sample Parser Outputs 229

(VP (Vbar (V (V_CP


(V_CP depending on)
(CP
(NP[2]
(Det which)
(Nbar (N fields)))
(Cbar (IP
t[2]
(Ibar
(Have have)
(Be been)
(VP (Vbar
(V (V_NP
(V_NP included)
t[2]))
(PP (Pbar (P
(P in)
(NP
(Det each)
(Nbar (N index))))))
)))))))))))))))))))))))))))
.)
230 Appendix II

RANLT: D22 (Analysis I)*


Displays the records that have a specific word or words in the TITLE,
CONTENTS, SUBJECT, or SERIES fields of the BIB record, depending on which
fields have been included in each index.

(N1/PPMOD
(N1/RELMOD1
(N1/VPMOD1
(N1/RELMOD1 (N1/N displays)
(S/THATLESSREL
(S1b (N2+/DET1a the (N2- (N1/N records))) (A2/ADVMOD1/- (A1/A that))
(VP/MOD1 (VP/NP have (TRACE1 E))
(X2/MOD3
(N2+/PART5
(N2+/N1PROa
(N1/PPMOD
(N1/COORD2A (N1/PN a)
(N1/APMOD1 (A2/ADVMOD1/- (A1/A specific)) (N1/N word))
(CONJ/N1 or (N1/N words)))
(P2/P1
(P1/NPa in
(N2+/APPOS/2
(N2+/DET1a the
(N2/COORD3A
(N2-
(N1/RELMOD1 (N1/N title)
(S/THATLESSREL
(S1a (N2+/N2-a (N2- (N1/N contents)))
(VP/NP subject (TRACE1 E))))))
(CONJ/N2 or (N2- (N1/N series)))))
(N2+/PN fields))))))
of (N2+/ADJ1 the
(A2/ADVMOD1/-
(A1/A
(A/COMPOUND bib record))))))))))
(VP/INTR depending))
(S/ADVBLa1 (X2/MOD2 (P2/P1 (P1/NPb on (N2+/PRO which))))
(S2 (N2+/N2-a (N2- (N1/N fields)))
(VP/HAVE have
(VP/BE_PRD been
(PRD2
(A2/ADVMOD1/-
(A1/A included))))))))
(P2/P1
(P1/NPa in
(N2+/QUA
(A2/ADVMOD1/-
(A1/A each)) (N2- (N1/PN index))))))
Sample Parser Outputs 231

SEXTANT: D22 (Analysis I)*


Displays the records that have a specific word or words in the TITLE,
CONTENTS, SUBJECT, or SERIES fields of the BIB record, depending on which
fields have been included in each index.

51 VP 101 Displays display ACTVERB 0 0


51 NP 2 the the DET 1 1 2 (record) DET
51 NP* 2 records record NOUN 2 1 0 (display) DOBJ
51 -- 0 that that CS 3 0
51 VP 102 have have INF 4 1 2 (record) SUBJ
51 NP 3 a a DET 5 1 7 (word) DET
51 NP 3 specific specific ADJ 6 1 7 (word) ADJ
51 NP* 3 word word NOUN 7 1 4 (have) DOBJ
51 NP 3 or or CC 8 0
51 NP* 3 words word NOUN 9 1 4 (have) DOBJ
51 NP 3 in in PREP 10 1 12 (TITLE) PREP
51 NP 3 the the DET 11 1 12 (TITLE) DET
51 NP* 3 TITLE TITLE NOUN 12 2 9 (word) NNPREP
4 (have) IOBJ-in
51 -- 0 , , CM 13 0
51 NP* 4 CONTENTS CONTENTS NOUN 14 0
51 -- 0 , , CM 15 0
51 NP* 5 SUBJECT SUBJECT NOUN 16 0
51 NP 5 , , CM 17 0
51 NP 5 or or CC 18 0
51 NP* 5 SERIES SERIES NOUN 19 0
51 VP 103 fields field ACTVERB 20 2 19 (SERIES) SUBJ
16 (SUBJECT) SUBJ
51 NP 6 of of PREP 21 1 24 (record) PREP
51 NP 6 the the DET 22 1 24 (record) DET
51 NP 6 BIB BIB NOUN 23 1 24 (record) NN
51 NP* 6 record record NOUN 24 1 20 (field) IOBJ-of
51 -- 0 , , CM 25 0
51 NP* 7 depending depend INGVERB 26 1 32 (include) DOBJ
51 -- 0 on on PREP 27 0
51 -- 0 which which CS 28 0
51 VP 105 fields field ACTVERB 29 0
51 VP 105 have have INF 30 0
51 VP 105 been been BE 31 0
51 VP 105 included include PPART 32 0
51 NP 9 in in PREP 33 1 35 (index) PREP
51 NP 9 each each DET 34 1 35 (index) DET
51 NP* 9 index index NOUN 35 1 32 (include) IOBJ-in
51 -- 0 . . . 36 0
232 Appendix II

DESPAR: D22 (Analysis I)


Displays the records that have a specific word or words in the TITLE,
CONTENTS, SUBJECT, or SERIES fields of the BIB record, depending on which
fields have been included in each index.

VBZ displays 1 --> 37 -


DT the 2 --> 3 [
NNS records 3 --> 1 + OBJ
WDT that 4 --> 3 "
VBP have 5 --> 1 ]
DT a 6 --> 8 [
JJ specific 7 --> 8 +
NN word 8 --> 5 + OBJ
CC or 9 --> 8 ]
NNS words 10 --> 9 [
IN in 11 --> 8 ]
DT the 12 --> 13 [
NN title 13 --> 11 +
, , 14 --> 13 ]
NNS contents 15 --> 14 [
, , 16 --> 15 ]
NN subject 17 --> 16 [
, , 18 --> 17 ]
CC or 19 --> 17 -
NN series 20 --> 21 [
NNS fields 21 --> 19 +
IN of 22 --> 21 ]
DT the 23 --> 25 [
NN bib 24 --> 25 +
NN record 25 --> 22 +
, , 26 --> 25 ]
VBG depending 27 --> 5 -
IN on 28 --> 27 -
WDT which 29 --> 30 [
NNS fields 30 --> 28 +
VBP have 31 --> 33 ]
VBN been 32 --> 33 -
VBN included 33 --> 27 -
IN in 34 --> 33 -
DT each 35 --> 36 [
NN index 36 --> 34 +
. . 37 --> 0 ]
Sample Parser Outputs 233

TOSCA: D22 (Analysis I)*


Displays the records that have a specific word or words in the TITLE,
CONTENTS, SUBJECT, or SERIES fields of the BIB record, depending on which
fields have been included in each index.

<tparc>
47 analyses in 93 seconds with TOSCA-ICE/V0.3.950102
</tparc>
<tparn fun=- cat=TXTU>
<tparn fun=UTT cat=S att=(-su,act,decl,indic,motr,pres,unm)>
<tparn fun=V cat=VP att=(act,indic,motr,pres)>
<tparn fun=MVB cat=LV att=(indic,motr,pres)> Displays
</tparn></tparn>
<tparn fun=OD cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=ART att=(def)> the
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,plu)> records
</tparn>
<tparn fun=NPPO cat=CL att=(act,indic,motr,pres,rel,unm)>
<tparn fun=SU cat=NP>
<tparn fun=NPHD cat=PN att=(rel)> that
</tparn></tparn>
<tparn fun=V cat=VP att=(act,indic,motr,pres)>
<tparn fun=MVB cat=LV att=(indic,motr,pres)> have
</tparn></tparn>
<tparn fun=OD cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=ART att=(indef)> a
</tparn></tparn>
<tparn fun=NPPR cat=AJP att=(attru)>
<tparn fun=AJHD cat=ADJ att=(attru)> specific
</tparn></tparn>
<tparn fun=NPHD cat=COORD>
<tparn fun=CJ cat=N att=(com,sing)> word
</tparn>
<tparn fun=COOR cat=CONJN att=(coord)> or
</tparn>
<tparn fun=CJ cat=N att=(com,plu)> words
</tparn></tparn></tparn>
<tparn fun=A cat=PP>
<tparn fun=P cat=PREP> in
</tparn>
<tparn fun=PC cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=ART att=(def)> the
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,plu)> TITLE, CONTENTS, SUBJECT, or
SERIES fields
</tparn>
<tparn fun=NPPO cat=PP>
<tparn fun=P cat=PREP> of
</tparn>
234 Appendix II

<tparn fun=PC cat=NP>


<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=ART att=(def)> the
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,sing)> BIB record
</tparn></tparn></tparn></tparn></tparn></tparn></tparn>
<tparn fun=PUNC cat=PM att=(comma)> ,
</tparn>
<tparn fun=A cat=PP>
<tparn fun=P cat=PREP> depending on
</tparn>
<tparn fun=PC cat=CL att=(indic,indrel,intr,pass,pres,unm)>
<tparn fun=SU cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=PN att=(inter)> which
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,plu)> fields
</tparn></tparn>
<tparn fun=V cat=VP att=(indic,intr,pass,perf,pres)>
<tparn fun=OP cat=AUX att=(indic,perf,pres)> have
</tparn>
<tparn fun=AVB cat=AUX att=(indic,pass,pastp)> been
</tparn>
<tparn fun=MVB cat=LV att=(indic,motr,pastp)> included
</tparn></tparn>
<tparn fun=A cat=PP>
<tparn fun=P cat=PREP> in
</tparn>
<tparn fun=PC cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTPE cat=PN att=(sing,univ)> each
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,sing)> index
</tparn></tparn></tparn></tparn></tparn></tparn>
<tparn fun=PUNC cat=PM att=(per)> .
</tparn></tparn>
Sample Parser Outputs 235

ALICE: L20*
Press SHIFT+INS or CTRL+V.

("Press_SHIFT" "+" "INS" "or" "CTRL" "+" "V")

;;; GC: 341468 words [1365872 bytes] of dynamic storage in use.


;;; 117282 words [469128 bytes] of free storage available before a GC.
;;; 576032 words [2304128 bytes] of free storage available if GC is
;;; disabled.

Parsing time: 30.606719 seconds.


String length: 7
Chart size: 99
spanning edges are: NIL
Chart size: 99

Fragment No. 1
>From 0 To 3

(NP (NP-MOD "Press_SHIFT") (NP (NP-MOD "+") (NP "INS")))


Fragment No. 2
>From 3 To 7

(NP-MOD (NP-CONJ "or") (NP (NP-MOD "CTRL") (NP (NP-MOD "+") (NP "V"))))
236 Appendix II

ENGCG: L20 (Analysis I)


Press SHIFT+INS or CTRL+V.

"<Press>"
"press" <*> <SVO> <SV> V IMP VFIN @+FMAINV
"<SHIFT+INS>"
"SHIFT+INS" <?> <**> <NoBaseformNormalisation> N NOM SG/PL @OBJ
"<or>"
"or" CC @CC
"<CTRL+V>"
"CTRL+V" <?> <**> N NOM SG @OBJ
"<$.>"
Sample Parser Outputs 237

LPARSER: L20 (Analysis I)


Press SHIFT+INS or CTRL+V.

Accepted
Unique linkage, cost vector = (0, 0, 1)

+-------O-------+
+---W--+ +------+----+
| | | | |
///// press.v SHIFT+INS or CTRL+V
238 Appendix II

PRINCIPAR: L20
Press SHIFT+INS or CTRL+V.

;; time: 0.05 sec.


;; Fragments: 1
(S
(VP (Vbar (V (V_NP
(V_NP Press)
(NP
(NP (Nbar (N SHIFT+INS)))
or
(NP (Nbar (N CTRL+V))))))))
.)
Sample Parser Outputs 239

RANLT: L20 (Analysis I)


Press SHIFT+INS or CTRL+V.

(VP/NP press
(N2/COORD3A (N2+/N2-a (N2- (N1/N shiftins)))
(CONJ/N2 or (N2+/N2-a (N2- (N1/N ctrlv))))))
240 Appendix II

SEXTANT: L20 (Analysis I)*


Press SHIFT+INS or CTRL+V.

93 VP 101 Press press INF 0 0


93 NP* 2 SHIFT-INS SHIFT-INS NOUN 1 1 0 (press) DOBJ
93 NP 2 or or CC 2 0
93 NP* 2 CTRL-V CTRL-V NOUN 3 1 0 (press) DOBJ
93 -- 0 . . . 4 0
Sample Parser Outputs 241

DESPAR: L20 (Analysis I)


Press SHIFT+INS or CTRL+V.

VB press 1 --> 5 -
NP shift+ins 2 --> 1 [ OBJ
CC or 3 --> 2 ]
NP ctrl+v 4 --> 3 [
. . 5 --> 0 ]
242 Appendix II

TOSCA: L20 (Analysis I)


Press SHIFT+INS or CTRL+V.

<tparc>
2 analyses in 2 seconds with TOSCA-ICE/V0.3.950102
</tparc>
<tparn fun=- cat=TXTU>
<tparn fun=UTT cat=S att=(-su,act,imper,motr,pres,unm)>
<tparn fun=V cat=VP att=(act,imper,motr,pres)>
<tparn fun=MVB cat=LV att=(imper,motr,pres)> Press
</tparn></tparn>
<tparn fun=OD cat=COORD>
<tparn fun=CJ cat=NP>
<tparn fun=NPHD cat=N att=(prop,sing)> SHIFT + INS
</tparn></tparn>
<tparn fun=COOR cat=CONJN att=(coord)> or
</tparn>
<tparn fun=CJ cat=NP>
<tparn fun=NPHD cat=N att=(prop,sing)> CTRL + V
</tparn></tparn></tparn></tparn>
<tparn fun=PUNC cat=PM att=(per)> .
</tparn></tparn>
Sample Parser Outputs 243

ALICE: L22*
Select the text you want to protect.

("Select" "the" "text" "you" "want" "to" "protect")

;;; GC: 311968 words [1247872 bytes] of dynamic storage in use.


;;; 146782 words [587128 bytes] of free storage available before a GC.
;;; 605532 words [2422128 bytes] of free storage available if GC is
;;; disabled.

Parsing time: 30.436594 seconds.


String length: 7
Chart size: 86
spanning edges are: NIL
Chart size: 86

Fragment No. 1
>From 0 To 5

(SENT (SENT-MOD (UNK-CAT "Select") (NP (DET "the") (NOUN "text")))


(SENT (VP-ACT (NP "you") (V-TR "want")) (NP NULL-PHON)))
Fragment No. 2
>From 5 To 7

(SENT-MOD (UNK-CAT "to") (NP "protect"))


244 Appendix II

ENGCG: L22 (Analysis I)


Select the text you want to protect.

"<Select>"
"select" <*> <SVO> <SV> <P/for> V IMP VFIN @+FMAINV
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<text>"
"text" N NOM SG @OBJ
"<you>"
"you" <NonMod> PRON PERS NOM SG2/PL2 @SUBJ
"<want>"
"want" <SVOC/A> <SVO> <SV> <P/for> V PRES -SG3 VFIN @+FMAINV
"<to>"
"to" INFMARK> @INFMARK>
"<protect>"
"protect" <SVO> V INF @-FMAINV
"<$.>"
Sample Parser Outputs 245

LPARSER: L22 (Analysis I)


Select the text you want to protect.

Accepted
Unique linkage, cost vector = (0, 0, 4)

+-----O-----+---------B---------+
+---W---+ +--D--+--C-+--S-+-TO+--I--+
| | | | | | | |
///// select.v the text.n you want to protect.v
246 Appendix II

PRINCIPAR: L22
Select the text you want to protect.

;; time: 0.13 sec.


;; Fragments: 1
(S
(VP (Vbar (V (V_NP
(V_NP Select)
(NP
(Det the)
(Nbar
(N text)
(CP
Op[1]
(Cbar (IP
(NP (Nbar (N you)))
(Ibar (VP (Vbar (V (V_CP
(V_CP want)
(CP (Cbar (IP
PRO
(Ibar
(Aux to)
(VP (Vbar (V (V_NP
(V_NP protect)
t[1]))))))))))))))))))))))
.)
Sample Parser Outputs 247

RANLT: L22 (Analysis I)


Select the text you want to protect.

(VP/NP select
(N2+/DET1a the
(N2-
(N1/INFMOD
(N1/RELMOD1 (N1/N text)
(S/THATLESSREL (S1a (N2+/PRO you) (VP/NP want (TRACE1 E)))))
(VP/TO to (VP/NP protect (TRACE1 E)))))))
248 Appendix II

SEXTANT: L22 (Analysis I)*


Select the text you want to protect.

134 VP 101 Select select INF 0 0


134 NP 2 the the DET 1 1 2 (text) DET
134 NP* 2 text text NOUN 2 1 0 (select) DOBJ
134 NP* 3 you you PRON 3 0
134 VP 102 want want INF 4 0
134 VP 102 to to TO 5 0
134 VP 102 protect protect INF 6 1 3 (you) SUBJ
134 -- 0 . . . 7 0
Sample Parser Outputs 249

DESPAR: L22 (Analysis I)


Select the text you want to protect.

VB select 1 --> 8 -
DT the 2 --> 3 [
NN text 3 --> 1 + OBJ
PP you 4 --> 5 " SUB
VBP want 5 --> 3 ]
TO to 6 --> 7 -
VB protect 7 --> 5 -
. . 8 --> 0 -
250 Appendix II

TOSCA: L22 (Analysis I)


Select the text you want to protect.

Cannot be parsed due to raising.


Sample Parser Outputs 251

ALICE: T6*
That is, these words make the source sentence longer or shorter than the
TM sentence.

("That" "is" "," "these" "words" "make" "the" "source" "sentence"


"longer" "or" "shorter" "than" "the" "TM" "sentence")

;;; GC: 349602 words [1398408 bytes] of dynamic storage in use.


;;; 109148 words [436592 bytes] of free storage available before a GC.
;;; 567898 words [2271592 bytes] of free storage available if GC is
;;; disabled.

Parsing time: 46.330783 seconds.


String length: 16
Chart size: 142
spanning edges are: NIL
Chart size: 142

Fragment No. 1
>From 0 To 1

(DET "That")
Fragment No. 2
>From 1 To 2

(UNK-CAT "is")
Fragment No. 3
>From 2 To 3

(UNK-CAT ",")
Fragment No. 4
>From 3 To 10

(SENT
(VP-ACT (V-TR (NP (DET "these") (NOUN "words")) (V-BITR "make"))
(NP (DET "the") (NOUN (ADJ "source") (NOUN "sentence"))))
(NP "longer"))
Fragment No. 5
>From 10 To 12

((NOUN MOD) (UNK-CAT "or") (NOUN "shorter"))


Fragment No. 6
>From 12 To 16

(NP (NP-MOD "than") (NP (DET "the")


(NOUN ((NOUN MOD) "TM") (NOUN "sentence"))))
252 Appendix II

ENGCG: T6 (Analysis III)*


That is, these words make the source sentence longer or shorter than the
TM sentence.

"<That=is>"
"that=is" <*> ADV ADVL @ADVL
"<$,>"
"<these>"
"this" DET CENTRAL DEM PL @DN>
"<words>"
"word" N NOM PL @SUBJ
"<make>"
"make" <SVC/A> <SVOC/N> <SVOC/A> <into/SVOC/A> <SVO> <SV>
<InfComp> <P/of> <P/for> V PRES -SG3 VFIN @+FMAINV
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<source>"
"source" N NOM SG @NN>
"<sentence>"
"sentence" N NOM SG @OBJ
"<longer>"
"long" A CMP @PCOMPL-O
"<or>"
"or" CC @CC
"<shorter>"
"short" A CMP @PCOMPL-S @PCOMPL-O @<NOM
"<than>"
"than" PREP @<NOM
"<the>"
"the" <Def> DET CENTRAL ART SG/PL @DN>
"<TM>"
"tM" <*> ABBR NOM SG @NN>
"<sentence>"
"sentence" N NOM SG @<P
"<$.>"
Sample Parser Outputs 253

LPARSER: T6 (Analysis III)


That is, these words make the source sentence longer or shorter than the
TM sentence.

Not accepted (no linkage exists)


254 Appendix II

PRINCIPAR: T6
That is, these words make the source sentence longer or shorter than the
TM sentence.

;; time: 0.39 sec.


;; Fragments: 3
(S
(NP (Nbar (N That)))
is
,
(CP (Cbar (IP
(NP
(Det these)
(Nbar (N words)))
(Ibar (VP (Vbar (V (V_IP
(V_IP make)
(IP
(NP
(Det the)
(Nbar (N source sentence)))
(Ibar (AP
(AP (Abar (A longer)))
or
(AP (Abar
(A shorter)
(PP (Pbar (P
(P than)
(NP
(Det the)
(Nbar (N TM sentence)))))))))))))))))))
.)
Sample Parser Outputs 255

RANLT: T6 (Analysis I)
That is, these words make the source sentence longer or shorter than the
TM sentence.

(S1a (N2+/PRO that)


(VP/BE_NP is
(N2+/N1PROa
(N1/POST_APMOD1
(N1/RELMOD2 (N1/PRO2 these)
(S/THATLESSREL
(S1a (N2+/N2-a (N2- (N1/N words)))
(VP/NP_NP(MSLASH1b) make
(N2+/DET1a the (N2- (N1/N (N/COMPOUND1 source sentence))))
(TRACE1 E)))))
(A2/COORD2 (A2/ADVMOD1/- (A1/A longer))
(CONJ/A2 or
(A2/COMPAR1 (A1/A shorter)
(P2/P1
(P1/NPa than
(N2+/DET1a the (N2- (N1/N (N/COMPOUND1 tm sentence)))))))))))))
256 Appendix II

SEXTANT: T6 (Analysis I)*


That is, these words make the source sentence longer or shorter than the
TM sentence.

116 -- 0 That that DET 0 0


116 VP 101 is is BE 1 0
116 -- 0 , , CM 2 0
116 NP 3 these these DET 3 1 4 (word) DET
116 NP* 3 words word NOUN 4 0
116 VP 102 make make INF 5 1 4 (word) SUBJ
116 NP 4 the the DET 6 1 8 (sentence) DET
116 NP 4 source source NOUN 7 1 8 (sentence) NN
116 NP* 4 sentence sentence NOUN 8 1 5 (make) DOBJ
116 NP 4 longer long ADJ 9 1 11 (short) ADJ
116 NP 4 or or CC 10 0
116 NP* 4 shorter short ADJ 11 2 8 (sentence) ADJ
5 (make) DOBJ
116 -- 0 than than CS 12 0
116 NP 5 the the DET 13 1 15 (sentence) DET
116 NP 5 TM TM NOUN 14 1 15 (sentence) NN
116 NP* 5 sentence sentence NOUN 15 0
116 -- 0 . . . 16 0
Sample Parser Outputs 257

DESPAR: T6 (Analysis I)
That is, these words make the source sentence longer or shorter than the
TM sentence.

WDT that 1 --> 6 [ SUB


VBZ is 2 --> 1 ]
, , 3 --> 2 -
DT these 4 --> 5 [
NNS words 5 --> 6 + SUB
VBP make 6 --> 17 ]
DT the 7 --> 9 [
NN source 8 --> 9 +
NN sentence 9 --> 6 + OBJ
RBR longer 10 --> 6 ]
CC or 11 --> 10 -
JJR shorter 12 --> 11 -
IN than 13 --> 12 -
DT the 14 --> 16 [
JJ tm 15 --> 16 +
NN sentence 16 --> 13 +
. . 17 --> 0 ]
TOSCA: T6 (Analysis I)
That is, these words make the source sentence longer or shorter than the
TM sentence.

<tparc>
2 analyses in 3 seconds with TOSCA-ICE/V0.3.950102
</tparc>
<tparn fun=- cat=TXTU>
<tparn fun=UTT cat=S att=(act,cxtr,decl,indic,pres,unm)>
<tparn fun=A cat=CON> That is
</tparn>
<tparn fun=PUNC cat=PM att=(comma)> ,
</tparn>
<tparn fun=SU cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=PN att=(dem,plu)> these
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,plu)> words
</tparn></tparn>
<tparn fun=V cat=VP att=(act,cxtr,indic,pres)>
<tparn fun=MVB cat=LV att=(cxtr,indic,pres)> make
</tparn></tparn>
<tparn fun=OD cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=ART att=(def)> the
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,sing)> source sentence
</tparn></tparn>
<tparn fun=CO cat=AJP att=(comp,prd)>
<tparn fun=AJHD cat=COORD att=(comp,prd)>
<tparn fun=CJ cat=ADJ att=(comp,prd)> longer
</tparn>
<tparn fun=COOR cat=CONJN att=(coord)> or
</tparn>
<tparn fun=CJ cat=ADJ att=(comp,prd)> shorter
</tparn></tparn>
<tparn fun=AJPO cat=CL att=(act,comp,indic,red,sub)>
<tparn fun=SUB cat=SUBP>
<tparn fun=SBHD cat=CONJN att=(subord)> than
</tparn></tparn>
<tparn fun=SU cat=NP>
<tparn fun=DT cat=DTP>
<tparn fun=DTCE cat=ART att=(def)> the
</tparn></tparn>
<tparn fun=NPHD cat=N att=(com,sing)> TM sentence
</tparn></tparn></tparn></tparn></tparn>
<tparn fun=PUNC cat=PM att=(per)> .
</tparn></tparn>
Appendix III
Collated References
260 Appendix III

Aarts, F., & Aarts, J., (1982). English Syntactic Structures. Oxford,
UK: Pergamon.
Aarts, J. (1991). Intuition-based and observation-based grammars. In
K. Aijmer & B. Altenberg (Eds.) English Corpus Linguistics (pp.
44-62). London, UK: Longman.
Aarts, J., de Haan, P., & Oostdijk, N., (Eds.) (1993). English lan-
guage corpora: design, analysis and exploitation. Amsterdam, The
Netherlands: Rodopi.
Aarts, J., van Halteren, H., & Oostdijk, N., (1996). The TOSCA anal-
ysis system. In C. Koster & E. Oltmans (Eds.) Proceedings of the
rst AGFL Workshop (Technical Report CSI-R9604, pp. 181-191).
Nijmegen, The Netherlands: University of Nijmegen. Computing Sci-
ence Institute.
Abney, S. (1991). Parsing by Chunks. In R. Berwick, S. Abney & C.
Tenny (Eds.) Principle-Based Parsing. Dordrecht, The Netherlands:
Kluwer Academic Publishers.
Aijmer, K. & Altenberg, B., (Eds.) (1991). English Corpus Linguistics.
London, UK: Longman.
Alshawi, H. (Ed.) (1992). The CORE Language Engine. Cambridge,
MA: MIT Press.
AMALGAM. (1996). WWW home page for AMALGAM.
http://agora.leeds.ac.uk/amalgam/
Atwell, E. S. (1983). Constituent Likelihood Grammar ICAME Journal,
7, 34-67. Bergen, Norway: Norwegian Computing Centre for the
Humanities.
Atwell, E. S. (1988). Transforming a Parsed Corpus into a Corpus
Parser. In M. Kyto, O. Ihalainen & M. Risanen (Eds.) Corpus Lin-
guistics, hard and soft: Proceedings of the ICAME 8th International
Conference (pp. 61-70). Amsterdam, The Netherlands: Rodopi.
Atwell, E. S. (1996). Machine Learning from Corpus Resources for
Speech And Handwriting Recognition. In J. Thomas & M. Short
(Eds.) Using Corpora for Language Research: Studies in the Honour
of Geo rey Leech (pp. 151-166). Harlow, UK: Longman.
Atwell, E. S., Hughes, J. S., & Souter, D. C. (1994a). AMALGAM:
Automatic Mapping Among Lexico-Grammatical Annotation Models.
In J. Klavans (Ed.) Proceedings of ACL workshop on The Balancing
Act: Combining Symbolic and Statistical Approaches to Language (pp.
21-28). Somerset, NJ: Association for Computational Linguistics.
Atwell, E. S., Hughes, J. S., & Souter, D. C. (1994b). A Uni ed Multi-
Corpus for Training Syntactic Constraint Models. In L. Evett & T.
Rose (Eds.) Proceedings of AISB workshop on Computational Lin-
guistics for Speech and Handwriting Recognition. Leeds, UK: Leeds
University, School of Computer Studies.
Billot, S., & Lang, B. (1989). The structure of shared forests in ambigu-
Collated References 261

ous parsing. Proceedings of ACL-89, Vancouver, Canada, June 1989,


143-151.
Black, E. (1993). Parsing English By Computer: The State Of The
Art (Internal report). Kansai Science City, Japan: ATR Interpreting
Telecommunications Research Laboratories.
Black, E., Abney, S., Flickenger, D., Gdaniec, C., Grishman, R., Har-
rison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman,
M., Marcus, M., Roukos, S., Santorini, B., & Strzalkowski, T. (1991).
A procedure for quantitatively comparing the syntactic coverage of
English grammars. Proceedings of the Speech and Natural Language
Workshop, DARPA, February 1991, 306-311.
Black, E., Garside, R. G., & Leech, G. N. (Eds.) (1993). Statistically-
driven Computer Grammars of English: the IBM / Lancaster Ap-
proach. Amsterdam, The Netherlands: Rodopi.
Black, E., La erty, J., & Roukos, S. (1992). Development and evaluation
of a broad-coverage probabilistic grammar of English-language com-
puter manuals. Proceedings of ACL-92, Newark, Delaware, 185-192.
Bouma, G. (1992). Feature Structures and Nonmonotonicity. Computa-
tional Linguistics, 18(2). (Special Issue on Inheritance I.)
Brehony, T. (1994). Francophone Stylistic Grammar Checking using
Link Grammars. Computer Assisted Language Learning, 7(3). Lisse,
The Netherlands: Swets and Zeitlinger.
Brill, E. (1993). A Corpus-Based Approach to Language Learning. Ph.D.
Dissertation, Department of Computer and Information Science, Uni-
versity of Pennsylvania.
Briscoe, T. (1994). Parsing (with) Punctuation etc. (Technical Report).
Grenoble, France: Rank Xerox Research Centre.
Briscoe, T., & Carroll, J. (1991). Generalised Probabilistic LR Parsing of
Natural Language (Corpora) with Uni cation-based Grammars (Tech-
nical Report Number 224). Cambridge, UK: University of Cambridge,
Computer Laboratory.
Burns, A., Du y, D., MacNish, C., McDermid, J., & Osborne, M.
(1995). An Integrated Framework for Analysing Changing Require-
ments (PROTEUS Deliverable 3.2). York, UK: University of York,
Department of Computer Science.
Carroll, G., & Charniak, E. (1992). Two Experiments On Learning
Probabilistic Dependency Grammars From Corpora (TR CS-92-16).
Providence, RI: Brown University, Department of Computer Science.
Carroll, J. (1993). Practical Uni cation-based Parsing of Natural Lan-
guage. Ph.D. Thesis, University of Cambridge, March 1993.
Chanod, J. P. (1996). Rules and Constraints in a French Finite-State
Grammar (Technical Report). Meylan, France: Rank Xerox Research
Centre, January.
Charniak, E., Hendrickson C., Jacobson, N., & Perkowitz M. (1993).
262 Appendix III

Equations for Part-of-Speech Tagging. Proceedings of AAAI'93, 784-


789.
Chomsky, N. (1981). Lectures on Government and Binding. Cinnamin-
son, NJ: Foris Publications.
Chomsky, N. (1986). Barriers. Cambridge, MA: MIT Press, Linguistic
Inquiry Monographs.
COMPASS (1995). Adapting Bilingual Dictionaries for online Compre-
hension Assistance (Deliverable, LRE Project 62-080). Luxembourg,
Luxembourg: Commission of the European Communities.
Cunningham, H., Wilks, Y., & Gaizauskas, R. (1996). GATE | a Gen-
eral Architecture for Text Engineering. Proceedings of the 16th Con-
ference on Computational Linguistics (COLING-96).
Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A Practi-
cal Part-of-Speech Tagger. Proceedings of the Third Conference on
Applied Natural Language Processing, April, 1992.
Cutting, D., Kupiec, J., Pedersen, J., & Silbun, P. (1992). A practical
part-of-speech tagger. Proceedings of the Third Conference on Applied
Natural Language Processing, ANLP, 1992.
Debili, F. (1982). Analyse Syntaxico-Semantique Fondee sur une Acqui-
sition Automatique de Relations Lexicales-Semantiques. Ph.D. The-
sis, University of Paris XI.
Douglas, S. (1995). Robust PATR for Error Detection and Correction.
In A. Schoter and C. Vogel (Eds.) Edinburgh Working Papers in Cog-
nitive Science: Nonclassical Feature Systems, Volume 10 (pp. 139-
155). Unpublished.
Du y, D., MacNish, C., McDermid, J., & Morris, P. (1995). A framework
for requirements analysis using automated reasoning. In J. Iivari, K.
Lyytinen and M. Rossi (Eds.) CAiSE*95: Proceedings of the Seventh
Advanced Conference on Information Systems Engineering (pp. 68-
81). New York, NY: Springer-Verlag, Lecture Notes in Computer
Science.
Dynix (1991). Dynix Automated Library Systems Searching Manual.
Evanston, Illinois: Ameritech Inc.
EAGLES. (1996). WWW home page for EAGLES.
http://www.ilc.pi.cnr.it/EAGLES/home.html
Fain, J., Carbonell, J. G., Hayes, P. J., & Minton, S. N. (1985). MUL-
TIPAR: A Robust Entity Oriented Parser. Proceedings of the 7th
Annual Conference of the Cognitive Science Society, 1985.
Forney, D. (1973). The Viterbi Algorithm, Proceedings of the IEEE. 61,
268-278.
Francis, W. N., & Kucera, H. (1982). Frequency Analysis of English.
Boston, MA: Houghton Miin Company.
Gadzar, G., Klein, E., Pullum, G. K., & Sag, I. A. (1985). General-
ized Phrase Structure Grammar. Cambridge, MA: Harvard University
Collated References 263

Press.
Gibson, E., & Pearlmutter, N.. (1993). A Corpus-Based Analysis of
Constraints on PP Attachments to NPs (Report). Pittsburgh, PA:
Carnegie Mellon University, Department of Philosophy.
Grefenstette, G. (1994). Light Parsing as Finite State Filtering. Pro-
ceedings of the Workshop `Extended nite state models of language',
European Conference on Arti cial Intelligence, ECAI'96, Budapest
University of Economics, Budapest, Hungary, 11-12 August, 1996
Grefenstette, G. (1994). Explorations in Automatic Thesaurus Discov-
ery. Boston, MA: Kluwer Academic Press.
Grefenstette, G., & Schulze, B. M. (1995). Designing and Evaluating
Extraction Tools for Collocations in Dictionaries and Corpora (De-
liverable D-3a, MLAP Project 93-19: Prototype Tools for Extracting
Collocations from Corpora). Luxembourg, Luxembourg: Commission
of the European Communities.
Grefenstette, G., & Tapanainen, P. (1994). What is a Word, What
is a Sentence? Problems of Tokenization. Proceedings of the 3rd
Conference on Computational Lexicography and Text Research, COM-
PLEX'94, Budapest, Hungary, 7-10 July.
Grover, C., Briscoe, T., Carroll, J., & Boguraev, B. (1993). The Alvey
Natural Language Tools Grammar (4th Release) (Technical report).
Cambridge, UK: University of Cambridge, Computer Laboratory.
Hayes, P. J. (1981). Flexible Parsing. Computational Linguistics, 7(4),
232-241.
Hellwig, P. (1980). PLAIN | A Program System for Dependency Anal-
ysis and for Simulating Natural Language Inference. In L. Bolc (Ed.)
Representation and Processing of Natural Language (271-376). Mu-
nich, Germany, Vienna, Austria, London, UK: Hanser & Macmillan.
Hindle, D. (1993). A Parser for Text Corpora. In B. T. S. Atkins & A.
Zampolli (Eds.) Computational Approaches to the Lexicon. Oxford,
UK: Clarendon Press.
HLT Survey. (1995). WWW home page for the NSF/EC Survey of the
State of the Art in Human Language Technology.
http://www.cse.ogi.edu/CSLU/HLTsurvey/
Hughes, J. S., & Atwell, E. S. (1994). The Automated Evaluation of
Inferred Word Classi cations. In A. Cohn (Ed.) Proceedings of Euro-
pean Conference on Arti cial Intelligence (ECAI'94) (pp. 535-539).
Chichester, UK: John Wiley.
Hughes, J. S., Souter, D. C., & Atwell, E. S. (1995). Automatic Ex-
traction of Tagset Mappings from Parallel-Annotated Corpora. In
E. Tzoukerman & S. Armstrong (Eds.) Proceedings of Dublin ACL-
SIGDAT workshop `From text to tags: issues in multilingual language
analysis'. Somerset, NJ: Association for Computational Linguistics.
Hyland, P., Koch, H.-D., Sutcli e, R. F. E., & Vossen, P. (1996). Se-
264 Appendix III

lecting Information from Text (SIFT) Final Report (LRE-62030 De-


liverable D61). Luxembourg, Luxembourg: Commission of the Euro-
pean Communities, DGXIII/E5. Also available as a Technical Report.
Limerick, Ireland: University of Limerick, Department of Computer
Science and Information Systems.
ICAME. (1996). WWW home page for ICAME.
http://www.hd.uib.no/icame.html
Jarvinen, T. (1994). Annotating 200 million words: the Bank of English
project. Proceedings of COLING-94, Kyoto, Japan, Vol. 1.
Johansson, S., Atwell, E. S., Garside, R. G., & Leech, G. N. (1986). The
Tagged LOB Corpus. Bergen, Norway: Norwegian Computing Centre
for the Humanities.
Jones, B. E. M. (1994). Can Punctuation Help Parsing? 15th Interna-
tional Conference on Computational Linguistics, Kyoto, Japan.
Karlsson, F. (1990). Constraint Grammar as a Framework for Parsing
Running Text. Proceedings of COLING-90, Helsinki, Finland, Vol.
3.
Karlsson, F., Voutilainen, A., Anttila, A., & Heikkila, J. (1991). Con-
straint Grammar: a Language-Independent System for Parsing Unre-
stricted Text, with an Application to English. In Natural Language
Text Retrieval: Workshop Notes from the Ninth National Conference
on Arti cial Intelligence (AAAI-91). Anaheim, CA: American Asso-
ciation for Arti cial Intelligence.
Karlsson, F., Voutilainen, A., Heikkila, J., & Anttila, A. (Eds.) (1995).
Constraint Grammar. A Language-Independent System for Parsing
Unrestricted Text. Berlin, Germany, New York, NY: Mouton de
Gruyter.
Karttunen, L. (1994). Constructing Lexical Transducers. Proceedings
of the 15th International Conference on Computational Linguistics,
COLING'94, Kyoto, Japan.
Karttunen, L., Kaplan, R. M., & Zaenen, A. (1992). Two-Level Morphol-
ogy with Composition. Proceedings of the 14th International Con-
ference on Computational Linguistics, COLING'92, Nantes, France,
23-28 August, 1992, 141-148.
Koskenniemi, K. (1983). Two-level Morphology. A General Computa-
tional Model for Word-form Production and Generation (Publication
No. 11). Helsinki, Finland: University of Helsinki, Department of
General Linguistics.
Kyto, M., & Voutilainen, A. (1995). Backdating the English Constraint
Grammar for the analysis of English historical texts. Proc. 12th Inter-
national Conference on Historical Linguistics, ICHL, 13-18 August.
1995, University of Manchester, UK.
Leech, G. N., Barnett, R., & Kahrel, P. (1995). EAGLES Final Report
and Guidelines for the Syntactic Annotation of Corpora (EAGLES
Collated References 265

Document EAG-TCWG-SASG/1.5, see EAGLES WWW page). Pisa,


Italy: Istituto di Linguistica Computazionale.
Liberman, M. (1993). How Hard Is Syntax. Talk given at Taiwan.
Lin, D. (1992). Obvious Abduction. Ph.D. thesis, Department of Com-
puting Science, University of Alberta, Edmonton, Alberta, Canada.
Lin, D. (1993). Principle-based parsing without overgeneration. Pro-
ceedings of ACL-93, Columbus, Ohio, 112-120.
Lin, D. (1994). Principar | an ecient, broad-coverage, principle-based
parser. Proceedings of COLING-94, Kyoto, Japan, 482-488.
Lin, D. (1995). A dependency-based method for evaluating broad-
coverage parsers. Proceedings of IJCAI-95, Montreal, Canada, 1420-
1425.
Lin, D., & Goebel, R. (1990). A minimal connection model of abductive
diagnostic reasoning. Proceedings of the 1990 IEEE Conference on
Arti cial Intelligence Applications, Santa Barbara, California, 16-22.
Lin, D., & Goebel, R. (1991). A message passing algorithm for plan
recognition. Proceedings of IJCAI-91, Sidney, Australia. 280-285.
Lin, D., & Goebel, R. (1993). Context-free grammar parsing by message
passing. Proceedings of the First Conference of the Paci c Association
for Computational Linguistics, Vancouver, British Columbia, 203-211.
Lotus (1992). Lotus Ami Pro for Windows User's Guide Release Three.
Atlanta, Georgia: Lotus Development Corporation.
Magerman, D. (1994). Natural Language Parsing As Statistical Pattern
Recognition. Ph.D. Thesis, Stanford University.
Magerman, D., & Weir, C. (1992). Eciency, Robustness and Accuracy
in Picky Chart Parsing. Proceedings of the 30th ACL, University of
Delaware, Newark, Delaware, 40-47.
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building
a Large Annotated Corpus of English: The Penn Treebank. Compu-
tational Linguistics, 19, 313-330.
Mel'cuk I. A. (1987). Dependency syntax: theory and practice. Albany,
NY: State University of New York Press.
Mellish, C. S. (1989). Some chart-based techniques for parsing ill-formed
input. ACL Proceedings, 27th Annual Meeting, 102-109.
Merialdo, B. (1994). Tagging English Text With A Probabilistic Model.
Computational Linguistics, 20, 155-171.
Nederhof, M. & Koster, K., (1993). A customized grammar workbench.
In J. Aarts, P. de Haan & N. Oostdijk, (Eds.) English language cor-
pora: design, analysis and exploitation (pp. 163-179). Amsterdam,
The Netherlands: Rodopi.
Nunberg, G. (1990). The linguistics of punctuation (Technical Report).
Stanford, CA: Stanford University, Center for the Study of Language
and Information.
266 Appendix III

O'Donoghue, T. (1993). Reversing the process of generation in systemic


grammar. Ph.D. Thesis. Leeds, UK: Leeds University, School of
Computer Studies.
Osborne, M. (1994). Learning Uni cation-based Natural Language Gram-
mars. Ph.D. Thesis, University of York, September 1994.
Osborne, M. (1995). Can Punctuation Help Learning? IJCAI95 Work-
shop on New Approaches to Learning for Natural Language Process-
ing, Montreal, Canada, August 1995.
Osborne, M., & Bridge, D. (1994). Learning uni cation-based gram-
mars using the Spoken English Corpus. In Grammatical Inference
and Applications (pp. 260-270). New York, NY: Springer Verlag.
Peh, L. S., & Ting, C. (1995) Disambiguation of the Roles of Commas
and Conjunctions in Natural Language Processing (Proceedings of the
NUS Inter-Faculty Seminar). Singapore, Singapore: National Univer-
sity of Singapore.
Quirk, R., Greenbaum, G., Leech, G., & Svartvik, J., (1972). A Gram-
mar of Contemporary English. London, UK: Longman.
Sampson, G. (1995). English for the Computer: the SUSANNE Corpus
and Analytic Scheme. Oxford, UK: Clarendon Press.
Sampson, G. (1995). English for the Computer: the SUSANNE Corpus
and Analytic Scheme. Oxford, UK: Clarendon Press.
Shieber, S. M. (1986). An Introduction to Uni cation-Based Approaches
to Grammar (Technical Report). Stanford, CA: Stanford University,
Center for the Study of Language and Information.
Sleator, D. D. K., & Temperley, D. (1991). Parsing English with a
Link Grammar (Technical Report CMU-CS-91-196). Pittsburgh, PA:
Carnegie Mellon University, School of Computer Science.
Souter, C. (1989). A Short Handbook to the Polytechnic of Wales Corpus
(Technical Report). Bergen, Norway: Bergen University, ICAME,
Norwegian Computing Centre for the Humanities.
Souter, D. C., & Atwell, E. S. (1994). Using Parsed Corpora: A review
of current practice. In N. Oostdijk & P. de Haan (Eds.) Corpus-based
Research Into Language (pp. 143-158). Amsterdam, The Netherlands:
Rodopi.
Sundheim, B. M. (1991). Third Message Understanding Evaluation and
Conference (MUC-3): Methodology and Test Results. Natural Lan-
guage Processing Systems Evaluation Workshop, 1-12.
Sutcli e, R. F. E., Brehony, T., & McElligott, A. (1994). Link Grammars
and Structural Ambiguity: A Study of Technical Text. Technical Note
UL-CSIS-94-15, Department of Computer Science and Information
Systems, University of Limerick, December 1994.
Sutcli e, R. F. E., Koch, H.-D., & McElligott, A. (Eds.) (1995). Proceed-
ings of the International Workshop on Industrial Parsing of Software
Manuals, 4-5 May 1995, University of Limerick, Ireland (Technical
Collated References 267

Report). Limerick, Ireland: University of Limerick, Department of


Computer Science and Information Systems, 3 May, 1995.
Tapanainen, P., & Jarvinen, T. (1994). Syntactic analysis of natural
language using linguistic rules and corpus-based patterns. Proceedings
of COLING-94, Kyoto, Japan, Vol. 1.
Tapanainen, P., & Voutilainen, A. (1994). Tagging accurately | Don't
guess if you know. Proceedings of Fourth ACL Conference on Applied
Natural Language Processing, Stuttgart, Germany.
Taylor, L. C., Grover, C., & Briscoe, E. J. (1989). The Syntactic Reg-
ularity of English Noun Phrases. Proceedings, 4th European Associa-
tion for Computational Linguistics, 256-263.
Taylor, L. J., & Knowles. G. (1988). Manual of information to accom-
pany the SEC corpus: The machine readable corpus of spoken English
(Technical Report). Lancaster, UK: University of Lancaster, Unit for
Computer Research on the English Language.
Ting, C. (1995a). Hybrid Approach to Natural Language Processing
(Technical Report). Singapore, Singapore: DSO.
Ting, C. (1995b). Parsing Noun Phrases with Enhanced HMM (Pro-
ceedings of the NUS Inter-Faculty Seminar). Singapore, Singapore:
National University of Singapore.
Tomita, M. (1986). Ecient Parsing for Natural Language. Norwell,
Massachusetts: Kluwer Academic Publishers.
Trados (1995). Trados Translator's Workbench for Windows User's
Guide. Stuttgart, Germany: Trados GmbH.
van Halteren, H. & van den Heuvel, T., (1990). Linguistic Exploitation
of Syntactic Databases. The use of the Nijmegen Linguistic DataBase
Program. Amsterdam, The Netherlands: Rodopi.
Vogel, C., & Cooper, R. (1995). Robust Chart Parsing with Mildly
Inconsistent Feature Structures. In A. Schoter and C. Vogel (Eds.)
Edinburgh Working Papers in Cognitive Science: Nonclassical Feature
Systems, Volume 10 (pp. 197-216). Unpublished.
Voutilainen, A. (1993). NPtool, a Detector of English Noun Phrases.
Proceedings of the Workshop on Very Large Corpora, Ohio State Uni-
versity, Ohio, USA.
Voutilainen, A. (1994). A noun phrase parser of English. In Robert
Eklund (Ed.) Proceedings of `9:e Nordiska Datalingvistikdagarna',
Stockholm, Sweden, 3-5 June 1993. Stockholm, Sweden: Stockholm
University, Department of Linguistics and Computational Linguistics.
Voutilainen, A. (1995a). A syntax-based part of speech analyser. Pro-
ceedings of the Seventh Conference of the European Chapter of the
Association for Computational Linguistics, Dublin, Ireland, 1995.
Voutilainen, A. (1995b). Experiments with heuristics. In F. Karlsson, A.
Voutilainen, J. Heikkila & A. Anttila (Eds.) Constraint Grammar. A
Language-Independent System for Parsing Unrestricted Text. Berlin,
Germany, New York, NY: Mouton de Gruyter.
Voutilainen, A., & Heikkila, J. (1994). An English constraint grammar
(ENGCG): a surface-syntactic parser of English. In U. Fries, G. Tottie
& P. Schneider (Eds.) Creating and using English language corpora.
Amsterdam, The Netherlands: Rodopi.
Voutilainen, A., Heikkila, J., & Anttila, A. (1992). Constraint Gram-
mar of English. A Performance-Oriented Introduction (Publication
21). Helsinki, Finland: University of Helsinki, Department of Gen-
eral Linguistics.
Voutilainen, A., Heikkila, J., & Anttila, A. (1992). A Lexicon and Con-
straint Grammar of English. Proceedings of the 14th International
Conference on Computational Linguistics, COLING'92, Nantes,
France, 23-28 August, 1992.
Voutilainen, A. & Jarvinen, T. (1995). Specifying a shallow grammatical
representation for parsing purposes. Proceedings of the Seventh Con-
ference of the European Chapter of the Association for Computational
Linguistics, Dublin, Ireland, 1995.
Voutilainen, A. & Jarvinen, T. (1996). Using English Constraint Gram-
mar to Analyse a Software Manual Corpus. In R. F. E. Sutcli e,
H.-D. Koch & A. McElligott (Eds.) Industrial Parsing of Software
Manuals. Amsterdam, The Netherlands: Editions Rodopi.
Weischedel, R. M. (1983). Meta-rules as a Basis for Processing Ill-formed
Input. Computational Linguistics, 9, 161-177.
Wood, M. M. (1993). Categorial Grammars. London, UK: Routledge.
comp.speech. (1996). WWW home page for comp.speech Frequently
Asked Questions. http://svr-www.eng.cam.ac.uk/comp.speech/
Index

Aarts, F., 186 Applications of parsing, 27, 43


Aarts, J., 179, 180, 186 Applied Linguistics, 43
Abney, S., 13, 140 Apposition, 64
Active Chart parsing, 48 Argument/Adjunct distinction,
Ax Grammar over Finite Lat- 149
tices (AGFL), 185 Attachment, prepositional
Algorithm phrase, 148, 152
dynamic context, 161 Atwell, E. S., 25, 26, 27, 28, 36
Viterbi, 165 Automatic parsing, 181
ALICE, 47 Auxiliary verbs, 148
analysis compared to other Awk (Unix utility), use for name
parsers, 33 recognition, 142
chart parsing algorithm in, Axioms, dependency, 161
48 Barnett, R., 26, 35
extraction of parsed fragments Billot, S., 107
from chart, 49 Black, E., 13, 25, 119, 159
origins in CRISTAL LRE Black, W. J., 47
project, 47 Boguraev, B., 120
postprocessing in, 49 Bouma, G., 126
relationship to nancial do- Bracketing, recommended use
main, 49 of in EAGLES Draft Report,
use of Categorial Grammar 36
in, 47 Brehony, T., 89, 101
Alshawi, H., 119 Bridge, D., 125
Alvey Natural Language Toolkit Brill, E., 92, 143
(ANLT), 120 Brill Tagger, 92, 143
AMALGAM, 26, 28, 29, 38 Briscoe, A. D., iii
Analysis, contextually appropri- Briscoe, E. J., 120, 122, 125,
ate, 188 129
Analysis trees vs. parse trees British National Corpus (BNC),
in TOSCA, 182 30
ANLT, see Alvey Natural Lan- Brown Corpus, 30, 165
guage Toolkit tagset for, 144
Anttila, A., 140 Burns, A., 120
270 Index

Capitalisation, problems of, 94 Spoken English (SEC), 28,


Carbonell, J. G., 126 29
Carroll, G., 120, 121, 125, 159 SUSANNE, 28
Categorial Grammar, 47 use of to de ne parsing scheme,
Cencioni, R., 11 28
Chain Wall Street Journal (WSJ),
noun, 145 165, 169
verb, 145 Corpus-based parsing approach,
Chanod, J. P., 140 164
Charniak, E., 159, 161, 165 CRISTAL LRE Project, 47
Chomsky, N., 103 Crossing rate metric, 127
Chomsky Normal Form, 49 Cutting, D., 124, 143, 151
Clause DCG, see De nite Clause Gram-
elliptical, 188 mar
verbless, 188 Debili, F., 140
Cleft sentence, 188 Default uni cation, use of to
Collins English Dictionary achieve relaxation in parsing,
(CED), 109 125
COMPASS, 140 De nite Clause Grammar (DCG),
Complement, preposed, 188 121
Conjoin marker, 192 De nition of parsing scheme, 28
Conjunctions, problems of, 155 Delicacy
Connective, 188 in grammatical classi cation,
Constituency structure, 25 26
vs. dependency structure, 41 issue of in comparing parsers,
Constituent structure, immedi- 35
ate, 186 Dependency axioms, 161
Constraints, percolation, 109 Dependency parsing, 160
Contextually appropriate anal- Dependency relations, 156
ysis, 188 binary, 140, 153
Cooper, R., 126 Dependency structure, 36
Coordination, 192 vs. constituency structure,
processing of in ENGCG, 64 41
strength of LPARSER in, 101 DESPAR, 159
Corpora, ICAME parsed, 29 analysis compared to other
Corpus parsers, 35
annotated, 159 evaluation of at word and sen-
British National (BNC), 30 tence level, 170
Brown, 30, 165 lack of grammar formalism
Lancaster Oslo Bergen (LOB), underlying, 160
28, 122 parsing noun phrases in, 169
Polytechnic of Wales (POW), processing unknown words in,
29 167
Index 271

production of parse forest by, Error detection, 27


160 Existential sentence, 188
relationship between depen- Expert Advisory Group on Lan-
dency parsing and tagging guage Engineering Standards
in, 161 (EAGLES), 26, 30, 36, 40,
use of enhanced Hidden 41
Markov Model (eHMM) Draft Report on parsing
in, 160, 161 schemes, 35
Dictionary hierarchy of syntactic anno-
Collins English (CED), 109 tation layers in Draft Re-
Oxford Advanced Learner's port, 39
(OALD), 109 Expression, formulaic, 188
Direct speech, 188 Extraposed sentence, 188
Discourse element, 188 Fain, J., 126
Douglas, S., 126 Ferris, M. C., 11
Du y, D., 120 FIDDITCH, 153
Dynamic context algorithm, 161 Filters, nite-state, 140, 146
EAGLES, see Expert Advisory Financial domain, vocabulary
Group on Language Engineer- of, 49
ing Standards Flickenger, D., 13
eHMM, see Enhanced Hidden Forest, shared, use of in TOSCA,
Markov Model 182
Elliptical clause, 188 Formal grammar underlying
Enclitic forms, 188 TOSCA, 180
End of noun phrase postmodi- Formulaic expression, 188
er marker, 192 Forney, D., 165
ENGCG, 57 Francis, W. N., 144
analysis compared to other Functional labels, recommended
parsers, 33 indication of in EAGLES Draft
availability of grammars for Report, 37
di erent languages, 58 Gallen, S., 11
compounds in lexicon of, 63 Gapping, 64
processing of coordination in, Garland, C., 11
64 Garside, R. G., 25, 28, 119
strategy for evaluating, 62 Gazdar. G., 121
tokenisation in, 59 Gdaniec, C., 13
English Constraint Grammar Generalised Phrase Structure
(ENGCG), 58, 140 Grammar (GPSG), 121
English language teaching, 43 Gibson, E., 148
ENGTWOL lexical analyser, 61 Goebel, R., 104, 109
Enhanced Hidden Markov Model Government-Binding Theory, 103
(eHMM), 160 GPSG, see Generalised Phrase
ENTWOL lexicon, 63 Structure Grammar
Error correction, 125 Grammar
272 Index

Ax over Finite Lattices Hypertags, use of to demarcate


(AGFL), 185 phrase boundaries, 30
Categorial, 47 ICAME, see International Com-
De nite Clause (DCG), 121 puter Archive of Modern En-
English Constraint (ENGCG), glish
58, 140 Idiomatic phrases, problems of,
Generalised Phrase Structure 30
(GPSG), 121 Ill-formed input, strategy for pars-
Link, 89 ing, 125
machine learning of, 127 Immediate constituent structure,
Object, 121 186
underlying TOSCA, 180 Imperative sentence, 188
wide-coverage, 121 Ingria, R., 13
Grammar-based parsing, 179 International Computer Archive
Grammar formalism, necessary of Modern English (ICAME),
for parsing or not, 160 26
Grammar Workbench (GWB), Interrogative sentence, 188
185 Inversion, subject-verb, 188
Grammars, principle-based vs. IPSM Utterance Corpus
rule-based, 103 breakdown by utterance type,
Greenbaum, G., 186 5
Grefenstette, G., 139, 140, 141, selection of 60 utterance sub-
142 set, 5
Grishman, R., 13 selection of 600 utterances,
Grover, C., 120 5
Harrison, P., 13 IPSM'95 Workshop, di erence
Hayes, P. J., 126 between workshop proceed-
Heikkila, J., 140 ings and this volume, 2
Hendrickson C., 161, 165 Island parsing, 149
Heyn, M., 11 Jacobson, N., 161, 165
Hickey, D., 2, 11, 89 Jarvinen, T., 48, 57
Hidden Markov Model (HMM), Jelinek, F., 13
160 Johansson, S., 28
enhanced (eHMM), 160, 169 Jones, B. E. M., 129
Hidden Markov Model (HMM) Kahrel, P., 26, 35
tagger, use in parsing, 143 Kaplan, R. M., 142
Hindle, D., 13, 153 Karlsson, F., 57
HMM, see Hidden Markov Model Karttunen, L., 142
Horizontal format, use of to rep- Klavans, J., 13
resent phrase structure, 36 Klein, E., 121
Hughes, J. S., 26 Knowledge extraction, 27
Human Language Technology Koch, H.-D., 1
Survey, 27 Koskenniemi, K., 59
Hybrid approach to parsing, 160 Koster, K., 185
Index 273

Kupiec, J., 124, 143, 151 linking requirements used by,


Kucera, H., 144 91
Labelled tree, use of in TOSCA, preprocessing required by, 94
186 problems with topicalisation
Labelling, recommended use of in, 101
in EAGLES Draft Report, syntagmatic lexicon as basis
36 of, 90
La erty, J., 13 treatment of capitalised to-
Lancaster Oslo Bergen (LOB) kens by, 94
Corpus, 28, 122 use of Brill Tagger in evalu-
Lang, B., 107 ating, 92
Language Engineering, de ni- Machine learning
tion of, 2 of grammar, 127
Leech, G. N., 25, 26, 28, 35, of language models, 27
119, 186 MacNish, C., 120
Lemmatisation, 143 Magerman, D. M., 13, 19, 119,
Lex (Unix utility), 141 125, 159
Lexical analysis, see Tokenisa- Marcinkiewicz, M. A., 165
tion Marcus, M. P., 13, 165
Lexicography, 27 Marker
Lexicon conjoin, 192
nite-state, 142 end of noun phrase
in TOSCA, 184 postmodi er, 192
McDermid, J., 120
problem of creating, 135 McElligott, A., 1, 89, 101
Liberman, M., 13, 159, 160, 161 Mel'cuk, I. A., 15, 161
Lin, D., 13, 14, 15, 19, 103, Merialdo, B., 161
104, 109, 110, 202 Metric, crossing rate, 127
Linguistic DataBase (software Minton, S. N., 126
package), 182 Module, unknown word, 167
Link Parser, 89 Molloy, T., 2, 11, 89
Link, syntagmatic, 90 Morphological analysis, 142
Lists in a text, recognition of, Morphology, nite-state, 142
146, 154 MultiTreebank, 29, 43
LOB Corpus, see Lancaster Oslo Murphy, B., 11
Bergen Corpus Name recognition, 142
Lookahead, 189 use of Awk for, 142
LPARSER, 89 National Software Directorate
ability of to analyse coordi- of Ireland (NSD), 2, 11, 89
nation, 101 Neal, P., 47
compared to other parsers, Nederhof, M., 185
34 N-gram Model, 27, 30
e ect of rogue utterances on Normal Form, Chomsky, 49
overall performance of, 95 Notational di erences, 26
274 Index

Noun chain, 145 low-level, 139


Noun phrases preprocessing in, 141
identi cation of heads in, 146 purpose of, 139
longest match of in SEXTANT, robustness of, 139
145 Parsing
parsing in DESPAR, 169 active chart, 48
NSD, see National Software Di- applications of, 27, 43
rectorate of Ireland as tagging, 161
Nunberg, G., 129 automatic, 181
O'Brien, R., 2, 11, 89 corpus-based approach to, 164
O'Donoghue, T. F., 31 de nition of, 159
Oostdijk, N., 179 dependency, 160
Osborne, M., 119, 120, 125, 128 e ect of mis-tagging on ac-
Overspeci cation, consequences curacy of, 155
of in TOSCA, 197 error correction in, 125
Oxford Advanced Learner's Dic- grammar-based, 179
tionary (OALD), 109 hybrid, 160
Parse forest island, 149
in DESPAR, 160 message passing in, 109
in SEXTANT, 153, 156 percolation constraints in, 109
Parse selection in TOSCA, 180, problems of punctuation in,
189 129
Parse tree, conversion to depen-
dency notation, 156 robustness in, 124
Parse trees vs. analysis trees in skeletal, 27
TOSCA, 182 statistical approach to, 164
Parser tokenisation prior to, 128
ALICE, 47 unrestricted language, 119
ANLT, 121 with unknown words, 120
DESPAR, 159 Parsing ill-formed input, 125
ENGCG, 57 Parsing noun phrases in
FIDDITCH, 153 DESPAR, 169
Link, 89 Parsing scheme, de nition of us-
LPARSER, 89 ing a corpus, 28
PLAIN, 35 Parsing schemes, EAGLES
RANLT, 119 Draft Report on, 35
SEXTANT, 139, 140, 144 Pearlmutter, N., 148
TOSCA, 186 Pedersen, J., 124, 143, 151
Parsers Peh, L. S., 159, 160, 169
chunking, 140 Penn Treebank, 165
constraint grammar, 140 Percolation constraints, 109
evaluation of, 149, 151 Perkowitz M., 161, 165
nite-state, 140 Phrase boundaries, identi ca-
Helsinki, 140 tion of in SEXTANT, 152
Index 275

PLAIN, analysis compared to RANLT, 119


other parsers, 35 analysis compared to other
Plan recognition, 109 parsers, 35
Pola, L., 11 problems addressed by, 124
Polytechnic of Wales (POW) Cor- processing of unknown words
pus, 29 in, 120
POW Corpus, see Polytechnic relationship of Alvey Natu-
of Wales Corpus ral Language Toolkit
Predicate-argument data, extrac- (ANLT) to, 120
tion in SEXTANT, 149 resource bound placed upon,
Preposed complement, 188 125
Preposed object, 188 use of default uni cation for
Prepositional phrase attachment, handling ill-formed input
148, 152 to, 125
Prepositional phrase heads, iden- use of part-of-speech tagger
ti cation of, 146 in, 124
PRINCIPAR use of punctuation to seg-
analysis compared to other ment input to, 128
parsers, 34 Reaction signal, 188
automatic method used for Resource bounds, 120
evaluation of, 110 Robustness in parsing, 124, 139
broad coverage of, 103 Roukos, S., 13
percolation constraints in, 109 Rule-based component in
relevance of Government- TOSCA, 184
Binding Theory to, 103 Rule-based parsing, 103
use of machine readable dic- Sag, I. A., 121
tionaries to create lexicon Sampson, G. R., 14, 28
for, 109 Santorini, B., 13, 165
use of message-passing in, 104 Schulze, B. M., 140
Principle-based parsing, 103 SEC Corpus, see Spoken En-
Probability, 27 glish Corpus
Proper names, problems of, 30 Segmentation
Proteus Project, 120 prior to parsing, 128
Pullum, G. K., 121 problems of, 30
Punctuation Segments, recommended brack-
relationship to parsing, 129 eting and labelling of in EA-
treatment of by SEXTANT, GLES Draft Report, 36
155 Selecting Information from Text
used for sentence segmenta- (SIFT), 2, 89
tion, 128, 133 Semantic component, 27
Quirk, R., 186 Sentence
Raising, 35, 38, 40, 188 cleft, existential, extraposed,
Rank of syntactic unit in EA- imperative & interrogative,
GLES Draft Report, 38 188
276 Index

distinguished from utterance, Spoken language, special char-


5 acteristics of, 38
Sentence boundaries, 141 Statistical parsing approach, 164
Sentence level, evaluation of Strzalkowski, T., 13
DESPAR at, 170 Subcategorisation, 33, 37
SEXTANT, 140 Subclassi cation, 37
analysis compared to other Subject-verb inversion, 188
parsers, 35 Sundheim, B. M., 127
comparison of philosophy to SUSANNE Corpus, 28
that of ENGCG, 140 Sutcli e, R. F. E., 1, 89, 101
extraction of predicate- Svartvik, J., 186
argument data in, 149 Syntactic annotation, EAGLES
identi cation of phrase bound- vs. IPSM, 39
aries in, 152 Syntagmatic relation, 90
list recognition in, 146 Tag correction, 191
longest match of noun phrases Tag selection in TOSCA, 180
in, 145 Tagger
name recognition in, 142 Brill, 92, 143
parse forest returned by, 156 building from Brown and Wall
prepositional attachment in, Street Journal Corpora, 165
148 part-of-speech, 160
tokenisation in, 141 stochastic, 124
treatment of punctuation by, Taggers
155 accuracy of, 151, 155
use of for automatic thesaurus ftp sites for, 143
construction, 140 Hidden Markov Model (HMM),
use of Hidden Markov Model 143
part-of-speech tagger in, lexical probabilities in, 155
143 Xerox, 143
Shared forest, use in TOSCA, Tagging, automatic, 181
182 Tagset
Shieber, S. M., 126 Brown Corpus, 144
Shiuan, P. L., 159, 160, 169 layered, 143, 144
Sibun, P., 143, 151 simpli cation of, 144
Signal, reaction, 188 Tapanainen, P., 61, 141
Silbun, P., 124 Taylor, L. C., 122
Skeletal parsing, 27 Temperley, D., 89
Sleator, D. D. K., 89 Text, technical vs. informal,
Souter, D. C., 26, 27 152
Speech recognition, 27 Thesaurus construction, 140
Speech recognition systems, sur- Ting, C. H. A., 159, 160, 169
veys of, 27 Tokenisation
Spoken English Corpus (SEC), in SEXTANT, 141
28, 29 in TOSCA, 181, 183
Index 277

prior to parsing, 128 Weighted scores, use of to com-


problems of, 30 pare parsing schemes, 42
with Unix lex, 141 Weischedel, R. M., 126
Tomita, M., 107 Wier, C., 125
TOSCA, 186 Wood, M. M., 47
analysis compared to other Word level, evaluation of
parsers, 35 DESPAR at, 170
grammar underlying, 180 Word tagging, problems of, 30
parse selection in, 180 Words, unknown, 142
parse trees vs. analysis trees WSJ Corpus, see Wall Street
in, 182 Journal Corpus
rule-based component in, 184 Wybrants, H. J., 11
tag selection in, 180 Xerox
tokenisation in, 181, 183 Palo Alto Research Center
use of labelled tree in, 186 (PARC), 140
use of shared forest in, 182 Rank Xerox Research Cen-
TOSCA Research Group, 179 tre, 140
Trace, 38 Xerox Taggers, 143
Transducers, nite-state, 142 XSoft, 143
Tree, labelled, 186 Zaenen, A., 142
Treebank, 29
Two-level grammar, 185
Two-level rules, 142
Unknown words
parsing sentences with, 120
processing in DESPAR, 167
Utterance, distinguished from
sentence, 5
van den Heuvel, T., 183
van Halteren, H., 179, 183
Verb chain, 145
identi cation of, 148
Verbless clause, 188
Verbs, auxiliary, 148
Vertical format, use of to rep-
resent phrase structure, 36
Vertical Strip Grammar, 31, 41
Viterbi algorithm, 165
Vocabulary of nancial domain,
49
Vogel, C., 126
Voutilainen, A., 48, 57, 140
Wall Street Journal (WSJ) Cor-
pus, 165, 169