Académique Documents
Professionnel Documents
Culture Documents
12
THE PROIEL CORPUS AS A SOURCE TO OLD CHURCH SLAVIC: A
PRACTICAL INTRODUCTION
4 http://files.morphgnt.org/tischendorf/
369
. . 12
(1) 28 . 9 .
(. 9 ( ( 8 .:.
While they were listening to this, he went on to tell them a parable, because he was
near Jerusalem and the people thought that the kingdom of God was going to appear
at once. (Lk. 19:11)
(2) ( 99 2 .
He said, I will judge you by your own words, you wicked servant! (Lk. 19:22)
Our solution was to split the text into chunks by punctuation, and then group the
chunks into sentences by comparing them to the Greek: orienting by text structure (chapter
and verse) the Gale-Church algorithm tries to make the best match with the length of the
corresponding Greek sentence. The method was fairly successful, but left some adjustments
to be made manually: both manual sentence splits and sentence mergers have been
necessary, and are done by the annotators/reviewers with the annotation tool.
4 Annotation workflow
The annotation was done manually by student annotators, but aided in different
ways by the annotation tool: First they did manual sentence division adjustments if
necessary. cf. section 3. The next step was lemmatisation and morphological
disambiguation. For the Greek text we had good morphological analysis and lemmatisation
from the MorphGNT source text, and the morphological annotation was a matter of brief
control and some disambiguation. For the Codex Marianus, on the other hand, we had only
a set of closed-class rules extracted from the UCS parsed corpus version. The lemmata
therefore had to be entered manually, following the lemmatisation in Cejtlin (1999) closely.
Also, full morphological analysis had to be done for most items. Reviewed annotations
were then stored and used for morphological guesses in future annotation.
The next step was manual syntactic annotation with a simple clicking tool
which provided good guesses from a set of morphologically based rules. Finally, the
whole analysis is reviewed by project members, and when found correct, published
on our website, accessible to all registered users.
5 The PROIEL web application
PROIEL web application is originally intended as an annotation tool, but is
convenient for studying the text as well. The page structure reflects the database
structure. After login, the user enters an index of source divisions. Each source
division (i.e. NT chapter) page displays the presentation-form text (i.e. the text with
contractions undissolved and including such things as deletions, lacuna marks, end
marks etc.) and an index of all the sentences in the source division.
The sentence page is the main source of information. Each sentence in the corpus has
such a page. It displays the morphological annotation, the syntactic analysis and tokenisation.
The morphological annotation also has links to the lemma page of each annotated token. The
lemma page gives information on part of speech, lemma-level semantic tags, and also some
frequency statistics and an index of all sentences containing that particular lemma. Likewise,
the tokenisation index has links to the token page of each individual token. The token page
contains a lot of low-level information, including the positional morphological tag, which is
necessary for morphological queries, as we shall see in section 8.
370
Hanne Eckhoff, Dag Haug. THE PROIEL CORPUS AS A SOURCE TO OLD CHURCH SLAVIC...
The web application does not include a query interface with the possibility to
do morphological and syntactic queries, but there are several opportunities for
exporting data and doing queries in external applications (section 8). However, it does
offer the possibility of simple searching in tokens and lemmata with wildcards.
6 Annotation
In this section we give a brief survey of the annotation layers in the PROIEL corpus.
6.1 Lemmatisation and morphological annotation
The lemmatisation follows part of speech. A single form may therefore belong to several
lemmata, for example, there are no less than four lemmata with the form : a
subjunction, a relative adverb and two regular adverbs that are deemed to have
sufficiently different functions to be separated (one meaning as, like, and the other
serving as an introductory for).
Nb common noun
Ne proper noun
Pp personal pronoun
Pr relative pronoun
Pd demonstrative pronoun
Ps possessive pronoun
Pk personal reflexive pronoun
Pt possessive reflexive pronoun
Pc reciprocal pronoun
Pi interrogative pronoun
Px indefinite pronoun
Ma cardinal numeral
Mo ordinal numeral
A- adjective
S- article
Dq relative adverb
Du interrogative adverb
Df other adverb
R- preposition
C- conjunction
G- subjunction
F- foreign word
I- interjection
Tag Function
PRED predicate
SUB subject
OBJ object
COMP complement
AG agent
ADV adverbial
ATR attribute
APOS apposition
AUX auxiliary
VOC vocative
374
Hanne Eckhoff, Dag Haug. THE PROIEL CORPUS AS A SOURCE TO OLD CHURCH SLAVIC...
(4) (8
and he answered them saying (Mark 3:33)
Example (5) illustrates how we analyse syntactically required open complements (i.e.
complements without an internal subject), such as infinitives dependent on an
auxiliary verb. In this case, the infinitive again depends on the matrix verb via the
relation XOBJ, and has a secondary dependency to the SUB of the matrix verb.
(5) 9 .
How can Satan drive out Satan? (Mark 3:23)
The subject of the XOBJ or XADV does not have to be the SUB of the matrix verb,
as we see in (6).
(6) . ( .
Let me first go and bury my father (Matthew 8:21)
375
. . 12
These analyses can be contrasted with the analysis of the dative absolute, which
comes with its own subject. As we see in (7), the participle is taken as an ADV on the
matrix verb, whereas the dative NP is taken as a SUB on the participle itself.
(7) 9 9 2 9.
When he came they spread their robes on the road (Luke 19:36)
There can sometimes be doubt whether a node should have the one or the other
relation, in particular it can be difficult to distinguish between objects and obliques, and
obliques and adverbials. For these cases we employ a small set of supertags.
376
Hanne Eckhoff, Dag Haug. THE PROIEL CORPUS AS A SOURCE TO OLD CHURCH SLAVIC...
Tag Function
ARG argument (object or oblique)
PER peripheral (oblique or adverbial)
NONSUB non-subject (object, oblique or adverbial)
REL apposition or attribute
ADNOM adnominal
377
. . 12
(8) ( (. 9 .
The Father loves the Son and has placed everything in his hands (John 3:35)
We get the same analysis when we have a negated transitive verb with an animate
object, as in (9). In queries for indisputable genitive-accusatives all negated sentences
have to be filtered out in any case.
(9) 9
I have no husband (John 4:17)
In (10) we see the verb , which consistently requires the genitive. In this
case, the argument is again taken as a genitive in the morphology, but
as an OBL in the syntax.
(10) .
Why do we need more witnesses? (Mark 14:63)
378
Hanne Eckhoff, Dag Haug. THE PROIEL CORPUS AS A SOURCE TO OLD CHURCH SLAVIC...
(11) ( 8 (8.
and having seen Jesus, he fell on his face and begged him, saying (Luke 5:12)
A query for indisputable instantiations of the genitive-accusative would have
to include only human masculine singular genitive-marked OBJs, excluding negated
examples, examples in supine constructions and a-stem nouns. For complements of
prepositions, the genitive-accusatives must be identified preposition by preposition.
All this information is available in different layers of the PROIEL corpus: the lemma,
morphology and syntax layers, and animacy status is available from the semantic
layer (next section).
379
. . 12
HUMAN things that look and act like humans, including deities and
spirits
ORG a collectivity of humans with some degree of group identity
We have also tagged for e.g. verb semantics and preposition semantics in the
Greek. As an example of a non-semantic use of the annotation layer, we have also
tagged OCS denominal adjectives for their derivational suffixes, so that we can pull
out e.g. possessive -ov- and -j- adjectives separately.
6.4 Information structure
Since the core interest of the PROIEL project is information-structuring
grammatical devices, we are in the process of annotating the texts for information structure.
This is a difficult field, where many of the core terms are under heated discussion. We have
tried to opt for annotation that can be implemented with high intersubjective agreement, and
have so far been annotating for givenness and anaphoric distance.
In the givenness annotation we try to answer the question How can the hearer
establish the referent of an NP? Our tags are based on which context the hearer uses
to establish reference.
Discourse (anaphora) OLD
Situation (deixis) ACC-sit
Scenarios (inferences) ACC-inf
Encyclopedic knowledge ACC-gen
No context (no extra-NP information) NEW
Thus, a referent is only OLD if it has been explicitly mentioned in the previous
discourse. It can be accessible in different ways: pointed at in the situation (in dialogue,
ACC-sit), inferred from something that is already mentioned (e.g. an arm inferred from a
man, ACC-inf) or generally known to any 1st century Hellenised Jew (Moses, Jerusalem,
ACC-gen). Only if the NP is unavailable from any of these contexts is it tagged as NEW.
380
Hanne Eckhoff, Dag Haug. THE PROIEL CORPUS AS A SOURCE TO OLD CHURCH SLAVIC...
When a referent is annotated as OLD, and thus has been explicitly mentioned in
the previous discourse, we also link it to its antecedent. This is a relatively objective
measure: annotators almost always agree on a referents antecedent. Thus we have a
quite reliable point of departure for calculating e.g. context saliency, which can be
stated in terms of the length and density of an anaphoric chain: a referent in a long
and dense anaphoric chain must be expected to be very salient in the context, which
may affect its grammatical behaviour.
381
. . 12
al. 2009a, Haug et al. 2009b, Eckhoff et al. forthcoming and Eckhoff & Haug
forthcoming (the two latter specifically on OCS). The downside is that complex data also
make for complex queries, and a simple query interface would simply not do the data
justice. Therefore, the PROIEL corpus so far does not have an integrated query interface.
However, PROIEL data can be freely exported in several xml formats at
http://foni.uio.no:3000/site/public_data
Most usefully, the data can be downloaded in TigerXML. The xml file may then be
read into TIGERSearch, a search engine specialised for syntactic queries with an intuitive
graphic user interface, which allows the user to draw partial syntactic trees and query the
syntactic relations and various properties of the lexical nodes of the tree.6 The search engine
makes it possible to retrieve information from most layers of the PROIEL corpus.
In querying for part of speech and morphological features, it is important to
know what the entry for a lexical node looks like. The form (Mark 2:18)
currently has a representation that looks like this:
<t tense_mood_voice="---" pos="A-" case_number="np" strength="s" lemma=""
word="" inflection="i" person_number="-p" id="w539659" degree="p"/>
We see that the database's original positional tag is split into various (bundles of)
features, so that lemma, part of speech, tense/mood/voice, tense/number, case/number,
degree etc. may be queried separately. The TigerXML export currently includes lemma
information, morphological and syntactic annotation. By the end of the project period all
other layers of annotation and the token alignments will also be published.
9 Conclusions
This paper has presented some of the possibilities offered by the PROIEL
corpus to linguists working on OCS. The corpus currently includes only one OCS
text, but is unique in offering detailed, multilayered annotation of morphology,
syntax, information structure and semantics. It is also possible for individual scholars
to add customised annotation. A crucial feature for OCS, where almost the entire text
canon is translated from Greek, is the token alignment with the Greek original, which
enables scholars to do sophisticated contrastive work, combining information from
various annotation layers. Thus, the PROIEL corpus is a very useful electronic
linguistic resource to OCS. The reviewed morphological and syntactic analyses in the
corpus are already freely available to all registered users, and more data will be made
available within the project period. The data can be downloaded in various formats.
The PROIEL application so far offers limited query possibilities, but the data can be
downloaded in TigerXML and queried in TIGERSearch.
References
Cejtlin, P. M. (ed.) (1999): Staroslavjanskij slovar. Moscow: Russkij jazyk.
Eckhoff, H. & D. Haug (forthcoming): Aligning syntax in early New Testament texts: the
PROIEL corpus, Wiener Slawistischer Almanach.
Eckhoff, H., D. Haug & M. Majer (forthcoming): Making the Most of the Data: Old Church Slavic
and the PROIEL Corpus of Old Indo-European Bible Translations. PALC proceedings.
Haug, D. T. T., M. L. Jhndal, H. M. Eckhoff, E. Welo, M. J. B. Hertzenberg & A. Mth (2009a):
PROIEL
:
. .
(abstract)
(OCS)
PROIEL
(http://foni.uio.no:3000).
,
, ( )
.
.
.
, .
(PROIEL) ,
,
, , ,
.
. ,
,
.
.
,
.
,
.
383