27 Echof Haung PDF

. .
12
THE PROIEL CORPUS AS A SOURCE TO OLD CHURCH SLAVIC: A
PRACTICAL INTRODUCTION
Hanne Eckhoff, Dag Haug (Oslo, Norway)
1 The PROIEL project and corpus

Pragmatic Resources in Old Indo-European Languages (PROIEL) is a research
project at the University of Oslo, Norway, devoted to the study of the morphosyntactic
means of expressing information structure in Greek, Latin, Gothic, Classical Armenian
and Old Church Slavic (OCS), more specifically word order, pronominal reference,
definiteness, participle use as a means of backgrounding and discourse particles. A core
concern in the project is to create a parallel treebank of the Greek New Testament (NT)
and its earliest translations into the other project languages. The corpus is essentially a
corpus for linguists, so the focus is on making the most of a limited data set for the
purposes of linguistic research. Therefore it features rich and many-layered annotation not
only for morphology and syntax, but also for information structure and a series of
semantic and other features. Each of the translations is automatically aligned with the
Greek original at token level.
Considering the state of the earliest sources to Slavic, such a treebank is a
unique resource to OCS. The OCS text canon consists almost entirely of translations
from Greek, and the Gospel translations are at the core of the canon. Therefore any
serious study of OCS grammar must always have a contrastive eye on Greek, and the
PROIEL corpuss version of the Codex Marianus is a good tool in this respect. This
paper is a practical introduction to the structure and applicability of the corpus.
2 Prior electronic resources to OCS
There are no similar resources to OCS available. What is available online is
mostly based on Jouko Lindstedts digitisations of the central texts of the OCS canon
in 7-bit ASCII in the Corpus Cyrillo-Methodianum Helsingiense (CCMH).1 The same
texts are also available from the Titus project.2 Neither CCMH nor TITUS offer any
annotation of the texts. The USC Parsed Corpus of Old South Slavic Texts3 also
mostly based on Lindstedts digitisations has morphological annotation and
glossing, but is not lemmatised and does not offer syntactic annotation.
3 Text processing
The main goal of the PROIEL corpus is to amass linguistic knowledge. We
therefore do not aim at representing the manuscripts as such, but base ourselves on
editions. We also limited ourselves to using already available electronic versions. For
1 http://www.helsinki.fi/slaavilaiset/ccmh/ the texts are Codex Assemanianus, Codex Marianus, Codex

Suprasliensis, Codex Zographensis, Savvina kniga, and outside the canon, Vita Constantini and Vita Methodii.
2 http://titus.fkidg1.uni-frankfurt.de/framee.htm?/texte/texte2.htm#aksl TITUS also has the Kiev Folia
and the Prague fragments.
3 http://www-rcf.usc.edu/~pancheva/ParsedCorpus.html the corpus contains Codex Marianus, Codex
Zographensis B, the Sluck Psalter, Vita Constantini, Vita Methodii and seven inscriptions. The corpus is
not online, but can be obtained from Roumyana Pancheva at the USC.
368
Hanne Eckhoff, Dag Haug. THE PROIEL CORPUS AS A SOURCE TO OLD CHURCH SLAVIC...
the Greek, we use the Tischendorf (18691872) edition of the Greek NT, as prepared
by Ulrik Sandborg-Petersen.4 For the OCS, we use Jagis (1883) edition of the
Codex Marianus in the CCMH electronic version, and have made some use of the
morphological analysis in the USC Parsed Corpus of Old South Slavic Texts.
Even though the USC corpus is based on the CCMH text, there are
discrepancies and inconsistencies between them and numerous deviations from the
printed text, mostly mistypings. We therefore harmonised the two texts and checked
them against the printed edition. We used the USC corpus annotation mainly to
extract closed-class morphological rules, and also made use of annotators comments
to correct errors from the CCMH. The whole text was then encoded in Unicode 5.1.
The resulting text is a faithful presentation of Jagis edition using TEI
concepts, though leaving out some of the details relating to manuscript structure, such
as page numbers and line breaks. For the linguistic analysis of the texts, we needed to
divide the text into sentences and tokens. At the level of linguistic analysis some units
also had to be left out, such as incipits, end marks and deleted words. To still preserve
a maximally faithful rendition of the edition, we chose a three-layered structure:
Each source division (e.g. chapter in the New Testament texts) contains the
TEI-light encoding of the edition we use
Each sentence contains the relevant part of the source division text, but with
possible further information (expansion of abbreviation) and a more
structured markup which reflects the tokenisation
Each token just has a single, possibly normalised form
The sentence and token representations are kept synchronised, whereas the
source division representation reflects the printed edition as closely as possible.
The tokenisation was relatively straightforward, as spacing between words
mostly indicate a token. The main complication were contractions, mostly of
prepositions and nouns or nouns and demonstrative pronouns. In those cases, the
contractions were split into two tokens and in some cases normalised. Thus, the
source division text of Luke 4:27 has 9 (
, whereas the tokenised version dissolves and normalises ( and
has 9 . Apart
from such cases, we generally do not dissolve abbreviations.
The division into sentences was more challenging. It should be remarked that due
to the NT style such segmentation will to some extent be arbitrary, since the text abounds
in introductory /kais, which cannot really be reliably distinguished from real
coordinations. However, for finding units with a matrix verb and its dependents, neither
the printed nor the electronic editions provide much help. The previous electronic editions
are not segmented into sentences, and all editions use the original punctuation, which is
not a reliable guide. The full stop is virtually the only punctuation mark in use, and
typically (but inconsistently) marks out smaller units than the sentence, often separating
participial constructions, prepositional phrases and dependent clauses from the matrix
verb on the one hand (1), but not necessarily marking obvious sentence boundaries, such
as between speech verb and direct speech (2)
4 http://files.morphgnt.org/tischendorf/
369
. . 12
(1) 28 . 9 .
(. 9 ( ( 8 .:.
While they were listening to this, he went on to tell them a parable, because he was
near Jerusalem and the people thought that the kingdom of God was going to appear
at once. (Lk. 19:11)
(2) ( 99 2 .
He said, I will judge you by your own words, you wicked servant! (Lk. 19:22)
Our solution was to split the text into chunks by punctuation, and then group the
chunks into sentences by comparing them to the Greek: orienting by text structure (chapter
and verse) the Gale-Church algorithm tries to make the best match with the length of the
corresponding Greek sentence. The method was fairly successful, but left some adjustments
to be made manually: both manual sentence splits and sentence mergers have been
necessary, and are done by the annotators/reviewers with the annotation tool.
4 Annotation workflow
The annotation was done manually by student annotators, but aided in different
ways by the annotation tool: First they did manual sentence division adjustments if
necessary. cf. section 3. The next step was lemmatisation and morphological
disambiguation. For the Greek text we had good morphological analysis and lemmatisation
from the MorphGNT source text, and the morphological annotation was a matter of brief
control and some disambiguation. For the Codex Marianus, on the other hand, we had only
a set of closed-class rules extracted from the UCS parsed corpus version. The lemmata
therefore had to be entered manually, following the lemmatisation in Cejtlin (1999) closely.
Also, full morphological analysis had to be done for most items. Reviewed annotations
were then stored and used for morphological guesses in future annotation.
The next step was manual syntactic annotation with a simple clicking tool
which provided good guesses from a set of morphologically based rules. Finally, the
whole analysis is reviewed by project members, and when found correct, published
on our website, accessible to all registered users.
5 The PROIEL web application
PROIEL web application is originally intended as an annotation tool, but is
convenient for studying the text as well. The page structure reflects the database
structure. After login, the user enters an index of source divisions. Each source
division (i.e. NT chapter) page displays the presentation-form text (i.e. the text with
contractions undissolved and including such things as deletions, lacuna marks, end
marks etc.) and an index of all the sentences in the source division.
The sentence page is the main source of information. Each sentence in the corpus has
such a page. It displays the morphological annotation, the syntactic analysis and tokenisation.
The morphological annotation also has links to the lemma page of each annotated token. The
lemma page gives information on part of speech, lemma-level semantic tags, and also some
frequency statistics and an index of all sentences containing that particular lemma. Likewise,
the tokenisation index has links to the token page of each individual token. The token page
contains a lot of low-level information, including the positional morphological tag, which is
necessary for morphological queries, as we shall see in section 8.
370
The web application does not include a query interface with the possibility to
do morphological and syntactic queries, but there are several opportunities for
exporting data and doing queries in external applications (section 8). However, it does
offer the possibility of simple searching in tokens and lemmata with wildcards.
6 Annotation
In this section we give a brief survey of the annotation layers in the PROIEL corpus.
6.1 Lemmatisation and morphological annotation
The lemmatisation follows part of speech. A single form may therefore belong to several
lemmata, for example, there are no less than four lemmata with the form : a
subjunction, a relative adverb and two regular adverbs that are deemed to have
sufficiently different functions to be separated (one meaning as, like, and the other
serving as an introductory for).
Nb common noun
Ne proper noun
Pp personal pronoun
Pr relative pronoun
Pd demonstrative pronoun
Ps possessive pronoun
Pk personal reflexive pronoun
Pt possessive reflexive pronoun
Pc reciprocal pronoun
Pi interrogative pronoun
Px indefinite pronoun
Ma cardinal numeral
Mo ordinal numeral
A- adjective
S- article
Dq relative adverb
Du interrogative adverb
Df other adverb
R- preposition
C- conjunction
G- subjunction
F- foreign word
I- interjection
Table 1. Part-of-speech tags

371
. . 12
In the PROIEL application, the morphology is represented with full

morphological description, e.g. a form like 8 is displayed as
participle, present, active, genitive, plural, masculine, feminine or neuter, weak. In
the database, however, the actual morphological tag is a positional tag -pppaqg--wi.
The full set of positions and tags is found in Table 2, with adjustments for the
individual languages. In the TigerXML export format, the positional tag is split to
simplify queries (see section 8)
index feature values (tag in parentheses)
1 person (1), (2), (3)

2 number (s)ingular, (d)ual, (p)lural
3 tense (p)resent, (i)mperfect, p(l)uperfect, (a)orist, (f)uture,
pe(r)fect, fu(t)ure perfect, re(s)ultative, (u) past
4 mood (i)ndicative, (s)ubjunctive, i(m)perative, (o)ptative,
i(n)finitive, (p)articiple, gerun(d), (g)erundive, s(u)pine
5 voice (a)ctive, middl(e) or passive, (m)iddle, (p)assive
6 gender (m)asc., (f)em., (n)eut. , (o) masculine or neuter, (p)
masculine or feminine, (r) feminine or neuter, (q) masculine,
feminine or neuter
7 case (n)ominative, (v)ocative, (a)ccusative, (g)enitive, (d)ative,
a(b)lative, (i)nstrumental, (l)ocative, (c) genitive or dative
8 degree (p)ositive, (c)omparative, (s)uperlative
9 animacy (i)nanimate, (a)nimate
10 strength (w)eak, (s)trong, (t) weak or strong
11 inflection (i)nflecting, (n)on-inflecting
Table 2. Morphological tags
We disambiguate the morphology as far as possible based on the syntax and

context. An important exception is gender, which we do not disambiguate in
adjectival usages. In a noun phrase ( , the gender of the head noun will of
course be masculine, but the gender of the adjective will be tagged as
masculine/neuter. The reason for this is purely practical: gender disambiguation
proved to spawn inconsistent annotations because of wavering agreement patterns.
We therefore have a variety of gender supertags.
In the annotation of OCS we resolve some difficult issues by resorting to the
interplay between morphology and syntax. In particular this is important in our
annotation of genitive-accusatives. We do not attempt to distinguish between real
genitives and genitive-accusatives in the morphology. Instead, we tag all genitive-
372
shaped nominals as morphological genitives, and leave the distinction to be made in
the syntactic annotation (see section 6.2).
6.2 Syntactic annotation

The syntactic annotation scheme is based on dependency grammar, which is
close to both Western and Eastern traditional grammar. As can be seen from Table 3,
most of the syntactic relation labels are immediately familiar.5
Tag Function
PRED predicate
SUB subject
OBJ object
OBL oblique (of verb or preposition)
COMP complement
AG agent
ADV adverbial
XADV open adverbial complement
XOBJ open objective complement
ATR attribute
APOS apposition
NARG nominal argument
PART adnominal partitive
AUX auxiliary
VOC vocative
Table 3. Syntactic relation labels
Dependency syntax has several advantages: it relies on overt elements, which

makes it computationally attractive. It also makes it possible to keep word order and
syntactic analysis in separate layers, which is essential in dealing with free-word-order
languages such as OCS and Greek. However, it also has some inherent problems.
The reliance on overt elements becomes problematic when heads are missing:
in dependency grammar every node must have a head. However, both asyndetic
5 Our guidelines for syntactic annotation can be found at http://folk.uio.no/daghaug/syntactic_guidelines.pdf

373
. . 12
coordinations (with no conjunction) and elided verbs are quite common in our texts.
We therefore allow very limited use of empty conjunction and verb nodes.
In (3) we find three empty verbs coordinated by a null conjunction.
(3) (. &. &.

- (.
some (say) John the Baptist, others Elijah, others again Jeremiah or one of
the prophets Matthew 16:14
Dependency grammar is also unable to express structure sharing, i.e. structures
where an element in the matrix clause also serves as the subject of a nonfinite verb, since
each node can have only one head. To make our annotation more expressive, we have
added secondary dependencies to represent this. Example (4) illustrates how we analyse
sentences with predicative participles. The participle is dependent on the matrix verb by
the relation XADV (free adverbial complement), but also has a secondary dependency
(marked XSUB) to the SUB of the matrix verb, which it shares.
374
(4) (8
and he answered them saying (Mark 3:33)
Example (5) illustrates how we analyse syntactically required open complements (i.e.
complements without an internal subject), such as infinitives dependent on an
auxiliary verb. In this case, the infinitive again depends on the matrix verb via the
relation XOBJ, and has a secondary dependency to the SUB of the matrix verb.
(5) 9 .
How can Satan drive out Satan? (Mark 3:23)
The subject of the XOBJ or XADV does not have to be the SUB of the matrix verb,
as we see in (6).
(6) . ( .
Let me first go and bury my father (Matthew 8:21)
375
. . 12
These analyses can be contrasted with the analysis of the dative absolute, which
comes with its own subject. As we see in (7), the participle is taken as an ADV on the
matrix verb, whereas the dative NP is taken as a SUB on the participle itself.
(7) 9 9 2 9.
When he came they spread their robes on the road (Luke 19:36)
There can sometimes be doubt whether a node should have the one or the other
relation, in particular it can be difficult to distinguish between objects and obliques, and
obliques and adverbials. For these cases we employ a small set of supertags.
376
Tag Function
ARG argument (object or oblique)
PER peripheral (oblique or adverbial)
NONSUB non-subject (object, oblique or adverbial)
REL apposition or attribute
ADNOM adnominal
Table 4. Syntactic supertags
Of these supertags, ARG is used systematically in the rendering of the OCS

genitive-accusative. As we saw in section 6.1, there is no separate morphological tag
for the genitive-accusative, they are all tagged as morphological genitives. The
motivation behind this is that there are so many cases where we cannot really tell
whether a given genitive-shaped nominal is really in the genitive-accusative. The
complicating factor is mainly negation, but also the fact that some verbs have
uncertain valency and may require either the accusative or the genitive. Our solution
is to tag all genitive-shaped arguments of regular transitive verbs as OBJs, whether
the genitive is due to animacy, negation or partitivity. Verbs that regularly require the
genitive take OBLs under all circumstances. Genitive-shaped animate arguments of
verbs with uncertain valency, on the other hand, are tagged with the supertag ARG,
and so are all genitive-shaped arguments of the same group of verbs when negated.
In (8), we see a regular genitive-accusative 2 with the transitive verb .
It is tagged as a genitive in the morphology, but in the syntactic analysis, it is an OBJ.
377
. . 12
(8) ( (. 9 .
The Father loves the Son and has placed everything in his hands (John 3:35)
We get the same analysis when we have a negated transitive verb with an animate
object, as in (9). In queries for indisputable genitive-accusatives all negated sentences
have to be filtered out in any case.
(9) 9
I have no husband (John 4:17)
In (10) we see the verb , which consistently requires the genitive. In this
case, the argument is again taken as a genitive in the morphology, but
as an OBL in the syntax.
(10) .
Why do we need more witnesses? (Mark 14:63)
378
In (11), we cannot strictly determine whether the genitive-marking on ( is a

genitive-accusative or triggered by the verb . The genitive-marked argument
therefore gets the supertag ARG.
(11) ( 8 (8.
and having seen Jesus, he fell on his face and begged him, saying (Luke 5:12)
A query for indisputable instantiations of the genitive-accusative would have
to include only human masculine singular genitive-marked OBJs, excluding negated
examples, examples in supine constructions and a-stem nouns. For complements of
prepositions, the genitive-accusatives must be identified preposition by preposition.
All this information is available in different layers of the PROIEL corpus: the lemma,
morphology and syntax layers, and animacy status is available from the semantic
layer (next section).
6.3 Free tags

The corpus has a separate layer for user-defined tagging of any phenomenon,
on lemma and/or token level. The idea behind this layer is that a researcher working
on the corpus should also be able to give something back to the corpus, and add
customised annotation in his/her area of expertise. In this way we can ensure
replicability of the research in question, and also allow the annotation to be re-used
for other purposes by other scholars, thus avoiding duplication of effort.
So far this layer has mostly been used for semantic features. Very relevantly
to the Slavic, all Greek lemmata were tagged for animacy, and the animacy annotation
has since been transferred to the OCS by way of the token alignments (see section 7)
and controlled for errors. We used a slightly simplified version of Zaenen et al.s
(2004) annotation scheme, as described in Table 5.
379
. . 12
Animacy tag Brief description
HUMAN things that look and act like humans, including deities and
spirits
ORG a collectivity of humans with some degree of group identity
ANIMAL non-human animates
CONCRETE prototypical concrete objects or substances, excluding

intangibles
VEH vehicles
NONCONC anything that is not prototypically concrete but clearly

inanimate, e.g. events
PLACE nominals that will normally serve as locations for human
actions
TIME expressions referring to periods of time
Table 5. Animacy tags
We have also tagged for e.g. verb semantics and preposition semantics in the
Greek. As an example of a non-semantic use of the annotation layer, we have also
tagged OCS denominal adjectives for their derivational suffixes, so that we can pull
out e.g. possessive -ov- and -j- adjectives separately.
6.4 Information structure
Since the core interest of the PROIEL project is information-structuring
grammatical devices, we are in the process of annotating the texts for information structure.
This is a difficult field, where many of the core terms are under heated discussion. We have
tried to opt for annotation that can be implemented with high intersubjective agreement, and
have so far been annotating for givenness and anaphoric distance.
In the givenness annotation we try to answer the question How can the hearer
establish the referent of an NP? Our tags are based on which context the hearer uses
to establish reference.
Discourse (anaphora) OLD
Situation (deixis) ACC-sit
Scenarios (inferences) ACC-inf
Encyclopedic knowledge ACC-gen
No context (no extra-NP information) NEW
Thus, a referent is only OLD if it has been explicitly mentioned in the previous
discourse. It can be accessible in different ways: pointed at in the situation (in dialogue,
ACC-sit), inferred from something that is already mentioned (e.g. an arm inferred from a
man, ACC-inf) or generally known to any 1st century Hellenised Jew (Moses, Jerusalem,
ACC-gen). Only if the NP is unavailable from any of these contexts is it tagged as NEW.
380
When a referent is annotated as OLD, and thus has been explicitly mentioned in
the previous discourse, we also link it to its antecedent. This is a relatively objective
measure: annotators almost always agree on a referents antecedent. Thus we have a
quite reliable point of departure for calculating e.g. context saliency, which can be
stated in terms of the length and density of an anaphoric chain: a referent in a long
and dense anaphoric chain must be expected to be very salient in the context, which
may affect its grammatical behaviour.
Example of information structure annotation. Red indicates OLD, green ACC-GEN.

The blue line indicates an anaphoric link
7 Alignment
The texts are aligned at the token level, i.e. each Slavic word has a pointer to
the Greek word of which it is the translation. This aligment can be performed
automatically with good results. First, a dictionary is created which, for each lemma
in the target language, ranks the potential originals in the source language based on a
maximum likelihood measure based on coccurrences in the same bible verse.
Candidate translation pairs within the same Bible verse are then scored using the
dictionary as well as the linearisation numbers within the sentence, and the morphological
and syntactic information available. The process is repeated, with each iteration accepting
worse equivalents, but penalising alignments that imply a transposition of word order.
Because the translators have aimed at keeping the original word order, this approach gives
good results. Controls of the results for the Slavic translations show well over 90% success.
The token alignments are obviously useful for researchers, as they allow
filtering of queries based on the source constructions, distinguishing e.g. participles
that are translations of Greek participles from those that are not. In this way, the
linguist can more easily distinguish translation effects from genuine Slavic syntax.
The token alignments will be made public at the end of the PROIEL project period.
The alignments are also helpful during the corpus creation, since it is often
possible to save work by tagging the original Greek text and porting the tags to the
translation languages, since functional features such as animacy, information structure
and others will often remain constant.
8 Queries
The PROIEL corpus contains complex and sophisticated data, which can be used
fruitfully by linguists for many purposes, see e.g. the preliminary applications in Haug et
381
. . 12
al. 2009a, Haug et al. 2009b, Eckhoff et al. forthcoming and Eckhoff & Haug
forthcoming (the two latter specifically on OCS). The downside is that complex data also
make for complex queries, and a simple query interface would simply not do the data
justice. Therefore, the PROIEL corpus so far does not have an integrated query interface.
However, PROIEL data can be freely exported in several xml formats at
http://foni.uio.no:3000/site/public_data
Most usefully, the data can be downloaded in TigerXML. The xml file may then be
read into TIGERSearch, a search engine specialised for syntactic queries with an intuitive
graphic user interface, which allows the user to draw partial syntactic trees and query the
syntactic relations and various properties of the lexical nodes of the tree.6 The search engine
makes it possible to retrieve information from most layers of the PROIEL corpus.
In querying for part of speech and morphological features, it is important to
know what the entry for a lexical node looks like. The form (Mark 2:18)
currently has a representation that looks like this:
<t tense_mood_voice="---" pos="A-" case_number="np" strength="s" lemma=""
word="" inflection="i" person_number="-p" id="w539659" degree="p"/>
We see that the database's original positional tag is split into various (bundles of)
features, so that lemma, part of speech, tense/mood/voice, tense/number, case/number,
degree etc. may be queried separately. The TigerXML export currently includes lemma
information, morphological and syntactic annotation. By the end of the project period all
other layers of annotation and the token alignments will also be published.
9 Conclusions
This paper has presented some of the possibilities offered by the PROIEL
corpus to linguists working on OCS. The corpus currently includes only one OCS
text, but is unique in offering detailed, multilayered annotation of morphology,
syntax, information structure and semantics. It is also possible for individual scholars
to add customised annotation. A crucial feature for OCS, where almost the entire text
canon is translated from Greek, is the token alignment with the Greek original, which
enables scholars to do sophisticated contrastive work, combining information from
various annotation layers. Thus, the PROIEL corpus is a very useful electronic
linguistic resource to OCS. The reviewed morphological and syntactic analyses in the
corpus are already freely available to all registered users, and more data will be made
available within the project period. The data can be downloaded in various formats.
The PROIEL application so far offers limited query possibilities, but the data can be
downloaded in TigerXML and queried in TIGERSearch.
References
Cejtlin, P. M. (ed.) (1999): Staroslavjanskij slovar. Moscow: Russkij jazyk.
Eckhoff, H. & D. Haug (forthcoming): Aligning syntax in early New Testament texts: the
PROIEL corpus, Wiener Slawistischer Almanach.
Eckhoff, H., D. Haug & M. Majer (forthcoming): Making the Most of the Data: Old Church Slavic
and the PROIEL Corpus of Old Indo-European Bible Translations. PALC proceedings.
Haug, D. T. T., M. L. Jhndal, H. M. Eckhoff, E. Welo, M. J. B. Hertzenberg & A. Mth (2009a):
6 TIGERSearch can be downloaded freely from http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERS

earch/download/
382
Computational and linguistic issues in designing a syntactically annotated parallel corpus
of Indo-European languages, Traitement automatique des langues 50:2, 1745.
Haug, D. T. T., Eckhoff, H. M., Majer, M., Welo, E. (2009b): Breaking down and putting
back together: analysis and synthesis of New Testament Greek, Journal of Greek
Linguistics 9, 5692.
Jagi, V. (1883): Quattuor Evangeliorum versionis palaeoslovenicae Codex Marianus
Glagoliticus, Berlin: Weidmann.
Tischendorf, C. V. (18691872): Novum Testamentum Graece, 8th edn, Leipzig: Hinrichs.
Zaenen, A., J. Carletta, G. Garretson, J. Bresnan, A. Koontz-Garboden, T. Nikitina, M. C.
OConnor & T. Wasow (2004): Animacy Encoding in English: why and how,
in B. Webber, D. K. Byron (eds), ACL 2004 Workshop on Discourse Annotation,
Association for Computational Linguistics, Barcelona, Spain, p. 118125.
PROIEL
:
. .
(abstract)
(OCS)
PROIEL
(http://foni.uio.no:3000).
,
, ( )

.
.
.
, .

(PROIEL) ,
,
, , ,
.
. ,

,
.
.
,
.
,

.
383

27 Echof Haung PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

27 Echof Haung PDF

Transféré par

Droits d'auteur :

Formats disponibles

. .

Hanne Eckhoff, Dag Haug (Oslo, Norway)

1 The PROIEL project and corpus

1 http://www.helsinki.fi/slaavilaiset/ccmh/ the texts are Codex Assemanianus, Codex Marianus, Codex

Table 1. Part-of-speech tags

In the PROIEL application, the morphology is represented with full

index feature values (tag in parentheses)

1 person (1), (2), (3)

Table 2. Morphological tags

We disambiguate the morphology as far as possible based on the syntax and

6.2 Syntactic annotation

OBL oblique (of verb or preposition)

XADV open adverbial complement

XOBJ open objective complement

NARG nominal argument

PART adnominal partitive

Table 3. Syntactic relation labels

Dependency syntax has several advantages: it relies on overt elements, which

5 Our guidelines for syntactic annotation can be found at http://folk.uio.no/daghaug/syntactic_guidelines.pdf

(3) (. &. &.

Table 4. Syntactic supertags

Of these supertags, ARG is used systematically in the rendering of the OCS

In (11), we cannot strictly determine whether the genitive-marking on ( is a

6.3 Free tags

Animacy tag Brief description

ANIMAL non-human animates

CONCRETE prototypical concrete objects or substances, excluding

NONCONC anything that is not prototypically concrete but clearly

Table 5. Animacy tags

Example of information structure annotation. Red indicates OLD, green ACC-GEN.

6 TIGERSearch can be downloaded freely from http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERS

Vous aimerez peut-être aussi