Académique Documents
Professionnel Documents
Culture Documents
3.2 Annotation Agreement rator can make choice between two different an-
notators choices, or even reject consistent but in-
Reliability is a key value in the creation of a correct annotations. Thanks to that module sev-
good quality corpora for learning and testing of eral Gold Standard projects were performed e.g.
NLP tools. The current version of Inforex en- Polish Coreference Corpus (Ogrodniczuk et al.,
ables simultaneous and independent annotation of 2015) for definite descriptions annotation and Pol-
the same text sample by more than one annota- ish Spatial Texts corpus for the annotation of dy-
tor. Moreover, the annotation process coordina- namic spatial expressions.
tor may keep track of inter-annotator agreement
between two raters thanks to the Agreement mod- 4 Applications
ule which uses Positive Specific Agreement (PSA) In the following sections we present several prac-
measure (Hripcsak and Rothschild, 2005) to cal- tical applications of the Inforex system.
culate the reliability (see Figure 5a). View config-
uration gives the opportunity to define annotation 4.1 KPWr
layers, subsets or categories, users and set of doc- KPWr (Polish Corpus of Wrocław University of
uments that have to be analysed. The coordina- Technology) (Broda et al., 2012) is a corpus of
tor may also specify a comparison mode: whether written and spoken documents available on the
the system has to take into consideration the an- Creative Commons license which is intended pri-
notation boundaries only or boundaries and cat- marily as a training and testing material for NLP
egories. It may also include annotation lemmas. tools being developed at Wrocław University of
Inter-annotator agreement is a very important in- Science and Technology. It is successively en-
dicator of the annotation guidelines clearness or riched with annotation layers. Inforex recently
cohesion. Keeping track of changes of the inter- supported manual text annotation within such lay-
annotator agreement between subsequent annota- ers as temporal expressions and their normaliza-
tion iterations helps to improve the quality of the tions, events (and description of event attributes),
annotation guidelines. Agreement module makes spatial expressions and semantic roles. In order to
that process easier and faster. prepare temporal expressions annotation (Kocoń
Inforex system also supports the curation of et al., 2015) a new annotation scheme based on
the annotation process (see Figure 5b). The cu- TimeML was added. These categories refer to a
(a) Facebook conversation.
date, time of a day, duration and frequency of an be used to work with documents written in other
event. Annotation lemmas perspective was used to languages. Inforex features and functionalities are
provide normalized temporal expressions, reveal- useful e.g. in examining current EU official lit-
ing that the term ’lemma’ in Inforex may func- erature related to territorial development and ur-
tion as a broad concept. The Annotator perspec- ban planning. Authors of this analysis first up-
tive from the system also supports event annota- loaded EU Territorial Policy Documents 2007-
tion (Marcińczuk et al., 2015). There are seven 20168 to CLARIN-PL DSpace repository and then
coarse-grained categories of events, i.e. action, imported it to the Inforex system. The corpus was
state, reporting, perception, aspectual, intensional divided into 4 subcorpora and prepared for qual-
action and intensional state. The categorization itative and quantitative analysis. The review of
was based on the TimeML guidelines with some the key strands enabled the identification of its 8
modifications. It also involved creation of a new core values (or principles) for further statistical
annotation scheme. The flexibility in adding new and contextual analysis. After ascribing to each
annotation layers (setting the new annotation cate- category its textual triggers (word forms), a quan-
gories) is one of the most important features. The titive analysis using words frequency lists gener-
possibility of establishing relations between anno- ated by Inforex was performed. Manual annota-
tated fragments is not less relevant. It was cru- tion with a newly defined set of annotations and
cial e.g. for spatial expressions annotation. Its Annotation Browser with the possibility of export-
main goal was to extract different ways of dis- ing data were a great support for qualitative anal-
tributing spatial information throughout a sentence ysis — detailed contextual analysis of the corpus
by reviewing the lexical and grammatical signals focused on two crucial categories: Participation
of various relations between objects (Marcińczuk and Communication.
et al., 2016).
4.3 Hebrew Corpus
4.2 European Legal Texts Inforex supports manual annotation even if the text
is written using non-latin alphabet and a right-to-
As practice shows, although Inforex was primar-
8
ily developed for Polish language, that it can also http://hdl.handle.net/11321/316
(a) Inforex layout before modernization
left notation. One of the system applications was tion / lemmatization, and cross-language matching
related to a corpus of Hebrew gravestone inscrip- (Marcińczuk et al., 2017). The system also sup-
tions. It also involved the creation of a new an- ported the annotation of the corpora constructed
notation schema. Categories referred mainly to specially for specific tasks from the field of natural
the pragmatic level of communication (e.g. initial language processing e.g. Polish Coreference Cor-
and final expressions, laudations, death circum- pus for definite descriptions annotation and Polish
stances). The perspective of annotation lemmas Spatial Texts corpus for the annotation of dynamic
was used to enter Polish translations of annotated spatial expressions. It involved creation of dedi-
fragments, which also showed that the lemma at- cated annotation layers but, what is important, in
tribute may be a broad term especially in the case these tasks the new module of the system (Anno-
of practical applications of the system. tation Agreement and ”2+1” annotation) was used
for the first time, which significantly improved the
4.4 Other Corpora time of preparation of annotated training and test-
Inforex was used to prepare the training data dur- ing corpora.
ing participation in BSNLP 2017 shared task on
5 Summary
multilingual named entity recognition aimed at
recognizing mentions of named entities in web Inforex system, as a part of CLARIN-PL infras-
documents in Slavic languages, their normaliza- tructure, is gradually developed. Although its ini-
tial role was to construct qualitative linguistic re- of the workshop on Language Technology Resources
sources for various tasks from the field of natu- and Tools for Digital Humanities (LT4DH) at COL-
ING 2016. pages 76–84.
ral language processing, recently it is also used
by scientists for other purposes. We received an George Hripcsak and Adam S. Rothschild. 2005.
important and constructive feedback from users Agreement, the f-measure, and reliability in infor-
during and after workshops related to CLARIN- mation retrieval. J. of Am. Medical Informatics As-
sociation 12(3):296–298.
PL tools and resources. As users have different
needs, we identified the common functionalities Adam Kilgarriff, Vı́t Baisa, Jan Bušta, Miloš
and implement them as soon as possible in order Jakubı́ček, Vojtěch Kovář, Jan Michelfeit, Pavel
Rychlý, and Vı́t Suchomel. 2014. The sketch en-
to boost their research tasks and provide new pos- gine: ten years on. Lexicography .
sibilities. We also challenged with the fact that
many researches from the field of digital humani- Jan Kocoń, Michał Marcińczuk, Marcin Oleksy,
Tomasz Bernaś, and Michał Wolski. 2015. Tem-
ties are not experienced users of such systems and
poral expressions in polish corpus kpwr. Cognitive
we made Inforex as easy and intuitive as possible. Studies— Études cognitives (15):293–317.
Acknowledgments LCD. 2005. ACE (Automatic Content Extraction) En-
glish Annotation Guidelines for Events. Technical
Work financed as part of the investment in the report, Linguistic Data Consortium.
CLARIN-PL research infrastructure funded by the
M. Marcińczuk and M. Ptak. 2012. Preliminary study
Polish Ministry of Science and Higher Education. on automatic induction of rules for recognition of
semantic relations between proper names in Polish
texts, volume 7499 LNAI.
References
Michał Marcińczuk, Jan Kocoń, and Marcin
Dominik Bas, Bartosz Broda, and Maciej Piasecki. Oleksy. 2017. Liner2 — a generic framework
2008. Towards Word Sense Disambiguation of for named entity recognition. In Proceedings
Polish. In Proceedings of the International of the 6th Workshop on Balto-Slavic Natural
Multiconference on Computer Science and In- Language Processing. Association for Computa-
formation Technology, {IMCSIT} 2008, Wisla, tional Linguistics, Valencia, Spain, pages 86–91.
Poland, 20-22 October 2008. IEEE, pages 73–78. http://www.aclweb.org/anthology/W17-1413.
https://doi.org/10.1109/IMCSIT.2008.4747220.
Michał Marcińczuk, Marcin Oleksy, Tomasz Bernaś,
Kalina Bontcheva, Hamish Cunningham, Ian Roberts, Jan Kocoń, and Michał Wolski. 2015. Towards an
Angus Roberts, Valentin Tablan, Niraj Aswani, event annotated corpus of polish. Cognitive Stud-
and Genevieve Gorrell. 2013. Gate teamware: a ies— Études cognitives (15):253–267.
web-based, collaborative text annotation framework.
Language Resources and Evaluation 47(4):1007– Michał Marcińczuk, Michał Stanek, Maciej Piasecki,
1029. and Adam Musiał. 2011. Rich Set of Features for
Proper Name Recognition in Polish Texts. In SIIS
Bartosz Broda, Michał Marcińczuk, Marek Maziarz, 2011. Springer.
Adam Radziszewski, and Adam Wardyński. 2012.
KPWr: Towards a Free Corpus of Polish. In Nico- Michał Mirosław Marcińczuk, Marcin Oleksy, and Jan
letta Calzolari, Khalid Choukri, Thierry Declerck, Wieczorek. 2016. Towards recognition of spatial re-
Mehmet Uğur Doğan, Bente Maegaard, Joseph Mar- lations between entities for polish. Cognitive Stud-
iani, Jan Odijk, and Stelios Piperidis, editors, Pro- ies— Études cognitives (16):119–132.
ceedings of LREC’12. ELRA, Istanbul, Turkey.
Małgorzata Marciniak, Agnieszka Mykowiecka, and
Hamish Cunningham, Diana Maynard, Kalina Katarzyna Głowińska. 2010. Anotowany korpus di-
Bontcheva, Valentin Tablan, Niraj Aswani, Ian alogów telefonicznych. In Małgorzata Marciniak,
Roberts, Genevieve Gorrell, Adam Funk, An- editor, Anotowany korpus dialogów telefonicznych,
gus Roberts, Danica Damljanovic, Thomas Akademicka Oficyna Wydawnicza EXIT, Warsaw,
Heitz, Mark A. Greenwood, Horacio Saggion, chapter Anotacja korpusu LUNA–WOZ.PL, pages
Johann Petrak, Yaoyong Li, and Wim Peters. 217–230.
2011. Text Processing with GATE (Version 6).
http://tinyurl.com/gatebook. Michał Marcińczuk, Jan Kocoń, and Maciej Janicki.
2013. Liner2 – a customizable framework for
Richard Eckart de Castilho, Eva Mujdricza-Maydt, proper names recognition for Polish. In Robert
Seid Muhie Yimam, Silvana Hartmann, Iryna Bembenik, Lukasz Skonieczny, Henryk Rybinski,
Gurevych, Anette Frank, and Chris Biemann. 2016. Marzena Kryszkiewicz, and Marek Niezgodka, ed-
A web-based tool for the integrated annotation of itors, Intelligent Tools for Building a Scientific In-
semantic and syntactic structures. In Proceedings formation Platform, pages 231–253.
Michał Marcińczuk, Monika Zaśko-Zielińska, and Ma-
ciej Piasecki. 2011. Structure annotation in the pol-
ish corpus of suicide notes. In Ivan Habernal and
Václav Matoušek, editors, Text, Speech and Dia-
logue, Springer Berlin Heidelberg, volume 6836 of
Lecture Notes in Computer Science, pages 419–426.
Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz
Kopeć, Agata Savary, and Magdalena Zawisławska.
2015. Coreference in {P}olish: Annotation,
Resolution and Evaluation. Walter De Gruyter.
http://www.degruyter.com/view/product/428667.
Adam Radziszewski and Maciej Piasecki. ???? A Pre-
liminary Noun Phrase Chunker for Polish ∗. Infor-
mation Systems pages 169–180.
Pavel Rychlý. 2007. Manatee/bonito - a modular
corpus manager. In 1st Workshop on Recent Ad-
vances in Slavonic Natural Language Processing.
Masarykova univerzita, Brno, pages 65–70.
M. Zaśko-Zielińska. 2013. Listy pożegnalne:
w poszukiwaniu lingwistycznych wyz-
naczników autentyczności tekstu. Quaestio.
https://books.google.pl/books?id=QG60ngEACAAJ.
(a) Summary of annotation agreement for a set of document