Vous êtes sur la page 1sur 9

Inforex — a collaborative system for text corpora annotation and analysis

Michał Marcińczuk Marcin Oleksy Jan Kocoń


G4.19 Research Group
Department of Computational Intelligence
Faculty of Computer Science and Management
Wrocław University of Technology, Wrocław, Poland
{michal.marcinczuk,marcin.oleksy,jan.kocon}@pwr.edu.pl

Abstract work. On that time the only existing tools were


desktop applications for individual work such as
We report a first major upgrade of In- GATE (Cunningham et al., 2011) or Manufak-
forex — a web-based system for qualita- turzysta Luna (Marciniak et al., 2010). Since
tive and collaborative text corpora anno- 2010 several systems have emerged, like We-
tation and analysis. Inforex is a part of bAnno 3 (Eckart de Castilho et al., 2016) or GATE
Polish CLARIN infrastructure1 . It is inte- Teamware (Bontcheva et al., 2013).
grated with a digital repository for storing The first version of Inforex system was released
and publishing language resources2 and it in 2010 and its initial role was to construct corpus-
allows to visualize, browse and annotate based linguistic resource for various tasks from
text corpora stored in the repository. As the field of natural language processing, includ-
a result of a series of workshops for re- ing named entity recognition (Marcińczuk et al.,
searchers in Humanities and Social Sci- 2011), shallow parsing (Radziszewski and Pi-
ences we improved the graphical inter- asecki), word sense disambiguation (Bas et al.,
face to make the system more friendly and 2008), recognition of semantic relations between
readable for non-experienced users. We named entities (Marcińczuk and Ptak, 2012). It
also implemented a new functionality for was used to develop two major (at that time) re-
a gold standard annotation which includes sources for Polish: Corpus of Wrocław Univer-
private annotations and annotation agree- sity of Technology called KWPr (Broda et al.,
ment by a super-annotator. 2012) (within the NEKST3 project) and Corpus
1 Introduction of Economic News (CEN) (Marcińczuk et al.,
2013) (within the SyNaT project4 ). Later, in
Digital humanities (DH) create new demand and 2013 Inforex was used to construct another ma-
challenges for development of new or existing jor resource, which is Polish Corpus of Suicide
tools and systems for text documents manip- Notes (PCSN)5 (Marcińczuk et al., 2011) guided
ulation, processing, analysis and visualization. by Monika Zaśko-Zielińska (2013). Until now the
CLARIN-PL — the Polish part of CLARIN infras- system has been used to access the corpus. The
tructure — tries to rise the challenges associated access is granted on a demand after obtaining a
with DH for Polish language. Among many other permission form Wrocław University.
issues, there is a need for an intuitive and easy In 2013 Poland joined CLARIN — European
to use system for qualitative text corpora manage- Research Infrastructure for Language Resources
ment, annotation, analysis and visualization. To and Technology. The goal of CLARIN is to
fulfill these needs we develop such a system called make the language technologies more accessible
Inforex. In this article we present the current state to researches from humanities and social sciences,
of the system development. which in most cases do not have the technical
The decision to create a system for text cor- skills to use many of the tools on their own. At that
pora annotation was taken in 2009 when there time we made a decision to make Inforex a part
were no such systems which support collaborative 3
http://nekst.ipipan.waw.pl/
1 4
http://clarin-pl.eu http://www.synat.pl/
2 5
http://clarin-pl.eu/dspace http://pcsn.uni.wroc.pl/
of the Polish CLARIN infrastructure. In 2015– 2.3 Integration with DSpace as a Part of
2017 we have organized several workshops for re- Polish CLARIN Infrastructure
searchers in humanities and social sciences. The
Inforex system is available at http:
workshops showed us several user experience is-
//inforex.clarin-pl.eu and it is part of
sues. System GUI turned out to be not enough
Polish CLARIN infrastructure. This installation is
intuitive for non-experienced users. Then, first
integrated with the official repository for language
of all, it needed to be simplified. Second prob-
resources in Polish CLARIN6 . The repository
lem was connected with the methodology. The
runs on DSpace system7 . When a user registers
researchers use various tools for corpora analy-
in https://clarin-pl.eu/dspace/, he
sis (including spreadsheets) and Inforex may be
also gains access to Inforex system. At this stage
treated as some kind of pre-processing tool that
accounts are automatically synchronized. In the
allows to prepare corpus for further analysis. Data
future both systems will use unified federation
export was possible but complicated and required
authorization.
an access to a database. Users feedback proved
that the easy form of data export is one of the 2.4 Collaboration
crucial needs. After the set of workshops we
gathered more information about other important Inforex offers several ways for collaborative work
needs (also in the form of questionnaires) like ac- on a single corpus. One of them is the access to
cess to a custom annotation schemas definition or the same corpora for different authorized users.
data visualisation. Some of them have been al- The other one is a selective, task-oriented access to
ready implemented and the other are under con- the same document. For instance, different groups
struction. of users can have access to document’s metadata.
The last one is the ”2+1” annotation, i.e. two or
2 Inforex Features Overview more users annotate the same set of documents in-
dependently and the super-annotator creates the fi-
In the following sections we present the main nal set of annotations based on their input. More
functionalities and features of the Inforex system. about this type of collaboration is presented in
Section 3.2.
2.1 Web-based Access
2.5 Qualitative Document Annotation
Inforex is a web-based tool which does not re-
quire installation. It can be accessed by any web- Inforex was designed for qualitative document an-
browser which support JavaScript. Despite In- notation. This means it does not offer a fast and ro-
forex is built on several universal JavaScript li- bust search functions over large corpora with mil-
braries and frameworks (jQuery, jQuery exten- lions of documents. Such functionality can be ob-
sions and Bootstrap) we suggest using Chrome tained using other existing tools designed for it,
and Firefox. These two web browsers are used to for instance Sketch Engine (Kilgarriff et al., 2014)
test the system on daily bases. Users might use or NoSketch Engine (Rychlý, 2007). Inforex is
other browsers as well, however we are not able to suited for medium size corpora (containing thou-
validate all functions in each of the available web sands of small documents) and to manually de-
browsers, thus some minor issues might occur. scribe documents in terms of their metadata, an-
notations (types of phrases organized in a hierar-
2.2 Authorized and Public Access chy), annotation attributes, relations between an-
notations and annotation frames.
Corpora stored in Inforex can be accessed by au-
thorized and unauthorized users. The manager of 2.6 Language-independent
the corpus (the owner or a user with specific priv-
Inforex is language-independent in the sense that
ileges) decides what type of information from the
it can handle documents in any natural language.
corpora can be publicly available. For instance,
So far it has been used to annotate Polish, English
only authorized users can have access to docu-
and Hebrew texts (see Section 3.2).
ments’ content and can modify the corpus anno-
tations while unauthorized users may have access 6
https://clarin-pl.eu/dspace/
7
to some statistics or annotation frequency lists. https://github.com/ufal/clarin-dspace
Figure 1: Corpus overview

2.7 Document Visualisation 3. Relation — an information unit which is as-


signed to a pair of annotations. It is a directed
Inforex can handle documents in two formats:
link between two annotations of some cate-
plain text and XML. For XML documents it is
gory.
possible to display their content in a visually for-
mated way. This allows to highlight the document 4. Frame — an information unit which is as-
structure what improves the user experience while signed to a set of annotations. Frame consists
browsing and annotating documents. Sample visu- of a set of annotations with roles assigned to
alizations of different types of documents are pre- them. This type of structure can be used for
sented in Figure 3. event annotations (LCD, 2005).

2.8 Document Description 3 Recent Improvements


Inforex supports four types of information units In the following sections we present the recent ma-
which can be used to describe documents content: jor improvements of Inforex system.

1. Metadata — an information unit which is 3.1 Modern Layout


assigned to whole document (author name, A set of workshops carried out from 2015 to 2017
document creation time, source, etc.). showed that there was the need for an adjustment
of user interface to a new group of users — re-
2. Annotation — an information unit which is searchers in humanities and social sciences not in-
assigned to a sequence of words in the doc- volved in NLP tools development. New users re-
ument content. Each annotation is described ported confusion with the large amount of infor-
with a category (categories can be organized mation and the number of available functions. The
in a hierarchy) and a set of attributes. The set need of interface simplification appeared while
of attributes depends on the semantic inter- functionalities of the system would remain un-
pretation of the annotation category. For in- changed. Thus, Inforex layout has been upgraded
stance, for named entities it can be a lemma, and modernized. It involved not only a design lift-
for temporal expressions it can be a normal- ing of the user interface but also changes in nav-
ized value of the expression and for event igation panels. The comparision of old and new
mentions it can be an event modality. layout is presented in Figure 4.
Figure 2: Document annotation view

3.2 Annotation Agreement rator can make choice between two different an-
notators choices, or even reject consistent but in-
Reliability is a key value in the creation of a correct annotations. Thanks to that module sev-
good quality corpora for learning and testing of eral Gold Standard projects were performed e.g.
NLP tools. The current version of Inforex en- Polish Coreference Corpus (Ogrodniczuk et al.,
ables simultaneous and independent annotation of 2015) for definite descriptions annotation and Pol-
the same text sample by more than one annota- ish Spatial Texts corpus for the annotation of dy-
tor. Moreover, the annotation process coordina- namic spatial expressions.
tor may keep track of inter-annotator agreement
between two raters thanks to the Agreement mod- 4 Applications
ule which uses Positive Specific Agreement (PSA) In the following sections we present several prac-
measure (Hripcsak and Rothschild, 2005) to cal- tical applications of the Inforex system.
culate the reliability (see Figure 5a). View config-
uration gives the opportunity to define annotation 4.1 KPWr
layers, subsets or categories, users and set of doc- KPWr (Polish Corpus of Wrocław University of
uments that have to be analysed. The coordina- Technology) (Broda et al., 2012) is a corpus of
tor may also specify a comparison mode: whether written and spoken documents available on the
the system has to take into consideration the an- Creative Commons license which is intended pri-
notation boundaries only or boundaries and cat- marily as a training and testing material for NLP
egories. It may also include annotation lemmas. tools being developed at Wrocław University of
Inter-annotator agreement is a very important in- Science and Technology. It is successively en-
dicator of the annotation guidelines clearness or riched with annotation layers. Inforex recently
cohesion. Keeping track of changes of the inter- supported manual text annotation within such lay-
annotator agreement between subsequent annota- ers as temporal expressions and their normaliza-
tion iterations helps to improve the quality of the tions, events (and description of event attributes),
annotation guidelines. Agreement module makes spatial expressions and semantic roles. In order to
that process easier and faster. prepare temporal expressions annotation (Kocoń
Inforex system also supports the curation of et al., 2015) a new annotation scheme based on
the annotation process (see Figure 5b). The cu- TimeML was added. These categories refer to a
(a) Facebook conversation.

(b) Wikipedia article. (c) Hebrew document.

Figure 3: Sample documents visualizations

date, time of a day, duration and frequency of an be used to work with documents written in other
event. Annotation lemmas perspective was used to languages. Inforex features and functionalities are
provide normalized temporal expressions, reveal- useful e.g. in examining current EU official lit-
ing that the term ’lemma’ in Inforex may func- erature related to territorial development and ur-
tion as a broad concept. The Annotator perspec- ban planning. Authors of this analysis first up-
tive from the system also supports event annota- loaded EU Territorial Policy Documents 2007-
tion (Marcińczuk et al., 2015). There are seven 20168 to CLARIN-PL DSpace repository and then
coarse-grained categories of events, i.e. action, imported it to the Inforex system. The corpus was
state, reporting, perception, aspectual, intensional divided into 4 subcorpora and prepared for qual-
action and intensional state. The categorization itative and quantitative analysis. The review of
was based on the TimeML guidelines with some the key strands enabled the identification of its 8
modifications. It also involved creation of a new core values (or principles) for further statistical
annotation scheme. The flexibility in adding new and contextual analysis. After ascribing to each
annotation layers (setting the new annotation cate- category its textual triggers (word forms), a quan-
gories) is one of the most important features. The titive analysis using words frequency lists gener-
possibility of establishing relations between anno- ated by Inforex was performed. Manual annota-
tated fragments is not less relevant. It was cru- tion with a newly defined set of annotations and
cial e.g. for spatial expressions annotation. Its Annotation Browser with the possibility of export-
main goal was to extract different ways of dis- ing data were a great support for qualitative anal-
tributing spatial information throughout a sentence ysis — detailed contextual analysis of the corpus
by reviewing the lexical and grammatical signals focused on two crucial categories: Participation
of various relations between objects (Marcińczuk and Communication.
et al., 2016).
4.3 Hebrew Corpus
4.2 European Legal Texts Inforex supports manual annotation even if the text
is written using non-latin alphabet and a right-to-
As practice shows, although Inforex was primar-
8
ily developed for Polish language, that it can also http://hdl.handle.net/11321/316
(a) Inforex layout before modernization

(b) Inforex layout after modernization

Figure 4: Inforex layouts comparison

left notation. One of the system applications was tion / lemmatization, and cross-language matching
related to a corpus of Hebrew gravestone inscrip- (Marcińczuk et al., 2017). The system also sup-
tions. It also involved the creation of a new an- ported the annotation of the corpora constructed
notation schema. Categories referred mainly to specially for specific tasks from the field of natural
the pragmatic level of communication (e.g. initial language processing e.g. Polish Coreference Cor-
and final expressions, laudations, death circum- pus for definite descriptions annotation and Polish
stances). The perspective of annotation lemmas Spatial Texts corpus for the annotation of dynamic
was used to enter Polish translations of annotated spatial expressions. It involved creation of dedi-
fragments, which also showed that the lemma at- cated annotation layers but, what is important, in
tribute may be a broad term especially in the case these tasks the new module of the system (Anno-
of practical applications of the system. tation Agreement and ”2+1” annotation) was used
for the first time, which significantly improved the
4.4 Other Corpora time of preparation of annotated training and test-
Inforex was used to prepare the training data dur- ing corpora.
ing participation in BSNLP 2017 shared task on
5 Summary
multilingual named entity recognition aimed at
recognizing mentions of named entities in web Inforex system, as a part of CLARIN-PL infras-
documents in Slavic languages, their normaliza- tructure, is gradually developed. Although its ini-
tial role was to construct qualitative linguistic re- of the workshop on Language Technology Resources
sources for various tasks from the field of natu- and Tools for Digital Humanities (LT4DH) at COL-
ING 2016. pages 76–84.
ral language processing, recently it is also used
by scientists for other purposes. We received an George Hripcsak and Adam S. Rothschild. 2005.
important and constructive feedback from users Agreement, the f-measure, and reliability in infor-
during and after workshops related to CLARIN- mation retrieval. J. of Am. Medical Informatics As-
sociation 12(3):296–298.
PL tools and resources. As users have different
needs, we identified the common functionalities Adam Kilgarriff, Vı́t Baisa, Jan Bušta, Miloš
and implement them as soon as possible in order Jakubı́ček, Vojtěch Kovář, Jan Michelfeit, Pavel
Rychlý, and Vı́t Suchomel. 2014. The sketch en-
to boost their research tasks and provide new pos- gine: ten years on. Lexicography .
sibilities. We also challenged with the fact that
many researches from the field of digital humani- Jan Kocoń, Michał Marcińczuk, Marcin Oleksy,
Tomasz Bernaś, and Michał Wolski. 2015. Tem-
ties are not experienced users of such systems and
poral expressions in polish corpus kpwr. Cognitive
we made Inforex as easy and intuitive as possible. Studies— Études cognitives (15):293–317.
Acknowledgments LCD. 2005. ACE (Automatic Content Extraction) En-
glish Annotation Guidelines for Events. Technical
Work financed as part of the investment in the report, Linguistic Data Consortium.
CLARIN-PL research infrastructure funded by the
M. Marcińczuk and M. Ptak. 2012. Preliminary study
Polish Ministry of Science and Higher Education. on automatic induction of rules for recognition of
semantic relations between proper names in Polish
texts, volume 7499 LNAI.
References
Michał Marcińczuk, Jan Kocoń, and Marcin
Dominik Bas, Bartosz Broda, and Maciej Piasecki. Oleksy. 2017. Liner2 — a generic framework
2008. Towards Word Sense Disambiguation of for named entity recognition. In Proceedings
Polish. In Proceedings of the International of the 6th Workshop on Balto-Slavic Natural
Multiconference on Computer Science and In- Language Processing. Association for Computa-
formation Technology, {IMCSIT} 2008, Wisla, tional Linguistics, Valencia, Spain, pages 86–91.
Poland, 20-22 October 2008. IEEE, pages 73–78. http://www.aclweb.org/anthology/W17-1413.
https://doi.org/10.1109/IMCSIT.2008.4747220.
Michał Marcińczuk, Marcin Oleksy, Tomasz Bernaś,
Kalina Bontcheva, Hamish Cunningham, Ian Roberts, Jan Kocoń, and Michał Wolski. 2015. Towards an
Angus Roberts, Valentin Tablan, Niraj Aswani, event annotated corpus of polish. Cognitive Stud-
and Genevieve Gorrell. 2013. Gate teamware: a ies— Études cognitives (15):253–267.
web-based, collaborative text annotation framework.
Language Resources and Evaluation 47(4):1007– Michał Marcińczuk, Michał Stanek, Maciej Piasecki,
1029. and Adam Musiał. 2011. Rich Set of Features for
Proper Name Recognition in Polish Texts. In SIIS
Bartosz Broda, Michał Marcińczuk, Marek Maziarz, 2011. Springer.
Adam Radziszewski, and Adam Wardyński. 2012.
KPWr: Towards a Free Corpus of Polish. In Nico- Michał Mirosław Marcińczuk, Marcin Oleksy, and Jan
letta Calzolari, Khalid Choukri, Thierry Declerck, Wieczorek. 2016. Towards recognition of spatial re-
Mehmet Uğur Doğan, Bente Maegaard, Joseph Mar- lations between entities for polish. Cognitive Stud-
iani, Jan Odijk, and Stelios Piperidis, editors, Pro- ies— Études cognitives (16):119–132.
ceedings of LREC’12. ELRA, Istanbul, Turkey.
Małgorzata Marciniak, Agnieszka Mykowiecka, and
Hamish Cunningham, Diana Maynard, Kalina Katarzyna Głowińska. 2010. Anotowany korpus di-
Bontcheva, Valentin Tablan, Niraj Aswani, Ian alogów telefonicznych. In Małgorzata Marciniak,
Roberts, Genevieve Gorrell, Adam Funk, An- editor, Anotowany korpus dialogów telefonicznych,
gus Roberts, Danica Damljanovic, Thomas Akademicka Oficyna Wydawnicza EXIT, Warsaw,
Heitz, Mark A. Greenwood, Horacio Saggion, chapter Anotacja korpusu LUNA–WOZ.PL, pages
Johann Petrak, Yaoyong Li, and Wim Peters. 217–230.
2011. Text Processing with GATE (Version 6).
http://tinyurl.com/gatebook. Michał Marcińczuk, Jan Kocoń, and Maciej Janicki.
2013. Liner2 – a customizable framework for
Richard Eckart de Castilho, Eva Mujdricza-Maydt, proper names recognition for Polish. In Robert
Seid Muhie Yimam, Silvana Hartmann, Iryna Bembenik, Lukasz Skonieczny, Henryk Rybinski,
Gurevych, Anette Frank, and Chris Biemann. 2016. Marzena Kryszkiewicz, and Marek Niezgodka, ed-
A web-based tool for the integrated annotation of itors, Intelligent Tools for Building a Scientific In-
semantic and syntactic structures. In Proceedings formation Platform, pages 231–253.
Michał Marcińczuk, Monika Zaśko-Zielińska, and Ma-
ciej Piasecki. 2011. Structure annotation in the pol-
ish corpus of suicide notes. In Ivan Habernal and
Václav Matoušek, editors, Text, Speech and Dia-
logue, Springer Berlin Heidelberg, volume 6836 of
Lecture Notes in Computer Science, pages 419–426.
Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz
Kopeć, Agata Savary, and Magdalena Zawisławska.
2015. Coreference in {P}olish: Annotation,
Resolution and Evaluation. Walter De Gruyter.
http://www.degruyter.com/view/product/428667.
Adam Radziszewski and Maciej Piasecki. ???? A Pre-
liminary Noun Phrase Chunker for Polish ∗. Infor-
mation Systems pages 169–180.
Pavel Rychlý. 2007. Manatee/bonito - a modular
corpus manager. In 1st Workshop on Recent Ad-
vances in Slavonic Natural Language Processing.
Masarykova univerzita, Brno, pages 65–70.
M. Zaśko-Zielińska. 2013. Listy pożegnalne:
w poszukiwaniu lingwistycznych wyz-
naczników autentyczności tekstu. Quaestio.
https://books.google.pl/books?id=QG60ngEACAAJ.
(a) Summary of annotation agreement for a set of document

(b) User agreement verification for a single document

Figure 5: Annotation agreement

Vous aimerez peut-être aussi