Vous êtes sur la page 1sur 5

Ciprian-Virgil Gerstenberger

Programmer/Computational Linguist, Diploma Linguist


University of Troms, Norway
Department of Language and Linguistics
Giellatekno - Center for Smi Language Technology
Software for language teaching for minority languages: the case of Smi
Introduction
The rapid development of mass media, electronic devices and the internet
came along with a radical change in human communication: globalization,
multiplication of channels, and speed of information exchange. In turn, this lead
to written communication becoming more and more prominent in the daily use.
Moreover, the most radical development in linguistics was the crystallization of
Computational Linguistics (CL), which paved the way for Natural Language
Processing (NLP) as reflected in any type of
Language Technology
application: from simple tools such as spellcheckers in a text editor to
sophisticated robots that interact with humans via spoken language. It is a fact
that a vital language cannot ignore the deployment of a modern society's
universal tool for creating, changing, exchanging, and storing data -- the
personal computer -- as well as various information and communication
channels related to it such as internet, mobile phones, PDAs, etc. What does that
mean for minority languages?
In this article, we give an overview of the use of free software for
implementing free specialized software for language learning and language
teaching of Smi languages at Giellatekno, the Smi Center for Language
Technology at University of Troms, Norway. Giellatekno is devoted to working
with and for Smi languages. By using advanced language technology, a wide
range of resources and tools are developed: collecting and linguistically
annotating text corpora, developing special tools for language analysis both at
word level morphology , and at sentence level syntax , compiling
dictionaries and spellcheckers, as well as building even more complex language
tools such as for Machine Translation (MT), Computer-Assisted Translation
(CAT), and Computer-Assisted Language Learning (CALL).
Open resources at Giellatekno
Giellatekno is a small team of linguists, computational linguists, and
programmers that have very specialized tasks on language data collecting,
analyzing, transforming, and generating. Any piece of software developed at
Giellatekno is free and even open source, yet due to the focus on language
processing, the most of it is very specific and not usable without a deep
knowledge of how to use it.

Building text corpora


One of the basic tasks is the collection of texts and word lists for Smi,
hence building a text corpus that, subsequently, can be analyzed, annotated
linguistically, and deployed for the purpose of developing and improving the
language software on each level of language description: morphology, syntax,
and semantics. The raw data, can be Smi text in different electronic format: pdf,
html, txt, doc, etc. The input file is processed by a pipeline of scripts for
converting raw data into xml format, a format that allows easy access for
annotation, search and extraction of lexical data (cf. Huhmarniemi et al., 2007).
The lexical data is used in developing different language software: electronic
dictionaries, lexical databases for developing morphological tools, and proofing
tools for Smi languages.
Morphological tools
The morphological analysis and generation tool developed at Giellatekno
are based on Finite-State Morphology (see Beesley&Karttunen, 2003). For this
purpose, both free tools such as the Xerox tools twolc, lexc, xfst, lookup (see
http://www.stanford.edu/~laurik/.book2software) and open source tools such as
Voikko (see http://voikko.sourceforge.net) are employed. Based on these
linguistic tools for morphology analysis and generation, more complex language
tools are developed: proofing tools, electronic dictionaries, and parsers for
syntax analysis.
Proofing tools
One of the first language tools for end users developed within the Divvun
project are the spellcheckers for North Smi, Lule Smi and now also for South
Smi (cf. Gaup et al., 2006). With regard to the most used editing programs, the
proofing tools are compiled for MS Office (Windows and MacOS X),
OpenOffice (all operating systems), and InDesign (MacOS X) and can be freely
downloaded at http://divvun.no.
Electronic dictionaries
Electronic dictionaries are basic helpers for teaching and learning
languages. For Smi languages, a great many of bilingual dictionaries are
compiled: both as offline resources and as online accessible dictionaries (see
http://dicts.uit.no). The offline versions are available for use with dictionary
clients on different platforms such as Macdict for MacOS X or StarDict for
Windows and Linux. There are even smartphone dictionary applications using
the StarDict dictionary format such as WeDict for iPod/iPhone. Owing to
morphological tools for word form generation, the offline dictionaries for North
Smi and South Smi provide the user not only with base forms, as in a
traditional dictionary, but also with all word forms of a given lemma, and for
specific word classes also with key forms of the paradigm. In that respect, one

can claim that such kind of dictionaries possesses kind of intelligence (see
Antonsen et al., 2009a).
Syntax tools
Based on previous morphological analysis, the syntactic analysis of
natural language sentences for such a language with complex morphology as
Smi is carried out as rule-based dependency parsing. For this task, the VISLCG
parser for Constraint Grammar developed by the VISL-group at the University
of Southern Denmark is employed (see VISL, 2008).
Computer-Assisted Language Learning tools
Oahpa! is a web-based, Smi language learning program suite. Initially, it
has been implemented for North Smi, the biggest group among the Smispeaking minority groups in North Europe. The aim of the project was to build
a language teaching system beyond simple multiple-choice exercises (see
Antonsen et al., 2009b). A lot of effort and a huge amount of language and
pedagogical knowledge have been put into the sophisticated learning units.
These language-learning games become more and more popular both for other
Smi languages (e.g., South Smi, Kildin Smi) and for different groups of
users: school children, university students (both native speakers and learners of
Smi as second language), even grown-ups native speakers of Smi who want to
improve their writing skills. Accessible at http://oahpa.uit.no, the CALL suite
for North Smi contains six individual learning units with different grades of
difficulty and complexity.
The first learning unit is Numra, a game for exercising numbers both from
numeral to string representation and vice-versa. Recently, exercises for time and
date expressions have been added to this game.
Leksa, the second unit, is a
vocabulary trainer both from Smi to Norwegian or Finnish and vice-versa.
When using Leksa different options for configuring the set of words to train are
available: thematic domains (e.g., family, nature, food) or teaching books used
for teaching North Smi (e.g., Davvin, lgu). MorfaS, the third unit, is unit for
drilling word forms, i.e., inflected words. The options to choose from are
teaching book, word class (e.g., noun, verb, adjective) and, depending on this
choice, features to practice (e.g., stem, case, tempus). MorfaC, the fourth unit, is
a game for practicing word forms, too. The difference between
MorfaS and
MorfaC is that the word forms required by MorfaC are embedded in sentential
context so that the choice of a correct form would render the specific sentence
grammatical. The fifth learning unit, Vasta, is a game in which the user is
supposed to build natural language sentences as reasonable answers to given
questions. Finally, Sahka, the last language-learning unit, is the most elaborated
program. It consists of a set of dialogue games on specific topics. As with other
Oahpa! units, users get context-sensitive corrective feedback if needed. The

word-level games Numra, MorfaS, and MorfaC are based on Finite-State


Morphology for word form generation, while Leksa is implemented by use of a
relational database. The syntax analysis of the free natural language input in the
sentence-level games Vasta and Sahka is carried out by means of rule-based
dependency parsing with Constraint Grammar. To trigger topic navigation in the
dialogue play Sahka, grammar rules for parsing individual sentences are
enriched for dialogue analysis with special rules for topic recognition of users
answers. For a more detailed description of the CALL programs, see Antonsen et
al., 2009b or Antonsen et al., 2009c.
A further CALL program suite developed in collaboration with the
Aajege, the Smi Language and Competence Center in Rros, is now available
also for South Smi. However, due to the complexity of Vasta and Sahka, these
two learning units are still under development. Furthermore, implementations of
simple demo units for Inari Smi and Kildin Smi originated as a collaboration
with language activists and researchers collecting data and documenting those
languages.
Experiences with free software
Using free software in teaching is extremely beneficial both for teachers
and for students. Yet, sometimes the use of free software proceeds not without
problems. In the following, we will shortly report on some of Giellatekno's
experiences with free software.
Since the software implemented at Giellatekno is specific for language
teaching and learning, issues such as correct input representation, correct
handling in the processing pipeline, and naturally correct display of the
characters of a specific alphabet are crucial. Unfortunately, despite the
consequent use of Unicode, this kind of low-level problems with different Smi
alphabets still exist with some free software such as the dictionary client
StarDict.
A further problem with the use of free software is the documentation. It is
true that writing software documentation is not an easy job and this is even more
the case for specialized software. As long as everything works as expected one
might not need to consult the documentation but if not, either there might be
insufficient documentation or the documentation is not understandable by
everybody. Unfortunately, this might be the case also for some software
implemented at Giellatekno.
Giellatekno made also nice communication experiences in case of
software problems with groups working very closely with Giellatekno such as
the VISLCG group, the Apertium group, or the Autshumato group when the
problems have been remedied in very short time.
To conclude with, when working with minority languages such as the
Smi languages a major task is not only to document them but even to try to

keep them alive. At Giellatekno, the Smi Center for Language Technology free
software for language teaching for Smi languages are implemented in close
collaboration with universities, language centers, and language revitalization
initiatives. In this respect, Giellatekno takes a special role in language
technology work with endangered languages.
References
1. Antonsen, L., Gerstenberger, C.-V., Moshagen, S. & Trosterud, T.
(2009a). Ei intelligent elektronisk ordbok for samisk. LexicoNordica.
2. Antonsen, L., Huhmarniemi, S. & Trosterud, T. (2009b). Constraint
Grammar in Dialogue Systems. Proceedings of the 17th Nordic
Conference of Computational Linguistics.
3. Antonsen, L., Huhmarniemi, S. & Trosterud, T. (2009c). Interactive
pedagogical programs based on Constraint Grammar. Proceedings of the
17th Nordic Conference of Computational Linguistics.
4. Beesley, K. R. & Karttunen L. (2003). Finite State Morphology. CSLI
publications in Computational Linguistics.
5. Gaup, B., Moshagen, S., Omma, Th., Palismaa, M., Pieski, T. &
Trosterud, T. (2006). From Xerox to Aspell: A First Prototype of a North
Smi Speller Based on TWOL Technology. In Finite-State Methods and
Natural Language Processing. Springer-Verlag.
6. Huhmarniemi, S., Moshagen, S. & Trosterud, T. (2007). Usage of XSL
Stylesheets for the annotation of the Smi language corpora. Proceedings
of the Linguistic Annotation Workshop.
7. VISL-Group University of Southern Denmark. (2008). Constraint
Grammar. http://beta.visl.sdu.dk/constraint_grammar.html.

Vous aimerez peut-être aussi