Vous êtes sur la page 1sur 7

Part VI Cross-Cutting Issues

Cooperative GMS ICT Framework


Case study: Construction of Lao Lexical Resources
Vincent Berment1, Houmphanh Thongvilu2
1
GETA, CLIPS, Joseph Fourier University
385 avenue de la Bibliothèque Saint-Martin-d'Hères, France, 38400
vincent.berment@imag.fr

2
LaoNet- Lao Human Resources Network
Melbourne, Australia
pan@Lao.Net

Abstract development team with a free and potentially much


The case study of Lao resources development offers a larger distributed team. This idea of a “Generalized
window into a process for GMS organization to play an Linguistic Contribution” (GLC) on the web, already
active role in regional ICT development, to enhance the present in an early Montaigne project (1996), has been
organization cooperation with some common platform applied by Oki to the Japanese language1 (Shimohata
and tools, for the benefit of its member countries, to 2001), by NII and NECTEC to the Saikam Japanese-
reduce the digital and information divide. It is observed Thai dictionary project2 and by miscellaneous
that most of the languages spoken in the GMS region contributors to the Papillon multilingual dictionary
(e.g. Khmer, Lanna, Lao, Lü, Tai Dam) are still weekly project3. Before that, in 1995, the LaoNet cyber-team
computerized or neglected. Among the different steps, already achieved the development of the so-called Lao
one of the heaviest tasks is probably the development of
lexical resources. In this paper, we present Montaigne-
Romanization Transcription (LRT)4.
Lao, an approach for building miscellaneous Lao In this paper, we first present Montaigne-Lao, a
resources based on a cooperative effort on the web. We cooperative project for developing Lao language
then discuss how it can contribute to a collaborative resources and which initial framework is formed by a
network for the GMS languages. discussion forum, a dictionary with translation support
environment, an open source virtual keyboard, as well
Keywords: Natural Language Processing, Generalized
as two ongoing projects, text to speech and Lao New
Linguistic Contribution, Minority language, Internet,
Coding. In a second part, we discuss the perspective of
Word Processing, Virtual Keyboard, Dictionary,
a Montaigne-GMS project that would be based on a
Machine Translation, Unsegmented Writing
principle of exchange.
The efficiency of the presented GLC concept can be
Generalized linguistic contribution witnessed by the fact that the authors of this paper never
met before the Regional Conference on Digital GMS in
During the last ten years, significant efforts have been 2003, though they worked together for more than one
driven to produce efficient linguistic tools for year and achieved significant progress thanks to their
previously unsupported languages. However, this is still exchanges.
limited to several tens of languages (less than 1%;
Berment 2002b). In particular, most of the languages in
the Greater Mekong Subregion (GMS) are still poorly
computerized even if the most affluent GMS countries
such as China, Thailand and Vietnam have greatly
benefited from their language computerization. The
developing nations in the GMS have limited resources
and more immediate priorities so they can not afford
significant research and development in computational
linguistics. The profit driven research and development 1
: http://www.yakushite.net/
also often sees these needs being unanswered. 2
: http://saikam.nii.ac.jp/
3
Our point of view is that computational linguistic : http://www.papillon-dictionary.org/
resources can be efficiently obtained by a collaborative The Papillon project already includes six languages: English,
work on the web (Boitet 1999), replacing a local French, Japanese, Malay, Thai and Vietnamese.
4
: http://www.lao.net/html/LRT.html

342 Berment, V. and Thongvilu, H.


Part VI Cross-Cutting Issues

A case study: the Montaigne-Lao project5 foster dialogue among peers, to encourage people to
take ownership of the issues and problems and
thereafter to search and identify means and ways of
1. General description realizing the goals.
The initial step requires coaching and mentorship by
Basically, the Montaigne project’s idea is to offer a free
an experienced teamwork overseer to manage the
cooperative work facility on the web for developing
evolution of the group and keep the group focus.
linguistic resources and machine translation tools. The
As in any community, cyber-communities suffer
Montaigne-Lao project is the first implemented step of
from weak "signal to noise ratio" and effectiveness. It is
this concept and has Lao as target language.
often the case that groups of people would break away
One important question is how to go about
into teams that are more compatible working on some
developing a collaborative effort targeting specific
common goals.
needs, or in other terms how to harvest the goodwill and
The driving force is not always money or fame, but
to channel the energies towards producing workable
the desire for the common good, so the 'why' would be
solutions. The cyber-forum described in the next section
answer by the 'why not'. Often the effort of participants,
is the way we propose for fostering such participation
mostly in time, could expand their horizon and bring
and for guiding it. It is the basis of an ICT cyber-
new challenges, as well as unforeseen solutions.
cooperation.
The resourcefulness and contact of the mentors, who
More generally, we describe here three already
could see to the group needs, could steer the group up to
existing components of the Montaigne-Lao project:
some level of self-management. "If there is a will there
Forum, is a way" aided by good resource would makes it easy.
Dictionary, Once the means and tools are available for wider
Virtual keyboard, participation, the virtual workgroup could be replicated
and two projects: to other fields. The trust of the process would bring peer
groups to work on common problems for mutual
Text to speech, bilateral benefits.
Lao New Coding. LaoNet forum was modeled on such cyber
We would like to emphasize that, if these existing or cooperation; with its limited resource, despite many
prospective developments have initially been driven difficulties and lack of financial resources we have
separately6, our discussions on the web allowed the managed to achieve some milestone results: Lao
achievement of synergies and of technical emulation. Romanization for Transliteration (LRT) 1995; STEA
Lao IT Technology seminar-VTE August 1995, leading
2. Forum & mailing list: a method to emulate to the introduction of Internet into Laos; first Lao
the cooperation History Symposium at Berkeley USA in March 2003;
STEA PAN-Laos ICT affairs seminar VTE March
It is often said that the Internet has modernized the 2003. The forum has brought many, many people to
workgroup relationship across cyberspace. However a exchange ideas across the cyber-space without having
forum does not necessarily become a project team. ever needed to meet physically to work together.
What are the ingredients that would attract common
minds to consolidate the efforts instead of projects
being worked on in isolation, duplicating efforts and 3. Dictionary and Translation Support
resources? Environment (DTSE)
The keys are: central contact point, shared goals, 3.1. Functional architecture
commitment, and participants. Obviously participants The three following tasks participate to an initial step of
are very important as they are the how and the fuel to the Montaigne-Lao dictionary project:
keep the central contact forum going. Even if some a) A group of skilled persons build lexical entries,
people think it is a waste of effort, it can quickly expand partly starting from existing paper dictionaries.
and sometimes replace the traditional networking of b) A panel of specialists derives a reference on-line
peer groups of people. dictionary from the gathered lexical entries.
To create such a forum is to enlist a group of people c) Various tools and applications are developed to
with the same interest, Internet news-groups are a good utilize the resource.
place to start for recruits as well as private mailing lists. An original point is that, in parallel with this work of
If such contact point does not exist, then individuals or specialists, a lexical contribution coming from a
organizations should allocate resources to set it up, to worldwide public is also compiled. This process relies
on a translation support environment provided on the
5
: http://sabaidi.imag.fr/ web site. In particular, a Lao-French bilingual editor is
6
: Mostly in France: dictionary and LaoWord/virtual proposed to the visitors together with a word for word
keyboard and in Australia: forum, Lao new coding and text to translation service.
speech.

Berment, V. and Thongvilu, H. 343


Part VI Cross-Cutting Issues

The Montaigne-Lao dictionary is planned to be


exported towards the Papillon dictionary7.
3.2. Lexicographic issues
From a lexicographic point of view, the project follows
the explanatory and combinatorial lexicology (ECL)
concepts8 so the lexicographic content of the entries
includes the lexical item itself and its part of speech,
optional examples and idioms as well as a semantic
formula, the government pattern and lexical functions.

Bilingual editor

Lexical items input page

However, fields have been added in order to derive


other applications from the database (paper dictionaries
or machine translation tools). The initial lexicographic
Word for word translation effort has been handled by a group of lecturers and
students of the Institut National des Langues et des
To help them working their personal translations out, Civilisations Orientales9 (Paris, France) and are
the users thus can ask for a word for word translation of currently being reviewed by the panel of linguists.
their Lao text. But in case a word is missing or is not 3.3. Technical architecture
correctly translated, they also can add new or improved The dictionary is stored as a MySQL database table as
entries into a personal dictionary that will be used in well as the contributors’ profiles. C code is used for
priority for their later translations. These personal segmenting Lao texts into words10 and for sorting the
dictionaries, as well as the personal bilingual texts dictionary11. It uses a syllable recognition technology
themselves, can only be seen by their owner and are (Berment 1998) and a longest matching algorithm (e.g.
stored on the server. Then, in order to improve the Meknavin et al. 1997). Text input is possible with the
reference on-line dictionary (and therefore the word for two currently used Lao keyboard layouts thanks to
word translation service), the users can propose the JavaScript and to TextArea or Input HTML forms
entries of their personal dictionaries to be transferred controls. The LaoUniKey input tool presented in the
into the main dictionary. This transfer is done under next section will be used when the text input will be
the control of the linguists' panel. done in Unicode directly.

7
: http://www.papillon-dictionary.org/
8
: On this matter, see André Clas, Igor Mel'cuk and Alain
Polguère's book, Introduction à la lexicographie explicative et
combinatoire, Duculot 1995
9
: http://www.inalco.fr
10
: Lao is written from left to right with an alphabet deriving
from Indian scripts. A major characteristic of Lao writing is
that words are not separated with spaces, like Khmer, Thai or
Burmese writings.
11
: Another important characteristic of Lao writing is that
some vowels are placed before the consonant. This contributes
Functional architecture to make the automatic sort of Lao dictionaries more complex.

344 Berment, V. and Thongvilu, H.


Part VI Cross-Cutting Issues

4. Virtual Keyboard (VK) Having the text transcribed, the romanized Lao text
is fed into a Festival voice synthesizer engine. Minor
Deriving from the Lao word processor called LaoWord
sound rules modification of Standard English sound
which is an add-in for Microsoft Word, a virtual
library is sufficient to make the engine cope with
keyboard was developed to provide a general Unicode
producing a workable Lao text to voice system.
input device for Lao: LaoUniKey12. LaoUniKey is part
Future Lao TTS work could share the same basis of
of the Montaigne project both because it is available on-
this approach in terms of word construction, word
line but also in the sense that its source code is available
break, sentence formation, and utterance rules.
under GPL license.
Its code has already been used as a basis of an
ongoing Montaigne-Lao project addressing an improved 6. Lao New Coding (LNC)
Lao keyboard evaluator. Due to a lack of technology 6.1 Lao writing basic
native to operating systems that can handle foreign Lao writing is derived from Indian Sanskrit script and
language script, the enhanced-VK is designed to create many words are borrowed to build Lao vocabulary.
some smart pre-emptive keyboard across application Word could be made of a single or multiple syllables.
and platforms. In the first place, it simply interprets and The syllable is formed by: [consonant/cluster
maps the keystrokes to character codes (8 bits or 16 bits consonants][vowel/derived vowel] [consonant modifier
Unicode). By implementing some keystrokes buffering, (optional)][tone (optional)]; this structure is being
keystrokes history recall could be implemented. By referred to as a word-clusters here within.
adding a lookup of keystrokes translation table, macro Lao vowels compose of multiple sub-elements; each
programming of keystrokes is achieved. The macro element occupies a predefined position within the word-
single keystroke could be expanded to a stream of cluster, some elements proceed and displace consonant
keystrokes, paving a way to implement multi elements when invoke. Such symbols compounding is not
vowel sound of a phonetic virtual keyboard. supported for Lao within any operating system, the lack
By enhancing the keystrokes buffering with some of standards had bred many incoherent implementations
intelligent rules or a dictionary lookup words correction that had confused the computerization of Lao language.
could be preempted. It is interesting to note that the term "Lao grammar"
The virtual keyboard is written in C language and is is classically defined as the rules for Lao speech and
limited to Windows platforms at the moment .Later on, writing, words construction and little inter-linkage of
by implementing it by using a platform-independent words. Multi-words words are built by joining a number
language such as Java, it will be possible to create a of words. Lao grammar does not have extensive word
smart keyboard across platforms. identifier to explicitly define words relationship in the
multi-words construction.
5. Text To Speech (TTS) Traditionally, a Lao text is not punctuated, other than
A text to speech synthesizer (TTS) is currently being being organized in paragraphs of long continuous
worked on to broaden the appeal and usefulness of the stream of characters.
resource, to attract an audience that would sustain and So when 'Lao grammar' is referred to in this text, we
contribute to the growth of the online dictionary. It is are referring to the word-clusters construction rule.
planned that the service will be an extension of the 6.2 Current technology
dictionary server, where a client application could make In current technology, Lao words require some five
a word to sound lookup request. Such approach would bytes on average. Keystrokes or bytes stream of Lao
omit a need for voice synthesizer engine at the client words require a correct order of sequence to correctly
end. It will also foster simple voice capable front-end position the elements within the word-cluster. Possible
applications e.g. language teaching web page. solution is to assign extra bytes that would preserve
The voice synthesizer is based on the public domain elements' attributes and position information. The recent
Festival system (http://www.festvox.org/). Although a font standard, OpenType13 developed by Adobe and
full native Lao words to sound library is yet to be Microsoft, combines TrueType and Postscript
constructed, an intermediate solution is adopted to technologies to provide new typographic features such
demonstrate the concept. With Thai TTS already being as the desirable capability of elements compounding,
more advanced we hope that a future GMS forum could promises a possible solution to Lao script problems.
see Lao TTS benefiting from NECTEC TTS work. Reliance on keystrokes sequence alone could
The current implementation is done with the Lao text complicate lexical search and matching of words.
being transcribed using roman characters employing an 6.3 New encoding
earlier work of romanization for transliteration (LRT). LNC offers a new way of correctly storing symbol
The transcription engine was carefully coded closely elements' position at bits level within the word-cluster,
to Lao grammar so to guarantee a reversal transcription. in two to three bytes compare with current technology
12
: LaoUniKey's technical principles are described in
13
(Berment 2002a) at section 4: "An input method using hooks". : http://www.adobe.com/type/opentype/main.html

Berment, V. and Thongvilu, H. 345


Part VI Cross-Cutting Issues

of on average using some five bytes. By efficiently 1. Introduction


using less byte for data representation and storage, LNC
In the Greater Mekong Subregion, there are many
can also be regarded as a compression algorithm and
writings deriving from Indian scripts and sharing a
technique that is tightly related to Lao grammar.
common feature: they are unsegmented in the sense that
LNC is pure research work, and has not yet proven
words boundaries are not marked.
in any product. It is not designed to replace characters
Having the same origin would mean possible many
encoding such as MS code-page, or Unicode, but rather
cross adaptations of already developed solutions. Such
complementing them in a way that data representation is
as with font development, various existing fonts could
compacted and fast to transmit. A cross conversion
be cloned for another character set.
engine or driver program could transparently translate
The cyber-synergy described above for the Lao
the data. However raw LNC data would not be readily
language computerization (forum, lexical resources,
readable. LNC will be an optimized use of memory in
open source, downloads) may already exist for other
applications e.g. Virtual Keyboard buffers, with fast
languages and could be generalized. This section
word search and recall. LNC can guarantee a unique
discusses several possible topics where synergies could
representation of the entered words disregarding the
be found. The implementation phase could be supported
sequence of keystrokes.
by specific project grant and sponsorship. We propose
to generalize an existing Lao word processing facility to
7. Available on-line resources other unsegmented languages of the GMS. We further
Before developing dictionaries, text-to-speech or other highlight issues and problems relating to implementing
more sophisticated resources, a written language first an effective online lexical resource.
needs the basic tools that allow it to be written in
computers. For languages such as Lao which has its 2. GMSWord platform proposal
own alphabet, those primary tools are fonts and virtual
In this section, we propose to reuse the LaoWord17
keyboards.
software and to build a GMSWord from its features.
Lao language computerization has been industry
LaoWord is an add-in for Microsoft Word that performs
driven, implemented with varying degrees of success
specifically Lao word processing functions18:
and acceptance. It is mainly based on Microsoft
1. Virtual keyboards:
Windows platform, with two across applications
Miscellaneous virtual keyboards are available, both
products dominating the Lao MS Windows system:
adapted to national physical keyboards (e.g. French,
LaoWord14 and LSWIN15. On Unix platform the project
US, Danish) and with different layouts (traditional
of adding foreign languages support or internationalize
typewriters or intuitive layouts).
(I18N)16 or localization of Lao language at commands
2. Non-standardized font encoding management:
and messages level is approaching fruition. Thereafter
Several encodings have been used by font's designers
applications that are compiled with I18N awareness
making text input and font change quite complex.
would have full support of the language.
LaoWord contains the necessary software for
The MS Windows LaoWord and LSWIN solutions
providing the user with a virtual encoding standard:
are based on mainly solving keyboard input of Lao
he doesn't have to matter which font he uses both
language with their respective keyboard translation
during text input and font changes.
application: LaoKey and Keyman. Both input
3. Text selections by entire syllables:
applications have been updated to allow Unicode entries
The absence of spaces between Lao words makes
(LaoUniKey is the version for LaoWord).
Microsoft Word selections rather chaotic with the
Lao TrueType fonts have been developed since
Lao unsegmented writing. The LaoWord syllabic
1992, mainly thanks to Fontographer. Most of them are
selection mode makes the selections of Lao texts
free and can be downloaded from the Internet. Credits
more natural (syllable by syllable), for example
go to many font designers (such as John Durdin, author
before a copy of text into the clipboard.
of LSWIN) who had made serious attempt in making
4. A small Lao-French / Lao-English lexicon:
Lao language work across applications and the artful
The French or English translation of a Lao word can
skill for rendering practical fonts despite Lao small
be obtained by selecting the word and clicking on the
market. Useful Lao language computerization resource
translation button.
could be found at http://park.lao.net/references/.
5. Several romanizations:
The automatic romanization of a Lao sample can be
obtained by selecting the text and clicking on the
A scheme for computerizing GMS
unsegmented languages 17
: See http://www.LaoSoftware.com/ or
http://sabaidi.imag.fr/.
14 18
: http://www.LaoSoftware.com : See LaoWord user's manual for a more detailed
15
: http://www.tavultesoft.com/ description. It is included in the LaoWord package
16
: http://www.kde.org/international/ (http://www.LaoSoftware.com/).

346 Berment, V. and Thongvilu, H.


Part VI Cross-Cutting Issues

romanization button. Three different romanizations As previously described in section 4 (Virtual


are proposed: free (i.e. only making use of the Latin Keyboard), the implementation already exists for Lao
alphabet), a phonetic and a reversible transcriptions and is fully adaptable to all GMS languages.
both using the International Phonetic Alphabet. 3.4 Resource: human and financial
6. Lao table sort: Currently most of the work on Lao lexical development
Tables with text in Lao cannot be directly sorted with is driven by a group of volunteers. To fast track its
Microsoft Word means. LaoWord allows a sort of development would require the support of resourceful
Lao tables. It can use any of the different rules found hosts who share our vision. The goodwill and the
in the dictionaries (selectable). human resource do exist, ready for the initiative to
7. Miscellaneous Lao text finishing functions: gather the efforts.
Lao making-up into pages,
3.5 Data format
Height selection (high/low) for accents, The data record format has to be carefully designed to
Height selection (high/low) for "sala u", compromise between compactness and flexible data
Ligature of accented vowels. fields of information. As Lao text is made of continuous
While Microsoft Windows still represent the stream of characters, it is hard for unfamiliar reader to
majority of the available platforms, GMSWord would dissect the word(s) boundary. It is further complicated
offer a suitable effective tool for the GMS languages by a lack of clear rule base multi-words construction.
with unsegmented writings. The same methodology To record all Lao words would mean wasting of
being implemented for the different languages, it would resource in duplicating many word entries. Retrieving
be more easy to define common communication tools. words from a dictionary would then be pure strings
Note that a BurmaWord has already been prototyped search rather than rule base.
for testing the Burmese Unicode capability. By having a rule based database could create a more
concise and smarter dictionary. Therefore data records
need to be selective in how words and what relevant
3. Character encoding standard, data format information are stored, how multiple-words are linked.
3.1 Reducing the brakes on development It is seen to be necessary to introduce word identifiers
Within the past ten years despite the neglects and small and deriving words dependencies and relationship.
commercial viability, language such as Lao had Another aspect of Lao words is dialectic variants, with
received enough interests and support to emerge with a different intonation and consonants used in words.
workable set of tools. However due to uncoordinated Incorporating all these issues, a possible data records
efforts, many diverging technologies and conventions format is such:
exacerbate the problems. Several issues had been 1) Single word Unicode encoded,
identified that had hampered the development of this 2) International phonetic notation,
Lao lexical on-line resource: 3) Romanized scheme wording,
Lack of workable Lao standard character coding. 4) Variants spelling (intonation, and consonant),
5) Word identifier/type (e.g. adj., noun etc.),
Lack of Lao script support native to current 6) Links to multiple wordings,
operating system affecting keyboard input. 7) Description,
Resource: human and financial. 8) Comment.
We present hereafter some solutions for Lao that we To ensure a quality words retrieval, a client-server
currently discuss in the forum or already use and that architecture is a key feature that would allow
could also be applied to other languages19. maintaining a uniform and centralized database.
3.2 Lao standard character coding Centralized database of significant hits answering to
Code-page scheme of MS Windows to display foreign remote requests could suffer some poor performance.
character sets combined with the lack of a Lao workable To improve on availability, it is possible to distribute
character standard had created a situation where an traffic load to multiple slave servers that replicate from
exchange of Lao data does not always convey the and update to a single master database.
relevant information. This is an issue that the Lao By centralizing the database it would be much easier
lexical online resource has to develop some standard or to collect statistical data of the words access and usage.
means of conversion: Such data is relevant in further enhance the design of
more intelligent lexical rules.
Adopt Unicode,
The implementation of a client-server architecture
Provide utilities for conversion, would achieve an across platforms solution, favoring
Develop a convention to embed font information, to HTML, SSI, PHP, JavaScript and compiled C code used
achieve some data uniformity. as CGI technology.
3.3 Lao script support, keyboard input

19 Conclusion
: See also (Berment 2002b) for a general approach.

Berment, V. and Thongvilu, H. 347


Part VI Cross-Cutting Issues

Computerization of a language is a pride for its speakers


and a natural right for all linguistic groups to possess Boitet C. 1999. A research perspective on how to
the same minimum set of tools for effective democratize machine translation and translation aids
communication of the time. Dialogues break down the aiming at high quality final output. MT Summit 1999.
barrier of prejudices and ignorance. It establishes a
working relationship build on trust. Trust and Meknavin S., Charoenpornsawat P. and Kijsirikul B.
understanding ensure cooperation for bilateral benefits. 1997. Featured-based Thai word segmentation. Natural
Though the Montaigne project started with an Language Processing Pacific Rim Symposium 1997.
application to the Lao language, its principles can be
applied to any of the GMS languages. Shimohata S., Kitamura M., Sukehiro T. and Murata T.
2001. Collaborative Translation Environment on the
Web. MT Summit 2001.
References
Thongvilu H. (LaoNet) Lao Romanisation for
Berment V. 1998. Prolégomènes graphotaxiques du Transliteration STEA Lao IT technology seminar –
laotien. DEA dissertation, Inalco, Paris, France. Vientiane August 1995.
Berment V. 2002a. Several Technical Issues for Thongvilu H. Lao New Coding (LNC) PAN-Laos
Building New Lexical Bases. Papillon seminar 2002. National ICT affairs seminar – Vientiane 10th March
2003. (to be published)
Berment V. 2002b. Several directions for minority
languages computerization. COLING 2002.

348 Berment, V. and Thongvilu, H.

Vous aimerez peut-être aussi