Académique Documents
Professionnel Documents
Culture Documents
Ulrich Kampffmeyer
Technology and Applications PROJECT CONSULT GmbH
Abstract
This paper on multilingual information and retrieval systems with optical mass
storage describes the technical principles of software design. The different layers
and modules from the user interface via transformation modules, thesaurus modules
and fulltext interpretation to database management are explained in detail. Two
examples of multilingual document imaging systems are presented:
- wfBase multilingual press and commerce information system base on
four ISDN-knots in Switzerland;
- HEMIS multilingual information system for CD-ROM distribution on
environmental institutions, projects and programmes of the
UN Environmental Programme UNEP/HEM.
Contents Page
1. The Importance of Multilingual Software Systems With Optical Storage
Media for the European Economic Region
1
2. Software Design
3
2.1 Structural and Other Requirements for Multilingual Software
.................................................................................................................
3
2.2 User Interface and Application
.................................................................................................................
6
2.3 Transformation Modules
.................................................................................................................
9
2.4 Selection Lists
.................................................................................................................
11
2.5 Thesauri
.................................................................................................................
12
2.6 Fulltext Translation
.................................................................................................................
16
3. Sample Applications
19
3.1 wfBase
20
3.2 HEMIS
23
4. Outlook and Summary
27
2. Software Design
Like the ability to load modular program segments and functions separately,
multilingualism must be designed in from the start. It is well-nigh impossible to
modify finished software to support multilingual operation. In such cases, it makes
more sense to completely redesign the software using modern tools.
German
Application Database
3
Change of keyboard setting and screen display during runtime via external program
Requirements
The user interface has to support several functions to enable change of
language during runtime
Dynamic data and message interchange with operation system and user interface,
application program and database
The text components are kept in separate files, called "resource libraries" (Fig. 2).
The resource libraries can be loaded on the fly by language selection variable L x.
For language resources to be usable, all texts in a program that are going to be
displayed or printed must be referenced by an unambiguous key variable with the
appropriate library.
Resource libraries must exist for:
a) All static texts in dialogue boxes and masks. These are texts which are
associated with a given dialogue box and do not change.
b) Dynamic texts in dialogue boxes and masks. These are texts which change,
appear, or disappear according to status (messages). This includes the graying
out of inoperative or unavailable functions on menus and buttons.
c) Help texts which appear automatically or interactively.
d) Error messages, system messages and other operation-related messages.
Language Resources
Requirements
Language resources are used for displaying texts related to the unique keys in
the application
Many applications use icons and buttons to simplify option selection. If they bear text
or abbreviations (such as "B" for bold), these must be converted when the language
is changed (thus, in German "F" for "Fett" = bold). For this reason these icons and
buttons should likewise be kept in dedicated resource libraries instead of being
managed directly in the program. The same applies to icons with graphics, where the
graphics do not bring across the same meanings in a different language area or
country.
Application
Characteristics
Numeric keys for every text entry related to the screen display and database fields
Direct control of database and user interface
Object oriented message driven program
Transformation Modules
Types
Transformation modules are used for the display transfomation
of numeric values of the database
d) Units of measure
This is an important requirement for international trade and manufacturing
companies. For example, in the oil industry large quantities of different types of
oil and petroleum products are transported and handled daily. Measurement
values and with them customs and tax rates constantly fluctuate, depending on
the type of product and its specific weight and even on the ambient temperature.
In cross-border trade the units of measure as well as of currency must be
automatically converted.
The most important categories are units of currency, distance, weight, and
volume.
e) EDI data
Standardised electronic data interchange (EDI), such as EDIFACT, allows entire
business transactions to be handled electronically, without paper originals. The
data is archived digitally. For display and printouts, EDI codes are converted into
text. This conversion can be made language-specific through a language control
variable. With EDI data it is necessary to know what version of a given EDI
application the data will be converted with.
Further transformation modules can be added to cover other requirements for
specific industries and applications, for example converting product codes into text.
Selection Lists
Characteristics
Selection lists are an easy way to translate information
and to spare storage capacity
Every entry in a selection list refers to a value which is related to a database field
Every entry in the different language versions of a list refers to the same value
The database has to store only the numeric value of the entry
Selection lists offer several advantages over regular text-entry fields in database
applications:
a) No typing mistakes
b) Selection lists keep the database uniform and ensure that entries can be easily
found again. Since the user must decide from among a set of given expressions,
entries are standardized.
c) The database stores only a reference number which refers to a text resource.
This keeps space requirements low, and different text resources can be accessed
depending on the language variable. Retrieval is faster, since the system must
search only through predefined numbers instead of text sequences.
d) Multiple selection lists facilitate the multiple allocation of a document and allow
the user to select a number of related items if he/she is unsure about the
allocation to a single one.
c) above is the most important factor for multilingual applications. The use of
reference numbers allows linkage to multiple lists in different languages. The
reference numbers can also be used to limit access, so that only cleared items
are shown in a search. Selection lists also facilitate data entry through the use of
presettings for recurring entries.
Selection lists can be created with standard text editors. However, this should be
done only by authorised persons, since changes to and especially deletions of
entries characteristics (entries in a selection list) can compromise the consistency of
the database. Strict update and maintenance rules are a must for distributed
systems and resources.
Selection lists with restricted vocabulary are the ideal medium for standardising
terminology within a company and for creating multilingual software systems.
Multilingual systems should avoid free text entry wherever possible and use
selection lists whenever feasible.
2.5 Thesauri
This term has widely differing meanings. In its original meaning it refers to a defined
specialist terminology, broken down hierarchically from the general down to the
precise. The terms differ clearly from one another and are distributed over several
hierarchical levels. A generic term at one level branches into a number of more
precise terms on the level below it. All terms at a given level should be at a similar
level of detail.
However, in many word-processing programs the "thesaurus" is simply a utility
showing possible synonyms. This familiar kind of thesaurus is completely unrelated
to the structured terminology system described above, as for example defined by the
International Standards Organization (ISO) for single- and multilanguage thesauri.
Thesauri
Thesauri offer a hierachical structured and crosslinked nomenclature
Similar to a selection list, the thesaurus displays a text related to a database value
related with this text
Thesauri
Hierarchy and Crosslinks
The Hierarchical View of the Thesaurus Unique identifier Position in hierarchical view
5 1120
3 1200 6 1210
7 1220
The Network Structure of the Thesaurus
(Crosslinks independent of the hierarchical position)
1 1000 2 1100 4 1110
8 1120
3 1200 6 1210
7 1220
same but have different meanings), synonyms, plural forms, explanatory notes, etc.
This information is also accessed during a global search.
The "language slices" need not necessarily contain foreign languages; they can also
contain different aspects of a single language. This is particularly useful for
specialist languages. Thus, one slice can contain the regular colloquial language,
with only two or three levels and accessible to everyone, while another slice can
contain the terminology for a specialist field broken down into more levels and
accessible only to those working in that field. This allows control of the extent, depth,
and accessibility of information.
Thesauri
"Slice"- Model of a Multilingual Thesaurus
searched and evaluated. The thesaurus then converts the unique identifiers for
display.
g) Thesaurus databases can be run on a PC LAN, thus reducing the workload on
the central database and information resources management (IRS; see below
and Fig. 3).
If the system includes optical-systems management software in addition to the
thesaurus database and the database proper, it has a three-level database hierarchy
(compare Figs. 3 and 26):
a) Database for one or more thesauri (local or central)
b) Database for managing unique identifiers to selection lists and thesauri and for
managing database entries (numerical, alphanumeric, date, time, Boolean
variables, etc.)
c) Information retrieval and access system (RIAS). As a rule a non-standard
database for managing WORM (write-once) media, erasable, rewritable, and M/O
optical media, or read-only media (CD-ROM).
A standard database (preferably relational) can be used for the thesaurus database
as well as for the database proper. Full-text databases are not suitable for this type
of application (Fig. 13).
Database
Characteristics
Support of optical disk information retrieval system for mass data management
Language Interpreter
Characteristics
The language interpreter contains different modules which allow
translation and interpretation of fulltext databases.
Language Interpreter
Structure
User Interface
Entry Query
Display
Comparision
Database
a) Dictionaries contain the individual words in their different forms (plural, singular,
declined, conjugated, irregular verb forms, etc.). As a rule the dictionary will
constitute a database application of its own. However, it is completely different in
structure, makeup and content from the thesaurus discussed above.
b) Statistics modules analyse the occurrence and composition of words and
combinations of words.
c) Linguistic and grammatical-analysis modules are the most difficult part. They
must contain all the rules and comparative examples required to analyse syntax.
Pattern recognition and fuzzy logic techniques are often used for this purpose.
d) The results of a), b) and c) above are combined, evaluated and interpreted in a
comparison module. The comparison module is designed so that intermediate
results of one module can be returned to another module for evaluation. This
gives rise to an iterative process with a relatively high rate of recognition in texts
on specific subjects for which there are electronic dictionaries containing the
subject terminology.
e) Due to their architecture, traditional databases are not very effective at time-
consuming text analysis. To speed things up, special cache and inverted file
modules are often used as intermediaries.
f) Presentation modules handle the correct on-screen presentation of the translated
text. They work with information from the dictionary module, the evaluated text
from the database, and the inverted file system.
The running text interpretation system we have described can be used to evaluate
queries in regular text. Fig. 15 shows the processing path for a query. The system
goes through the modules from bottom to top in the same way to convert a text out of
the database. The system shown here is just one possible configuration. Since this
technology is very new, many other approaches are being investigated. This
particular approach has the advantage that different modules with differing
evaluation strategies can be consulted simultaneously. Furthermore, each module
can be dedicated to certain languages or vocabularies, and accessed automatically
by the comparison module as needed. The interpretation and translation of a text is
very time-consuming, and usually possible only on very fast dialogue computers.
Complex systems such as the one described should not be confused with simple
translation aids.
Traditional full-text databases are seldom suitable for such systems. Standard
database software uses a strategy of leaving out filler words, adjectives, adverbs,
etc. in order to save memory space and increase database speed. However, a
language interpretation system needs all of the information contained in the text,
since otherwise coherent, context-adequate translation is not possible.
"Language Interpreter" database systems have enjoyed initial successes with the
UNO and the European Commission.
The choice of a system for multilingual database applications is still simple at this
point:
a) For document-oriented (facsimile) systems, applications with controlled
vocabularies, and systems intended to bring about a standardization of use, the
transformation, selection list and thesaurus approach is the right choice.
b) For full-text applications which will not go into full use within the next three to four
years, the approach described in this section should be attempted or at least
examined.
At present there is no commercial software immediately available for either
application, nor are off-the-shelf solutions likely to become available in the future,
since the nature of the application and the vocabulary will be subject to constant
change.
However, in my opinion an approach as shown in Fig. 3 is ideal. It combines the
different transformation and interpretation components in one level where they work
in parallel. They link the user interface with the database proper. This integrative
approach combines the advantages of all of the techniques named, which can then
be used individually or in combination as needed.
3. Sample Applications
We will now look at multilingual information and retrieval systems from the user's
point of view, using three examples.
Application Examples
3.1 wfBase
wfBase was developed specially for the Swiss Institute for Commercial Development
(German "Wirtschaftsförderung", hence "wf"). It has been in operational use since
1992.
The Swiss Institute for Commercial Development is located in Zürich, with offices in
Geneva, Bern and Lugano. Prior to the introduction of wfBase, dossiers on political
events, economic data, and the like were kept independently at all four locations.
The goal of wfBase is to enable access by all Institute users to all press articles,
periodicals, and Institute documents, independent of the language of data entry
(Figs. 17 and 18).
wf Schweitzer Wirtschaftsförderung
wfBase Swiss Institute for Commercial Development
Zürich - Geneva - Bern - Lugano
The wf owns one of the largest archives on commercial and political topics in
Switzerland. It provides information to politicians, journalists and its commercial
members representing all major companies of Switzerland.
wfBase
Storage and Communications Layout
Jukebox
External Use
wf-User Harddisk Cache
Zürich Novell Jukebox
Netware Internal Use
Images, Files & Descriptors
Read / Write / Create
DB Server Addresses
Harddisk Cache Library
wf-User ISDN Zürich
Dossiers
Lugano SwissNet 2
Images, Files & Descriptors
Read / Create Archive - Server
Zürich
wf-User Harddisk Cache ISDN
Geneva
Images, Files & Descriptors SwissNet 2
Communications -
Read / Create Server
Zürich
wf-User Harddisk Cache
ISDN Telephone
Bern ISDN & Modem
Images, Files & Descriptors SwissNet 2
Read / Create External User
Harddisk Cache
Fig. 18: wfBase - System configuration with internal and external users and information
management in two jukeboxes (Zürich)
wfBase also integrates other applications besides document management under its
graphical user interface, such as word processing and spreadsheet applications,
address and library management, billing for outside users, electronic faxing and
mailboxes, etc. The wfBase system makes use of some HYPARCHIV modules, but is
otherwise an independent application with client-server architecture and a relational
database on an OS/2 server. The MS Windows workplaces are linked together in a
Novell network. Outside users can access wfBase by modem, query documents
("subsets"), and display and print them locally or have wfBase fax the documents to
them.
The four wfBase locations are linked by SwissNet2 (ISDN). This powerful network
allows compressed scanned facsimile transmission. Two jukeboxes store scanned
facsimiles, locally-generated data, and incoming faxes. The system is highly error-
tolerant and largely fail-safe.
At the heart of wfBase is the database with a quadrilingual (German, French, Italian,
English) thesaurus for subject-area classification. The thesaurus includes over 2000
subject areas, organized hierarchically and in linked structure over four levels.
wfBase
Multilingual Thesaurus
The two images show different views of the thesaurus for thematic
keywords (here in German). The thesaurus supports the user in navigation,
jump-functions, short-key-entries, synonym-retrieval and other techniques for
easy-to-use access.
Screen II Screen I
aus Vortrag Online ´92 aus Vortrag Online ´92
Thesaurus-Maske Thesaurus-Maske
Sachgebiet Sachgebiet
Fig. 19: wfBase multilingual thesaurus, showing two windows of the thesaurus screen. The left
shows the branching from a broad term to a list of narrower terms. The thesaurus contains
the subject areas covered in the dossiers.
In addition to the thesaurus, there are selection lists for other fields and fields for text
and data entry. The database enables the user to locate documents regardless of
the language in which they were entered. However, the system displays documents
only in their language of origin; in a multilingual country like Switzerland it is not
necessary to translate the contents of documents, as users are expected to be
multilingual as a matter of course. Instead, the objective of wfBase is to improve
communication between office locations, standardize addresses and documentation,
eliminate redundancies, and provide third parties (members of the wf's supporting
organizations) with a simple, time-saving and cost-effective means of access.
3.2 HEMIS
Within the United Nations Environmental Programme, or UNEP, there is an
organization called UNEP/HEM (Harmonization of Environmental Measurement)
which is responsible for the harmonisation of environmental monitoring methods,
plans, projects and information. Since 1990 a project has been underway at the
Munich UNEP/HEM office to immplement an information and meta-database system
for the UNEP/HEM, called HEMIS (= HEM Information System). HEMIS is intended
to provide an overview of:
a) Current global and national environmental projects by the UN and other
international and world organizations
b) Institutions, research emphases, periodicals, and key personnel
c) Methodology, reference materials, etc.
d) Databases, data formats, data quality, access, etc.
The information contained in HEMIS is meta-data compiled from widely varying
sources (Figs. 17 and 24).
Hyperlinks, guided tours, global search facilities together with the thesaurus
enable easy access to the Information independent of the language of entry
Fig. 20: HEMIS - Information and meta-database system of the UN environmental organization
UNEP/HEM
H E M I S EARTHWATCH
INFOTERRA
Institutions
UNEP
ESA
Programmes
UN
EEA-TF
Data- Classification Methods/
WMO
bases Systems Models
NGOs
Persons
GEMS Governments
High Level Data Model
IAEA Others
Fig. 21: Information harmonisation and and distribution by HEMIS. Data on paper, diskette and CD
is read into the stationary HEMIS, selected and formatted, classified semi-automatically or
manually following a defined nomenclature (thesauri), and finally distributed in the form of
printed reports on specific subjects or on CD-ROM. This figure shows only a representative
sample of the participating organizations.
1At this writing (late 1992) HEMIS is still at the design and prototype stage. Not all components have
been implemented as yet.
IMC Congress, Brussels
The major components of both the stationary and the CD-ROM HEMIS systems are
a number of electronic thesauri, structured as shown in Figs. 11, 12, and 22.
One entry Up to 64 entries Up to 255 entries One entry One entry Up to 255 entries One entry
Fig. 22: Structure of the HEMIS thesaurus for geographical units, climate zones, subject areas, and
other hierarchically structured reference keys. For an explanation of the entries in the first
row see Section 2.5 and Fig. 12.
The thesauri and selection lists are part of both HEMIS systems. In the stationary
system they are used in making key words for data sets, documents, graphics,
images etc., and for searching and compiling data. If information is supplied on
computer media in pre-agreed formats, some of the key-word creation process can
be done by the system automatically. In the CD-ROM version the thesauri, selection
lists and all other entries are used only for researching and compiling information.
The HEMIS CD-ROM version has a multi-layer modular structure (see Fig. 23).
Objects
Fig. 23: HEMIS system layout with multilingual user guidance and search. The user interfaces in the
various languages make up the first layer. The next layer is composed of modules for
different search and navigation strategies, likewise language-specific. In addition to a
database, HEMIS has prearranged "guided tours" and "links". The information and
documents on the CD-ROM are managed by an Information Retrieval and Access System
(IRAS).
In addition to searching for certain key words or terms, HEMIS also offers navigation
assistance in the form of prearranged "guided tours" and individual links. A global
database search takes a certain amount of time, but it does allow the user to use the
system without prior knowledge of what contents lie behind a given field in the
search mask. The user interface can be toggled among different loadable languages,
as can the thesauri, selection lists, links and guided tours. Free text input and
scanned-in documents are not translated. HEMIS is intended to provide the initial
information; the user can then consult the source institutions, databases, or
publications for more in-depth information.
Fig. 24 shows the proposed starting screen of the HEMIS prototype with the button
fields for moving to the main subject-area screens.
H E M I S
Institutions Environmental Information System Guided Tours
Databases Location
Methods Region
EXIT
Choose Choisir Wählen Sie
English Francais Deutsch
Conclusions
MultiLingual Information and Retrieval Software
Fig. 25: Summary of the most important arguments for multilingual software
In this article, the following arguments have been advanced (Fig. 25):
a) Multilingual software is a necessity for all organizations with Europe-wide or
world-wide activities, for which a single "company language" is undesirable or
impracticable.
b) Multilingual software is available in its basic features as standard software, but
as a rule it must be modified for the specific application before it can be used to
full benefit (compare wfBase, 3.2, and HEMIS, 3.3)
c) Multilingual retrieval software can be used for accessing large quantities of data
or documents on digital optical storage media.
d) Multilingual thesauri encourage standardization in document classification,
enable clear and structured access to documnets, and support searches for
documents not in the user's own language.
e) Multilingual fulltext retrieval and translating systems are in use in prototype form.
Combined with other techniques, such as thesauri, they will make easy-to-use
information systems feasible in the future.
f) Multilingual software is a market opportunity for European software and systems
firms.
g) Multilingual retrieval and information systems can be used to advantage in
almost all areas of business and administration which extend beyond national
and cultural boundaries.