(EN) "Multilingual Information and Retrieval Systems Technology and Applications" - Dr. Ulrich Kampffmeyer - IMC Congress 1993 - Brussels

Multilingual Information and Retrieval Systems Dr.
Ulrich Kampffmeyer
Technology and Applications PROJECT CONSULT GmbH
Multilingual Information and Retrieval Systems

Technology and Applications
IMC Congress, Brussels 1993
Dr. Ulrich Kampffmeyer

· VOI Verband Optische Informationssysteme, Roßdorf / Darmstadt
German Association of Manufacturers and Resellers of Digital Optical Media,
Systems and Software (Chairman of the Board)
· PROJECT CONSULT Unternehmensberatung Dr. Ulrich Kampffmeyer GmbH
Wachenheim, Hamburg, Darmstadt
Abstract
This paper on multilingual information and retrieval systems with optical mass
storage describes the technical principles of software design. The different layers
and modules from the user interface via transformation modules, thesaurus modules
and fulltext interpretation to database management are explained in detail. Two
examples of multilingual document imaging systems are presented:
- wfBase multilingual press and commerce information system base on
four ISDN-knots in Switzerland;
- HEMIS multilingual information system for CD-ROM distribution on
environmental institutions, projects and programmes of the
UN Environmental Programme UNEP/HEM.
Contents Page
1. The Importance of Multilingual Software Systems With Optical Storage
Media for the European Economic Region
1
2. Software Design
3
2.1 Structural and Other Requirements for Multilingual Software
.................................................................................................................
3
2.2 User Interface and Application
.................................................................................................................
6
2.3 Transformation Modules
.................................................................................................................
9
2.4 Selection Lists
.................................................................................................................
11
IMC Congress, Brussels
© Copyright PROJECT CONSULT GmbH 1993 Page 1 of 28

Multilingual Information and Retrieval Systems Dr. Ulrich Kampffmeyer
Technology and Samples PROJECT CONSULT GmbH
2.5 Thesauri
.................................................................................................................
12
2.6 Fulltext Translation
.................................................................................................................
16
3. Sample Applications
19
3.1 wfBase
20
3.2 HEMIS
23
4. Outlook and Summary
27
1. The Importance of Multilingual Software Systems With Optical Storage

Media for the European Economic Region
Europe 1993 is a catch-phrase that is often heard. But opening the borders and
removing trade barriers will not eliminate the cultural and language differences
between countries. These differences are a concern for all firms and organizations
that operate in more than one country.
Overcoming the language barrier is not simply a matter of lexical comprehension
and translation. It involves many levels of differing interpretations, meanings in
various contexts, and adaptation of specialized vocabulary. In business and
commerce, mere translation is not enough; the unwritten laws of the target specialist
language must be adhered to.
In addition, the organization working across national boundaries must take into
account differing units of measure, currency, and conventions (date formats,
addresses, orthography).
Multilingual software is a requirement wherever users require access to the same
information regardless of the nature of the source. This is particularly the case for:
- Trading firms
- Service firms
- International authorities and institutions
- Manufacturers with suppliers and subcontractors in more than one
country
- Communications firms
- Banks
- Insurance companies
- Authorities and bodies such as police, air-traffic control, disaster
relief organizations, environmental monitoring agencies, etc.
- Others
Page 2 of 28 © Copyright PROJECT CONSULT GmbH 1993

Dr. Ulrich Kampffmeyer Multilingual Informations and Retrieval Systems
PROJECT CONSULT GmbH Technology and Samples
English is often used as a de facto communications standard. However, the use of a

language that is foreign to its speakers can lead to misinterpretations and
misunderstandings when the user is not familiar with the exact meaning,
interrelationships, and contextual significance of terms and phrases. A "working
knowledge" of a language is not enough.
As software and the information underlying it become more complex, the support
provided by the software must become friendlier and more comprehensive. This is
especially true for the user interface, information on current actions, status
messages, context-sensitive information (especially with user mistakes or critical
program branchings) and help screens. The latter must be available in index form as
well as context-sensitive.
Modern "Windows"-oriented programs generally includes these features. However,
like most programs the information they contain is available in only one user
language.
Most standard software today comes from a few leading software houses in the
United States. Consequently the software and documentation is available in English,
or American English, first. The various language versions are then translated from
the original English version. The translations become available with more or less
delay in various release standards, depending on the relative importance of the
national market. In such standard software, the screens, associated texts, etc. are
contained in the main body of the program, making translation with adaptation of the
screens and texts a very complex undertaking.
Even when different users access identical information, they cannot change the user
language while the program is running. Instead, the complete target language
version must be started, at considerable cost in time. In addition, most standard
software lacks the integrated database or resource management components that
enable the administration of different language and function modules, not to speak of
the creation and maintenance of such modules.
Thus, "traditional" standard software has a built-in a bias against multilingual use.
This article will examine database information systems that are suitable for
multilingual applications.
2. Software Design
Like the ability to load modular program segments and functions separately,
multilingualism must be designed in from the start. It is well-nigh impossible to
modify finished software to support multilingual operation. In such cases, it makes
more sense to completely redesign the software using modern tools.
2.1 Structural and Other Requirements for Multilingual Software

Multilingual software is subject to the following design criteria (Fig. 1):
a) Modular design with clear logical and software partitioning of the various levels
(user interface, main program, resources, transformation modules, database,
etc.). Interaction is controlled through messages and global variables.

b) No text components may be contained in the program segments responsible for

execution, but must be referenced by variables. The user can switch from one
language to another using a global variable.
c) Texts are kept in resource libraries and accessed by variables. The libraries must
be simple to maintain and the texts must be accessible and loadable in the
application during runtime.
d) All parts of an application must have defined interfaces. This is particularly the
case for the user interface, the actual application itself, the operating system and
all additional application modules.
e) The application, user interface and operating system must support variable text
field lengths and positions, since these can vary greatly from language to
language.
f) The application, operating system, screen and printer drivers, and the database
must support a variety of fonts, character sets, sortings, data formats, etc. This
requires that the underlying operating system support this.
Multilingual Software - Design Principles

Modular design with clear separation
of user interface, operating system and application (database)
Every text component has to be referenced by a key variable in the application
Resource libraries easy to link and to maintain (i.E. text editor)
Defined interfaces between the user interfaces, operating system

and application modules
Variable textfield positions and field-lengths in the user interface modules

of the application
Support of different sets of fonts, language specific characters,

keyboard layouts, date formats etc. by the underlying operating system
Fig 1: Multilingual Software Design Criteria
Such a multilingual application is thus divided into several inter-communicating

modules and levels (Fig. 2). The actual application program, which can be part of the
database, uses messages and global variables to control the language selection,
display and printing, and the search and conversion functions.

Principles of Language Display during Runtime

Trans-
Language Resources Screen formation
Modules
Text Field Thesauri
German
English Please select ... Selection

French Lists
Spanish LX Screen
LX LX Data-
Application base
Resource Data
The language selector (Lx) in the application defines which

resource is used for display and how the information in the
datafield is represented
Fig. 2: Language display during runtime of a multilingual program
The variable "Lx" ("Language Resource") determines which texts will be displayed,
and which transformation modules and selection lists will be used to control an entry
or search in a selected language. The information in the database itself is not
changed, but only the screen display and printout.
Figure 3 shows the levels of a multilingual application.
Layers and Modules of Multi-Lingual Software
User interface (Application) User interface

1
(Windows, Presentation Manager, X-Windows, etc.)
Language Trans- Selection Thesauri Language Operating

Resources formation Lists Interpreter System 2
Modules
Application Database
3
IRS Information Resources Management Driver 4
Fig. 3: Multilingual software levels (1-4) and modules

Level one essentially handles the presentation of the information, Level 2 converts
the information from one language to another, Level 3 manages the access
information and handles searches, and Level 4 manages the "documents" (datasets,
images, graphics, etc.) on optical storage media. This article will not go into Level 4,
the "IRS" Information Resources Management Program, in greater detail (compare

Kampffmeyer, Ulrich: "Combined WORM and Magneto-optical Mass Storage Devices

and Procedure-Oriented Information Processing Systems", GI Gesellschaft für
Informatik, Arbeitskreis "Datenbanken", Conference at the University of Oldenburg,
Germany, on Feb. 19, 1990). Levels 1-3 and their components will be explained
below.
2.2 User Interface and Application

The user interface depends to a large extent on the underlying operating system.
Many operating systems are not up to the demands of multilingual software, since
they do not allow for reconfiguration during runtime and do not support international
character sets and formats. Operating systems with graphic user interfaces, like
Microsoft Windows and OS/2 Presentation Manager, and operating systems based
on XWindows (OSF Motif, OpenLook, etc.) are suitable. These systems allow control
of the screen largely independent of the actual operating system itself.
There is a fundamental difference between
a) The standard Windows interface, and
b) The application-specific interface implemented on the basis of this interface. The
application-specific interface uses the tools provided by the standard interface to
represent the functions of the application.
A graphic user interface has numerous advantages: A lower learning curve,
integrated help functions, and simple operation by mouse, menus, or key
combinations. Another advantage of Windows interfaces is the unrestricted user-
sizing of windows and other displays.
It would be impracticable to give all displays of a multilingual application their own
user interface, since this would severely limit the number of compatible screen and
printer drivers. The application's user interface should use standard Windows
interface routines wherever possible.
The user interface (Windows as well as application) of a multilingual application
should include the following (see Fig. 4 and 5):
a) Change of key assignments for differing language keyboard layouts during
runtime by the application
b) Change of screen display during runtime by the application
c) Display of language-specific character sets (e.g. German: ä, ö, ü, ß; French: é, è,
ê, ç; Spanish: Í, ñ, ¿, ¡; Danish: å, æ; Hungarian: ÿ, ý, ï; Greek: a, b, c, etc.)
d) Change of formats, as for date, currency, time, etc. during runtime
e) Automatic adaptation of screens and fields to differing text lengths, special
symbols, fonts, etc., under the given monitor resolution
f) Language-specific context-sensitive help based on the cursor position, current
program status and the feasible or just completed action.
g) Modules loadable during runtime without leaving the program

Operating System and User Interface
Requirements for European Software

The operating system and Window Interface must support several features
to enable switching the language during runtime
Change of keyboard setting and screen display during runtime via external program
Enhanced keyboard setting with special characters:

European languages ( ç, ê, æ, å, ø ,ä , etc. ).
Support and change during runtime of date and time formates
Graphic Interface with virtual Window architecture to allow different sizes

of screens and fields while changing the language
Context-sensitive help in relation to the actual position of the cursor

Fig. 4: Operating system and user interface
User Interface (application)
Requirements
The user interface has to support several functions to enable change of
language during runtime
Object oriented software
Change of screens, settings and styles during runtime
Dynamic positioning of fields
Automatic adaption of different field lengths
Controllable by the application program
Loadable modules during runtime for messages, windows and helptexts
Dynamic data and message interchange with operation system and user interface,
application program and database
Fig. 5: User interface
The most important feature of a multilingual application is convertibility during

runtime, without having to load and start another program and without changing the
screen and screen information content (Fig. 4).

The text components are kept in separate files, called "resource libraries" (Fig. 2).
The resource libraries can be loaded on the fly by language selection variable L x.
For language resources to be usable, all texts in a program that are going to be
displayed or printed must be referenced by an unambiguous key variable with the
appropriate library.
Resource libraries must exist for:
a) All static texts in dialogue boxes and masks. These are texts which are
associated with a given dialogue box and do not change.
b) Dynamic texts in dialogue boxes and masks. These are texts which change,
appear, or disappear according to status (messages). This includes the graying
out of inoperative or unavailable functions on menus and buttons.
c) Help texts which appear automatically or interactively.
d) Error messages, system messages and other operation-related messages.
Language Resources
Requirements
Language resources are used for displaying texts related to the unique keys in
the application
Loadable modules for each language
Every entry in the language resource is referenced by a unique key

which may be used by different applications and the database itself
Language resources are needed for
Every text on a entry or search screen form
Every message
Every helptext
Icons adapted for each country
Editor or tools for translation support
Fig. 6: Language resources
Many applications use icons and buttons to simplify option selection. If they bear text
or abbreviations (such as "B" for bold), these must be converted when the language
is changed (thus, in German "F" for "Fett" = bold). For this reason these icons and
buttons should likewise be kept in dedicated resource libraries instead of being
managed directly in the program. The same applies to icons with graphics, where the
graphics do not bring across the same meanings in a different language area or
country.

Object-oriented programming languages and databases often support the use of

loadable resources, making them preferable to traditional programming tools.
The right choice of tools is important for the creation of applications based on a
programming language or database. The application is the superposed, integrative
component of the system as a whole (compare Figs. 2 and 7). The application
contains not only the usual data-processing algorithms and input/output modules,
but also the control and selection of language resources (transformation modules,
selection lists, thesauri, help texts, messages, screen layout and display, etc.).
Application
Characteristics
Numeric keys for every text entry related to the screen display and database fields
Direct control of database and user interface
Object oriented message driven program
Transformatters, selection lists, thesauri, language interpretors and

language resources as loadable modules
Database as loadable module or server-client-communication via SQL
Fig. 7: Components of the application
Object-oriented programs with a "message" concept, such as Microsoft Windows,

allow continuous control of the resources used and the condition of the screen.
Direct communication should be set up for control of the modules on level 2 (Fig. 3).
SQL can be used as a standardized interface for communication with the database
in which the actual information is kept and managed. All modules on levels 2, 3, and
4 (Fig. 3) should be directly accessible or loadable during runtime.
2.3 Transformation Modules

The numerical information in the database is stored in a format that can be
converted as needed for a given onloaded language resource. This conversion is
controlled by the variable "Lx" (Fig. 2). Transformation modules are considerably
easier to implement than text translators, since they work by exact rules and with
numeric values only (Fig. 8).

Transformation Modules
Types
Transformation modules are used for the display transfomation
of numeric values of the database
Transformation of date formates (supported by operating system)

Transformation of time formates (supported by operating system)
Transformation of addresses (position of postal codes, etc. )
Transformation of units of measure (litre to gallon, km to mile, etc. )
Transformation of international standardized
nomenclature (country and city names, etc. )
Transformation of user-defined values (see selection lists, etc. )

Fig. 8: Transformation modules
The most important standard transformation modules are:

a) Date formats
This module toggles the display format of dates between American (month-day-
year) and European (day-month-year). This function is often supported by the
operating system directly, and allows use of either the months' full names or their
abbreviations. The transformation module should be designed to cope with the
conversion of pre-2000 dates into the next century. This is important for all data
which must be retained for several years. The date transformer module must also
ensure the proper sorting during display.
b) Time formats
The same applies to time-display formats. For firms active on an international
scale, data is best stored in "Coordinated Universal Time" format (UTC).
Date and time transformation modules can be set up to check whether the
system's internal time setting is correct (the current date and time must always be
later than that of the last document to be saved; calibration with standard working
hours and days, etc., in order to be able to determine system down time if
necessary).
c) Address conversion
Address-format conversion affects printouts more than it does on-screen
displays. Addresses in Europe are not standardized, and use a variety of
sequences of street, house number, and postal code. This transformer module
recognizes the country of the addressee and selects the appropriate address
format for printouts.

d) Units of measure
This is an important requirement for international trade and manufacturing
companies. For example, in the oil industry large quantities of different types of
oil and petroleum products are transported and handled daily. Measurement
values and with them customs and tax rates constantly fluctuate, depending on
the type of product and its specific weight and even on the ambient temperature.
In cross-border trade the units of measure as well as of currency must be
automatically converted.
The most important categories are units of currency, distance, weight, and
volume.
e) EDI data
Standardised electronic data interchange (EDI), such as EDIFACT, allows entire
business transactions to be handled electronically, without paper originals. The
data is archived digitally. For display and printouts, EDI codes are converted into
text. This conversion can be made language-specific through a language control
variable. With EDI data it is necessary to know what version of a given EDI
application the data will be converted with.
Further transformation modules can be added to cover other requirements for
specific industries and applications, for example converting product codes into text.
2.4 Selection Lists

Graphic interfaces like Microsoft Windows support single and multiple selection lists
(Fig. 9). With single selection lists only one item on a given list can be marked and
processed. With multiple selection lists, one or more items can be selected.
Selection Lists
Characteristics
Selection lists are an easy way to translate information
and to spare storage capacity
The list displays a text on the screen related to a database value
Every entry in a selection list refers to a value which is related to a database field
Every entry in the different language versions of a list refers to the same value
The database has to store only the numeric value of the entry
Selection lists can be used as single and multiple-choice lists
Selection lists help to standardize nomenclature in multinational and

multilingual organizations
Fig. 9: Selection Lists

Selection lists offer several advantages over regular text-entry fields in database
applications:
a) No typing mistakes
b) Selection lists keep the database uniform and ensure that entries can be easily
found again. Since the user must decide from among a set of given expressions,
entries are standardized.
c) The database stores only a reference number which refers to a text resource.
This keeps space requirements low, and different text resources can be accessed
depending on the language variable. Retrieval is faster, since the system must
search only through predefined numbers instead of text sequences.
d) Multiple selection lists facilitate the multiple allocation of a document and allow
the user to select a number of related items if he/she is unsure about the
allocation to a single one.
c) above is the most important factor for multilingual applications. The use of
reference numbers allows linkage to multiple lists in different languages. The
reference numbers can also be used to limit access, so that only cleared items
are shown in a search. Selection lists also facilitate data entry through the use of
presettings for recurring entries.
Selection lists can be created with standard text editors. However, this should be
done only by authorised persons, since changes to and especially deletions of
entries characteristics (entries in a selection list) can compromise the consistency of
the database. Strict update and maintenance rules are a must for distributed
systems and resources.
Selection lists with restricted vocabulary are the ideal medium for standardising
terminology within a company and for creating multilingual software systems.
Multilingual systems should avoid free text entry wherever possible and use
selection lists whenever feasible.
2.5 Thesauri
This term has widely differing meanings. In its original meaning it refers to a defined
specialist terminology, broken down hierarchically from the general down to the
precise. The terms differ clearly from one another and are distributed over several
hierarchical levels. A generic term at one level branches into a number of more
precise terms on the level below it. All terms at a given level should be at a similar
level of detail.
However, in many word-processing programs the "thesaurus" is simply a utility
showing possible synonyms. This familiar kind of thesaurus is completely unrelated
to the structured terminology system described above, as for example defined by the
International Standards Organization (ISO) for single- and multilanguage thesauri.

Thesauri
Thesauri offer a hierachical structured and crosslinked nomenclature
One field on the screen may be represented by a structured hierarchical thesaurus
Similar to a selection list, the thesaurus displays a text related to a database value
related with this text
The thesaurus offers navigation and interpretation tools
The Thesaurus is a database of itself which relates numeric values to texts

and provides additional structure by hierarchic order and crosslinks
The structure of thesauri is standardized by ISO
The same thesaurus may be used by different applications

Fig. 10: Thesauri
Seen from the outside, the thesauri we are discussing here for multilingual systems
act similar to selection lists (Fig. 10, compare also Fig. 9). First a list of generic
terms is displayed (ISO Top Term; TT). Once a top term has been selected, the more
precise terms subordinated to it are shown on a second list (ISO Narrower Term;
NT). When one of these is selected it forms the new generic term (ISO Broader
Term; BT) for the next level of narrower terms (Fig, 11). This strict hierarchy is fully
applicable to only a few subject areas. Therefore, the ISO standard provides for
crosslinks. These link terms from different levels and branchings independently of
their position in the hierarchy. This is easier to follow on a program than it is to
describe in print.
An electronic thesaurus is referenced by numbers in the program just as is a
selection list (which see). However, unlike a selection-list entry, a thesaurus entry
includes not only a "unique identifier" number in the database, but also flags which
specify its display position (level and branching in the hierarchy) and the type (and if
necessary direction) of links. The links allow a term to be associated with more than
one top or broader terms in other branchings, as well as the linkage of a broader
term to several narrower terms in other branchings, regardless of the position in the
hierarchy. The use of different links (uni-directional, bidirectional, broad-to-narrow,
narrow-to-broad, additional reference, synonym, etc.) make it easier to navigate in
such a system. In principle the electronic thesaurus is an entire database
application, which stands between the user interface and the database proper.
The database proper stores only the unique identifier. If this is referenced with a
"narrower term", using its links and hierarchical position all associated broader terms
up to the top term can be found.

Thesauri
Hierarchy and Crosslinks
The Hierarchical View of the Thesaurus Unique identifier Position in hierarchical view
(Top Term, Broader Term, Narrower Term)

1 1000 2 1100 4 1110
5 1120
3 1200 6 1210
7 1220
The Network Structure of the Thesaurus
(Crosslinks independent of the hierarchical position)
1 1000 2 1100 4 1110
8 1120
3 1200 6 1210
7 1220
Fig. 11: Hierarchy and virtual linkages (crosslinks)
An electronic thesaurus is represented internally as a network (relational system),

but to the outside as a hierarchy. Thus, the composition of a list of terms depends
not only on the broader term, but also on the links and the route taken to get to the
broader term. Unlike with a selection list, the lists displayed by an electronic
thesaurus can differ from situation to situation.
In addition to assisting in navigating by displaying the selection lists specific to a
broader term selected previously, a database-supported thesaurus can also be used
in "specialist" or "beginner" mode. When entering information, a specialist mode is
best which allows entry of a narrower term or an abbreviation directly, with the
system determining the associated broader terms without having to go through the
hierarchy. However, users who are inexperienced with hierarchical selection lists or
with the subject content of the thesaurus are better off doing their searches in
beginner mode, whereby the system analyses users' text input, looks for a match in
the thesaurus, and if in doubt shows a synonym list and help text suggesting a
repeat attempt or a more closely defined query. Such a "global search" can also be
done by further fields or other resources of the thesaurus.
Fig. 12 shows how a number of "slices" are assigned to the reference keys of the
thesaurus database. Each of the language slices contains all information on the
hierarchical and network structure of the terms, since this will differ from language to
language (narrower or broader terms, different semantic fields). However, regardless
of the differences among the languages the same information must be clearly
accessible in all. Therefore, the unique identifier is assigned not only the term itself
main keyword), but also acronyms (e.g. "NASA"), homonyms (words that sound the

same but have different meanings), synonyms, plural forms, explanatory notes, etc.
This information is also accessed during a global search.
The "language slices" need not necessarily contain foreign languages; they can also
contain different aspects of a single language. This is particularly useful for
specialist languages. Thus, one slice can contain the regular colloquial language,
with only two or three levels and accessible to everyone, while another slice can
contain the terminology for a specialist field broken down into more levels and
accessible only to those working in that field. This allows control of the extent, depth,
and accessibility of information.
Thesauri
"Slice"- Model of a Multilingual Thesaurus
German Language "Slice"

ID´s of pre- ID´s of suc- position in main key synonyms, ho-
Unique decessors cessors hierarchy wordn monyms etc.
help text
ID ID´s of pre- ID´s of suc- position in main key synonyms, ho-
A1 help text
decessors cessors hierarchy wordn monyms etc.
ID´s French Language "Slice"
Unique dec
decessors help text
A2 cessors hierarchy wordn monyms etc.
ID´s ID´s of pre- ID´s of suc- position in main key synonyms, ho-
dec decessors help text
Unique cessors hierarchy wordn monyms etc.
ID ID´sEnglish Language "Slice"
A3 dec
help text
ID´s ID´s of pre- ID´s of suc- position in main key synonyms, ho-
... dec decessors cessors hierarchy wordn monyms etc.
help text
help text
Unique ... ...
An help text
Fig. 12: Slice structure of a multilingual thesaurus
In addition to the modular slice structure, an electronic thesaurus database offers

many advantages:
a) A standardized, controlled vocabulary ensures unambiguous and complete
retrieval of all correctly entered information.
b) Entry errors are prevented.
c) Selection lists and help functions assist the user in finding his or her way through
extensive, many-layered specialist vocabularies.
d) Functions like "global search" enable searches to include synonyms, homonyms,
acronyms and other references as well as the help text itself.
e) The organization and structure of thesauri are internationally standardized.
f) A thesaurus database acts as a pre-processor, saving time in searches in the
database proper, since only short, unambiguous numerical references need be

searched and evaluated. The thesaurus then converts the unique identifiers for
display.
g) Thesaurus databases can be run on a PC LAN, thus reducing the workload on
the central database and information resources management (IRS; see below
and Fig. 3).
If the system includes optical-systems management software in addition to the
thesaurus database and the database proper, it has a three-level database hierarchy
(compare Figs. 3 and 26):
a) Database for one or more thesauri (local or central)
b) Database for managing unique identifiers to selection lists and thesauri and for
managing database entries (numerical, alphanumeric, date, time, Boolean
variables, etc.)
c) Information retrieval and access system (RIAS). As a rule a non-standard
database for managing WORM (write-once) media, erasable, rewritable, and M/O
optical media, or read-only media (CD-ROM).
A standard database (preferably relational) can be used for the thesaurus database
as well as for the database proper. Full-text databases are not suitable for this type
of application (Fig. 13).
Database
Characteristics
Standard relational database may be used to manage data

(except for language interpretation)
Support of optical disk information retrieval system for mass data management
Standard fulltext database are not usable
Fig. 13: Database characteristics
2.6 Fulltext Translation

The electronic interpretation and translation of running text requires very different
strategies from those described up until now. Transformation modules, selection lists
and thesauri can be combined in a system as desired, since they all work by the
same rules: Numerical identifiers are transformed into predefined expressions in
defined ways.
A system capable of analysing running text is difficult to combine with these
modules. It is an independent and complex software system made up of many
component parts (Figs. 14 and 15).

Language Interpreter
Characteristics
The language interpreter contains different modules which allow
translation and interpretation of fulltext databases.
Dictionaries provide information for the direct translation of nouns

(singular, plural, conjunctions, etc.)
Statistical modules support the interpretation of the noun inside a text
Linguistic modules support the interpretation of the grammatical context
Comparision modules combine the different strategies of interpretation
Presentation modules display the answer of a query in the chosen language

as translated fulltext
Inverted file and cache modules optimize access
Fig. 14: Components of a language translation system
Language Interpreter
Structure
User Interface
Entry Query
Display
Dictionaire Statistic Linguistic Presentation

Modules Modules Modules Modules
Comparision
Inverted File Language

Interpreter
Database
Fig. 15: Structure of a language translation system

a) Dictionaries contain the individual words in their different forms (plural, singular,
declined, conjugated, irregular verb forms, etc.). As a rule the dictionary will
constitute a database application of its own. However, it is completely different in
structure, makeup and content from the thesaurus discussed above.
b) Statistics modules analyse the occurrence and composition of words and
combinations of words.
c) Linguistic and grammatical-analysis modules are the most difficult part. They
must contain all the rules and comparative examples required to analyse syntax.
Pattern recognition and fuzzy logic techniques are often used for this purpose.
d) The results of a), b) and c) above are combined, evaluated and interpreted in a
comparison module. The comparison module is designed so that intermediate
results of one module can be returned to another module for evaluation. This
gives rise to an iterative process with a relatively high rate of recognition in texts
on specific subjects for which there are electronic dictionaries containing the
subject terminology.
e) Due to their architecture, traditional databases are not very effective at time-
consuming text analysis. To speed things up, special cache and inverted file
modules are often used as intermediaries.
f) Presentation modules handle the correct on-screen presentation of the translated
text. They work with information from the dictionary module, the evaluated text
from the database, and the inverted file system.
The running text interpretation system we have described can be used to evaluate
queries in regular text. Fig. 15 shows the processing path for a query. The system
goes through the modules from bottom to top in the same way to convert a text out of
the database. The system shown here is just one possible configuration. Since this
technology is very new, many other approaches are being investigated. This
particular approach has the advantage that different modules with differing
evaluation strategies can be consulted simultaneously. Furthermore, each module
can be dedicated to certain languages or vocabularies, and accessed automatically
by the comparison module as needed. The interpretation and translation of a text is
very time-consuming, and usually possible only on very fast dialogue computers.
Complex systems such as the one described should not be confused with simple
translation aids.
Traditional full-text databases are seldom suitable for such systems. Standard
database software uses a strategy of leaving out filler words, adjectives, adverbs,
etc. in order to save memory space and increase database speed. However, a
language interpretation system needs all of the information contained in the text,
since otherwise coherent, context-adequate translation is not possible.
"Language Interpreter" database systems have enjoyed initial successes with the
UNO and the European Commission.
The choice of a system for multilingual database applications is still simple at this
point:
a) For document-oriented (facsimile) systems, applications with controlled
vocabularies, and systems intended to bring about a standardization of use, the
transformation, selection list and thesaurus approach is the right choice.

b) For full-text applications which will not go into full use within the next three to four
years, the approach described in this section should be attempted or at least
examined.
At present there is no commercial software immediately available for either
application, nor are off-the-shelf solutions likely to become available in the future,
since the nature of the application and the vocabulary will be subject to constant
change.
However, in my opinion an approach as shown in Fig. 3 is ideal. It combines the
different transformation and interpretation components in one level where they work
in parallel. They link the user interface with the database proper. This integrative
approach combines the advantages of all of the techniques named, which can then
be used individually or in combination as needed.
3. Sample Applications
We will now look at multilingual information and retrieval systems from the user's
point of view, using three examples.
Application Examples
HYPARCHIV Standard optical filing software

for Microsoft Windows in 9 languages
wf Base Distributed press and commercial information system

in 4 languages based on ISDN-Knots
(wf, Switzerland)
HEMIS Meta-database and information system for

environmental data;
Informations, programmes, methods, etc.
for CD-ROM-distribution
(UNEP/HEM, worldwide)
Fig. 16: Application examples
a) wfBase Press and economics information in a distributed document-

imaging system
b) HEMIS Environmental information on CD-ROM
3.1 wfBase

wfBase was developed specially for the Swiss Institute for Commercial Development
(German "Wirtschaftsförderung", hence "wf"). It has been in operational use since
1992.
The Swiss Institute for Commercial Development is located in Zürich, with offices in
Geneva, Bern and Lugano. Prior to the introduction of wfBase, dossiers on political
events, economic data, and the like were kept independently at all four locations.
The goal of wfBase is to enable access by all Institute users to all press articles,
periodicals, and Institute documents, independent of the language of data entry
(Figs. 17 and 18).
wf Schweitzer Wirtschaftsförderung
wfBase Swiss Institute for Commercial Development
Zürich - Geneva - Bern - Lugano
The wf owns one of the largest archives on commercial and political topics in
Switzerland. It provides information to politicians, journalists and its commercial
members representing all major companies of Switzerland.
Optical filing system for press and commercial documents

(scanned and created via word processor, sreadsheet, etc. )
Distributed system linked via SwissNet 2 (ISDN)
Access for wf-employees and third-party partners via multilingual graphic

user interface (ISDN and telephone modem)
Database with 4-lingual thesaurus

Access to information independent of the language in which it was entered
Several million documents stored on M/O-Jukeboxes (2 times 50 gigabyte)
Integrated bureau communication with textprocessing, spreadsheet, FAX,

library management, electronic mail, accounting, address database, etc.
Fig. 17: wfBase - Features

wfBase
Storage and Communications Layout
Jukebox
External Use
wf-User Harddisk Cache
Zürich Novell Jukebox
Netware Internal Use
Images, Files & Descriptors
Read / Write / Create
DB Server Addresses
Harddisk Cache Library
wf-User ISDN Zürich
Dossiers
Lugano SwissNet 2
Images, Files & Descriptors
Read / Create Archive - Server
Zürich
wf-User Harddisk Cache ISDN
Geneva
Images, Files & Descriptors SwissNet 2
Communications -
Read / Create Server
Zürich
wf-User Harddisk Cache
ISDN Telephone
Bern ISDN & Modem
Images, Files & Descriptors SwissNet 2
Read / Create External User
Harddisk Cache
Fig. 18: wfBase - System configuration with internal and external users and information
management in two jukeboxes (Zürich)

wfBase also integrates other applications besides document management under its
graphical user interface, such as word processing and spreadsheet applications,
address and library management, billing for outside users, electronic faxing and
mailboxes, etc. The wfBase system makes use of some HYPARCHIV modules, but is
otherwise an independent application with client-server architecture and a relational
database on an OS/2 server. The MS Windows workplaces are linked together in a
Novell network. Outside users can access wfBase by modem, query documents
("subsets"), and display and print them locally or have wfBase fax the documents to
them.
The four wfBase locations are linked by SwissNet2 (ISDN). This powerful network
allows compressed scanned facsimile transmission. Two jukeboxes store scanned
facsimiles, locally-generated data, and incoming faxes. The system is highly error-
tolerant and largely fail-safe.
At the heart of wfBase is the database with a quadrilingual (German, French, Italian,
English) thesaurus for subject-area classification. The thesaurus includes over 2000
subject areas, organized hierarchically and in linked structure over four levels.
wfBase
Multilingual Thesaurus
The two images show different views of the thesaurus for thematic
keywords (here in German). The thesaurus supports the user in navigation,
jump-functions, short-key-entries, synonym-retrieval and other techniques for
easy-to-use access.
Screen II Screen I
aus Vortrag Online ´92 aus Vortrag Online ´92
Thesaurus-Maske Thesaurus-Maske
Sachgebiet Sachgebiet
für Vortrag auf Folie für Vortrag auf Folie

einkleben einkleben
Fig. 19: wfBase multilingual thesaurus, showing two windows of the thesaurus screen. The left
shows the branching from a broad term to a list of narrower terms. The thesaurus contains
the subject areas covered in the dossiers.

In addition to the thesaurus, there are selection lists for other fields and fields for text
and data entry. The database enables the user to locate documents regardless of
the language in which they were entered. However, the system displays documents
only in their language of origin; in a multilingual country like Switzerland it is not
necessary to translate the contents of documents, as users are expected to be
multilingual as a matter of course. Instead, the objective of wfBase is to improve
communication between office locations, standardize addresses and documentation,
eliminate redundancies, and provide third parties (members of the wf's supporting
organizations) with a simple, time-saving and cost-effective means of access.
3.2 HEMIS
Within the United Nations Environmental Programme, or UNEP, there is an
organization called UNEP/HEM (Harmonization of Environmental Measurement)
which is responsible for the harmonisation of environmental monitoring methods,
plans, projects and information. Since 1990 a project has been underway at the
Munich UNEP/HEM office to immplement an information and meta-database system
for the UNEP/HEM, called HEMIS (= HEM Information System). HEMIS is intended
to provide an overview of:
a) Current global and national environmental projects by the UN and other
international and world organizations
b) Institutions, research emphases, periodicals, and key personnel
c) Methodology, reference materials, etc.
d) Databases, data formats, data quality, access, etc.
The information contained in HEMIS is meta-data compiled from widely varying
sources (Figs. 17 and 24).
UNITED NATIONS ENVIRONMENTAL PROGRAMME

HEMIS HARMONIZATION OF ENVIRONMENTAL MEASUREMENT
UNEP / HEM, Nairobi / Munich
The UNEP / HEM Office harmonizes nomenclature, measurements and other

information used worldwide in environmental projects. This task will be
supported in the future by the HEMIS meta-database and information system,
a multilingual CD-ROM using PC-system.
Multilingual thesauri for scientific nomenclature, countries, climates, etc.

with references, links, synonyms, homonyms, acronyms and wildcard-functionalitity
Harmonization of nomenclature by standardized access to Information
Hyperlinks, guided tours, global search facilities together with the thesaurus
enable easy access to the Information independent of the language of entry
CD-ROM based worldwide distribution
Fig. 20: HEMIS - Information and meta-database system of the UN environmental organization
UNEP/HEM

The goal is to harmonize access to heterogeneous information of varying quality and

extent from varying sources.
HEMIS is made up of two component systems:
a) One system will be installed in Munich with which all information can be
collected, processed, the contents made readily accessible, and managed. The
system is intended to be able to create reports (printouts) selected from its
database and to create CD-ROM databases.
b) The other will handle worldwide distribution of extracts from HEMIS in Munich by
CD-ROM in regularly updated editions.
The two component systems will have differing user interfaces, databases, etc. 1
System a) is a production system that will generally be used only by UNEP/HEM

employees. System b) is designed to provide information internationally on
environmental projects, prevent parallel developments, and supply basic project and
database data, even if the information is not available in the user's own language.
The HEMIS CD-ROM will be made as attractive as possible so that it is widely used,
and so that other institutions not associated with the UN will be motivated to supply
data for the system (Fig. 21).
Harmonization and Distribution of Information via HEMIS

Examples of Users
sectoral / regional / specialized
sources of
environmental meta-data
H E M I S EARTHWATCH
INFOTERRA
Institutions
UNEP
ESA
Programmes
UN
EEA-TF
 Data- Classification Methods/
WMO 
bases Systems Models
 NGOs
Persons
GEMS Governments
High Level Data Model
IAEA Others
Fig. 21: Information harmonisation and and distribution by HEMIS. Data on paper, diskette and CD
is read into the stationary HEMIS, selected and formatted, classified semi-automatically or
manually following a defined nomenclature (thesauri), and finally distributed in the form of
printed reports on specific subjects or on CD-ROM. This figure shows only a representative
sample of the participating organizations.
1At this writing (late 1992) HEMIS is still at the design and prototype stage. Not all components have
been implemented as yet.

The major components of both the stationary and the CD-ROM HEMIS systems are
a number of electronic thesauri, structured as shown in Figs. 11, 12, and 22.
The Internal Structure of the HEMIS Thesaurus

A B C D E F G
Unique IDs of prede- IDs of followers Main descriptor Position of “D” Synonyms, Explanation
Identifier cessors (ISO NT, links) for display in the in the hierarchy acronyms,
(ID) (ISO TT, BT, hierarchy of the homonyms,
links) thesaurus interpretations,
etc. of “D”
Numeric Numeric Numeric Alpha- numeric Numeric Numeric Alpha- numeric
One entry Up to 64 entries Up to 255 entries One entry One entry Up to 255 entries One entry
8 digits 8 digits 8 digits Up to 20 Up to 8 digits Up to 40 Up to 255

characters (max. of 8 characters each characters
Sequence of Sequence of (due to display hierarchy sequence of
digits digits restrictions) levels) texts
Unique Internal Internal Retrievable via For screen Retrievable Available

reference key management mangement hierarchical display via as context -
for the selection list in the global search sensitive
descriptor Bi- Uni- and hierarchical help function
database directional directional global search thesaurus
only
Fig. 22: Structure of the HEMIS thesaurus for geographical units, climate zones, subject areas, and
other hierarchically structured reference keys. For an explanation of the entries in the first
row see Section 2.5 and Fig. 12.
The thesauri and selection lists are part of both HEMIS systems. In the stationary
system they are used in making key words for data sets, documents, graphics,
images etc., and for searching and compiling data. If information is supplied on
computer media in pre-agreed formats, some of the key-word creation process can
be done by the system automatically. In the CD-ROM version the thesauri, selection
lists and all other entries are used only for researching and compiling information.
The HEMIS CD-ROM version has a multi-layer modular structure (see Fig. 23).

HEMIS-System Layout with Multi-Lingual User-interfaces

Additional user interface in different languages
User Interface - (i.e. English)
Query by Global Thesauri Guided Links language

search translator
example Selection lists tours
Standard variables Numeric keys

(alphanumeric, related to thesauri and
Database Hyper-
of links
numeric etc) selection lists
(part of
guided
Descriptor database the
tour stored
(field oriented database)
links objects)
Information retrieval and access system (IRAS)
Objects
Texts Images Datasets
Fig. 23: HEMIS system layout with multilingual user guidance and search. The user interfaces in the
various languages make up the first layer. The next layer is composed of modules for
different search and navigation strategies, likewise language-specific. In addition to a
database, HEMIS has prearranged "guided tours" and "links". The information and
documents on the CD-ROM are managed by an Information Retrieval and Access System
(IRAS).
In addition to searching for certain key words or terms, HEMIS also offers navigation
assistance in the form of prearranged "guided tours" and individual links. A global
database search takes a certain amount of time, but it does allow the user to use the
system without prior knowledge of what contents lie behind a given field in the
search mask. The user interface can be toggled among different loadable languages,
as can the thesauri, selection lists, links and guided tours. Free text input and
scanned-in documents are not translated. HEMIS is intended to provide the initial
information; the user can then consult the source institutions, databases, or
publications for more in-depth information.
Fig. 24 shows the proposed starting screen of the HEMIS prototype with the button
fields for moving to the main subject-area screens.

Start Screen of the HEMIS-Prototype

Institutions Programmes Databases Methods Ref. Mat. Guided Tour Thesaurus Location Region ?
H E M I S
Institutions Environmental Information System Guided Tours
Programmes Subject Thesaurus
Databases Location
Methods Region
Ref. Material Help
EXIT
Choose Choisir Wählen Sie
English Francais Deutsch
Fig. 24: HEMIS starting screen (suggested CD-ROM version)
4. Outlook and Summary

The development of multilingual information and retrieval systems has only just
begun.
Conclusions
MultiLingual Information and Retrieval Software
The European Challenge for 1993
Multi-lingual software is a must for all companies and organizations working

in different European Coutries
The American software industry is presently unable to supply multilingual software -

This is a window of opportunity for European software companies
Multilingual software helps to bridge the national barriers within Europe
Multilingual software is intelligent object-oriented programming

using databases and information management systems as a framework
for huge masses of coded and non-coded information
Fig. 25: Summary of the most important arguments for multilingual software

In this article, the following arguments have been advanced (Fig. 25):
a) Multilingual software is a necessity for all organizations with Europe-wide or
world-wide activities, for which a single "company language" is undesirable or
impracticable.
b) Multilingual software is available in its basic features as standard software, but
as a rule it must be modified for the specific application before it can be used to
full benefit (compare wfBase, 3.2, and HEMIS, 3.3)
c) Multilingual retrieval software can be used for accessing large quantities of data
or documents on digital optical storage media.
d) Multilingual thesauri encourage standardization in document classification,
enable clear and structured access to documnets, and support searches for
documents not in the user's own language.
e) Multilingual fulltext retrieval and translating systems are in use in prototype form.
Combined with other techniques, such as thesauri, they will make easy-to-use
information systems feasible in the future.
f) Multilingual software is a market opportunity for European software and systems
firms.
g) Multilingual retrieval and information systems can be used to advantage in
almost all areas of business and administration which extend beyond national
and cultural boundaries.

(EN) "Multilingual Information and Retrieval Systems Technology and Applications" - Dr. Ulrich Kampffmeyer - IMC Congress 1993 - Brussels

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

(EN) "Multilingual Information and Retrieval Systems Technology and Applications" - Dr. Ulrich Kampffmeyer - IMC Congress 1993 - Brussels

Transféré par

Droits d'auteur :

Formats disponibles

Multilingual Information and Retrieval Systems Dr.

Multilingual Information and Retrieval Systems

Dr. Ulrich Kampffmeyer

IMC Congress, Brussels

© Copyright PROJECT CONSULT GmbH 1993 Page 1 of 28

1. The Importance of Multilingual Software Systems With Optical Storage

IMC Congress, Brussels

Page 2 of 28 © Copyright PROJECT CONSULT GmbH 1993

English is often used as a de facto communications standard. However, the use of a

2.1 Structural and Other Requirements for Multilingual Software

IMC Congress, Brussels

© Copyright PROJECT CONSULT GmbH 1993 Page 3 of 28

b) No text components may be contained in the program segments responsible for

Multilingual Software - Design Principles

Every text component has to be referenced by a key variable in the application

Resource libraries easy to link and to maintain (i.E. text editor)

Defined interfaces between the user interfaces, operating system

Variable textfield positions and field-lengths in the user interface modules

Support of different sets of fonts, language specific characters,

Such a multilingual application is thus divided into several inter-communicating

IMC Congress, Brussels

Page 4 of 28 © Copyright PROJECT CONSULT GmbH 1993

Principles of Language Display during Runtime

English Please select ... Selection

The language selector (Lx) in the application defines which

User interface (Application) User interface

Language Trans- Selection Thesauri Language Operating

IRS Information Resources Management Driver 4

Fig. 3: Multilingual software levels (1-4) and modules

IMC Congress, Brussels

© Copyright PROJECT CONSULT GmbH 1993 Page 5 of 28

Kampffmeyer, Ulrich: "Combined WORM and Magneto-optical Mass Storage Devices

2.2 User Interface and Application

IMC Congress, Brussels

Page 6 of 28 © Copyright PROJECT CONSULT GmbH 1993

Operating System and User Interface

Requirements for European Software

Enhanced keyboard setting with special characters:

Support and change during runtime of date and time formates

Graphic Interface with virtual Window architecture to allow different sizes

Context-sensitive help in relation to the actual position of the cursor

User Interface (application)

Object oriented software

Change of screens, settings and styles during runtime

Dynamic positioning of fields

Automatic adaption of different field lengths

Controllable by the application program

Loadable modules during runtime for messages, windows and helptexts

Fig. 5: User interface

The most important feature of a multilingual application is convertibility during

IMC Congress, Brussels

© Copyright PROJECT CONSULT GmbH 1993 Page 7 of 28

Loadable modules for each language

Every entry in the language resource is referenced by a unique key

Editor or tools for translation support

Fig. 6: Language resources

IMC Congress, Brussels

Page 8 of 28 © Copyright PROJECT CONSULT GmbH 1993

Object-oriented programming languages and databases often support the use of

Transformatters, selection lists, thesauri, language interpretors and

Fig. 7: Components of the application

Object-oriented programs with a "message" concept, such as Microsoft Windows,

2.3 Transformation Modules