Académique Documents
Professionnel Documents
Culture Documents
ABSTRACT
To cut-across the language barrier and to encourage the language pluralism of morphologically complex
languages [Sproat 1991], especially South-Asian languages [Krishnamurti et al. 1986] in India, a
consortium mode robust Machine Translation system (MTS) that is able to raise the accuracy of generation
is developed jointly by C-DAC, Pune and DIT, GOI. In Natural Language Processing (NLP) and Natural
Language Understanding (NLU), Machine Translation plays a vital role in todays India for any sort of elanguage processing and understanding by machine. In each of the quarter of electronic era of a multilingual community machine translation, information retrieval or speech processing becomes obligatory.
This paper proposes to describe a hybrid based machine translation system from English to Indian
languages. This paper also proposes the TAG based memory managed Machine Translation System [Joshi
et al. 1981] aligning with other rule based, example based and statistical based Machine Translation System
for English-Hindi, English-Urdu, English-Oriya, English-Bangla, English-Marathi and English-Tamil.
EILMT has especially been designed to translate in platform independent modules. This is a proposed
hybrid based thin-client/thick-server design; where users (clients) of this system use a standard browser to
access the translation services of the server. We call this as a Pan-Indian perspective on Machine
Translation. In this paper, we will explain the challenges faced and solution drawn at the various levels of
architecture, language and linguistic computation. While building the Machine Translation System, we
have taken care of the speed and accuracy of syntactically and morphologically diversified languages at
modular and phases of EILMT system.
1
1.0 Introduction to Machine Translation
In this present paper, we will explain the challenges encountered to cope with the speed and accuracy of
syntactically and morphologically diversified languages tested and developed for Machine Translation
system based on consortium mode for English to Indian Languages in collaboration with C-DAC, Pune and
DIT, Govt. of India.
In 1629 the idea of Machine Translation evolved, when Rene Descartes proposed Universal Language. In
1954, the Georgetown experiment (1954) involved fully-automatic translation of sixty Russian sentences
into English. In late 1980s, machine translation inclined to statistical models and example based models
evolved gradually. And the Machine Translation system like Systran used by AltaVista search engine,
METEO used at the Canadian Meteorological Centre, Example-based machine translation proposed by
Makoto Nagao and several other Hybrid based Machine Translation system came into existence. During the
year 1990-91, DIT (Department of Information Technology) of Government of India initiated the TDIL
(Technology for Development of Indian languages) project to encourage the Indian language processing in
the area of IT. The institutions namely, C-DAC, Pune (MANTRA); NCST (now C-DAC, Mumbai;
MATRA); IIIT-Hyderabad (Anusaaraka, and SHAKTI) and IIT-Kanpur (Anglabharati) have taken the
Machine Translation System from English to Hindi to greater height by developing applications using
cutting edge technology.
AnalGen (%)
EBMT (%)
SMT (%)
TAG (%)
Copula
88.75
42.50
66.25
87.50
Simple
58.75
60.00
57.50
95.00
Appositional
71.50
47.50
67.50
91.25
Relative Clause
75.00
53.75
55.00
95.00
That-Clause
62.50
51.25
60.00
92.50
Wh-Clause
56.66
38.33
53.75
63.33
Co-ordinate
65.00
32.50
78.75
95.00
Conditional
48.50
56.25
63.75
77.50
PP Initial
60.00
43.75
57.50
93.75
Adverb Initial
80.00
53.33
53.33
95.00
Gerundial
36.00
31.60
75.00
83.33
Participle
81.25
25.00
81.25
91.25
Infinitive
50.00
46.25
60.00
93.75
75.00
75.00
Discourse Connector
70.00
55.00
Table 1: Engine wise Translation output accuracy
Similarly, language pair wise translation accuracy on TAG translation engine was evaluated, whose
approximate translation accuracy is as follow: for English-Hindi pair the translation accuracy is
2
approximately 85%; for English-Urdu is approximately 75%; for English-Oriya is approximately 80%; for
English-Bangla is approximately 70%; for English-Marathi is approximately 65%; and for English-Tamil is
approximately 70%.
Basic system facilitations of EILMT consortium are: User Log module; Pre-Processing module; four
Translation Engines: AnalGen, EBMT, SMT & TAG for six language pairs; Post-Processing module;
Collation and Ranking module; a compatible system with W3C; and Browser compatibility for IE, Mozilla,
Firefox, Google Chrome, Apple Safari & Opera. (See Annexure 1 B for detailed EILMT system
specifications).
3
the incremental parser that has given modularity, extensionality and speed in the translation process of TAG
engine. Probability of adjoining the parent derivations to a nearest probable child derivation is given by the
following equation:
Y = {c(X)} ; Where, X = Number of Child derivations, Y = Number of Parser Derivations, c = Combination ,
Consider the sentence The 18th century Bharatpur-Bird-Sanctuary, which is also known as the Keoladeo-Ghana-National Park,
is famous as the most important bird breeding and feeding habitat of the world.
The present paper represents our internal analysis and testing of tourism corpus on TAG translation engine
through an accuracy graph based on the grade scale of 4 points subject to syntactically and semantically
well-formedness and choice of proper lexical selection.
4
rule. The stylistic trend observed in EILMT tourism corpus is: simple sentence (14.94% frequency of
occurrence), copulative construction (3.49% frequency of occurrence), co-ordinate sentences (20.10%
frequency of occurrence), appositional sentences (11.33% frequency of occurrence), various declarative
clause structures (22% frequency of occurrence), gerundial constructions (.35% frequency of occurrence),
conditional sentences (1% frequency of occurrence), discourse connector (0.77% frequency of occurrence)
and infinitival sentences (9.03% frequency of occurrence). Thus, the parallel corpus created for all 6
language-pairs and features such as intelligibility, comprehensibility and fluency in translation are maintained
to set a reference to the machine output, as E.M. Enquest has said very correctly that Proper words in proper
places creates styles.
In Natural Language Processing (NLP) and Natural Language Understanding (NLU) [Terry Patten, 1985],
Machine Translation plays a vital role in Indian sub-context for e-language processing by machine. In EngHindi EILMT system, the localization of linguistic peculiarities of Hindi such as oblique formation,
ergativity, marked-gender system, case-marking, direct-oblique pluralization etc. are handled in a controlled
environment through morph-synthesizer, finite and non-finite generators, POS conversion rule etc. Similarly,
for other Indo-Aryan language pairs i.e., Eng-Urdu, Eng-Oriya, Eng-Bangla and Eng-Marathi the linguistic
features such as, Perso-Arabic and Indic pluralization system, lexico-semantic peculiarities, copula drop,
dropping of existential subject, post-position synthesis, synthesis of case-marking, emphatic clitic formation,
usage of classifier, verb root alteration, strong and three level gender system, gender based noun synthesis,
and compounding etc., [Bhattacharya, T et. al 1996; Krishnamurti, Bh. et al 1986; Selkirk 1982; Williams
1981] are incorporated through feature-based lexicon, ordered rule-based normalizer etc. Approximately
37,000 bilingual lexicon, 2000 phrasal lexicon, 97 TAG tree disambiguation rule, 125-150 source TAG trees,
215-230 target TAG trees, 800 transfer grammar mapping and 70-75 morph-synthesis rule are developed for
each Indo-Aryan language-pairs.
Following section will explain the linguistic challenges faced and language computing solutions drawn in
EILMT system for syntactically and morphologically diversified and complex languages to raise the speed
and translation accuracy:
Apart from POS tagging, emotion and sense tagging is necessary in Machine Translation to capture the
semantic anomaly of the natural language.
b). Chunking is an important part of shallow parsing level. It minimizes the number of tokens to be sent to the
core parser, thus reducing the number of possible adjunctions and effected the translation time as well as the
translation quality. We perform noun phrase chunking and verb group collation. Consider the following
example at level-1 chunking,
[The-Prince/NNP of/IN Wales/NNP Museum]/NNP ,/, [the-Jahangir-art-Gallery]/NNP ,/, [the-various-churches]/NNS ,/, temples/NNS
and/CC shrines/NNS including/VBG [the/DT one/CD]/NNP of/IN [Haji-Ali]/NNP out/IN on/IN [an-island]/NN linked/VBN by/IN [acauseway]/NN ,/, are/VBP worth/JJ [a-glimpse]/NN
5
c). We use a TAG (Tree Adjoining Grammar) [Joshi et al., 1975] parser, and for that we have created a
number of trees to represent structure of source and target languages. In this formalism each token is tagged
with a POS tag/category, on the basis of which a set of possible tree tags are assigned to the token. This
process is called tree tagging. A sentence as a string of tree-tagged tokens, are then sent to the parser. When a
token in a sentence is tagged with a number of trees the parser is liable to produce multiple derivations, most
of them being inappropriate. This reduces accuracy and speed. To eliminate this spurious derivation, or at
least minimize them, we adopted the technique of TAG tree pruning. Our pruning module further
disambiguates according to the syntactic context and helps in selection of TAG tree in more precise way.
Accuracy and speed of the system, thus, was substantially improved.
d). To handle the synthesis of constructions in Indian Languages, morphological complexities of nouns and
verbs and their inter-relationships, and the kaaraka formalism [Gangopadhyay, M. 1990] in a defined context
plays a major role in noun or verb synthesis. Henceforth various categories at adjoining positions as an
adpositional words like adjectives, post-positions (parasargas), and various particles like the avyayas etc. are
also within this defined context. Basically post-positions, avyayas have modified-modifier [Aronoff, M.
1976] function adding more linguistic information to the end-users of the target language. The feature
embedded morphological rules (and also sometimes gender agreement) written for the synthesizer can be
seen through the synthesized output. Verbs in the language demand the karaka identities and the nouns fulfil
the demands according to the yogyataa. And, in a defined context, nouns demand parasargas or post-position
on a semantic account. Following diagram explains the synthesis process in EILMT system :
Above mentioned points from 5 a) to d), the linguistic variations and complexities that are handled through
pre-processing or post-processing generative modules have escalated translation accuracy and speed in a
considerable way. Following graph represents the comparison of translation speed between old and new
version of EILMT system (i.e., speed of translation before and after pruning and context disambiguation of
the POS tagsets, TAG tree tagging and noun-verb synthesis). The above development at parsing and
generation stages has raised the speed of translation in the latter (or new) version of EILMT:
6
In English-Tamil EILMT system, special attention to Tamil morphological system has been given. As Tamil
roots to Dravidian language family, being agglutinative language, the synthesis of finite and non-finite forms,
synthesis of noun or noun group and gender based system has been catered through feature based lexicon, and
noun and verb morph-synthesizer. In modern Tamil three types of words noun, verb and itaiccol or particles
are found. The noun indicates animate and inanimate categories (tiNai, is classified into uyartiNai and
akRiNai). There are three genders in Tamil - masculine and feminine and neuter where masculine and
feminine indicates singular number and neuter gender indicates plural number. There are three persons in
Tamil (first, second and third person). Case inflexion is prominent with suffixes in Tamil. Tamil being
agglutinative in nature [See Varadarajan, Mu. 1988] is found to be different to parse and generate than the
manner in which Indo-Aryan languages are generating in EILMT system. Approximately 35,000 bilingual
lexicon, 92 phrasal lexicon, 97 TAG tree disambiguation rule, 125 source TAG trees, 127 target TAG trees,
147 transfer grammar mapping and 100 morph-synthesis rule are developed for English-Tamil version.
Consider the following example from tourism domain with EILMT TAG output in Tamil:
English: Mother Earth' is kind in return. / Tamil:
Following diagram represents the English-Tamil User Interface (with Tamil output):
7
a). Re-framing and enhancing the Noun Collation module on the basis of Phrase Tagging and agglutinating
character of Tamil.
b). The process of new Tree set creation/ existing Tree set modification / deletion of Redundant Tree set to
be improved and target tree set pruning, and purification of transfer grammar to be enhanced.
c). Enhancement of feature based linguistic rule-set in the verb generator and noun generator module and
Bilingual lexicon correction and purification and feature attachment to be completed.
6.0 Conclusion
All these above findings, research and implementations to EILMT system give a more productive and
evolutionary ground in e-corpus in Indian subcontinent. And this ground will definitely raise some critical
questioning and re-analyzing on machine translation, text mining, data pruning, information extraction and
retrieval, speech pathology and technology in IL to IL information exchange and access. Thus the research
and study on EILMT for Indian languages should be guided and formalized as following:
a). Standardization of Indian tagset and considering the factor of morphologically rich language families
and formal tagging, sense tagging and emotion tagging of the e-corpora available in Indian languages.
b). Memory based parsing management to organize the multiple language with multiple domain.
c). Further feature-based development of morphology based modules for morphologically rich Indian
languages.
d). Further, memory managed MT will increase the system efficiency 15-20% more. The scope of this
analysis and synthesis can be extended for reverse translation as-well.
7.0 Reference
Aronoff, Mark. 1976. Word Formation in Generative Grammar. Cambridge: MA: MIT Press
Bhattacharya, T. and P. Dasgupta 1996. Classifiers, word order and definiteness in Bangla. In V.S. Lakshmi
and A. Mukherjee, ed. Word order in Indian languages. 73-94. Hyderabad: Booklinks.
Gangopadhyay, Malaya. (1990). The Noun Phrase in Bengali: Assignment of Role and the
Kaaraka Theory. Delhi: Motilal Banarsidass.
Joshi, Arvind, Bonnie Weber and Ivan Sag. 1981, Elements of Discourse Understanding. Cambridge
University Press, New York.
Krishnamurti, Bh., C.P. Masica and A. K. Sinha (eds). (1986). South Asian Languages: Structure,
Convergence and Diglossia. Delhi: Motilal Benarsidass.
Kroch, T. and A. Joshi (1985). The Linguistic Relevance of Tree Adjoining Grammar. University of
Pennsylvania. Department of Computer and Information.
Patten, Terry 1985. A problem solving approach to generating text from systematic grammars. Proceedings
of 2nd Conference on European chapter of Association for Computational Linguistics. Geneva. Switzerland.
Sinclair, J. 1991. Corpus, concordance, collocation. Tuscan Word Centre, Oxford: Oxford University Press
___________. 2004. Developing Linguistic Corpora: A Guide to good practice. Oxford: Oxford University
Press
Selkirk (1982). The Syntax of Word. MIT Press.
Vardarajan, Mu. 1988. A History of Tamil Literature. Translated from Tamil by E. Sa. Viswanathan 1-17.
Sahitya Academy. New Delhi.
ANNEXURE 1 A
(EILMT system specifications)
Server Machine(s)
Server Machine 1 (Main EILMT Server)
Hardware:
HP DL rack Mount Server, 8 core xeon 3.0GHZ, 32GBRAM, 148GB*6 SAS Disk.
Operating System:
Windows Server 2008
Software Required:
Java platform:
jdk1.5.0_01 or jre1.5.0_01.
Server version:
jboss-4.0.2(Application server).
Database version:
MySQL Server 6.0, MySQL Tools for 5.0
Server Machine 2 (AnalGen Server)
Hardware:
HP DL rack Mount Server, 8 core xeon 3.0GHZ, 32GBRAM, 148GB*6 SAS Disk.
Operating System:
Linux Fedora 8 (For Consortia Engine Name: AnalGen).
Client Machine
Hardware:
PC with 486MHz or Higher (Pentium processor recommended).
16 MB RAM minimum.
Operating System:
Windows 98 & Above
Linux Fedora 8
Browser support:
Internet Explorer IE6, IE7
Mozilla FireFox 3.0.11-13,3.5.2-6
Google Chrome 2.0.172.28, 2.0.172.37, 3.0.195.24, 3.0.195.27
Apple Safari 3.2.2, 4.0.2, 4.0.3(531.9.1), 4.0.4(531.21.10)
Opera 9.64
ANNEXURE I B
(Evaluation of EILMT system Translation output for English-Tamil language pair)
Version 5.0
English Sentence
Simple + Copula
Vrindavan is a pilgrimage.
Analysis of Translation
POS= = 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 75%
Phrase-Marking= 100%
Synthesizer= 100%
Simple + Copula (co-ordinate)
The Mehrangarh Fort has seven
gates and provides wonderful
views of the city.
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 100%
Phrase Marking= 100%
Synthesizer= 100%
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 70%
Phrase Marking= 100%
Synthesizer= 100%
POS= 100%
9
English Sentence
December to February
Analysis of Translation
P syntax = 80%
G-syntax= 80%
Lexicon= 100%
Phrase Marking= 80%
Synthesizer= 100%
POS= 95
P syntax = 100%
G-syntax= 100%
Lexicon= 100%
Phrase Marking= 100%
Synthesizer= 100%
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 100%
Phrase Marking= 100%
Synthesizer= 100%
POS= 95%
P syntax = 70%
G-syntax= 70%
Lexicon= 80%
Phrase Marking= 100%
Synthesizer= 90%
Appositional(complement + initial)
Balsamand Lake & Palace, an & ,
artificial lake, is a splendid spot
,
and was built in 1159 AD.
1159 D
That Complement
Madurai is the oldest city in
Tamil Nadu and was home to the
ancient Tamil Sangam, the
literary conclave that produced
the first epic, Silappathikaram.
,
Tamil Sangam ,
Co-ordinates
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 60%
Phrase Marking= 100%
Synthesizer= 100%
POS= 100%
P syntax =100%
G-syntax= 100%
Lexicon= 80%
Phrase Marking= 100%
Synthesizer= 70%
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 95%
Phrase Marking= 100%
Synthesizer= 100%
POS= 100%
P syntax = 70%
G-syntax= 70%
Lexicon= 70%
Phrase Marking= 100%
Synthesizer= 50%
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 80%
Phrase Marking= 100%
Synthesizer= 100%
10
English Sentence
Analysis of Translation
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 70%
Phrase Marking= 100%
Synthesizer= 100%
POS= 80%
P syntax = 80%
G-syntax= 80%
Lexicon= 75
Phrase Marking= 100%
Synthesizer= 75
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 80%
Phrase Marking= 100%
Synthesizer= 100%
,
,
POS= 80%
P syntax = 70%
G-syntax= 70%
Lexicon= 70%
Phrase Marking= 100%
Synthesizer= 100%
POS= 80%
P syntax = 70%
G-syntax= 70%
Lexicon= 70%
Phrase Marking= 100%
Synthesizer= 70%
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 80%
Phrase Marking= 100%
Synthesizer= 100%
Gerundial
Visitors found the villagers
dancing to the tune of folk music.
Discourse connector
Udaipur is known for its beautiful
lakes, well structured palaces,
lush green gardens and temples
but the major attractions of this
place are the Lake Palace and the
City Palace.
Conditional
Unless you are accustomed to
horse riding, a daylong camel
ride will be tiring.
POS= 100%
P syntax = 90%
G-syntax= 90%
Lexicon= 70%
Phrase Marking= 100%
Synthesizer= 100%
POS= 100%
P syntax = 60%
G-syntax= 60%
Lexicon= 95
Phrase Marking= 100%
Synthesizer= 100%
POS= 95%
P syntax = 95%
G-syntax= 70%
Lexicon= 75%
Phrase Marking= 100%
Synthesizer= 80%
11
English Sentence
Analysis of Translation
POS= 90%
P syntax = 90%
G-syntax= 90%
Lexicon= 70%
Phrase Marking= 100%
Synthesizer= 100%
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 80%
Phrase Marking= 100%
Synthesizer= 90%
POS= 100%
P syntax = 90%
G-syntax= 90%
Lexicon= 80%
Phrase Marking= 100%
Synthesizer= 100%
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 90%
Phrase Marking= 100%
Synthesizer= 100%
PP initial
At the end of the Kullu Valley, 32
km from Manali is the famous
Rohtang Pass (3978 m) which
offers some of the most
spectacular views of the awesome
Himalayas.
, 32
km
<
3978 >
POS= 100%
P syntax = 100%
G-syntax= 100%
Lexicon= 70%
Phrase Marking= 100%
Synthesizer= 100%