Vous êtes sur la page 1sur 6

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 8, No. 2, May 2010

Evaluation of English-Telugu and English-Tamil


Cross Language Information Retrieval System
using Dictionary Based Query Translation Method

P. Sujatha P. Dhavachelvan V.Narasimhulu


Department of Computer Science Department of Computer Science Department of Computer Science
Pondicherry Central University Pondicherry Central University Pondicherry Central University
Pondicherry-605014, India. Pondicherry-605014, India. Pondicherry-605014, India.
spothula@gmail.com dhavachelvan@gmail.com narasimhavasi@gmail.com

Pondicherry Central University


Abstract—Cross Lingual Information Retrieval (CLIR) Pondicherry-605014,
system India.this kind of direct matching is impossible in CLIR.
languages,
spothula@gmail.com
helps the users to pose the query in one language and retrieve the Translation is needed: either the query has to be translated into
documents in another language. We developed a CLIR system in the language of the documents or the documents have to be
computer science domain to retrieve the documents in Telugu translated into the language of the query. Obviously,
and Tamil languages for the given English query. We opted for
translating the whole document collection is more demanding,
the method of translating queries for English-Tamil and English-
Telugu language pairs using bilingual dictionaries. as it requires more scarce resources like full-fledged Machine
Transliteration is also performed for the named entities present Translation (MT) system, which is not available for a number
in the query. Finally, the translation and transliteration results of languages in developing countries. Hence query translation
are combined and used the resultant query to the searching techniques become more feasible and common in development
module for retrieving target language documents. For Telugu, we and implementation of CLIR system. The present paper
achieve a Mean Average Precision (MAP) of 0.3835 and for discusses a CLIR system using query based translation.
Tamil, we achieve a MAP of 0.3665.
The organization of the paper is as follows. Section II,
Keywords-Cross Lingual Information Retrieval; Translation; describes related work done on CLIR systems in Indian
Transliteration; Ranking. languages. Section III discusses a brief overview of CLIR
system architecture. Evaluation results are described in
Section IV. The conclusion and future enhancements of the
I. INTRODUCTION paper are given in Section V.

CLIR can be defined as a subfield of Information Retrieval


(IR) system that deals with searching and retrieving II. RELATED WORK
information written/recorded in a language different from the
language of the user’s query. It Facilitates the process of Many organizations in India are working on the CLIR
finding relevant documents written in one natural language system for different Indian Languages [13]. IIIT, Hyderabad
with automated systems that can accept queries expressed in has developed a Hindi and Telugu to English CLIR system
other language(s) is thus the major purpose of CLIR system. [4]. They used a vector based ranking model with bilingual
The process is bilingual when dealing with a language pair, lexicon using word translations combined with a set of
that is, one source language and one target or document heuristics for query refinement after translation. Jagadeesh and
language. In multilingual information retrieval the target Kumaran [5] build a CLIR system with the help of a word
collection is multilingual, and topics are expressed in one alignment table learned from a parallel corpus, primarily for
language [1]. In any of such cases CLIR is expected to support statistical machine translation. They participated in the Cross
queries in one language with a collection in another Language Evaluation Forum (CLEF) competition, in the
language(s) [2]. Indian language sub-task of the main Ad-Hoc monolingual
and bilingual track. This track tests the performance of
According to Peters and Sheridan [3] CLIR is a complex systems in retrieving the relevant documents in response to a
multidisciplinary research area in which methodologies and query in the same and different languages from that of the
tools developed in the field of IR and natural language document set. In Indian context, documents are provided in
processing converges. IR is traditionally based on matching English (corpus) and queries are specified in different
the words of a query with the words of document collections. languages including Hindi, Telugu, Bengali, Marathi and
Because the query and the document collection are in different Tamil on the CLEF dataset. A cross-language query focused

314 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, May 2010

multi-document summarization for the Telugu-English perform some text processing steps. The following are the text
language pair was described in [6]. The authors used a cross- processing steps: Tokenization, stop words removal,
lingual relevance based language modeling approach to morphological analyzer and stemming. Tokenization is the
generate extraction based summary. It would provide a task of dividing query into pieces, called tokens, perhaps at the
syntactically well formed set of sentences in the summary to same time throwing away certain characters, such as
enable easy machine translation. Other benefit of the system is punctuation. These tokens are referred to as terms or words.
output can be an easily translatable content (minimizing After tokenization, some words in the query need to be
ambiguities). Mandal et al. [7] described two cross-lingual and removed called stop words, which should not help retrieving
one monolingual English text retrieved at CLEF in the Ad- target documents. Examples of those words are: the, is, was,
Hoc track. The cross-language task includes the retrieval of that etc. Using stop words removal step, these words can be
English documents in response to queries in two most widely removed from the source query. Morphological analyzer
spoken Indian languages Hindi and Bengali. Here, authors analyzes the structure of the words in the query. Examples of
adapted automatic query generation and machine translation those words are verbs, adverbs, adjectives etc. That is,
approach to develop the system. vocabulary of the words in the query can be identified.
Stemming is the process of reducing inflected words to their
An Indian Language Information Retrieval System [8],
base or root form. For example, fishing, fished and fisher are
which exploits the significant overlap in vocabulary across the
inflected words, which can be reduced into their root form fish
Indian languages. Cognates are identified using some of the
using stemmer. After the text processing, the output of the
well-known similarity measures, and incorporated this with
source query (SQ) is called preprocessed source language
the traditional bilingual dictionary approach. The effectiveness
query (PSQ), which includes preprocessed source language
of the retrieval system was compared on various models. The
query words {PSW1, PSW2…PSWn}.
results show that using cognates with the existing dictionary
approach leads to a significant increase in the performance of Verification Module: It is designed for the purpose of
the system. Language independent information retrieval is checking the occurrence of source language words in Machine
one of the major issues in the web access by the regional Readable Dictionaries (MRD) or Bilingual (source to target
population of any kind. Language Independent Information language) dictionaries. MRDs are electronic versions of
Retrieval from Web (LIIRW) was described in [9]. Here, the printed dictionaries, and may be general dictionaries or
user with the independence of typing the query in any specific domain dictionaries or a combination of both. After
language of his choice and getting the results in any language text processing module, the verification module accepts the
or any combination of languages, it is intended to make the input query words{PSW1, PSW2…PSWn} and performs a
multilingual content of the web easily available and more database lookup operation to check whether the given query is
noticeable. It addresses the implementation of the LIIRW directly present in the bilingual dictionary. The words which
concept in Indian languages (Hindi and Tamil). A Tamil- found in the dictionary {PSW1, PSW2…PSWi} can be given to
English CLIR system was developed in [10]. This system is the Translation Module and the words which are not found in
mainly developed for the farmers of Tamilnadu in Agriculture the dictionary {PSW1, PSW2…PSWj} can be given to the
domain. It helps them to specify their information need in Transliteration module.
Tamil and retrieve the documents in English (corpus). Here,
Translation Module: In this paper, the language of the user
the query in Tamil language is translated syntactically and
query is English and the documents considered for retrieval
semantically to English using statistical machine translation
are in Tamil and Telugu languages. These documents are a set
approach and gives the better result. The system exhibits a
articles based on computer science terminology. Hence we
dynamic learning approach.
have concentrated on queries with computer terminology. It is
This paper presents a CLIR system, which translate English also called machine translation module. It follows the
query into Tamil and English query into Telugu using dictionary based translation method. Dictionary based
bilingual dictionaries related to computer domain. It also translation method can translates the query words using the
transliterates the named entities, which are present in the bilingual dictionaries. These words are called vocabulary
query other than the words which can be translated. words. We have developed an English-Tamil and English-
Telugu bilingual dictionaries that contain most the words
related to computer science domain. The dictionary had to be
built from the scratch as no resource is available for this
III. SYSTEM ARCHITECTURE
domain. After each intermediary step in the Morphological
Analyzer, the extracted word is mapped with the bilingual
The overview of CLIR system is shown in Fig. 1. It mainly
dictionary to check whether it is a root word. If it is available,
contains the following modules: Text Processing, Verification,
meaning of the word is returned. If not, the word is then
Translation, Transliteration and Retrieval and Ranking.
passed on to the subsequent stages in the Morphological
Text Processing: For a given source or user query, this Analyzer. The words which are not found in the dictionary are
module performs preprocessing of the source query. That is, called Out Of Vocabulary (OOV) words. This module takes
before translating the source query into target query, need to input as {PSW1, PSW2…PSWi} and translates into target

315 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, May 2010

language query words {TDW1, TDW2…TDWi } using bilingual source language into a target language without the aid of a
dictionary. The output of this module is translated dictionary resource like a bilingual dictionary.
words {TDW1, TDW2…TDWi}.

Source Query Target Result (TR)

I1 = {SQ} O6 = {TR} = {TRD1,


TRD2…TRDm}
Text Processing Ranking Topmost
Documents

O5 = I6 = {TD1, Target
O1 = {PSQ} Documents
TD2…TDk}
Collection
Processed Source Searching and
Query Retrieving

I2 = {PSQ} = {PSW1,
PSW2…PSWn} I5 = {TQ}

Verifying Verification No (Not Target Query


Words Module included in) Formation

Yes (Included O4 = {TDW1,


in) I4 = {PSW1, TDW2…TDWj}
PSW2…PSWj}
Transliteration
Process
O2 = {I3} = {PSW1,
Bilingual PSW2…PSWi}
Dictionary
(Source to Translation Using
Target) Bilingual Dictionary
O3 = {TDW1,
Translating
TDW2…TDWi}
Words

Input Information Flow Output Information Flow

I1 = {SQ} = {Source Language Query} O1 = {PSQ} = {Processed Source Query}


I2 = {PSQ} = {PSW1, PSW2…PSWn} = {Processed Source O2 = {PSW1, PSW2…PSWi} = {Processed Source Language
Language Query Words} Query Words}
I3 = {PSW1, PSW2…PSWi} = {Words Found in the Bilingual O3 = {TDW1, TDW2…TDWi} = {Translated Words Found in the
Dictionary} Bilingual Dictionary}
I4 = {PSW1, PSW2…PSWj} = {Words Not Found in the O4 = {TDW1, TDW2…TDWj} = {Transliterated Words Not Found
Bilingual Dictionary} in the Bilingual Dictionary}
I5 = {TQ} = {Target Language Query} O5 = {TD1, TD2…TDk} = {Retrieved Relevant Target Documents}
I6 = {TD1, TD2…TDk} = {Retrieved Relevant Target O6 = {TR} = {TRD1, TRD2…TRDm} = {Topmost m Relevant
Documents} Ranked Documents}

Figure 1. System Architecture

This work follows grapheme based transliteration model [12],


Transliteration Module: Translation module can handle
which is one of the major techniques of transliteration.
only vocabulary words, but not OOV words. Previous studies
Grapheme refers to the basic unit of written language or
suggested that OOV words can be properly handled; otherwise
smallest contrastive units. In grapheme based transliteration
the retrieval performance of CLIR system can reduce up to
model spelling of the original string is considered as a basis for
60% [11]. OOV terms can be of many types. They can be of
transliteration. It is referred to as the direct method because it
newly formed words, loan words; abbreviations or domain
directly transforms source language graphemes into target
specific terms etc. One possible and effective way of handling
language graphemes without any phonetic knowledge of the
OOV terms is using transliteration techniques. Transliteration
source language words. This module takes input as {PSW1,
is the suitable method for translating OOV terms.
PSW2…PSWj}. It is designed with following steps: dividing
Transliteration is the process of transforming a word in a

316 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, May 2010

each word into characters, character level alignment, mapping given query. Both translated words {TDW1, TDW2…TDWi }
with intermediate target scheme and generation of each target and transliterated words {TDW1, TDW2…TDWj} combined
word. First it divides each processed word which is not found forms the target language query (TQ). For merging these
in the dictionary {PSW1, PSW2…PSWj} into individual translated and transliterated words, we have used a simple
characters. Next it applies character level alignments on array-based technique. That is, each word in the query will be
{PSW1, PSW2…PSWj} which include level wise alignments. numbered. Once their translation and transliteration tasks are
After alignment, it uses intermediate mapping (Roman) completed arrange the resultant words in the original order.
scheme, which is suitable for current character level alignments Using TQ, target language documents can be retrieved. Target
for mapping target language characters. Using intermediate language documents are collected from online and stored in
scheme, it maps individual characters in the source word with the database. For a given TQ, search process will retrieve the
the individual target characters and generates complete target relevant target documents {TD1, TD2…TDk} from the target
word. Each target word can be generated in similar fashion. documents collection. Here, search process is designed with
The output of this module is transliterated target words not indexing method. Indexing is the simple and fast method for
found in the dictionary {TDW1, TDW2…TDWj}. For example, retrieving relevant documents. Retrieved documents are given
the transliteration of an English word (data mining) into Telugu to the ranking method [13] for making final ranking which is
is shown in Table I. The transliteration of an English word described as below.
(program) to Tamil is also shown in Table II. Here, English
Given a pair of cross lingual queries (qe, qte) and (qe, qta),
word is scanned from left to right and divided according to
we can extract the set of corresponding cross lingual document
character level alignments based on the target languages
pairs and their click counts {(ei, tej), (C(ei ),C(tej))} and {(ei,
(Telugu or Tamil). The final Telugu or Tamil word is
taj), (C(ei ),C(taj))}, where i = 1, . . . ,N and j = 1, . . . , n. Based
generated based on the Romanization scheme. That is, “data
on that, we produce a set of cross lingual ranking instances S
mining” is transliterated into “డేట మైనింగ్” in Telugu and = {фij, zij}, where each фij = {xi; yj; sij} is the feature vector of
“program” is transliterated into “ப்ரரொக்ரொம்” in Tamil. (ei, tej) and (ei , taj) consisting of three components: xi = f(qe,
tei ) and f(qe, tai) is the vector of monolingual relevancy
features of ei, yi = f(qte, tej) and f(qta, taj) is the vector of
TABLE I. ENGLISH TO TELUGU TRANSLITERATION EXAMPLE monolingual relevancy features of tej and taj, and sij = sim(ei,
tej) and sim( ej, taj) is the vector of cross-lingual similarities
between ei and tej and ei and taj, and zij = (C(ei ),C(tej)) and
English word Transliteration in
Telugu (C(ei ),C(taj)) is the corresponding click counts. The task is to
select the optimal function that minimizes a given loss with
Da డే
respect to the order of ranked cross lingual document pairs.
Ta డేట For each pair of cross lingual queries (qe, qte) and (qe, qta), the
documents were ranked using Lucene’s BM251 algorithm as
Mi డేట మై the similarity metric.
Ni డేట మైని
This ranking method specifies topmost m relevant
N డేట మైనిన్ documents {TRD1, TRD2…TRDm} from a given set of k
G డేట మైనింగ్ documents {TD1, TD2…TDk}. {TRD1, TRD2…TRDm} are
shown to the user as a final target result.

IV. EVALUATION RESULTS


TABLE II. ENGLISH TO TAMIL TRANSLITERATION EXAMPLE
We have used two cross lingual runs: E->Te and E->Ta. In
English word Transliteration in this paper, the official English topics are used to retrieve
Tamil Telugu and Tamil documents. The English topics are
p ப் translated into Telugu and Tamil by the model prescribed in
ro ப்ரரொ
Fig. 1. Target documents contain collection of both Tamil and
g
Telugu language documents. The details about the number of
ப்ரரொக்
target document collection are given in Table III. Most of the
ra ப்ரரொக்ர documents (Tamil and Telugu) are collected from the
m
electronic news articles. These documents covered only
ப்ரரொக்ரம்
computer science articles. Remaining documents are collected
from the native language websites for a period of three to four
months. The details about the total number of terms, number
Retrieval and Ranking Module: This module is designed of unique terms and average document length is specified in
for searching and retrieving relevant target documents for a Table III.

1 http://nlp.uned.es/~jperezi/Lucene-BM25/

317 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, May 2010

TABLE III. DETAILS OF DOCUMENTS COLLECTION

Number of documents 18659


Number of terms 5848964
Number of unique terms 84594
Average document length 85

The test set for evaluating the performance consists of 100


English queries. 50 queries on retrieving Telugu language
documents and remaining 50 queries are used on retrieving
Tamil language documents. Finally, either language
documents are ranked based on the relevance with the
specified query. The performance results are measured in Figure 3. Average Precision vs Recall for E-Ta run
terms of metrics like Recall, MAP, P@10, R-Precision as
shown in Table IV. The performance result shows that the
given CLIR system retrieves the relevant documents in either V. CONCLUSION AND FUTURE WORK
language for a given query in English. A CLIR system is developed for computer Science
domain. The system focuses on dictionary-based approach that
has been used for E-Te and E-Ta translation. Transliteration is
TABLE IV. PERFORMANCE EVALUATION RESULTS E-Te AND E-Ta
also done using simple grapheme based transliteration model.
EXPERIMENTS
In future we would compare the performance of this query
Cross lingual translation method with document translation method. This
Recall MAP P@10 R-Precision system can be further extended to exhibit a dynamic learning
runs
E -Te 86% 0.3835 0.4631 0.3820
approach wherein any new word that is encountered in the
transliteration process could be updated in the database by
E -Ta 84% 0.3665 0.4270 0.3647
allowing the user dynamically to insert it into the database
along with its corresponding Tamil or Telugu transliterated
words.
The performance curves of E-Te and E-Ta runs are
depicted in the Fig. 2 and Fig. 3 respectively. There is little
difference in these two runs, i.e. the E-Te run outperforms the REFERENCES
E-Ta run because the words in E-Te dictionary are better than
the words in E-Ta dictionary i.e the quality of the dictionary [1] T. Hedlund, E. Airio, H. Keskustalo, R. Lehtokangas, A. Pirkola, and K.
Jrvelin, “Dictionary-based cross-language information retrieval: Learning
will affect the performance of the system. For Telugu, we experiences from CLEF 2000-2002,” In Information Retrieval, 2004.
achieve a MAP of 0.3835 and for Tamil; we achieve a MAP of [2] E. Waterhouse, “Building translation lexicons for proper names from the
0.3665. The recall levels in Telugu are 86%. The recall levels web,” In Thesis, Department of Computer Science, University of
in Tamil are 84%. Sheffield, 2003.
[3] C. Peters and P. Sheridan, “Multilingual information access,” In ESSIR
’00: Proceedings of the Third European Summer-School on Lectures on
Information Retrieval-Revised Lectures, pages 51–80, London, UK, 2001.
Springer-Verlag.
[4] P. Pingali and V. Varma, “Hindi and Telugu to English cross language
information retrieval at CLEF 2006,” In Working Notes for the CLEF
2006 Workshop (Cross Language Adhoc Task), 20-22 September,
Alicante, Spain.
[5] J. Jagadeesh and K. Kumaran, “Cross-lingual information retrieval
System for Indian Languages”, Advances in Multilingual and
Multimodal Information Retrieval: 8th Workshop of the Cross-Language
Evaluation Forum, CLEF 2007, pages: 80-87.
[6] P. Pingali and V. Varma, “Experiments in cross language query focused
multi-document summarizations”, Workshop on Cross Language
Information Access CLIA-2007, International Joint Conference on
Artificial Intelligence (IJCAI), 2007.
[7] D. Mandal, S. Dandapat, M. Gupta, P. Banerjee, and S. Sarkar, “Bengali
and Hindi to English cross-language Text Retrieval under Limited
Figure 2. Average Precision vs Recall for E-Te run Resources”, In the working notes of CLEF 2007.
[8] M. Ranbeer, P. Nikita, P. Prasad, and V. Vasudeva, “Experiments in
Cross-Lingual IR among Indian languages”, International Workshop on
Cross Language Information Processing (CLIP-2007), 2007.

318 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 2, May 2010

[9] R. Seethalaksmi, A. Ankur, and R. Ranjit, “Language independent


information retrieval from web”, ULIB Conference 2007.
[10] D. Thenmozhi and C. Aravindan, “Tamil-English cross lingual
information retrieval system for agriculture society”, International
Forum for Information Technology in Tamil (INFITT), Tamil
International Conference 2009.
[11] D. Demner-Fushman and D. W. Oard, “The effect of bilingual term list
size on dictionary-based cross-language information retrieval,” In 36th
Annual Hawaii International Conference on System Science HICSS'03),
pp. 108-118, 2003.
[12] P. Majumder, M. M. Swapan parui, and P. Bhattacharyya, “Initiative for
Indian Language IR Evaluation," Invited paper in EVIA 2007 Online
Proceedings.
[13] W. Gao, J. Blitzer, M. Zhou, and K. F. Wong, “Exploiting Bilingual
Information to Improve Web Search,” Proceedings of the 47th Annual
Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 1075–
1083,Suntec, Singapore, 2-7 August 2009.
[14] J. H. Oh and K. S. Choi, “An ensemble of transliteration models for
information retrieval,” Information Processing and Management: an
International Journal, v.42 n.4, pp. 980-1002, July 2006.

319 http://sites.google.com/site/ijcsis/
ISSN 1947-5500