Vous êtes sur la page 1sur 17

Cross Language

Information Retrieval
Presented by:
Namita Singh
B.Tech 3rd year CS
GLA University

What is CLIR?
Cross-language information retrieval (CLIR) is a
subfield of information retrieval dealing with
retrieving information written in a language
different from the language of the user's query. For
example, a user may pose their query in English
but retrieve relevant documents written in French.

Multilingual Collections
There are 6,703 languages listed in the Ethnologue
Digital libraries
OCLC Online Computer Library Center serves more
than 17,000 libraries in 52 countries and contains over
30 million bibliographic records with over 500 million
records ownership attached in more than 370 languages

World Wide Web


Around 40% of Internet users do not speak English,
however, 80% of Web sites are still in English

The General Problem


Find documents written in any language

Using queries expressed in a single language


Traditional IR identifies relevant documents in the same
language as the query (monolingual IR)

Cross-language information retrieval (CLIR) tries to identify


relevant documents in a language different from that of the query
This problem is more and more acute for IR on the Web due to
the fact that the Web is a truly multilingual environment

Why is CLIR important?

Global Internet User Population


2000

2005
5%

8%

9%

8%
32%
5%

6%

English

5%

English
52%

5%
21%

3%

4%

5%

3%
2%
2%
Japanese
Scandanavian
Portuguese

3%
German
Italian
Other

Chinese

6%

4%

Spanish
Chinese
Korean

3%

French
Dutch
English

2%

5%

2%

Spanish

Japanese

German

French
Italian

Chinese
Dutch

Scandanavian
Korean

Portuguese

Other

English

8%
1 2%

40%
6%
4%

8%
2%
5%

2%

6%

2%

5%

S p an
ish

Jap an
e se

Ge rm
an

Fre n
ch

C hi
nese

Sc an
d an
av
ian

Italian

Du
tch

K orea n

Po rtu
g
u
e se

O ther

E
ng
li
sh

Source: Global Reach

Importance of CLIR
CLIR research is becoming more and more
important for global information exchange and
knowledge sharing.
National Security
Foreign Patent Information Access
Medical Information Access for Patients

CLIR is Multidisciplinary
CLIR involves researchers from the
following fields: information retrieval, natural
language processing, machine translation and
summarization, speech processing, document
image understanding,
human-computer
interaction

Why Do Cross-Language IR?


When users can read several languages

Eliminates multiple queries


Query in most fluent language
Monolingual users can also benefit
If translations can be provided
If it suffices to know that a document exists
If text captions are used to search for images

CLIR Experimental System


2 systems:

SMART Information retrieval system modified to

work with 11 European languages (Danish, Dutch,


English, Finnish, French, German, Italian, Norwegian,
Portuguese, Spanish, Swedish)

TAPIR is a language model IR system written by M.

Srikanth. It has been adated to work with 12 different


European languages (Danish, Dutch, English, Finnish,
French, German, Italian, Norwegian, Portuguese,
Russian, Spanish, Swedish)
10

Approaches to CLIR

11

Design Decisions
What to index?

Free text or controlled vocabulary


What to translate?
Queries or documents
Where to get translation knowledge?
Dictionary, ontology, training corpus

12

Cross-Language Text Retrieval


Query Translation

Controlled Vocabulary

Dictionary-based

Document Translation

Free Text

Corpus-based

Problems with CLIR


Morphological processing difficult for some languages (e.g. Arabic)
Many different encodings for Arabic
Windows Arabic (e.g. dictionaries)
Unicode (UTF-8) (e.g. corpus)
Macintosh Arabic (e.g. queries)
Normalization
Remove diacritics
to Arabic (language)
Standardize spellings for foreign names
vs Kleentoon vs Klntoon for Clinton
14

Problems with CLIR (contd)


The problem of translation ambiguity
The second problem that CLIR tasks have to face
is inflection, especially in Western languages.
The out-of-vocabulary (OOV) word refers to a
word or a phrase that cannot be found in a
dictionary. Cross language information retrieval
tasks are significantly affected by OOV
words/terms.
15

Problems with CLIR (contd)


Correct phrase translation is also becoming one of
the problems in CLIR. A phrase cannot be
translated word by word
Correct recognition of named entities (NEs) plays
an important role in improving the performance of
CLIR.

16

CLIR better than IR?


How can cross-language beat within-language?
We know there are translation errors
Surely those errors should hurt performance
Hypothesis is that translation process may disambiguate some query terms
Words that are ambiguous in Arabic may not be ambiguous in English
Expansion during translation from English to Arabic prevents the ambiguity
from re-appearing
Has been proposed that CLIR is a model for IR
Translate query into one language and then back to original
Given hypothesis, should have an improved query
Should be reasonable to do this across many different languages

17

Vous aimerez peut-être aussi