Vous êtes sur la page 1sur 14

Information Processing and Management 45 (2009) 438–451

Contents lists available at ScienceDirect

Information Processing and Management


journal homepage: www.elsevier.com/locate/infoproman

A lemmatization method for Mongolian and its application to indexing


for information retrieval
Badam-Osor Khaltar, Atsushi Fujii *
Graduate School of Library, Information and Media Studies, University of Tsukuba, 1-2 Kasuga, Tsukuba 305-8550, Japan

a r t i c l e i n f o a b s t r a c t

Article history: In Mongolian, two different alphabets are used, Cyrillic and Mongolian. In this paper, we
Received 19 May 2008 focus solely on the Mongolian language using the Cyrillic alphabet, in which a content
Received in revised form 9 September 2008 word can be inflected when concatenated with one or more suffixes. Identifying the origi-
Accepted 24 January 2009
nal form of content words is crucial for natural language processing and information retrie-
Available online 5 March 2009
val. We propose a lemmatization method for Mongolian. The advantage of our
lemmatization method is that it does not rely on noun dictionaries, enabling us to lemma-
tize out-of-dictionary words. We also apply our method to indexing for information retrie-
Keywords:
Lemmatization
val. We use newspaper articles and technical abstracts in experiments that show the
Mongolian language effectiveness of our method. Our research is the first significant exploration of the effec-
Natural language processing tiveness of lemmatization for information retrieval in Mongolian.
Information retrieval Ó 2009 Elsevier Ltd. All rights reserved.

1. Introduction

In Mongolian, two different alphabets are used, Cyrillic and Mongolian. While the Cyrillic alphabet is mainly used in Mon-
golia, the Mongolian alphabet is mainly used in the Inner Mongolian Autonomous Region of China. Depending on the alpha-
bet used, the writing system is also different in Mongolian.
In the Mongolian writing system for the Cyrillic alphabet, each sentence is segmented on a phrase-by-phrase basis. A
phrase consists of a content word, such as a noun or a verb, and one or more suffixes, such as postpositional particles. Con-
tent words can potentially be inflected when concatenated with suffixes. However, in the Mongolian writing system for the
Mongolian alphabet, each sentence is segmented on a word-by-word basis and content words are not inflected.
Identifying the original forms of content words is crucial for natural language processing and information retrieval. In the
Cyrillic alphabet system, identifying the original forms of content words is more complex than in the Mongolian alphabet
system, because content words are concatenated with suffixes, and can be inflected. Therefore, in this paper we focus solely
on the Mongolian language that uses the Cyrillic alphabet, which will simply be termed ‘‘Mongolian” hereafter.
In information retrieval, normalizing index terms can involve either lemmatization or stemming. Lemmatization identi-
fies the original form of an inflected word, whereas stemming identifies a stem, which is not necessarily a word.
Currently, Web pages in Mongolian that include only inflected forms of a query term cannot be retrieved. This implies
that existing search engines do not perform lemmatization or stemming when indexing Mongolian Web pages.
In this paper, we propose a lemmatization method for Mongolian and apply our method to indexing for information re-
trieval. Existing lemmatization methods for Mongolian use predefined content word dictionaries. However, reflecting the ra-
pid growth of science and technology, new words, such as loanwords and technical terms, are continually created. Because of

* Corresponding author. Tel.: +81 29 859 1401; fax: +81 29 859 1093.
E-mail addresses: khab23@slis.tsukuba.ac.jp (B.-O. Khaltar), fujii@slis.tsukuba.ac.jp (A. Fujii).

0306-4573/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ipm.2009.01.008
B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451 439

the limitations of manual compilation, new words are often out-of-dictionary words. New words are often nouns, and we
therefore propose a lemmatization method that does not rely on noun dictionaries.

2. Inflection types in Mongolian phrases

In Mongolian, nouns, adjectives, numerals, and verbs can be concatenated with suffixes. Nouns and adjectives are usually
concatenated with a sequence comprising a plural suffix, case suffix, and reflexive possessive suffix. Numerals are concate-
nated with either a case suffix or a reflexive possessive suffix. Verbs are concatenated with various suffixes, such as aspect
suffixes, participle suffixes, and mood suffixes.
Fig. 1 shows the inflection types for content words in Mongolian phrases. In (a), there is no inflection in the content
word ‘‘yov (book)”, concatenated with the suffix ‘‘sy (the genitive case)”. In (b)–(d), the content words are inflected. In
(b), either a consonant or a vowel can be inserted. In (c), the character ‘‘ m” is replaced with the character ‘‘b”. In (d), a
vowel is eliminated.
Loanwords, which can be nouns, adjectives, or verbs in Mongolian, can also be concatenated with suffixes. Because most
loanwords are linguistically different from conventional Mongolian words, the suffix concatenation is also different from
that for conventional Mongolian words. Therefore, exceptional rules are required for loanwords. For example, if the loan-
word ‘‘rovgm⁄Tep (computer)” is to be concatenated with an ablative case suffix, ‘‘oop” would normally be selected from
the four ablative case suffixes (i.e., aap, oop, ''p, and p) according to Mongolian grammar. However, because ‘‘rovgm⁄Tep
(computer)” is a loanword, the ablative case ‘‘''p” is selected instead of ‘‘oop”, resulting in the noun phrase ‘‘rovgm⁄Tep''p
(by computer)”. In addition, the inflection (d) of Fig. 1 never occurs for noun and adjective loanwords.

3. Related work

Ehara, Hayata, and Kimura (2004) Jaimai, Zundui, Chagnaa, and Ock (2005) independently proposed a morphological anal-
ysis method for Mongolian that uses manually produced dictionaries and rules.
Sanduijav, Utsuro, and Sato (2005) proposed a morphological analysis method for noun and verb phrases in Mongolian.
They manually produced inflection rules and concatenation rules for nouns and verbs. Then, they automatically produced a
phrase dictionary by aligning nouns or verbs with suffixes. Morphological analysis for phrases is performed by consulting
this dictionary.
Because these three lemmatization methods rely on predefined dictionaries, they cannot lemmatize out-of-dictionary
words, which are often loanwords or technical terms. Sanduijav et al. (2005) used predefined nouns and verbs in producing
a phrase dictionary.
Arguably, lemmatization and stemming are effective for indexing in information retrieval (Hull, 1996; Porter, 1980).
Stemming methods have been proposed for a number of agglutinative languages, including Malay (Tai, Ong, & Abdullah,
2000), Indonesian (Vega & Bressan, 2001), Finnish (Laurikkala, Järvelin, & Juhola, 2004), Arabic (Larkey, Ballesteros, & Connel,
2002), Swedish (Carlberger, Dalianis, Hassel, & Knutsson, 2001), Slovene (Popovic & Willett, 1992), and Turkish (Ekmekçioglu,
Lynch, & Willett, 1996).
Xu and Croft (1998) and Melucci and Orio (2003) independently proposed a language-independent method for stemming,
which analyzes a corpus in a target language and identifies an equivalent class comprising an original form, inflected forms,
and derivations. However, because none of these methods can identify the original form in each class, they cannot be used
for natural language applications where word occurrences must be standardized in terms of their original form.
To date, no attempt has been made to apply lemmatization or stemming to information retrieval for Mongolian. Our re-
search is the first significant effort to address this problem.

Fig. 1. Inflection types for content words and suffixes in Mongolian.


440 B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451

4. Methodology

4.1. Overview

Fig. 2 shows an overview of our lemmatization method for Mongolian. Our method consists of two segments, identified
by the dashed boxes in Fig. 2, namely ‘‘lemmatization for verb phrases” and ‘‘lemmatization for noun phrases”.
A problem arises when we target both noun and verb phrases. There are suffixes that can concatenate with either verbs or
nouns, but the inflection type can be different depending on the part of speech. As a result, it is possible for verb phrases to be
incorrectly lemmatized as noun phrases, and vice versa.
Because new verbs are created less frequently than nouns, we use a verb dictionary, but not a noun dictionary. We first
lemmatize an input phrase as a verb phrase, and check if the extracted content word is defined in our verb dictionary. If the
content word is not defined in the verb dictionary, we lemmatize the input phrase as a noun phrase.
The suffixes concatenated with adjectives are the same as those with nouns, and the suffixes concatenated with numerals
are also concatenated with nouns. Therefore, the lemmatization method for noun phrases can also be used for adjective and
numeral phrases without any modifications. We use the term ‘‘lemmatization for noun phrases” to refer to the lemmatiza-
tions for noun, adjective, and numeral phrases.
We briefly explain our lemmatization process using Fig. 2. We consult a ‘‘verb suffix dictionary” and perform backward
partial matching to check whether a suffix is concatenated at the end of a phrase. If a suffix is detected, we use a ‘‘verb suffix
segmentation rule” to remove the suffix and extract the content word. We use a ‘‘vowel insertion rule” to check whether
vowel elimination occurred in the content word, and insert the eliminated vowel if necessary. These processes are repeated
until the residue of the phrase does not match any of the entries in the verb suffix dictionary. If the resultant content word is
defined in a ‘‘verb dictionary”, we output the content word as a verb and terminate the lemmatization process.
The following (1)–(9) are example processes for the input verb phrase ‘‘adxx-aal (take + complete suffix + perfect suffix)”.

(1) ‘‘aal” at the end of ‘‘adxxaal” is detected as a suffix.


(2) The suffix ‘‘aal” is removed and ‘‘adxx” is extracted.
(3) The eliminated vowel ‘‘b” is inserted and ‘‘adxx” becomes ‘‘adxbx”.
(4) ‘‘xbx” at the end of ‘‘adxbx” is detected as a suffix.

Fig. 2. Overview of our lemmatization method.


B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451 441

(5) The suffix ‘‘xbx” is removed and ‘‘ad” is extracted.


(6) No vowel is inserted into ‘‘ad”.
(7) No suffix is detected in ‘‘ad”.
(8) ‘‘ad (take)” is found in the verb dictionary.
(9) Output ‘‘ad” as the original form of a verb.

If the content word is not defined in the verb dictionary, we return to the input phrase and perform lemmatization for
noun phrases. We consult a ‘‘noun suffix dictionary” to determine if one or more suffixes are concatenated at the end of
the target phrase.
We use a ‘‘loanword identification rule” to check if the input phrase includes a loanword before using a ‘‘noun suffix seg-
mentation rule” and ‘‘vowel insertion rule”. If the phrase does not include a loanword, we use a ‘‘noun suffix segmentation
rule” to remove the suffixes and extract the content word. However, if the phrase includes a loanword, we use exceptional
segmentation rules to extract the loanword.
As in the lemmatization for verb phrases, we use the ‘‘vowel insertion rule” to check if vowel elimination has occurred in
the content word, and insert the eliminated vowel if necessary. We use the same vowel insertion rule for both noun and verb
phrases. However, as explained in Section 2, vowel elimination never occurs in noun loanwords. Therefore, if the input
phrase includes a loanword, we do not use the vowel insertion rule.
If the input phrase does not match any of the entries in the noun suffix dictionary, we determine that a suffix is not con-
catenated, and we output the phrase as is.
The inflection types (b) and (c) of Fig. 1 are processed by the verb suffix segmentation rule and noun suffix segmentation
rule. The inflection type (d) of Fig. 1 is processed by the vowel insertion rule.
We elaborate on the above dictionaries and rules in Sections 4.2–4.8.

4.2. Verb suffix dictionary

We produced a verb suffix dictionary, which contains 126 orthographic forms of suffixes that can concatenate with verbs.
Our verb suffix dictionary does not include derivational suffixes that form a verb stem. Fig. 3 shows a fragment of our verb
suffix dictionary on a type-by-type basis, in which suffixes corresponding to the same suffix type represent the same mean-
ing. In Fig. 3, inflected forms of suffixes are shown in parentheses, which are also counted as an individual suffix in the 126
suffixes.

4.3. Verb suffix segmentation rule

For the verb suffix segmentation rule, we produced 179 rules. There are one or more segmentation rules for each of the
126 verb suffixes mentioned in Section 4.2.

Fig. 3. Fragment of the verb suffix dictionary.


442 B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451

Fig. 4. Fragment of the verb suffix segmentation rule.

Fig. 4 shows a fragment of the verb suffix segmentation rule for the suffix ‘‘d (past)”. In the column ‘‘Segmentation rule”,
the condition of each ‘‘If” sentence is a phrase ending. ‘‘V” and ‘‘C” refer to a vowel and a consonant, respectively, and ‘‘”
refers to any string. ‘‘C9” refers to any of the nine consonants (‘‘w”, ‘‘;”, ‘‘ä”, ‘‘c”, ‘‘l”, ‘‘T”, ‘‘i”, ‘‘x”, and ‘‘x”), and ‘‘C7” refers
to any of the seven consonants (‘‘v”, ‘‘y”, ‘‘u”, ‘‘k”, ‘‘,”, ‘‘d”, and ‘‘p”). If a condition is satisfied, we perform the corresponding
action. For example, because the verb phrase ‘‘iby'xk'd (renew þ past)” satisfies the first condition, we remove the suffix ‘‘d”
and the preceding vowel ‘‘'” to extract ‘‘iby'xk”.

4.4. Vowel insertion rule

The vowel insertion rule determines if a vowel should be inserted and, if so, which vowel. First, to determine if a vowel
should be inserted, we check whether vowel elimination has occurred in the target content word. To achieve this, we exam-
ine the last two characters of the content word. If both of these are consonants, we determine that a vowel has been
eliminated.
However, a number of content words inherently end with two consonants, and therefore we referred to a textbook on
Mongolian grammar (Ts, 2002) to produce four rules that determine when to insert a vowel. If the last two characters of
a content word are any of ‘‘C7 + C7”, ‘‘C9 + C7”, ‘‘C9 + C9”, or ‘‘C7 + x”, we insert a vowel between the two consonants.
‘‘C7” and ‘‘C9” are the same sets as those defined in Section 4.3. Second, to determine which vowel should be inserted,
we use the Mongolian vowel harmony rule. According to this rule, the vowels after the first syllable in a content word
are determined by the vowel of the first syllable in the content word. Therefore, we check the vowel of the first syllable
in the content word and insert the appropriate vowel.
Fig. 5 shows the vowels to be inserted by the vowel insertion rule. The position of the inserted vowel is always between
the last two consonants of a content word. For example, if ‘‘a”, ‘‘y”, ‘‘z”, or ‘‘⁄” occur in the first syllable of a word, the vowel
‘‘a” is inserted between the last two consonants of the word. However, there are exceptional rules for the long vowels ‘‘yy”,
‘‘YY”, ‘‘⁄y”, ‘‘⁄Y”, ‘‘zy”, ‘‘ëy”, ‘‘eY”, and ‘‘by”. If one of the long vowels ‘‘yy”, ‘‘⁄y”, ‘‘zy”, ‘‘ëy”, or ‘‘by” is in a word, the vowel ‘‘a” is
inserted, irrespective of the vowel in the first syllable of the word. If one of the long vowels ‘‘YY”, ‘‘⁄y”, or ‘‘eY” is in a word, the
vowel ‘‘'” is inserted, irrespective of the vowel in the first syllable of the word.
In addition, if the last two characters of a word are any of ‘‘; + Consonant”, ‘‘x + Consonant”, or ‘‘i + Consonant”, the vo-
wel ‘‘b” is inserted between the characters. For example, because the last two characters of ‘‘iby'xk”, which was extracted in
Section 4.3, are ‘‘x + Consonant”, we insert the vowel ‘‘b” between the last two characters and output the verb ‘‘iby'xbk
(renew)”.

Fig. 5. Vowels to be inserted by the vowel insertion rule.


B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451 443

4.5. Verb dictionary

Sanduijav et al. (2005) extracted 1254 verbs from a Japanese–Mongolian bilingual dictionary containing 7500 entries and
produced a verb dictionary for Mongolian. We use this dictionary as our verb dictionary in Fig. 2. Fig. 6 shows a fragment of
the verb dictionary.

4.6. Noun suffix dictionary

The noun suffix dictionary contains 35 orthographic forms of suffixes that can be concatenated with nouns. These suffixes
are case suffixes, reflexive suffixes, and plural suffixes. There are seven cases in Mongolian. Because the nominative case is
not associated with any suffix, we do not consider the nominative case. Fig. 7 shows the entries in this dictionary, in which
suffixes corresponding to the same suffix type represent the same meaning. In Fig. 7, inflected forms of suffixes are shown in
parentheses, which are also counted as an individual suffix in the 35 suffixes.

4.7. Loanword identification rule

We produced the following three classes of rules for identifying loanwords in Mongolian text.
The first class of rules identifies those loanwords that follow the standard loanword grammar in Mongolian. A word
including any of the consonants ‘‘r”, ‘‘g”, ‘‘a”, or ‘‘o” is extracted as a loanword. These consonants are usually used to spell
out foreign words.
The second class of rules identifies the following words that violate the standard Mongolian grammar as loanwords.

 A word violating the Mongolian vowel harmony rule.

(Because of the vowel harmony rule, a word that includes both feminine and masculine vowels, and which is not based on
the Mongolian phonetic system, is probably a loanword).

 A word beginning with two consonants.

(A conventional Mongolian word does not begin with two consonants).

 A word ending with two particular consonants.

(A word having as its penultimate character one of ‘‘g”, ‘‘,”, ‘‘T”, ‘‘w”, ‘‘x”, ‘‘ä”, or ‘‘i” and its final character a consonant
violates Mongolian grammar, and is probably a loanword).
The third class of rules is based on heuristics that identify the following words as loanwords.

 A word beginning with the consonant ‘‘d”.

Fig. 6. Fragment of the verb dictionary.


444 B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451

Fig. 7. Entries in the noun suffix dictionary.

(In one Modern Mongolian dictionary (Ozawa, 2000), there are 54 words beginning with ‘‘d”, of which 31 are loanwords.
Therefore, a word beginning with ‘‘d” is probably a loanword).

 A word beginning with the consonant ‘‘p”.

(In this Modern Mongolian dictionary, there are 49 words beginning with ‘‘p”, of which 45 are loanwords. Therefore, a
word beginning with ‘‘p” is probably a loanword).
A preliminary study (Khaltar, Fujii, & Ishikawa, 2006) showed that the precision and recall of loanword identification
using the above rules were 92.7% and 84.2%, respectively. Although supervised machine-learning methods have been pro-
posed for identifying loanwords in Korean corpora (Myaeng & Joeng, 1999; Oh & Choi, 2001), we leave the application of
these methods to Mongolian for future work.

4.8. Noun suffix segmentation rule

The noun suffix segmentation rule comprises 196 rules, of which 173 are produced for both conventional Mongolian
words and loanwords. The remaining 23 rules are exceptional rules that are produced only for loanwords.
Fig. 8 shows a fragment of the noun suffix segmentation rule for the locative–dative case suffixes ‘‘l” and ‘‘T”. In Fig. 8, ‘‘V”,
‘‘”, ‘‘C9”, and ‘‘C7” are the same as those of Fig. 4. ‘‘C4” refers to any of the four consonants: ‘‘u”, ‘‘d”, ‘‘p”, or ‘‘c”. If a condition
is satisfied, we perform the corresponding action. For example, because the noun phrase ‘‘xYYxl'l (child + locative–dative)”
satisfies the last condition for the suffix ‘‘l”, we remove the suffix ‘‘l” and the preceding vowel ‘‘'” to extract ‘‘xYYxl”.
However, the 173 segmentation rules often incorrectly lemmatize loanwords that have different endings from conven-
tional Mongolian words. We analyzed those incorrectly lemmatized loanwords. As a result, a large number of loanwords that
end with ‘‘-awb”, ‘‘-zwb”, or ‘‘-okoub” were incorrectly lemmatized. For example, if the phrases ‘‘'rokoubqH (of ecology)”,
‘‘byneupawbqH (of integration)”, and ‘‘byakzwbac (from inflation)” are independently lemmatized by our method, the resulting

Fig. 8. Fragment of noun suffix segmentation rule.


B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451 445

Fig. 9. Fragment of the suffix segmentation rule for loanwords.

content words will be ‘‘'rokou”, ‘‘byTeupaw”, and ‘‘byakzw”, respectively, whereas the correct words are ‘‘'rokoub (ecology)”,
‘‘byTeupawb (integration)”, and ‘‘byakzwb (inflation)”, respectively.
Therefore, we produced 23 exceptional suffix segmentation rules for loanwords that end with ‘‘-awb”, ‘‘-zwb”, or ‘‘-okoub”. If
the loanword identification rule determines that the target phrase includes a loanword, we apply these 23 exceptional rules.
Fig. 9 shows a fragment of these exceptional rules. For example, for the loanword phrase ‘‘'rokoubqy (ecology + genitive)”,
Fig. 9 suggests that we use the segmentation rule for suffix ‘‘bqy (genitive)”. Therefore, we remove the suffix ‘‘bqy (genitive)”
and add ‘‘b” to the end of the noun. As a result, the noun ‘‘'rokoub (ecology)” is correctly extracted.

5. Experiments

5.1. Evaluation method

We evaluated our lemmatization method using two different corpora in Mongolian, newspaper articles, and technical ab-
stracts. Newspaper articles mainly contain general words, whereas technical abstracts frequently include loanwords and
technical terms. We collected 183 newspaper articles from the ‘‘Olloo”.1 There are 65,900 phrase tokens and 13,827 phrase
types in the 183 newspaper articles. We collected 1102 technical abstracts from the ‘‘Mongolian IT Park”.2 There are 178,448
phrase tokens and 17,709 phrase types in the 1102 technical abstracts. Using these corpora, we experimentally evaluated the
accuracy of our lemmatization method (Section 5.2) and its effectiveness for information retrieval (Section 5.3).

5.2. Evaluating lemmatization

Two Mongolian graduate students acted as assessors. Neither of the assessors was an author of this paper. The assessors
provided correct answers for lemmatization. The assessors also tagged each word with its part of speech.
Each assessor performed the same tasks independently. Differences can occur between assessors on a task. We measured
the agreement between the assessors using the Kappa coefficient, which is calculated by Eq. (1) and which ranges from 0 to
1.
Po  Pe
ð1Þ
1  Pe
P o is the probability of the observed agreement between the two assessors with respect to performing lemmatization or tag-
ging parts of speech. In practice, Po is the ratio of the number of phrase types for which the two assessors annotated the same
result and the total number of target phrase types. P e is the probability of the expected agreement occurring by chance. For
each tag type i, we calculate the probability that the tag annotated by each assessor at random is i. Pe is the total of these
probabilities for all tag types.
For the newspaper articles, the Kappa coefficients for performing lemmatization and tagging parts of speech were 0.95
and 0.92, respectively. For the technical abstracts, the Kappa coefficients for performing lemmatization and tagging parts
of speech were 0.96 and 0.94, respectively. Therefore, the agreement between the two assessors was almost perfect (Landis
& Koch, 1977). However, to enhance the objectivity of the evaluation, we used only the phrases for which the two assessors
agreed with respect to the lemmatization and part of speech. As a result, we targeted 12,279 phrase types in the newspaper
articles and 15,478 phrase types in the technical abstracts. These phrases are noun, adjective, verb, and number phrases.

1
http://www.olloo.mn/ (February, 2008).
2
http://www.itpark.mn/ (February, 2008).
446 B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451

Phrases related to other parts of speech were not targeted. Our experiment is the largest evaluation of Mongolian lemma-
tization in the literature.
We were able to use the noun and verb dictionaries of Sanduijav et al. (2005). However, we were not able to use their
system. They reported that phrases having content words in their dictionaries could be lemmatized correctly. Therefore,
if a content word in an input phrase is in their dictionaries, we output the content word as their lemmatization result. If
not, we output the input phrase as their lemmatization result. To identify a content word in an input phrase, we used the
lemmatization result given by the assessors.
Consequently, we can compare the effectiveness of our method with that of Sanduijav et al. In addition, to show the effec-
tiveness of applying the loanword identification rule to our lemmatization process, we compared two variations of our meth-
od, namely ‘‘the method without loanword identification (WOLWI)” and ‘‘the method with loanword identification (WLWI)”.
We compared the lemmatization methods in terms of accuracy. Accuracy is the ratio of the number of phrase types correctly
lemmatized by the method under evaluation to the total number of target phrase types.
Tables 1 and 2 show the results of lemmatization for the newspaper articles and the technical abstracts, respectively. Ta-
bles 1 and 2 show higher accuracy for WOLWI and WLWI than for Sanduijav et al., except for verbs. Because WOLWI, WLWI,
and Sanduijav et al. used the same verb dictionary, the accuracy for verbs is the same for these methods. The accuracy for
verbs was low because many target verbs were omitted from the verb dictionary and were therefore mistakenly lemmatized
as noun phrases. This problem will be solved in the future by enhancing the verb dictionary. Overall, however, WOLWI and
WLWI had higher accuracy than Sanduijav et al. In addition, the accuracy of WLWI was higher than that of WOLWI. There-
fore, the loanword identification rule was effective in the lemmatization.
Fig. 10 shows our analysis of the errors produced by WLWI. ‘‘News” and ‘‘Tech” denote the newspaper articles and the
technical abstracts, respectively. In the column ‘‘Example”, the left side and the right side of an arrow denote an error
and the correct answer, respectively.
Error (a) occurred with nouns, adjectives, and numerals, in which the ending of a content word was mistakenly recog-
nized as a suffix and was removed. Error (b) occurred because we did not consider nouns with irregular plural forms. Error
(c) occurred to loanword nouns because the second class of the loanword identification rule in Section 4.7 was insufficient.
Error (d) occurred because we relied on a verb dictionary. Error (e) occurred because we cannot identify parts of speech of
words in a context, and multiple parts of speech words were always lemmatized as verbs, if the verbs are in the verb dic-
tionary. Therefore, a number of nouns and adjectives were incorrectly lemmatized as verbs.
For Errors (a)–(c), we have not found solutions. Error (d) can be solved by a future enhancement of the verb dictionary. At
the same time, an enhanced verb dictionary will make Error (e) more crucial. However, if we were to use parts of speech of
words in a context, we can solve Error (e). There are a number of automatic methods for tagging parts of speech (Brill, 1995;
Charniak, Hendrickson, Jacobson, & Perkowitz, 1993). In those methods, words that co-occur with multiple parts of speech
words can be used as features for determining the part of speech of a target word in a context.
For example, the multiple parts of speech word ‘‘xYp''k'y” can be used as the noun ‘‘xYp''k'y (park)” or as the inflected
verb ‘‘xYp''k + y (to surround + suffix)”, as in the following sentences.

 ”y' ,ok ayxys avmTys xYp''k'y ⁄v.(This is the first zoological park.)
 XYybq Yqk a;bkkauaays ykvaac ,blybqu xYp''k'y ,yq ,aquakm opxby, auaap vaylakl pxk kT op; ,aqya.
(Because of human activities, the environment and atmosphere that surround us have been changed.)

Table 1
Accuracy of lemmatization for the newspaper articles (%).

#Phrase types Sanduijav et al. WOLWI WLWI


Noun 6 694 69.7 87.4 89.5
Verb 4 842 34.5 34.5 34.5
Adjective 630 85.3 88.5 89.4
Numeral 113 60.1 91.2 91.2
Total 12,279 56.5 66.7 67.9

Table 2
Accuracy of lemmatization for the technical abstracts (%).

#Phrase types Sanduijav et al. WOLWI WLWI


Noun 13 016 57.6 87.7 92.5
Verb 1 797 24.5 24.5 24.5
Adjective 609 82.6 83.5 83.9
Numeral 56 41.1 80.4 81.2
Total 15,478 54.7 80.2 84.2
B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451 447

Fig. 10. Errors in our lemmatization method.

In the first sentence ‘‘xYp''k'y” is used with the word ‘‘avmTys (zoological)”. Because ‘‘zoological” and ‘‘park” are often used
together, the probability that ‘‘xYp''k'y” is used as a noun is high. In the second sentence, there are ‘‘,aquakm opxby (environ-
ment)” and ‘‘auaap vaylak (atmosphere)”, which are frequently used with the verb ‘‘to surround”. In addition, the word ‘‘,yq”,
which is an auxiliary verb in Mongolian, often occurs immediately after verbs. Therefore, the possibility that ‘‘xYp''k'y” is
used as a verb is high.
To use a statistical method to determine the part of speech in a context, multiple parts of speech words must be included
in a training corpus. At the same time, new words are not always included in the training corpus. However, because verbs are
created less frequently than nouns, multiple parts of speech words associated with a verb will also not be created frequently.
Thus, the out-of-dictionary problem is not crucial for multiple parts of speech words associated with verbs. In other words,
extra verb or noun dictionaries are not required to resolve Error (e). This issue should be explored in future.

5.3. Evaluating the effectiveness of lemmatization in information retrieval

We evaluated the effectiveness of our lemmatization methods in information retrieval by performing two experiments.
In the first experiment, we evaluated the methods in terms of retrieval accuracy, using the mean average precision (MAP)
(Manning, Raghavan, & Schütze, 2008) as the evaluation measure. MAP has frequently been used to evaluate the effective-
ness of information retrieval. For a test query, average precision is the average of the precision values at the points at which
each relevant document is retrieved. MAP is the average of the average precision values for all test queries.
Because no test collection for Mongolian information retrieval is available to the public, we used the 1102 technical ab-
stracts to produce our test collection. Fig. 11 shows an example of a technical abstract; its title is ‘‘Advanced Albumin Fusion

Fig. 11. Example of a technical abstract.


448 B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451

Technology” in English. Each technical abstract contains one or more keywords. In Fig. 11, the keywords include ‘‘wycys
bqkl'c (blood serum)” and ‘‘'x'c (placenta)”.
We produced two different types of queries. First, using an individual keyword generates a ‘‘keyword query (KQ)”. Second,
using a keyword list generates a ‘‘list query (LQ)”. There were 6.1 keywords per keyword list, on average. For a KQ, we con-
sidered an abstract that was annotated with the query keyword in the keyword field to be a relevant document. For an LQ,
we considered an abstract that was annotated with all query keywords in the keyword field to be a relevant document. In
this way, we were able to avoid relevance judgments. There were 4485 and 1075 queries for KQ and LQ, respectively. The
average numbers of relevant documents were 1.50 and 1.00 for KQ and LQ, respectively.
The target documents were the 1102 technical abstracts, from which we extracted content words in the title, abstract,
and result fields as index terms. However, we did not use the keyword field, which was used to produce queries, for indexing
purposes. We used a variation of Okapi BM25 Robertson, Walker, Hancock-Beaulieu, and Gatford, 1995 as the retrieval mod-
el, which calculates the relevance score of a document to a query by Eq. (2), below. Instead of the inverse document fre-
quency (IDF) factor used in Okapi BM25, which takes a negative value when nt is very large, we used the standard IDF.
X ðK þ 1Þ  ft;d N
ft;q  n o  log ð2Þ
t2q K  ð1  bÞ þ b  avdlgdl
d
þ ft;d nt

Here, ft;q and ft;d denote the frequency with which term t appears in query q and document d, respectively. N denotes the total
number of documents in a corpus and nt denotes the number of documents containing term t. dld denotes the length of doc-
ument d in bytes, and avgdl denotes the average length of the documents in the corpus. We empirically set K ¼ 2 and b ¼ 0:8.
We also performed pseudo-relevance feedback (PRF) (Manning et al., 2008) for query expansion purposes, in which top
documents were collected in the initial retrieval and from those documents highly ranked terms are added to the original
query. The rank of each term was determined by the value of Eq. (3), which is a product of term frequency (TF) and IDF fac-
tors in Eq. (2).
ðK þ 1Þ  ft;d N
n o  log ð3Þ
K  ð1  bÞ þ b  avdlgdl
d
þ ft;d nt

Because additional terms can potentially introduce noise in the query, we multiply the value of Eq. (3) by a relative weight
against the terms in the original query. The relative weight is a parametric constant common to all additional terms. If the
relative weight is set to 1, the additional terms are considered to be as important as the original terms. However, the impor-
tance of additional terms decreases as the value of the relative weight decreases. In summary, PRF is associated with three
parameters: the number of top documents in the initial retrieval (nd), the number of terms added to the original query (nt),
and the relative weight of an additional term (tw). We determined the optimal values for these parameters experimentally.
Details of these values will be explained later.
In the first experiment, queries were limited to the keywords annotated in the documents. In addition, because newspa-
per articles do not contain keywords, we cannot perform the first experiment on the newspaper articles.
Therefore, to complement the first experiment, we performed the second experiment. If a query exists in a document in
its original form, the document can be retrieved without lemmatization. However, if a query exists in a document only in an
inflected form, the document cannot be retrieved without lemmatization. In the second experiment, we used ‘‘Query Cov-
erage (QC)”, as proposed in this paper, for the evaluation measure. To calculate the QC for a corpus, we consider each doc-
ument in the corpus and calculate the ratio of the number of content word types correctly lemmatized to the total number of
content word types in the document. We calculate the QC for the corpus as the average of this ratio over all corpus docu-
ments. The value of QC, which ranges from 0 to 1, increases as the accuracy of the lemmatization improves. We used the
183 newspaper articles and the 1102 technical abstracts as two corpora.
Table 3 shows the MAP results of the first experiment for the different lemmatization methods. ‘‘NOPRF” and ‘‘PRF” de-
note results of the experiments without PRF and with PRF, respectively. In the ‘‘PRF” columns of Table 3, we show the best
results of the PRF experiment. The parameter values for PRF were optimized for no lemmatization. While we changed the

Table 3
MAPs for lemmatization methods.

KQ LQ
NOPRF PRF NOPRF PRF
No lemmatization 0.2260 0.2374 0.5243 0.5303
Sanduijav et al. A 0.2414 0.2520 0.5608 0.5644
Sanduijav et al. B 0.2410 0.2516 0.5618 0.5636
WOLWI 0.2731 0.2834 0.6032 0.6001
WLWI 0.2759 0.2863 0.6098 0.6048
Correct lemmatization A 0.2793 0.2902 0.6034 0.6057
Correct lemmatization B 0.2790 0.2894 0.6097 0.6129
B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451 449

values of nd and nt from 1 to 10 in steps of 1, we changed the value of tw from 0.1 to 1 in steps of 0.1. The best result for KQ
was obtained with nd ¼ 1, nt ¼ 10, and tw ¼ 0:2 and the best result for LQ was obtained with nd ¼ 9, nt ¼ 1, and tw ¼ 0:3.
We used these parameter values for all methods.
In Table 3, ‘‘Correct lemmatization A” and ‘‘Correct lemmatization B” were performed by assessors A and B, respectively.
Because the lemmatization results for Sanduijav et al. rely on assessor lemmatizations, ‘‘Sanduijav et al. A” and ‘‘Sanduijav
et al. B” are based on correct lemmatizations A and B, respectively.
Comparing no lemmatization with other methods, any lemmatization method improved the MAP for all cases, irrespec-
tive of the query type (KQ or LQ) and whether PRF was used (NOPRF or PRF). Comparing WOLWI and WLWI with Sanduijav
et al. (A and B), WOLWI and WLWI had an improved MAP for all cases. Comparing WOLWI with WLWI, WLWI had an im-
proved MAP for all cases. The MAPs of WLWI were smaller than those of correct lemmatizations A and B for all cases, except
for the combination of LQ and NOPRF.
We used the two-sided paired t-test for statistical testing, to investigate whether the differences in Table 3 were mean-
ingful or simply chance occurrences (Keen, 1992). Table 4 shows the results, in which ‘‘<” and ‘‘” indicate that the differ-
ence between two results was significant at the 5% and 1% levels, respectively, and ‘‘” indicates that the difference between
two results was not significant.
In Table 4, the MAP differences between no lemmatization and both correct lemmatizations were statistically significant
for all cases, irrespective of the query type (KQ or LQ) and whether PRF was used (NOPRF or PRF). Therefore, lemmatization
was effective for information retrieval in Mongolian. The MAP differences between no lemmatization and WOLWI or WLWI
were statistically significant for all cases. The differences between Sanduijav et al. (A and B) and WOLWI or WLWI were sta-
tistically significant for all cases. Therefore, WOLWI and WLWI were both more effective than no lemmatization and Sandu-
ijav et al. (A and B).
The MAP differences between WOLWI and WLWI were significant for all cases, except for the combination of LQ and PRF.
Therefore, we can state that applying the loanword identification rule to a lemmatization was effective in information
retrieval.
The MAP differences between WOLWI and the correct lemmatizations were statistically significant for KQ combined with
either NOPRF or PRF. The MAP differences between WOLWI and the correct lemmatization A were not statistically significant
for LQ combined with either NOPRF or PRF. While the MAP difference between WOLWI and the correct lemmatization B was
not statistically significant for the combination of LQ and NOPRF, the MAP difference was significant for the combination of
LQ and PRF. MAP differences between WLWI and the correct lemmatizations were not statistically significant for all cases,
except for the combination of KQ and PRF. Therefore, we can state that WLWI was as effective as the correct lemmatizations
in information retrieval.
Tables 5 and 6 show the results of the second experiment, giving the QC values for the newspaper articles and technical
abstracts, respectively. The original form of a query can vary depending on the assessor lemmatization. Therefore, we used
two types of correct answers, namely the correct answer from assessor A (Correct answer A) and the correct answer from
assessor B (Correct answer B). The average numbers of content word types in the newspaper articles using Correct answers
A and B were 207.39 and 205.56, respectively. The average numbers of content word types in the technical abstracts using
Correct answers A and B were 101.32 and 103.89, respectively.
The tendency in Tables 5 and 6 was the same for both Correct answers A and B. When using any lemmatization method,
the QC values were higher than those for no lemmatization. The QC values for WOLWI and WLWI were higher than those for
Sanduijav et al., and the QC values for WLWI were higher than those for WOLWI. According to Tables 5 and 6, and using
WLWI for lemmatization, approximately 90% of the terms in a document, on average, can be used as queries to retrieve that
document.

Table 4
t-Test result of the differences between lemmatization methods.

KQ LQ
NOPRF PRF NOPRF PRF
No lemmatization vs. Correct lemmatization A    
No lemmatization vs. Correct lemmatization B    
No lemmatization vs. WOLWI    
No lemmatization vs. WLWI    
Sanduijav et al. A vs. WOLWI    
Sanduijav et al. A vs. WLWI    
Sanduijav et al. B vs. WOLWI    
Sanduijav et al. B vs. WLWI    
WOLWI vs. WLWI < < < 
WOLWI vs. Correct lemmatization A    
WOLWI vs. Correct lemmatization B    
WLWI vs. Correct lemmatization A  <  
WLWI vs. Correct lemmatization B    
450 B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451

Table 5
QC for the newspaper articles.

Correct answer A Correct answer B


No lemmatization 0.31 0.36
Sanduijav et al. 0.81 0.83
WOLWI 0.91 0.88
WLWI 0.93 0.90

Table 6
QC for the technical abstracts.

Correct answer A Correct answer B


No lemmatization 0.38 0.40
Sanduijav et al. 0.79 0.78
WOLWI 0.91 0.90
WLWI 0.93 0.92

In summary, as for the first experiment, the second experiment verified the effectiveness of our lemmatization in infor-
mation retrieval for Mongolian, with WOLWI and WLWI each being more effective than Sanduijav et al. In addition, the first
experiment and second experiment verified the effectiveness of loanword identification in lemmatization. At the same time,
to reduce manual cost, our information retrieval experiments did not use manually created queries and relevance judgments,
which have usually been used for information retrieval evaluations in the literature. This issue should be explored in future.

6. Conclusion

In Mongolian, two different alphabets are used, Cyrillic and Mongolian. In this paper, we focus solely on the use of the
Cyrillic alphabet, and the language is simply termed ‘‘Mongolian”.
In Mongolian, each sentence is segmented on a phrase-by-phrase basis. A phrase consists of a content word, such as a
noun or a verb, and one or more suffixes, such as postpositional participles. Content words can potentially be inflected when
concatenated with suffixes.
The process of identifying the original form of content words in Mongolian text, termed ‘‘lemmatization”, is crucial for
natural language processing and information retrieval.
In this paper, we have proposed a lemmatization method for Mongolian, in which we targeted nouns, verbs, adjectives,
and numerals. Loanwords, which can be nouns, adjectives, or verbs can also be concatenated with suffixes. Because loan-
words are linguistically different from conventional Mongolian words, the suffix concatenation is also different from that
for conventional Mongolian words. Therefore, exceptional lemmatization rules are required for loanwords.
Our lemmatization method does not rely on noun dictionaries, enabling us to lemmatize out-of-dictionary words. Instead
of noun dictionaries, we have produced noun and verb suffix dictionaries, suffix segmentation rules, and a vowel insertion
rule. We have also produced loanword identification rules to identify loanwords. For loanwords, we have produced excep-
tional suffix segmentation rules.
We used newspaper articles and technical abstracts to evaluate our lemmatization method by experiment. The accuracy
of our method was higher than that of existing methods. We also applied our lemmatization method to information retrieval,
and demonstrated that retrieval accuracy and query coverage were improved by our method. Our research is the first sig-
nificant effort in applying lemmatization to information retrieval in Mongolian.
Future work will include improvements to loanword identification and tagging parts of speech to improve the lemmati-
zation. In our information retrieval experiments, to reduce manual cost, we produced queries and relevant documents auto-
matically. Therefore, information retrieval experiments using manually produced test collections should also be performed
in future.

References

Berlian Vega, S. N., & Bressan, S. (2001). Indexing the Indonesian Web: Language identification and miscellaneous issues. In Proceedings of tenth international
world wide web conference (pp. 46–47).
Brill, E. (1995). Unsupervised learning of disambiguation rules for part of speech tagging. In Proceedings of the third workshop on very large corpora (pp. 1–13).
Carlberger, J., Dalianis, H., Hassel, M., & Knutsson, O. (2001). Improving precision in information retrieval for Swedish using stemming. In Proceedings of
NODALIDA 01 – 13th Nordic conference on computational linguistics.
Charniak, E., Hendrickson, C., Jacobson, N., & Perkowitz, M. (1993). Equations for part of speech tagging. In Proceedings of the conference of the American
association for artificial intelligence (pp. 784–789).
Ehara, T., Hayata, S., & Kimura, N. (2004). Mongolian morphological analysis using ChaSen. In Proceedings of the 10th annual meeting of the association for
natural language processing (pp. 709–712), (in Japanese).
B.-O. Khaltar, A. Fujii / Information Processing and Management 45 (2009) 438–451 451

Ekmekçioglu, Ç. F., Lynch, M. F., & Willett, P. (1996). Stemming and n-gram matching for term conflation in Turkish texts. Information Research News, 7(1),
2–6.
Hull, D. A. (1996). Stemming algorithms – a case study for detailed evaluation. Journal of the American Society for Information Science and Technology, 47(1),
70–84.
Jaimai, P., Zundui, T., Chagnaa, A., & Ock, C. (2005). PC-KIMMO-based description of Mongolian morphology. International Journal of Information Processing
Systems, 1(1), 41–48.
Keen, M. E. (1992). Presenting results of experimental retrieval comparisons. Information Processing and Management, 28(4), 491–502.
Khaltar, B., Fujii, A., & Ishikawa, T. (2006). Extracting loanwords from Mongolian corpora and producing a Japanese–Mongolian bilingual dictionary. In
Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (pp.
657–664).
Korenius T., Laurikkala, J., Järvelin, K., & Juhola, M. (2004). Stemming and lemmatization in the clustering of Finnish text documents. In Proceedings of the
13th association for computing machinery international conference on information and knowledge management (pp. 625–633).
Landis, R. J., & Koch, J. J. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Larkey, L. S., Ballesteros, L., & Connel, M. E. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In
Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 275–282).
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. New York: Cambridge University Press.
Melucci, M. & Orio, N. (2003). A novel method for stemmer generation based on hidden Markov models. In Proceedings of the 12th international conference on
information and knowledge management (pp. 131–138).
Myaeng, S. H., & Joeng, K. (1999). Back-transliteration of foreign words for information retrieval. Information Processing and Management, 35(4), 523–540.
Oh, J. & Choi, K. (2001). Automatic extraction of transliterated foreign words using hidden Markov model. In Proceedings of the international conference on
computer processing of oriental languages (pp. 433–438).
Ozawa, S. (2000). Modern Mongolian Dictionary, Daigakushorin, Tokyo.
Popovic, M., & Willett, P. (1992). The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for
Information Science and Technology, 43(5), 384–390.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Robertson, S. E., Walker, S., Jones. S, Hancock-Beaulieu, M., & Gatford, M. (1995). Okapi at TREC-3. In Proceedings of the third text retrieval conference, NIST
Special Publication 500-226 (pp. 109–126).
Sanduijav, E., Utsuro, T., & Sato, S. (2005). Mongolian phrase generation and morphological analysis based on phonological and morphological constraints.
Journal of Natural Language Processing, 12(5), 185–205 (in Japanese).
Tai, S. Y., Ong, C. O., & Abdullah, N. A. (2000). On designing an automated Malaysian stemmer for the Malay language. In Proceedings of the 5th international
workshop on information retrieval with Asian languages, Hong Kong (pp. 207–208).
Ts, B. (2002). Mongolian Grammar for Grades I–IV. Ulaanbaatar (in Mongolian).
Xu, J., & Croft, B. W. (1998). Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16(1), 61–81.

Vous aimerez peut-être aussi