Académique Documents
Professionnel Documents
Culture Documents
44
Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP 2009, pages 44–47,
Suntec, Singapore, 7 August 2009.
2009
c ACL and AFNLP
Optimal Accuracy Mean F-
Task Run Parameter Set in top-1 score MRR MAPref MAP10 MAPsys
LM Order: 5,
LM Weight:
English-Hindi Standard 0.6 0.47 0.86 0.58 0.47 0.18 0.20
LM Order: 5,
Non- LM Weight:
English-Hindi standard 0.6 0.52 0.87 0.62 0.52 0.19 0.21
LM Order: 5,
LM Weight:
English-Tamil Standard 0.3 0.45 0.88 0.56 0.45 0.18 0.18
LM Order: 5,
English- LM Weight:
Kannada Standard 0.3 0.44 0.87 0.55 0.44 0.17 0.18
Accuracy in Mean F-
Task Run top-1 score MRR MAPref MAP10 MAPsys
45
2.2 Improving CSM Performance 4 Conclusion
In addition to the above mentioned parameters, We presented the transliteration system which we
we varied the order of the CSM and the mono- used for our participation in the NEWS 2009 Ma-
lingual corpus used to estimate the CSM. For each chine Transliteration Shared Task on Translitera-
task, we started with a trigram CSM as mentioned tion. We took a Phrase-Based SMT approach to
above and tuned both the order of the CSM and transliteration where words are replaced by char-
λCSM on the development set. The optimal set acters and sentences by words. In addition to the
of parameters and the development set results are standard SMT parameters, we tuned the CSM re-
shown in Figure 1. In addition, we use a mono- lated parameters like order of the CSM, weight as-
lingual Hindi corpus of around 0.4 million doc- signed to CSM and corpus used to estimate the
uments called Guruji corpus. We extracted the CSM. Our results show that improving the ac-
2.6 million unique words from the above corpus curacy of CSM pays off in terms of improved
and trained a CSM on that. This CSM which was transliteration accuracies.
learnt on the monolingual Hindi corpus was used
Acknowledgements
for the non-standard Hindi run. We repeat the
above procedure of tuning the order of CSM and We would like to thank the Indian search-engine
λCSM and find the optimal set of parameters for company Guruji (http://www.guruji.com)
the non-standard run on the development set. for providing us the Hindi web content which was
used to train the language model for our non-
standard Hindi runs.
3 Results and Discussion
References
The details of the NEWS 2009 dataset for Hindi,
Kannada and Tamil are given in [Li et al.2009a, Hieu Hoang, Alexandra Birch, Chris Callison-burch,
Richard Zens, Rwth Aachen, Alexandra Constantin,
Kumaran and Kellner2007]]. The final results of Marcello Federico, Nicola Bertoldi, Chris Dyer,
our system on the test set are shown in Figure 2. Brooke Cowan, Wade Shen, Christine Moran, and
Figure 3 shows the improvements obtained on test Ondej Bojar. 2007. Moses: Open Source Toolkit
set by tuning the CSM parameters. The trigram for Statistical Machine Translation. In In Proceed-
ings of ACL, Demonstration Session, pages 177–
CSM model used along with the optimal Moses
180.
parameter set tuned on development set was taken
as baseline for the above experiments. The results Fei Huang. 2005. Cluster-specific Named Entity
show that a major improvement (32.43%) was ob- Transliteration. In HLT ’05: Proceedings of the con-
tained in the non-standard run where the monolin- ference on Human Language Technology and Em-
pirical Methods in Natural Language Processing,
gual Hindi corpus was used to learn the CSM. Be- pages 435–442, Morristown, NJ, USA. Association
cause of the use of monolingual Hindi corpus in for Computational Linguistics.
the non-standard run, the transliteration accuracy
improved by 22.5% when compared to the stan- A. Kumaran and Tobias Kellner. 2007. A Generic
Framework for Machine Transliteration. In SIGIR
dard run. The improvements (15.38%) obtained in ’07: Proceedings of the 30th annual international
Tamil are also significant. However, the improve- ACM SIGIR conference on Research and develop-
ment in Hindi standard run was not significant. In ment in information retrieval, pages 721–722, New
Kannada, there was no improvement due to tuning York, NY, USA. ACM.
of LM parameters. This needs further investiga- Haizhou Li, A Kumaran, Vladimir Pervouchine, and
tion. Min Zhang. 2009a. Report on NEWS 2009 Ma-
chine Transliteration Shared Task. In Proceed-
The above results clearly highlight the impor- ings of ACL-IJCNLP 2009 Named Entities Work-
tance of improving CSM accuracy since it helps shop (NEWS 2009).
in improving the transliteration accuracy. More-
over, improving the CSM accuracy only requires Haizhou Li, A Kumaran, Min Zhang, and Vladimir
Pervouchine. 2009b. Whitepaper of NEWS 2009
monolingual language resources which are easy Machine Transliteration Shared Task. In Proceed-
to obtain when compared to parallel transliteration ings of ACL-IJCNLP 2009 Named Entities Work-
training data. shop (NEWS 2009).
46
Franz Josef Och and Hermann Ney. 2003. A System-
atic Comparison of Various Statistical Alignment
Models. Computational Linguistics, 29(1):19–51.
Tarek Sherif and Grzegorz Kondrak. 2007. Substring-
Based Transliteration. In In Proceedings of ACL
2007. The Association for Computer Linguistics.
Andreas Stolcke. 2002. SRILM - An Extensible Lan-
guage Modeling Toolkit. In In Proceedings of Intl.
Conf. on Spoken Language Processing.
47