Académique Documents
Professionnel Documents
Culture Documents
- GOOGLE TRANSLATE -
Yet, the criteria I used in evaluating the accuracy of the translations made by the program
are the following:
The would-be interpreter must have a perfect grasp of the sense of spirit of his author. He must
possess knowledge in depth of the language of the original as well as of his own tongue. He ought
() be faithful to the meaning of the sentence, not to the word order. () Our translator will aim
for a version in plain speech. He will avoid the importation of neologisms, rare terms, and
esoteric flourishes of syntax () The final rule applies to all good writing: the translator must
achieve harmonious cadences (nombres oratoires), he must compose in a sweet and even style so
as to ravish the readers ear and intellect. (Steiner, 2001:302)
According to various online sources based on different surveys, Google Search is, by far,
the most used search engine all across the globe. Regardless of the country one accesses Google
Search from, there are several buttons (links) at the top of its homepage, among which you will
also find Google Translate.
The main reason that justifies this study and article is the spreading and the promotion of
a considerable amount of inaccurate translations from Arabic into Romanian made by a marketleading software (due to the amount of languages for which it provides translations and the
number of users) that is only a click away from the homepage of the most used search engine in
the world Google Search. At the same time, before proceeding, it is necessary for me to
mention that this work does not intend to criticize or minimize the efforts of the program and its
developers, and this is why I havent used any text that would be particularly difficult even for a
human translator.
For a better understanding of the study it is essential to know how the Google Translate
software works. When Google Translate started, it would only translate from English into
French, German and Spanish, and from French, German and Spanish into English. Arabic was
only added later, in the 5th stage (launched in 2006), while Romanian, along with other
languages, was added in the 10th stage (launched in 2008). At this state, translations could be
done between any two languages, going through English. Since any translation from Arabic into
Romanian is made through English, that means that any mistake that appears at the level of
translation from Arabic into English will, unless by accident, also appear, if not actually worsen,
in the translation from English into Romanian.
The history of machine translation actually goes back over 60 years, almost immediately
after the first computers had been used to break encryption codes in the Second World War, as
foreign languages were seen, in fact, as encrypted English. In the 70s the foundations for the
first commercial systems were laid, and ever since, with the appearance of personal computers
and with the users need to benefit from translation tools, the field of machine translation kept
growing. (Koehn, 2010:4)
But as time passed, people realized how complex, how active languages are in the
process of enrichment, and especially how difficult it is to teach a computer all the rules of a
language. This lead to innovating the field of machine translation by creating statistical machine
translation methods, which dont imply teaching the computer anything, as the computer is left to
discover the rules by itself.
According to the official Google Translate website,
Google Translate is a free translation service that provides instant translations between 64
different languages. It can translate words, sentences and web pages between any combination of
our supported languages. With Google Translate, we hope to make information universally
accessible and useful, regardless of the language in which its written. (Google Translate, 2012)
As for how it works, the developers explain how Google Translate generates a translation
after
looking for patterns in hundreds of millions of document to help decide on the best translation for
you. By detecting patterns in documents that have already been translated by human translators,
Google Translate can make intelligent guesses as to what an appropriate translation should be.
This process of seeking patterns in large amounts of text is called statistical machine translation.
(Google Translate, 2012)
These texts come from books, organizations like the United Nations and websites all
around the world. (Google Translate, 2012)
Our computers scan these texts looking for statistically significant patters, that is, patterns
between the translation and the original text that are unlikely to occur by chance. Once the
computer finds a pattern, it can use this pattern to translate similar texts in the future. When you
repeat this process billions of times you end up with billions of patterns and one very smart
computer program. (Google Translate, 2010)
The Defence Advanced Research Projects Agency1 did a study on machine translation in
relation with human translation and the processing of spoken languages. The authors of the study
certify that even tough computational capabilities of machines exceed those of humans in many
ways, even the most advanced of todays computers cannot compare with the language ability
that humans acquire naturally, since for translating and extracting the information conveyed
through language humans take advantage of a variety of cognitive abilities that no computer can
currently emulate. (Olive, 2011: vii)
The linguists working for DARPA state that text input can be problematic mainly due to
the lack of orthographic representation. While Chinese does not indicate word boundaries
orthographically, Arabic does that, but it often does not include explicit vowel marking, thus
creating ambiguities, since it can be uncertain which vowels were intended. (Olive, 2011:viii)
The Arabic alphabet includes twenty eight letters that mark consonants and long vowels.
(Grigore, 2006:9) As for the short vowels, all three of them, a, i and u, are graphically
1
DARPA is an agency of the United States Department of Defense and it is responsible for the development of new
technologies for use by the military.
represented through signs placed above or under the consonants, and can be only found in books
for children, holy books, especially the Qur'n (Grigore, 2006:9), some poems, in order to avoid
reading errors. In regular texts, short vowels only appear in order to avoid confusions between
words that have the same consonant skeleton, but that are different once vocalized. (Grigore,
2006: 10) Vowels dont only influence the meaning of a word, but also the syntactic function, as
vowels carry grammatical case information. (Olive, 2011: 537)
Statistical machine translation is corpus based, that is, it learns from examples of
translations called bilingual/parallel corpora. The program does that by aligning source and
target sides of parallel texts. (Habash, 2010: 119) By doing this, the program learns in an
unsupervised manner what pairs of words in the paired sentences are translations of each other.
(Olive, 2011: 133) The world alignments are used to learn translation models that relate words
and sequences of words in the source language to those in the target language. At the same time,
when translating a source language sentence, a statistical decoder combines the information in
the translation model with a language model of the target language to produce a ranked list of
optimal sentences in the target language. (Habash, 2010: 119)
Besides pairs of words (source language word target language word), translation
models also include various additional statistics reflecting the likelihood of a certain translation
pair to appear. (Olive, 2011: 134)
When it comes strictly to Arabic, the real challenge for the statistical translation method
is nothing else but the morphological complexity of the language, that consist of a large set of
morphological features. These features are realized using both concatenative (affixes and
stems), on the pattern
[CONJUNCTION+[PARTICLE+[ARTICLE+BASE+PRONOUN]]]
and templatic (root and patterns) morphology with a variety of morphological and phonological
adjustments that appear in word orthography and interact with orthographic variations. (Olive,
2011: 135) The main concern gravitates around the clitics, that are distinct from inflectional
features such as gender, person and voice. The clitics are written attached to the word, and thus,
increase its ambiguity. (Olive, 2011:136)
Going back to the pattern listed above, the base can have attached either proclitics (such
as the definite article al, prepositions such as bi- and li-, functional particles such as sa-,
conjunctions such as wa- and fa-), or enclitics (affix pronouns). The base can have a definite
article or a member of the class of pronominal enclitics. Pronominal enclitics can attach to nouns
(as possessives) or verbs and prepositions (as objects). The definite article doesnt apply to verbs
or prepositions. The definite article and pronouns dont coexist on nouns. Particles can attach to
all words. (Olive, 2011: 136)
Of course that in order to obtain proper words to work with base words the program
must remove all the clitics. But getting to the base words can be quite a difficult thing to do since
what the program might select as a particle, for example the preposition bi- [with], might
actually be the first consonant of the pattern. So it is either the program does not properly
identify the clitics and ends up not removing them, or the program removes more than the clitics,
which adds ambiguity and requires higher accuracy of preprocessing tools which, when failed,
introduce errors and noise. (Olive, 2011: 137)
For a given parallel text, the Arabic vocabulary size is significantly larger than the
English one. In the parallel news corpora used in the experiments described by Olive, the
average English sentence length is 33 words compared to 25 words on the Arabic side. The
larger Arabic vocabulary is causing problems in sparsity and variability in estimating translation
models, since Arabic words do not appear as often in the training data as their English
counterparts, and this is a problem when using the statistical machine translation method on
language pairs with significant vocabulary size difference. If a language F has a larger
vocabulary than a language E, problems in analysis dominate when translating from F into E
(due to a relatively larger number of out-of-vocabulary words in the input) whereas, generation
problems dominate when translating from E into F). (Olive, 2011: 137)
Arabic is such a morphologically rich language since its open-class words consist of a
consonantal root interspersed with different vowel patterns, plus various derivational affixes and
particles. A given root form can lead to thousand of different word forms, which may present a
problem for language modeling and vocabulary selection. (Olive, 2011: 522) Also, one must not
forget that some patterns resemble others, regardless of the fact that the consonantic root is
different. In the above mentioned situations only the context can help decide for one pattern
(word form) or another.
The most problematic levels for Google Translate when it comes to translating from
Arabic into Romanian are, by far, the lexical and the syntactical one.
On the morphosyntactic level, in order to translate, one must discover the linguistically
relevant features of both the source and the target languages. The selection of these features can
be different in the meaning that what might be relevant for the source language might be
irrelevant for the target language and vice versa. In this case, the translator must determine which
of the inherent features of the morphosyntactic units in the source language are also relevant for
the target language. (Cristea, 2007: 30)
For example, if number and definiteness should be reflected in the target language
(Romanian) just as they appear in the source language (Arabic), when it comes to case, it is the
translator in this particular case the program who first has to correctly identify the cases in
the source sentence (in the source language) in order to understand the relations between the
words in the sentence, then he has to know how to correctly use the cases in the target language
so he would finally come with a correct translation. Since short vowels are, most of the time, not
marked in Arabic, and since case only reflects in the final vocalization of nouns, adjectives and
pronouns, Google Translate is prone to misunderstanding the relations between the words in a
sentence, no matter how rich in hints the sentence or the piece of text is. If the relations between
the words of the source sentence are not preserved as such in the target sentence the general
meaning of the sentence is affected. The most common mistake Google Translate does in this
respect is the switching of the subject with the direct object and vice versa.
Gender is an inherent feature for nouns, but only overtly appears in adjectives that agree
with nouns. Both Arabic and Romanian work with flectional adjectives, while in English
adjectives are invariable, as seen below.
waladun amlun
bintun amlatun
beautiful boy
beautiful girl
biat frumos
fat frumoas
'awldun amlna
bantun amltun
beautiful boys
beautiful girls
biei frumoi
fete frumoase
Since Google Translate has to go through English when translating from Arabic intro Romanian,
the program actually has to start from a variable adjective (Arabic), go through an invariable one
(English), and finally end up as a variable adjective (Romanian) which agrees in gender, number
and case with the determined noun. Here is where Google Translate fails most of the time, since
no agreement criterion is a good enough argument in front of the statistical method.
As for the number, English and Romanian deal only with singulars and plurals, while
Arabic also has duals for nouns, adjectives, pronouns and verbs. Most of the times, Google
Translate translates duals as singulars, and sometimes also as plurals. The real challenge appears
when adding an affix pronoun to either a dual noun or a dual verb, in which case the software
fails most of the time by not translating these forms and actually rendering them using the Latin
alphabet, first letter capitalized as if it were a proper noun. The same goes for prepositions
added to dual nouns.
1)
Google Translate doesnt deal so much better with personal and relative pronouns either.
First, I must mention that both Arabic and Romanian are pro-drop languages, instead English is
not a pro-drop language. While in both Arabic and Romanian the person of the subject is overt in
the form of the predicate, so it can be omitted, in English the overt subject is compulsory. So if
Google Translate is given to translate a sentence that does not have an overt subject in Arabic, he
will sometimes translate it in English as such, which is incorrect.
3)
When a relative clause in Arabic and Romanian refers back to a noun or noun phrase in
the main clause which is the object of a verb or a preposition (e.g. the book that I read), a
substitute pronoun, resumptive pronoun, must be inserted in the relative clause to serve as the
object of the verb or preposition, referring back to the object noun in the main phrase (e.g. alkitbu lla qara'tuhu, cartea pe care am citit-o, the book that I read [it]). In English,
resumptive pronouns are not used in shallow relative clauses, but are required in certain more
deeply embedded clauses. With resumptive pronouns, Google Translate actually has to go from
pronoun, in the Arabic sentence, through no pronoun, in the intermediate English translation, to
pronoun, the final Romanian translation, which most of the times ends up with incomplete and
incorrect sentences.
As for the relative pronoun, if its presence in Romanian and English (with some
exceptions) is always compulsory, in Arabic, relative pronouns can appear or not: definite
clauses are introduced by a relative pronoun, while indefinite relative clauses do not include a
relative pronoun. Whenever Google Translates has to translate an indefinite relative clause it
actually needs to bring the relative pronoun to light so it ends up overt in both English and
Romanian, which most of the times doesnt happen.
Personal pronouns determine the appearance of a very interesting yet constant
phenomenon when they are subjects in equational sentences, that is Google Translate perceives
and translates the pronoun as the verb to be, leaving, in most cases, the subject position empty,
which is incorrect in English.
hum 'aafu mimm tataawwaru.
4)
Consonantic root
Pattern
Meaning
manal
n, , l
mafalun
apiary
munallun
, l, l
munfalun
dissolved
Once again, Google Translate uses not logics, but the statistical method, so no matter how many
hints that should help the program decide which meaning to pick, it will choose the most used
meaning (in certain word combinations).
5)
This example above also explains why the program chose cream instead of kind, excitat
(sexually aroused, horny) instead of confecionat din corn (horny), Perotty (Argentinian
football player) instead of from Beirut, Gedda (probably the city in Saudi Arabia) instead of
grandmother, so on and so forth.
Sometimes, Google Translate does not translate some words at all, sometimes it
improvises on its own, it inserts words randomly:
talat 'awtu l-aabi mina l-marakati llat arat bayna
6)
Voices of anger Almarkp that took place between the two friends who were old foes accuse each
other.
Voci din Almarkp furie care a avut loc ntre
doi prieteni care au fost dusmani vechi se acuz reciproc.
10)
11)
The purpose of this article was to shortly highlight the flaws in the translations made by
the statistical machine translation, Google Translate, and also to raise awareness over its limits,
given the fact that Google Search is the most used search engine worldwide and that Google
Translate is one click away from its homepage, and even more because, since it is a free service,
more and more websites, blogs, governmental websites have started and will start using it. For
sure that Google Translate proves itself useful for many Internet users, especially for those who
only need it for basic activities, but for those who actually expect a medium to a high level
translation, Google Translate can be, at most, a helpful tool when it comes to understanding the
main idea of a text and why not, a good source of laughter.
Bibliography
Och,
Franz.
2011.
Statistical
machine
translation
live.
http://googleresearch.blogspot.com/2006/04/statistical-machine-translation-live.html (4.
01. 2011).
Google
Translate.
2010.
Inside
Google
Translate.
http://www.youtube.com/watch?v=Rq1dow1vTHY (4. 01. 2011).
Google Translate. 2012. Find out how our translations are created.
http://translate.google.com/about/intl/ro_ALL/
(6. 10. 2012)