Académique Documents
Professionnel Documents
Culture Documents
0. Introduction
Recent US federal mandates (e.g. White House Executive Order #13166),1 requiring health care providers who are recipients of federal funds to provide language
translation and interpretation for patients with limited English proficiency (LEP),
have brought the long-standing issue of translation quality to a wider audience of
health care professionals (e.g. managers, decision makers, industry stakeholders,
private foundations), who generally feel unprepared to address the topic. A striking example of how challenging quality evaluation can be for health care organizations is illustrated by the experience of Hablamos Juntos, an initiative funded by
the Robert Wood Johnson Foundation to develop practical solutions to language
barriers to health care.
Several healthcare providers (including hospitals) working with the program
identified what they believed were the best translations available. Eighty-seven
Target 21:2 (2009), 235264. doi 10.1075/target.21.2.02col
issn 09241884 / e-issn 15699986 John Benjamins Publishing Company
different than that envisioned by the writer of the original,3 one can imagine the
difficulties entailed by equating quality with equivalence of response. Finally, as
with many other theoretical approaches, reader-response testing is time-consuming and difficult to apply to actual translations. At a minimum, careful selection
of readers is necessary to make sure that they belong to the intended audience for
the translation.
1.2.2 Textual and pragmatic approaches
Textual and pragmatic approaches have made a significant contribution to the
field of translation evaluation by shifting the focus from counting errors at the
word or sentence level to evaluating texts and translation goals, giving the reader
and communication a much more prominent role. Yet, despite these advances,
none of these approaches can be said to have been widely adopted by either professionals or scholars.
Some models have been criticized because they focus too much on the source
text (Reiss 1971) or on the target text (Skopos) (Reiss and Vermeer 1984, Nord
1997); Reiss argues that the text type and function of the source text is the most
important factor in translation and quality should be assessed with respect to it.
For Skopos Theory it is the text type and function of the translation that is of paramount importance in determining the quality of the translation.
Houses (1997, 2001) functional pragmatic model relies on an analysis of the
linguistic-situational features of the source and target texts, a comparison of the
two texts, and the resulting assessment of their match. The basic measure of quality is that the textual profile and function of the translation match those of the
original, the goal being functional equivalence between the original and the translation. One objection that has been raised against Houses functional model is its
dependence on the notion of equivalence, often a vague and controversial term in
translation studies (Hnig 1997). This is a problem because translations sometimes
are commissioned for a somewhat different function than that of the original; in
addition, a different audience and time may require a slightly different function
than that of the source text (see Hnig 1997 for more on the problematic notion
of equivalence). These scenarios are not contemplated by equivalence-based theories of translation. Furthermore, one can argue that what qualifies as equivalent
is as variegated as the notion of quality itself. Other equivalence-based models of
evaluation are Gerzymisch-Arbogast (2001), Neubert (1985), and Van den Broeck
(1985). In sum, the reliance on an a priori notion of equivalence is problematic
and limiting in descriptive as well as explanatory value.
An additional objection against textual and pragmatic approaches is that they
are not precise about how evaluation is to proceed after the analysis of the source or
the target text is complete or after the function of the translation has been established
as the guiding criteria for making translation decisions. This obviously affects the
ease with which the models can be applied to texts in professional settings. Hnig,
for instance, after presenting some strong arguments for a functionalist approach
to evaluation, does not offer any concrete instantiation of the model, other than in
the form of some general advice for translator trainers. He comes to the conclusion
that the speculative element will remain at least as long as there are no hard and
fast empirical data which serve to prove what a typical readers responses are like
(1997:32).4 The same criticism regarding the difficulty involved in applying textual
and theoretical models to professional contexts is raised by Lauscher (2000). She
explores possible ways to bridge the gap between theoretical and practical quality
assessment, concluding that translation criticism could move closer to practical
needs by developing a comprehensive translation tool (2000:164).
Other textual approaches to quality evaluation are the argumentation-centered approach of Williams (2001, 2004), in which evaluation is based on argumentation and rhetorical structure, and corpus-based approaches (Bowker 2001).
The argumentation-centered approach is also equivalence-based, as a translation
must reproduce the argument structure of ST to meet minimum criteria of adequacy (Williams 2001:336). Bowkers corpus-based model uses a comparatively
large and carefully selected collection of naturally occurring texts that are stored
in machine-readable form as a benchmark against which to compare and evaluate specialized student translations. Although Bowker (2001) presents a novel,
valuable proposal for the evaluation of students translations, it does not provide
specific indications as to how translations should be graded (2001:346). In sum,
argumentation and corpus-based approaches, although presenting crucial aspects
of translation evaluation, are also complex and difficult to apply in professional
environments (and one could argue in the classroom as well).
1.3 The functional-componential approach (Colina 2008)
Colina (2008) argues that current translation quality assessment methods have
not achieved a middle ground between theory and applicability; while anecdotal
approaches lack a theoretical framework, the theoretical models often do not contain testable hypotheses (i.e., they are non-verifiable) and/or are not developed
with a view towards application in professional and/or teaching environments. In
addition, she contends that theoretical models usually focus on partial aspects of
translation (e.g., reader response, textual aspects, pragmatic aspects, relationship
to the source, etc): Perhaps due to practical limitations and the sheer complexity
of the task, some of these approaches overlook the fact that quality in translation
is a multifaceted reality, and that a general comprehensive approach to evaluation
may need to address multiple components of quality simultaneously.
the tool without significant training. Pilot testing results indicate good inter-rater
reliability for the tool and the need for further testing. The current paper focuses
on a second experiment designed to further test the approach and tool proposed
in Colina (2008).
2. Second phase of TQA testing: Methods and Results
2.1 Methods
One of the most important limitations of the experiment in Colina (2008) is in
regard to the numbers and groups of participants. Given the project objective of
ensuring applicability across languages frequently used in the USA, subject recruitment was done in three languages, Spanish, Russian, and Chinese. As a result,
resources and time for recruitment had to be shared amongst the languages, with
smaller numbers of subjects per language group. The testing described in the current experiment includes more subjects and additional texts. More specifically, the
study reported in this paper aims:
I. To test the TQA tool again for inter-rater reliability (i.e. to what degree trained
raters use the TQA tool consistently) by answering the following questions:
Question 1. For each text, how consistently do all raters rate the text?
Question 2. How consistently do raters in the first session (Benchmark) rate
the texts?
Question 3. How consistently do raters in the second session (Reliability) rate
the texts?
Question 4. How consistently do raters rate each component of the tool? Are
there some test components where there is higher rater reliability?
II. Compare the rating skills/behavior of translators and teachers: Is there a difference in scoring between Translators and Teachers? (Question 5, Section2.2).
Data was collected during two rounds of testing: the first, referred to as the Benchmark Testing, included 9 Raters; the second session, the Reliability Testing, included 21 raters. Benchmark and Reliability sessions consisted of a short training
session, followed by a rating session. Raters were asked to rate 45 translated texts
(depending on the language) and had one afternoon and one night to complete the
task. After their evaluation worksheets had been submitted, raters were required
to submit a survey on their experience using the tool. They were paid for their
participation.
2.1.1 Raters
Raters were drawn from the pool used for the pre-pilot and pilot testing sessions
reported in Colina (2008) (see Colina [2008] for selection criteria and additional
details). A call was sent via email to all those raters selected for the pre-pilot and
pilot testing (including those who were initially selected but did not take part). All
raters available participated in this second phase of testing.
As in Colina (2008), it was hypothesized that similar rating results would be
obtained within the members of the same group. Therefore, raters were recruited according to membership in one of two groups: Professional translators; and
language teachers (language professionals who are not professional translators).
Membership was assigned according to the same criteria as in Colina (2008). All
selected raters exhibited linguistic proficiency equivalent to that of a native (or
near-native) speaker in the source and in one of the target languages.
Professional translators were defined as language professionals whose income
comes primarily from providing translation services. Significant professional experience (5 years minimum, most had 1220 years of experience), membership in
professional organizations, and education in translation and/or a relevant field were
also needed for inclusion in this group. Recruitment for these types of individuals was primarily through the American Translators Association (ATA). Although
only two applicants were ATA certified, almost all were ATA affiliates (members).
Language teachers were individuals whose main occupation was teaching
language courses, at a university or other educational institution. They may have
had some translation experience, but did not rely on translation as their source
of income. A web search of teaching institutions with known foreign language
programs was used for this recruitment. We outreached to schools throughout
the country at both the community college and university levels. The definition of
teacher did not preclude graduate student instructors.
Potential raters were assigned to the above groups on the basis of the information provided in their resume or curriculum vitae and a language background
questionnaire included in a rater application.
The bilingual group in Colina (2008) was eliminated from the second experiment, as subjects were only available for one of the languages (Spanish). Translation competence models and research suggest that bilingualism is only one component of translation competence (Bell 1991, Cao 1996, Hatim and Mason 1997,
PACTE 2008). Nonetheless, since evaluating translation products is not the same
as translating, it is reasonable to hypothesize that other language professionals,
such as teachers, may have the competence necessary to evaluate translations; this
may be particularly true in cases, such as the current project, in which the object of
evaluation is not translator competence, but translation products. This hypothesis
would be born out if the ratings provided by translators and teachers are similar.
As mentioned above, data was collected during two rounds of testing: the first
one, the Benchmark Testing, included 9 Raters (3 Russian; 3 Chinese; 3 Spanish);
these raters were asked to evaluate 45 texts (per language) that had been previously selected as clearly of good or bad quality by expert consultants in each language. The second session, the Reliability Testing, included 21 raters, distributed
as follows:
Differences across groups reflect general features of that language group in the US.
Among the translators, the Russians had degrees in Languages, History and Translating, Engineering and Nursing from Russian and US universities and experience
ranging from 12 to 22 years; the Chinese translators experience ranged from 6 to
30 years and their education included Chinese language and literature, Philosophy
(MA), English (PhD), Neuroscience (PhD) and Medicine (MD), with degrees obtained in China and the US. Their Spanish counterparts experience varied from
5 to 20 years and their degrees included areas such as Education, Spanish and
English Literature, Latin American Studies (MA), and Creative Writing (MA).
The Spanish and Russian teachers were perhaps the most uniform groups, including: College instructors (PhD students) with MAs in Spanish or Slavic Linguistics,
Literature, and Communication, and one college professor of Russian. With one
exception, they were all native speakers of Spanish or Russian with formal education in the country of origin. Chinese teachers were college instructors (PhD
students) with MAs in Chinese, one college professor (PhD in Spanish) and an
elementary school teacher and tutor (BA in Chinese). They were all native speakers of Chinese.
2.1.2 Texts
As mentioned above, experienced translators serving as language consultants selected the texts to be used in the rating sessions. Three consultants were instructed to identify health education texts translated from English into their language.
Texts were to be publicly available on the Internet: Half were to be very good and
the other half were to be considered very poor on reading the text. Those texts
were used for the Benchmark session of testing during which they were rated by
the consultants and two additional expert translators. The texts where there was
the most agreement in rating were selected for the Reliability Testing. Reliability
texts were comprised of five Spanish texts (three good and two bad), four Russian
texts and four Chinese texts, two for each language being of good quality and two
of bad quality, making up a total of thirteen additional texts.
2.1.3 Tool
The tool tested in Colina (2008) was modified to include a cover sheet consisting
of two parts. Part I is to be completed by the person requesting the evaluation (i.e.
the Requester) and read by the rater before he/she started his/her work. It contains
the Translation Brief, relative to which the evaluation must always take place, and
the Quality Criteria, clarifying requester priorities among components. The TQA
Evaluation Tool included in Appendix1 contains a sample Part I, as specified by
Hablamos Juntos (the Requester), for the evaluation of a set of health education
materials. The Quality Criteria section reflects the weights assigned to the four
components in the Scoring Worksheet at the end of the tool. Part II of the Cover
Sheet is to be filled in by the raters after the rating is complete. An Assessment
Summary and Recommendation section was included to allow raters the opportunity to offer an action recommendation on the basis of their ratings: I.e. What
should the requester do now with this translation? Edit it? Minor or small edits?
Redo it entirely? An additional modification to the tool consisted of eliminating or adding descriptors so that each category would have an equal number of
descriptors (four for each component) and revising the scores assigned so that the
maximum number of points possible would be 100. Some minor stylistic changes
were made in the language of the descriptors.
2.1.4 Rater Training
The Benchmark and Reliability sessions included training and rating sessions. The
training provided was substantially the same offered in the pilot testing and described in Colina (2008): It focused on the features and use of the tool and it consisted of PDF materials (delivered via email), a Power-point presentation based on
the contents of the PDF materials, and a question-and-answer session delivered
online via Internet and phone conferencing system.
Some revisions to the training reflect changes to the tool (including instructions on the new Cover Sheet), a few additional textual examples in Chinese, and a
scored, completed sample worksheet for the Spanish group. Samples were not included for the other languages due to time and personnel constraints. The training
served as a refresher for those raters who had already participated in the previous
pilot training and rating (Colina 2008).5
2.2 Results
The results of the data collection were submitted to statistical analysis to determine to what degree trained raters use the TQA tool consistently.
Table1 and Figures 1a and 1b show the overall score of each text rated and
the standard deviation between the overall score and the individual rater scores.
# of raters
Average Score
Standard Deviation
210
11
91.8
8.1
214
11
89.5
11.3
215
11
86.8
15.0
228
11
48.6
19.2
235
11
56.4
18.5
Spanish
Avg.
14.42
Chinese
410
10
88.0
10.3
413
10
63.0
21.0
415
10
96.0
5.7
418
10
76.0
21.2
Avg.
14.55
Russian
312
59.4
16.1
314
82.8
15.6
315
75.6
22.1
316
67.8
29.0
Avg.
20.7
200-series texts are Spanish texts, 400s are Chinese and 300s are Russian. The standard deviations range from 8.1 to 19.2 for Spanish, from 5.7 to 21.2 for Chinese
and from 16.1 to 29.0 for Russian.
Question 1. For each text, how consistently do all raters rate the text?
The standard deviations in Table1 and Figures 1a and 1b offer a good measure of
how consistently individual texts are rated. A large standard deviation suggests
that there was less rater agreement (or that the raters differed more in their assessment). Figure1b shows the average standard deviations per language. According
to this, the Russian raters were the ones with the highest average standard deviation and the less consistent in their ratings. This is in agreement with the reliabillity
coefficients shown below (Table5), as the Russian raters have the lowest inter-rater
reliability. Table2 shows average scores, standard deviations, and average standard
deviations for each component of the tool, per text and per language. Figure2
represents average standard deviations per component and per language. There
does not appear to be an obvious connection between standard deviations and
80
60
Average Score
Standard Deviation
40
20
316
315
314
312
418
415
413
410
235
228
215
214
210
0
Text number
20
15
Standard Deviation
(Avg.)
10
0
Spanish
Chinese
Russian
Table2. Average scores and standard deviations for four components, per text and per
language
TL
Text
FTA
MEAN
TERM
Raters
Mean
SD
Mean
SD
Mean
SD
Mean
SD
210
11
27.7
2.6
23.6
2.3
22.7
2.6
17.7
3.4
214
11
27.3
4.7
20.9
7.0
23.2
2.5
18.2
3.4
215
11
28.6
2.3
22.3
4.7
18.2
6.8
17.7
3.4
228
11
15.0
7.7
11.4
6.0
10.9
6.3
11.4
4.5
235
11
15.9
8.3
12.3
6.5
13.6
6.4
14.5
4.7
Spanish
Avg.
5.12
5.3
4.92
3.88
Chinese
410
10
27.0
4.8
22.0
4.8
21.0
4.6
18.0
2.6
413
10
18.0
9.5
16.5
5.8
14.0
5.2
14.5
3.7
415
10
28.5
2.4
25.0
0.0
23.5
2.4
19.0
2.1
418
10
22.5
6.8
21.0
4.6
16.0
7.7
16.5
4.1
Avg.
5.875
3.8
4.975
3.125
Russian
312
18.3
7.1
15.0
6.1
13.3
6.6
12.8
4.4
314
25.6
6.3
21.7
5.0
19.4
3.9
16.1
4.2
315
23.3
9.4
18.3
7.9
17.8
4.4
16.1
4.2
316
20.0
10.3
16.7
7.9
17.2
7.1
13.9
6.5
8.275
6.725
5.5
4.825
6.3
5.3
5.1
3.9
components. Although generally the components Target Language (TL) and Functional and Textual Adequacy (FTA) have higher standard deviations (i.e., ratings
are less consistent), this is not always the case as seen in the Chinese data (FTA).
One would in fact expect the FTA category to exhibit the highest standard deviations, given its more holistic nature; yet, the data do not bear out this hypothesis, as
the TL component also shows standard deviations that are higher than Non-Specialized Content (MEAN) and Specialized Content and Terminology (TERM).
Question 2. How consistently do raters in the first session (Benchmark) rate the
texts?
The inter-rater reliability for the Spanish and for the Chinese raters is remarkable; however, the inter-rater reliability for the Russian raters is too low (Table3).
9
8
7
6
Spanish
Chinese
Russian
All languages
3
2
1
0
TL
FTA
MEAN
TERM
Figure2. Average standard deviations per tool component and per language
Table3. Reliability coefficients for benchmark ratings
Reliability coefficient
Spanish
.953
Chinese
.973
Russian
.128
This, in conjunction with the Reliability Testing results, leads us to believe in the
presence of other unknown factors, unrelated to the tool, responsible for the low
reliability of the Russian raters.
Question 3. How consistently do raters in the second session (Reliability) rate the
texts? How do the reliability coefficients compare for the Benchmark and the Reliability Testing?
The results of the reliability raters mirror those of the benchmark raters, whereby
the Spanish raters achieve a very good inter-rater reliability coefficient, the Chinese raters have acceptable inter-rater reliability coefficient, but the inter-rater reliability for the Russian raters is very low (Table4).
Table5 (see also Tables 3 and 4) shows that there was a slight drop in interrater reliability for the Chinese raters (from the benchmark rating to the reliability
rating), but the Spanish raters at both rating sessions achieved remarkable interrater reliability. The slight drop among the Russian raters from the first to the second session is negligible; in any case, the inter-rater reliability is too low.
.934
Chinese
.780
Russian
.118
Reliability coefficient
(for Reliability Testing)
Spanish
.953
.934
Chinese
.973
.780
Russian
.128
.118
Question 4. How consistently do raters rate each component of the tool? Are there
some test components where there is higher rater reliability?
The coefficients for the Spanish raters show very good reliability, with excellent coefficients for the first three components; the numbers for the Chinese raters
are also very good, but the coefficients for the Russian raters are once again low
(although some consistency is identified for the FTA and MEAN components).
(Table6)
Table6. Reliability coefficients for the four components of the tool (all raters per
language group)
TL
FTA
MEAN
TERM
Spanish
.952
.929
.926
.848
Chinese
.844
.844
.864
.783
Russian
.367
.479
.492
.292
In sum, very good reliability was obtained for Spanish and Chinese raters, for the
two testing sessions (Benchmark and Reliability Testing) as well as for all components of the tool. Reliability scores for the Russian raters are low. These results are
in agreement with the standard deviation data presented in Tables 12 and Figure1a and 1b and Figure2. All of this leads us to believe that whatever the cause
for the Russian coefficients, it was not related to the tool itself.
Question 5. Is there a difference in scoring between translators and teachers?
Table7a and Table7b show the scoring in terms of average scores and standard
deviations for the translators and the teachers for all texts. Figures 3 and 4 show
the mean scores and times for Spanish raters, comparing teachers and translators.
Table7a. Average scores and standard deviations for consultants and translators
Score
Time
text
Mean
SD
Mean
SD
210
93.3
7.5
75.8
59.4
214
93.3
12.1
94.2
101.4
215
85.0
17.9
36.3
18.3
228
46.7
20.7
37.5
22.3
235
46.7
18.6
49.5
38.9
410
91.4
7.5
46.0
22.1
413
62.9
21.0
40.7
13.7
415
96.4
4.8
26.1
15.4
418
69.3
22.1
52.4
22.2
312
52.5
15.1
26.7
2.6
314
88.3
10.3
22.5
4.2
315
74.2
26.3
28.7
7.8
316
63.3
32.7
25.8
6.6
Time
text
Mean
SD
Mean
SD
210
90.0
9.4
63.6
39.7
214
85.0
9.4
67.0
41.8
215
89.0
12.4
36.0
30.5
228
51.0
19.5
38.0
31.7
235
68.0
10.4
57.6
40.2
410
80.0
13.2
61.0
27.7
413
63.3
25.7
71.0
24.6
415
95.0
8.7
41.0
11.5
418
91.7
5.8
44.0
6.6
312
73.3
5.8
55.0
56.7
314
71.7
20.8
47.7
62.7
315
78.3
14.4
37.7
45.5
316
76.7
22.5
46.7
63.5
100
90
80
70
60
Translators
Teachers
50
40
30
20
10
0
210
214
215
228
235
Mean
forraters
Spanish raters
Figure3. Mean scores
for scores
Spanish
The corresponding data for Chinese appears in Figures 5 and 6 and in Figures 7
and 8 for Russian.
Spanish teachers tend to rate somewhat higher (3 out of 5 texts) and spend
more time rating than translators (all texts).
As with the Spanish raters, it is interesting to note that Chinese teachers rate
either higher or similarly to translators (Figure5): Only one text obtained lower
ratings from teachers than from translators. Timing results also mirror those found
for Spanish subjects: Teachers take longer to rate than translators (Figure6).
Despite the low inter-rater reliability among Russian raters, the same trend
was found when comparing Russian translators and teachers with the Chinese and
the Spanish. Russian teachers rate similarly or slightly higher than translators, and
they clearly spend more time on the rating task than the translators (Figure7 and
Figure8). This also mirrors the findings of the pre-pilot and pilot testing (Colina
2008).
In order to investigate the irregular behavior of the Russian raters and to try to
obtain an explanation for the low inter-rater reliability, the correlation between the
total score and at the recommendation (the field rec) issued by each rater was considered. This is explored in Table8. One would expect there to be a relatively high
(negative) correlation because of the inverse relationship between high score and a
low recommendation. As is illustrated in the three sub tables below, all Spanish raters with the exception of SP02PB show a strong correlation between the recommendation and the total score, ranging from 0.854 (SP01VS) to 0.981 (SP02MC). The
results are similar with the Chinese raters, whereby all raters correlate very highly
80
70
60
50
Translators
40
Teachers
30
20
10
0
210
214
215
228
235
60
Teachers
40
20
0
410
413
415
Meanfor
Score
for Chinese
Figure5. Mean scores
Chinese
raters Raters
418
80
70
60
50
Translators
40
Teachers
30
20
10
0
410
413
415
418
50
Teachers
40
30
20
10
0
312
314
315
Meanfor
scores
for Russian
Figure7. Mean scores
Russian
raters Raters
316
30
Teachers
20
10
0
312
314
315
316
Timeraters
for Russian Raters
Figure8. Time for Russian
between the recommendation and the total score, ranging from 0.867 (CH01BJ)
to a perfect 1.00 (CH02JG). The results are different for the Russian raters, however.
It appears that three raters (RS01EM, RS02MK, and RS01NM) do not correlate
highly between their recommendations and their total scores. A closer look especially at these raters is warranted, as is a closer look at RS02LB, who was excluded
from the correlation analysis due to a lack of variability (the rater uniformly recommended a 2 for all texts, regardless of the total score he or she assigned). The other
Russian raters exhibited strong correlations. This result suggests some unusual behavior in the Russian raters, independently of the tool design and tool features, as
the scores and overall recommendation do not correlate highly, as expected.
Table8 (3 sub-tables). Correlation between recommendation and total score:
8.1 Spanish raters:
SP04AR
SP01JC
SP01VS
SP02JA
SP02LA
SP02PB
SP02AB
SP01PC
SP01CC
SP02MC SP01PS
0.923
0.958
0.854
0.938
0.966
0.421
0.942
0.975
0.913
0.981
CH01CK CH01FL
0.935
1.000
0.955
0.943
0.980
0.996
0.894
0.980
0.867
RS02LB
RS02MK RS01SM
RS01NM RS01RW
0.998
0.115
n/a
0.500
0.500
0.933
1.000
0.982
0.993
0.926
0.938
3. Conclusions
As in Colina (2008), testing showed that the TQA tool exhibits good inter-rater
reliability for all language groups and texts, with the exception of Russian. It was
also shown that the low reliability of the Russian raters scores is probably due to
factors unrelated to the tool itself. At this point, it is not possible to determine
what these factors may have been; yet further research with Russian teachers and
translators may provide insights about the reasons for the low inter-rater reliability
obtained for this group in the current study. In addition, the findings are in line
with those of Colina (2008) with regard to the rating behavior of translators and
teachers: Although translators and teachers exhibit similar behavior, teachers tend
to spend more time rating and their scores are slightly higher than those of translators. While, in principle, it may appear that translators would be more efficient
raters, one would have to consider the context of evaluation to select an ideal rater
for a particular evaluation task. Because they spent more time rating (and one assumes reflecting on their rating), teachers may be more apt evaluators in a formative context, where feedback is expected from the rater. Teachers may also be better
at reflecting on the nature of the developmental process and therefore better able
to offer more adequate evaluation of a process and/or a translator (versus evaluation of a product). However, when rating involves a product and no feedback is
expected (e.g. industry, translator licensing exams, etc), a more efficient translator
rater may be more suitable to the task. In sum, the current findings suggest that
professional translators and language teachers could be similarly qualified to assess
translation quality by means of the TQA tool. Which of the two types of professionals is more adequate for a specific rating task probably will depend on the
purpose and goal of evaluation. Further research comparing the skills of these two
groups in different evaluation contexts is necessary to confirm this view.
In summary, the results of empirical tests of the functional-componential tool
continue to offer evidence for the proposed approach and to warrant additional
testing and research. Future research needs to focus on testing on a larger scale,
with more subjects and various text types.
Notes
* The research described here was funded by the Robert Wood Johnson Foundation. It was part
of the Phase II of the Translation Quality Assessment project of the Hablamos Juntos National
Program. I would like to express my gratitude to the Foundation, to the Hablamos Juntos National Program, and to the Program Director, Yolanda Partida, for their support of translation in
the USA. I owe much gratitude to Yolanda Partida and Felicia Batts for comments, suggestions
References
Bell, Roger T. 1991. Translation and Translating. London: Longman.
Bowker, Lynne. 2001. Towards a Methodology for a Corpus-Based Approach to Translation
Evaluation. Meta 46:2. 345364.
Cao, Deborah. 1996. A Model of Translation Proficiency. Target 8:2. 325340.
Carroll, John B. 1966. An Experiment in Evaluating the Quality of Translations. Mechanical
Translation 9:34. 5566.
Colina, Sonia. 2003. Teaching Translation: From Research to the Classroom. New York: McGraw
Hill.
Colina, Sonia. 2008. Translation Quality Evaluation: Empirical evidence for a Functionalist
Approach. The Translator 14:1. 97134.
Gerzymisch-Arbogast, Heidrun. 2001. Equivalence Parameters and Evaluation. Meta 46:2.
227242.
Hatim, Basil and Ian Mason. 1997. The Translator as Communicator. London and New York:
Routledge.
Hnig, Hans. 1997. Positions, Power and Practice: Functionalist Approaches and Translation
Quality Assessment. Current issues in language and society 4:1. 634.
House, Julianne. 1997. Translation Quality Assessment: A Model Revisited. Tbingen: Narr.
House, Julianne. 2001. Translation Quality Assessment: Linguistic Description versus Social
Evaluation. Meta 46:2. 243257.
Lauscher, S. 2000. Translation Quality-Assessment: Where Can Theory and Practice Meet?.
The Translator 6:2. 149168.
Neubert, Albrecht. 1985. Text und Translation. Leipzig: Enzyklopdie.
Nida, Eugene. 1964. Toward a Science of Translation. Leiden: Brill.
Nida, Eugene and Charles Taber. 1969. The Theory and Practice of Translation. Leiden: Brill.
Nord, Christianne. 1997. Translating as a Purposeful Activity: Functionalist Approaches Explained. Manchester: St. Jerome.
PACTE. 2008. First Results of a Translation Competence Experiment: Knowledge of Translation and Efficacy of the Translation Process. John Kearns, ed. Translator and Interpreter
Training: Issues, Methods and Debates. London and New York: Continuum, 2008. 104
126.
Reiss, Katharina. 1971. Mglichkeiten und Grenzen der bersetungskritik. Mnchen: Hber.
Reiss, Katharina and Vermeer, Hans. 1984. Grundlegung einer allgemeinen Translations-Theorie.
Tbingen: Niemayer.
Van den Broeck, Raymond. 1985. Second Thoughts on Translation Criticism. A Model of its
Analytic Function. Theo Hermans, ed. The Manipulation of Literature. Studies in Literary
Translation. London and Sydney: Croom Helm, 1985. 5462.
Williams, Malcolm. 2001. The Application of Argumentation Theory to Translation Quality
Assessment. Meta 46:2. 326344.
Williams, Malcolm. 2004. Translation Quality Assessment: An Argumentation-Centered Approach, Ottawa: University of Ottawa Press.
Rsum
Colina (2008) propose une approche componentielle et fonctionnelle de lvaluation de la qualit des traductions et dresse un rapport sur les rsultats dun test-pilote portant sur un outil
conu pour cette approche. Les rsultats attestent un taux lev de fiabilit entre valuateurs et
justifient la continuation des tests. Cet article prsente une exprimentation destine tester
lapproche ainsi que loutil. Des donnes ont t collectes pendant deux priodes de tests. Un
groupe de 30 valuateurs, compos de traducteurs et enseignants espagnols, chinois et russes,
ont valu 4 ou 5 textes traduits. Les rsultats montrent que loutil assure un bon taux de fiabilit
entre valuateurs pour tous les groupes de langues et de textes, lexception du russe; ils suggrent galement que le faible taux de fiabilit des scores obtenus par les valuateurs russes est sans
rapport avec loutil lui-mme. Ces constats confirment ceux de Colina (2008).
Appendix1: Tool
Benchmark Rating Session
Time Rating Starts
Delivery Date:
TRANSLATION BRIEF
Source Language
Target Language
Spanish, Russian, Chinese
Text Type:
Text Title:
Target Audience:
Purpose of Document:
Date Completed:
Contact Information
Date Received:
Total Score:
Notes/Recommended Edits
RATING INSTRUCTIONS:
1. Carefully read the instructions for the review of the translated text. Your decisions and evaluation should be
based on these instructions only.
2. Check the description that best fits the text given in each one of the categories.
3. It is recommended that you read the target text without looking at the English and score the Target
Language and Functional categories.
4. Examples or comments are not required, but they can be useful to help support your decisions or to provide
rationale for your descriptor selection.
1. TARGET LANGUAGE
Category
Number
Description
1.a
The translation reveals serious language proficiency issues. Ungrammatical use of the target language, spelling
mistakes. The translation is written in some sort of third language (neither the source nor the target). The
structure of source language dominates to the extent that it cannot be considered a sample of target language
text. The amount of transfer from the source cannot be justified by the purpose of the translation. The text is
extremely difficult to read, bordering on being incomprehensible.
1.b
The text contains some unnecessary transfer of elements/structure from the source text. The structure of the
source language shows up in the translation and affects its readability. The text is hard to comprehend.
1.c
Although the target text is generally readable, there are problems and awkward expressions resulting, in most
cases, from unnecessary transfer from the source text.
1.d
Check one
box
The translated text reads similarly to texts originally written in the target language that respond to the same
purpose, audience and text type as those specified for the translation in the brief. Problems/awkward
expressions are minimal if existent at all.
Examples/Comments
Description
2.a
Disregard for the goals, purpose, function and audience of the text. The text was translated without considering
textual units, textual purpose, genre, need of the audience, (cultural, linguistic, etc.) Can not be repaired with
revisions.
2.b
The translated text gives some consideration to the intended purpose and audience for the translation, but
misses some important aspect/s of it (e.g. level of formality, some aspect of its function, needs of the audience,
cultural considerations, etc.). Repair requires effort.
2.c
The translated text approximates to the goals, purpose (function) and needs of the intended audience, but it is
not as efficient as it could be, given the restrictions and instructions for the translation.
Can be repaired with
suggested edits.
2.d
The translated text accurately accomplishes the goals, purpose (function: informative, expressive, persuasive)
set for the translation and intended audience (including level of formality). It also attends to cultural needs and
characteristics of the audience. Minor or no edits needed.
Examples/Comments
-2-
Check one
box
3. NON-SPECIALIZED CONTENT-MEANING
Category
Check one
box
3.Number
NON-SPECIALIZED CONTENT-MEANINGDescription
Category
3.a
Number
3.a
3.b
3.c
3.b
The translation reflects or contains important unwarranted deviations from the original. It contains inaccurate
Description
renditions and/or important omissions and additions
that cannot be justified by the instructions. Very defective
comprehension of the original text.
The translation reflects or contains important unwarranted deviations from the original. It contains inaccurate
renditions and/or important omissions and additions that cannot be justified by the instructions. Very defective
There have been some changes in meaning, omissions or/and additions that cannot be justified by the translation
comprehension of the original text.
instructions. Translation shows some misunderstanding of original and/or translation instructions.
Check one
box
There have been some changes in meaning, omissions or/and additions that cannot be justified by the translation
Minor alterations in meaning, additions or omissions.
instructions. Translation shows some misunderstanding of original and/or translation instructions.
Examples/Comments
Description
4.Number
SPECIALIZED CONTENT AND TERMINOLOGY
4.a
Category
Number
4.b
4.a
Check one
box
Check one
box
4.c
4.b
A few terminological errors, but the specialized content is not seriously affected.
Serious/frequent mistakes involving terminology and/or specialized content.
4.d
4.c
Accurate and appropriate rendition of the terminology. It reflects a good command of terms and content specific
A few terminological errors, but the specialized content is not seriously affected.
to the subject.
Examples/Comments
Accurate and appropriate rendition of the terminology. It reflects a good command of terms and content specific
4.d
to the subject.
Examples/Comments
TOTAL
SCORE
TOTAL
SCORE
-3-3-
SCORING WORKSHEET
Component: Target Language
Category #
Value
Score
1.a
2.a
1.b
15
2.b
10
1.c
25
2.c
20
1.d
30
2.d
25
Value
3.a
Score
4.a
3.b
10
4.b
10
3.c
20
4.c
15
3.d
25
4.d
20
Tally Sheet
Category
Rating
Component
Target Language
Functional and Textual Adequacy
Non-Specialized Content
-4-
Score Value
Authors address
Sonia Colina
Department of Spanish and Portuguese
The University of Arizona
Modern Languages 545
Tucson, AZ 85721-0067
United States of America
scolina@email.arizona.edu