Further Evidence For A Functionalist Approach To Translation Quality Evaluation

Further evidence for a functionalist approach
to translation quality evaluation*

Sonia Colina
The University of Arizona
Colina (2008) proposes a componential-functionalist approach to translation

quality evaluation and reports on the results of a pilot test of a tool designed according to that approach. The results show good inter-rater reliability and justify
further testing. The current article presents an experiment designed to test the
approach and tool. Data was collected during two rounds of testing. A total of 30
raters, consisting of Spanish, Chinese and Russian translators and teachers, were
asked to rate 45 translated texts (depending on the language). Results show
that the tool exhibits good inter-rater reliability for all language groups and texts
except Russian and suggest that the low reliability of the Russian raters scores is
unrelated to the tool itself. The findings are in line with those of Colina (2008).
Keywords: quality, assessment, evaluation, rating, componential, functionalism,
errors
0. Introduction
Recent US federal mandates (e.g. White House Executive Order #13166),1 requiring health care providers who are recipients of federal funds to provide language
translation and interpretation for patients with limited English proficiency (LEP),
have brought the long-standing issue of translation quality to a wider audience of
health care professionals (e.g. managers, decision makers, industry stakeholders,
private foundations), who generally feel unprepared to address the topic. A striking example of how challenging quality evaluation can be for health care organizations is illustrated by the experience of Hablamos Juntos, an initiative funded by
the Robert Wood Johnson Foundation to develop practical solutions to language
barriers to health care.
Several healthcare providers (including hospitals) working with the program
identified what they believed were the best translations available. Eighty-seven
Target 21:2 (2009), 235264. doi 10.1075/target.21.2.02col
issn 09241884 / e-issn 15699986 John Benjamins Publishing Company
236 Sonia Colina
documents, rated as highly satisfactory and recommended for replication, were

collected from the providers. Examination of these health education texts by doctorate-level, Spanish language specialists resulted in quality being identified as a
problem. Many of these texts were cumbersome to read, to the point that readers
required the English originals to decipher the intended meanings of some translations. It became clear that these texts were potentially hampering health care
quality and outcomes by not providing needed access to intended health care information for patients with limited English proficiency. Furthermore, health care
administrators overseeing the translation processes that produced these texts had
not identified quality as a problem and needed assistance assessing the quality of
non-English written materials. It was this context that prompted the launch of the
Translation Quality Assessment (TQA) project, funded as one of various HJ initiatives, to improve communication between health providers and patients with limited English proficiency. The TQA project aims to design and test a research-based
prototype tool that could be used by health care organizations to assess the quality
of translated materials, being able to identify a wide range of quality. Colina (2008)
describes the initial version of the tool and the first phase of testing. The results
of a pilot experiment, reported also in Colina (2008), reveal good inter-rater reliability and provide justification for further testing. The current article presents a
second experiment designed to test the approach and tool.
1. Translation quality revisited
Translation quality evaluation is probably one of the most controversial, intensely
debated topics in translation scholarship and practice. Yet, progress in this area
does not seem to correlate with the intensity of the debate. One may wonder
whether the situation is perhaps partly related to the diverse nature of the definitions of translation. In a field such as translation studies, filled with unstated,
often culturally-dependent, assumptions about the role of translation and translators, equivalence and literalness, translation norms and translation standards, it is
not surprising that quality and evaluation have remained elusive to definition or
standards. Current reviews of the literature offer support for this hypothesis (Colina 2008, House 2001, Lauscher 2000), as they reveal a multiplicity of views and
priorities in the area of translation quality. In one recent overview, Colina (2008)
classifies the various approaches into two major groups according to whether their
orientation is experiential or theoretical; parts of that overview are reproduced
here for ease of reference (see further Colina 2008).
Further evidence for a functionalist approach to translation quality evaluation 237
1.1 Experiential approaches

Many methods of translation quality assessment fall within this category. They
tend to be ad hoc, anecdotal marking scales developed for the use of a particular
professional organization or industry, e.g., the ATA certification exam, the SAE
J2450 Translation Quality Metric for the automotive industry, the LISA QA tool
for localization.2 While the scales are often adequate for the particular purposes
of the organization that created them, they suffer from limited transferability, precisely due to the absence of theoretical and/or research foundations that would
permit their transfer to other environments. For the same reason, it is difficult to
assess the replicability and inter-rater reliability of these approaches.
1.2 Theoretical approaches
Recent theoretical, research-based approaches tend to focus on the user of a translation and/or the text. They have also been classified as equivalence-based or functionalist (Lauscher 2000). These approaches arise out of a theoretical framework
or stated assumptions about the nature of translation; however, they tend to cover
only partial aspects of quality and they are often difficult to apply in professional
or teaching contexts.
1.2.1 Reader-response approaches
Reader-response approaches evaluate the quality of a translation by assessing
whether readers of the translation respond to it as readers of the source would respond to the original (Nida 1964, Carroll 1966, Nida and Taber 1969). The readerresponse approach must be credited with recognizing the role of the audience in
translation, more specifically, of translation effects on the reader as a measure of
translation quality. This is particularly noteworthy in an era when the dominant
notion of text was that of a static object on a page.
Yet, the reader-response method is also problematic because, in addition to
the difficulties inherent to the process of measuring reader response, the response
of a reader may not be equally important for all texts, especially for those that are
not reader-oriented (e.g., legal texts). The implication is that reader response will
not be equally informative for all types of translation. In addition, this method addresses only one aspect of a translated text (i.e., equivalence of effect on the reader),
ignoring others, such as the purpose of the translation, which may justify or even
require a slightly different response from the readers of the translation. One also
wonders if it is in fact possible to determine whether two responses are equivalent,
as even monolingual texts can trigger non-equivalent reactions from slightly different groups of readers. Since, in most cases, the readership of a translated text is
238 Sonia Colina
different than that envisioned by the writer of the original,3 one can imagine the
difficulties entailed by equating quality with equivalence of response. Finally, as
with many other theoretical approaches, reader-response testing is time-consuming and difficult to apply to actual translations. At a minimum, careful selection
of readers is necessary to make sure that they belong to the intended audience for
the translation.
1.2.2 Textual and pragmatic approaches
Textual and pragmatic approaches have made a significant contribution to the
field of translation evaluation by shifting the focus from counting errors at the
word or sentence level to evaluating texts and translation goals, giving the reader
and communication a much more prominent role. Yet, despite these advances,
none of these approaches can be said to have been widely adopted by either professionals or scholars.
Some models have been criticized because they focus too much on the source
text (Reiss 1971) or on the target text (Skopos) (Reiss and Vermeer 1984, Nord
1997); Reiss argues that the text type and function of the source text is the most
important factor in translation and quality should be assessed with respect to it.
For Skopos Theory it is the text type and function of the translation that is of paramount importance in determining the quality of the translation.
Houses (1997, 2001) functional pragmatic model relies on an analysis of the
linguistic-situational features of the source and target texts, a comparison of the
two texts, and the resulting assessment of their match. The basic measure of quality is that the textual profile and function of the translation match those of the
original, the goal being functional equivalence between the original and the translation. One objection that has been raised against Houses functional model is its
dependence on the notion of equivalence, often a vague and controversial term in
translation studies (Hnig 1997). This is a problem because translations sometimes
are commissioned for a somewhat different function than that of the original; in
addition, a different audience and time may require a slightly different function
than that of the source text (see Hnig 1997 for more on the problematic notion
of equivalence). These scenarios are not contemplated by equivalence-based theories of translation. Furthermore, one can argue that what qualifies as equivalent
is as variegated as the notion of quality itself. Other equivalence-based models of
evaluation are Gerzymisch-Arbogast (2001), Neubert (1985), and Van den Broeck
(1985). In sum, the reliance on an a priori notion of equivalence is problematic
and limiting in descriptive as well as explanatory value.
An additional objection against textual and pragmatic approaches is that they
are not precise about how evaluation is to proceed after the analysis of the source or
the target text is complete or after the function of the translation has been established
as the guiding criteria for making translation decisions. This obviously affects the
ease with which the models can be applied to texts in professional settings. Hnig,
for instance, after presenting some strong arguments for a functionalist approach
to evaluation, does not offer any concrete instantiation of the model, other than in
the form of some general advice for translator trainers. He comes to the conclusion
that the speculative element will remain at least as long as there are no hard and
fast empirical data which serve to prove what a typical readers responses are like
(1997:32).4 The same criticism regarding the difficulty involved in applying textual
and theoretical models to professional contexts is raised by Lauscher (2000). She
explores possible ways to bridge the gap between theoretical and practical quality
assessment, concluding that translation criticism could move closer to practical
needs by developing a comprehensive translation tool (2000:164).
Other textual approaches to quality evaluation are the argumentation-centered approach of Williams (2001, 2004), in which evaluation is based on argumentation and rhetorical structure, and corpus-based approaches (Bowker 2001).
The argumentation-centered approach is also equivalence-based, as a translation
must reproduce the argument structure of ST to meet minimum criteria of adequacy (Williams 2001:336). Bowkers corpus-based model uses a comparatively
large and carefully selected collection of naturally occurring texts that are stored
in machine-readable form as a benchmark against which to compare and evaluate specialized student translations. Although Bowker (2001) presents a novel,
valuable proposal for the evaluation of students translations, it does not provide
specific indications as to how translations should be graded (2001:346). In sum,
argumentation and corpus-based approaches, although presenting crucial aspects
of translation evaluation, are also complex and difficult to apply in professional
environments (and one could argue in the classroom as well).
1.3 The functional-componential approach (Colina 2008)
Colina (2008) argues that current translation quality assessment methods have
not achieved a middle ground between theory and applicability; while anecdotal
approaches lack a theoretical framework, the theoretical models often do not contain testable hypotheses (i.e., they are non-verifiable) and/or are not developed
with a view towards application in professional and/or teaching environments. In
addition, she contends that theoretical models usually focus on partial aspects of
translation (e.g., reader response, textual aspects, pragmatic aspects, relationship
to the source, etc): Perhaps due to practical limitations and the sheer complexity
of the task, some of these approaches overlook the fact that quality in translation
is a multifaceted reality, and that a general comprehensive approach to evaluation
may need to address multiple components of quality simultaneously.
240 Sonia Colina
As a response to the inadequacies identified above, Colina (2008) proposes an

approach to translation quality evaluation based on a theoretical approach (functionalist and textual models of translation) that can be applied in professional and
educational contexts. In order to show the applicability of the model in practical
settings, as well as to develop testable hypotheses and research questions, Colina and her collaborators designed a componential, functionalist, textual tool
(henceforth the TQA tool) and pilot-tested it for inter-rater reliability (cf. Colina
2008 for more on the first version of this tool). The tool evaluates components of
quality separately, consequently reflecting a componential approach to quality; it
is also considered functionalist and textual, given that evaluation is carried out
relative to the function and the characteristics of the audience specified for the
translated text.
As mentioned above, it seems reasonable to hypothesize that disagreements
over the definition of translation quality are rooted in the multiplicity of views
of translation itself and on different priorities regarding quality components: It
is often the case that a requesters view of quality will not coincide with that of
the evaluators; yet, without explicit criteria on which to base the evaluation, the
evaluator can only rely on his/her own views. In an attempt to introduce flexibility
with regard to different conditions influencing quality, the proposed TQA tool allows for a user-defined notion of quality in which it is the user or requester who
decides which aspects of quality are more important for his/her communicative
purposes. This can be done either by adjusting customer-defined weights for each
component or simply by assigning higher priorities to some components. Custom
weighting of components is also important because the effect of a particular component on the whole text may also vary depending on textual type and function.
An additional feature of the TQA tool is that it does not rely on a point deduction
system; rather, it tries to match the text under evaluation with one of several descriptors provided for each category/component of evaluation. In order to capture
the descriptive, customer-defined notion of quality, the original tool was modified
in the second experiment to include a cover sheet (see Appendix1).
The experiment in Colina (2008) sets out to test the functional approach to
evaluation by testing the tools inter-rater reliability. 37 raters and 3 consultants
were asked to use the tool to rate three translated texts. The texts selected for evaluation consisted of reader-oriented health education materials. Raters were bilinguals, professional translators, and language teachers. Some basic training was
provided. Data was collected by means of the tool and a post rating survey. Some
differences in ratings could be ascribed to rater qualifications: teachers and translators ratings were more alike than those of bilinguals; bilinguals were found to
rate higher and faster than the other groups. Teachers also tended to assign higher
ratings than translators. It was shown that different types of raters were able to use
the tool without significant training. Pilot testing results indicate good inter-rater
reliability for the tool and the need for further testing. The current paper focuses
on a second experiment designed to further test the approach and tool proposed
in Colina (2008).
2. Second phase of TQA testing: Methods and Results
2.1 Methods
One of the most important limitations of the experiment in Colina (2008) is in
regard to the numbers and groups of participants. Given the project objective of
ensuring applicability across languages frequently used in the USA, subject recruitment was done in three languages, Spanish, Russian, and Chinese. As a result,
resources and time for recruitment had to be shared amongst the languages, with
smaller numbers of subjects per language group. The testing described in the current experiment includes more subjects and additional texts. More specifically, the
study reported in this paper aims:
I. To test the TQA tool again for inter-rater reliability (i.e. to what degree trained
raters use the TQA tool consistently) by answering the following questions:
Question 1. For each text, how consistently do all raters rate the text?
Question 2. How consistently do raters in the first session (Benchmark) rate
the texts?
Question 3. How consistently do raters in the second session (Reliability) rate
the texts?
Question 4. How consistently do raters rate each component of the tool? Are
there some test components where there is higher rater reliability?
II. Compare the rating skills/behavior of translators and teachers: Is there a difference in scoring between Translators and Teachers? (Question 5, Section2.2).
Data was collected during two rounds of testing: the first, referred to as the Benchmark Testing, included 9 Raters; the second session, the Reliability Testing, included 21 raters. Benchmark and Reliability sessions consisted of a short training
session, followed by a rating session. Raters were asked to rate 45 translated texts
(depending on the language) and had one afternoon and one night to complete the
task. After their evaluation worksheets had been submitted, raters were required
to submit a survey on their experience using the tool. They were paid for their
participation.
242 Sonia Colina
2.1.1 Raters
Raters were drawn from the pool used for the pre-pilot and pilot testing sessions
reported in Colina (2008) (see Colina [2008] for selection criteria and additional
details). A call was sent via email to all those raters selected for the pre-pilot and
pilot testing (including those who were initially selected but did not take part). All
raters available participated in this second phase of testing.
As in Colina (2008), it was hypothesized that similar rating results would be
obtained within the members of the same group. Therefore, raters were recruited according to membership in one of two groups: Professional translators; and
language teachers (language professionals who are not professional translators).
Membership was assigned according to the same criteria as in Colina (2008). All
selected raters exhibited linguistic proficiency equivalent to that of a native (or
near-native) speaker in the source and in one of the target languages.
Professional translators were defined as language professionals whose income
comes primarily from providing translation services. Significant professional experience (5 years minimum, most had 1220 years of experience), membership in
professional organizations, and education in translation and/or a relevant field were
also needed for inclusion in this group. Recruitment for these types of individuals was primarily through the American Translators Association (ATA). Although
only two applicants were ATA certified, almost all were ATA affiliates (members).
Language teachers were individuals whose main occupation was teaching
language courses, at a university or other educational institution. They may have
had some translation experience, but did not rely on translation as their source
of income. A web search of teaching institutions with known foreign language
programs was used for this recruitment. We outreached to schools throughout
the country at both the community college and university levels. The definition of
teacher did not preclude graduate student instructors.
Potential raters were assigned to the above groups on the basis of the information provided in their resume or curriculum vitae and a language background
questionnaire included in a rater application.
The bilingual group in Colina (2008) was eliminated from the second experiment, as subjects were only available for one of the languages (Spanish). Translation competence models and research suggest that bilingualism is only one component of translation competence (Bell 1991, Cao 1996, Hatim and Mason 1997,
PACTE 2008). Nonetheless, since evaluating translation products is not the same
as translating, it is reasonable to hypothesize that other language professionals,
such as teachers, may have the competence necessary to evaluate translations; this
may be particularly true in cases, such as the current project, in which the object of
evaluation is not translator competence, but translation products. This hypothesis
would be born out if the ratings provided by translators and teachers are similar.
As mentioned above, data was collected during two rounds of testing: the first
one, the Benchmark Testing, included 9 Raters (3 Russian; 3 Chinese; 3 Spanish);
these raters were asked to evaluate 45 texts (per language) that had been previously selected as clearly of good or bad quality by expert consultants in each language. The second session, the Reliability Testing, included 21 raters, distributed
as follows:

Spanish: 5 teachers, 3 translators (8)

Chinese: 3 teachers, 4 translators (7)
Russian: 3 teachers, 3 translators (6)
Differences across groups reflect general features of that language group in the US.
Among the translators, the Russians had degrees in Languages, History and Translating, Engineering and Nursing from Russian and US universities and experience
ranging from 12 to 22 years; the Chinese translators experience ranged from 6 to
30 years and their education included Chinese language and literature, Philosophy
(MA), English (PhD), Neuroscience (PhD) and Medicine (MD), with degrees obtained in China and the US. Their Spanish counterparts experience varied from
5 to 20 years and their degrees included areas such as Education, Spanish and
English Literature, Latin American Studies (MA), and Creative Writing (MA).
The Spanish and Russian teachers were perhaps the most uniform groups, including: College instructors (PhD students) with MAs in Spanish or Slavic Linguistics,
Literature, and Communication, and one college professor of Russian. With one
exception, they were all native speakers of Spanish or Russian with formal education in the country of origin. Chinese teachers were college instructors (PhD
students) with MAs in Chinese, one college professor (PhD in Spanish) and an
elementary school teacher and tutor (BA in Chinese). They were all native speakers of Chinese.
2.1.2 Texts
As mentioned above, experienced translators serving as language consultants selected the texts to be used in the rating sessions. Three consultants were instructed to identify health education texts translated from English into their language.
Texts were to be publicly available on the Internet: Half were to be very good and
the other half were to be considered very poor on reading the text. Those texts
were used for the Benchmark session of testing during which they were rated by
the consultants and two additional expert translators. The texts where there was
the most agreement in rating were selected for the Reliability Testing. Reliability
texts were comprised of five Spanish texts (three good and two bad), four Russian
texts and four Chinese texts, two for each language being of good quality and two
of bad quality, making up a total of thirteen additional texts.
244 Sonia Colina
2.1.3 Tool
The tool tested in Colina (2008) was modified to include a cover sheet consisting
of two parts. Part I is to be completed by the person requesting the evaluation (i.e.
the Requester) and read by the rater before he/she started his/her work. It contains
the Translation Brief, relative to which the evaluation must always take place, and
the Quality Criteria, clarifying requester priorities among components. The TQA
Evaluation Tool included in Appendix1 contains a sample Part I, as specified by
Hablamos Juntos (the Requester), for the evaluation of a set of health education
materials. The Quality Criteria section reflects the weights assigned to the four
components in the Scoring Worksheet at the end of the tool. Part II of the Cover
Sheet is to be filled in by the raters after the rating is complete. An Assessment
Summary and Recommendation section was included to allow raters the opportunity to offer an action recommendation on the basis of their ratings: I.e. What
should the requester do now with this translation? Edit it? Minor or small edits?
Redo it entirely? An additional modification to the tool consisted of eliminating or adding descriptors so that each category would have an equal number of
descriptors (four for each component) and revising the scores assigned so that the
maximum number of points possible would be 100. Some minor stylistic changes
were made in the language of the descriptors.
2.1.4 Rater Training
The Benchmark and Reliability sessions included training and rating sessions. The
training provided was substantially the same offered in the pilot testing and described in Colina (2008): It focused on the features and use of the tool and it consisted of PDF materials (delivered via email), a Power-point presentation based on
the contents of the PDF materials, and a question-and-answer session delivered
online via Internet and phone conferencing system.
Some revisions to the training reflect changes to the tool (including instructions on the new Cover Sheet), a few additional textual examples in Chinese, and a
scored, completed sample worksheet for the Spanish group. Samples were not included for the other languages due to time and personnel constraints. The training
served as a refresher for those raters who had already participated in the previous
pilot training and rating (Colina 2008).5
2.2 Results
The results of the data collection were submitted to statistical analysis to determine to what degree trained raters use the TQA tool consistently.
Table1 and Figures 1a and 1b show the overall score of each text rated and
the standard deviation between the overall score and the individual rater scores.
Table1. Average score of each text and standard deviation

Text
# of raters
Average Score
Standard Deviation
210
11
91.8
8.1
214
11
89.5
11.3
215
11
86.8
15.0
228
11
48.6
19.2
235
11
56.4
18.5
Spanish
Avg.
14.42
Chinese
410
10
88.0
10.3
413
10
63.0
21.0
415
10
96.0
5.7
418
10
76.0
21.2
Avg.
14.55
Russian
312
59.4
16.1
314
82.8
15.6
315
75.6
22.1
316
67.8
29.0
Avg.
20.7
200-series texts are Spanish texts, 400s are Chinese and 300s are Russian. The standard deviations range from 8.1 to 19.2 for Spanish, from 5.7 to 21.2 for Chinese
and from 16.1 to 29.0 for Russian.
Question 1. For each text, how consistently do all raters rate the text?
The standard deviations in Table1 and Figures 1a and 1b offer a good measure of
how consistently individual texts are rated. A large standard deviation suggests
that there was less rater agreement (or that the raters differed more in their assessment). Figure1b shows the average standard deviations per language. According
to this, the Russian raters were the ones with the highest average standard deviation and the less consistent in their ratings. This is in agreement with the reliabillity
coefficients shown below (Table5), as the Russian raters have the lowest inter-rater
reliability. Table2 shows average scores, standard deviations, and average standard
deviations for each component of the tool, per text and per language. Figure2
represents average standard deviations per component and per language. There
does not appear to be an obvious connection between standard deviations and
246 Sonia Colina

100
80
60
Average Score
Standard Deviation
40
20
316
315
314
312
418
415
413
410
235
228
215
214
210
0
Text number
Figure1a. Average score and standard deviation per text

25
20
15
Standard Deviation
(Avg.)
10
0
Spanish
Chinese
Russian
Figure1b. Average standard deviations per language
Table2. Average scores and standard deviations for four components, per text and per
language
TL
Text
FTA
MEAN
TERM
Raters
Mean
SD
Mean
SD
Mean
SD
Mean
SD
210
11
27.7
2.6
23.6
2.3
22.7
2.6
17.7
3.4
214
11
27.3
4.7
20.9
7.0
23.2
2.5
18.2
3.4
215
11
28.6
2.3
22.3
4.7
18.2
6.8
17.7
3.4
228
11
15.0
7.7
11.4
6.0
10.9
6.3
11.4
4.5
235
11
15.9
8.3
12.3
6.5
13.6
6.4
14.5
4.7
Spanish
Avg.
5.12
5.3
4.92
3.88
Chinese
410
10
27.0
4.8
22.0
4.8
21.0
4.6
18.0
2.6
413
10
18.0
9.5
16.5
5.8
14.0
5.2
14.5
3.7
415
10
28.5
2.4
25.0
0.0
23.5
2.4
19.0
2.1
418
10
22.5
6.8
21.0
4.6
16.0
7.7
16.5
4.1
Avg.
5.875
3.8
4.975
3.125
Russian
312
18.3
7.1
15.0
6.1
13.3
6.6
12.8
4.4
314
25.6
6.3
21.7
5.0
19.4
3.9
16.1
4.2
315
23.3
9.4
18.3
7.9
17.8
4.4
16.1
4.2
316
20.0
10.3
16.7
7.9
17.2
7.1
13.9
6.5
Avg.SD (all lgs.)
8.275
6.725
5.5
4.825
6.3
5.3
5.1
3.9
components. Although generally the components Target Language (TL) and Functional and Textual Adequacy (FTA) have higher standard deviations (i.e., ratings
are less consistent), this is not always the case as seen in the Chinese data (FTA).
One would in fact expect the FTA category to exhibit the highest standard deviations, given its more holistic nature; yet, the data do not bear out this hypothesis, as
the TL component also shows standard deviations that are higher than Non-Specialized Content (MEAN) and Specialized Content and Terminology (TERM).
Question 2. How consistently do raters in the first session (Benchmark) rate the
texts?
The inter-rater reliability for the Spanish and for the Chinese raters is remarkable; however, the inter-rater reliability for the Russian raters is too low (Table3).
248 Sonia Colina
Average SD per tool component
9
8
7
6
Spanish
Chinese
Russian
All languages
3
2
1
0
TL
FTA
MEAN
TERM
Figure2. Average standard deviations per tool component and per language
Table3. Reliability coefficients for benchmark ratings
Reliability coefficient
Spanish
.953
Chinese
.973
Russian
.128
This, in conjunction with the Reliability Testing results, leads us to believe in the
presence of other unknown factors, unrelated to the tool, responsible for the low
reliability of the Russian raters.
Question 3. How consistently do raters in the second session (Reliability) rate the
texts? How do the reliability coefficients compare for the Benchmark and the Reliability Testing?
The results of the reliability raters mirror those of the benchmark raters, whereby
the Spanish raters achieve a very good inter-rater reliability coefficient, the Chinese raters have acceptable inter-rater reliability coefficient, but the inter-rater reliability for the Russian raters is very low (Table4).
Table5 (see also Tables 3 and 4) shows that there was a slight drop in interrater reliability for the Chinese raters (from the benchmark rating to the reliability
rating), but the Spanish raters at both rating sessions achieved remarkable interrater reliability. The slight drop among the Russian raters from the first to the second session is negligible; in any case, the inter-rater reliability is too low.
Table4. Reliability coefficients for Reliability Testing

Spanish
.934
Chinese
.780
Russian
.118
Table5. Inter-rater reliability: Benchmark and Reliability Testing

Benchmark reliability
coefficient
(for Reliability Testing)
Spanish
.953
.934
Chinese
.973
.780
Russian
.128
.118
Question 4. How consistently do raters rate each component of the tool? Are there
some test components where there is higher rater reliability?
The coefficients for the Spanish raters show very good reliability, with excellent coefficients for the first three components; the numbers for the Chinese raters
are also very good, but the coefficients for the Russian raters are once again low
(although some consistency is identified for the FTA and MEAN components).
(Table6)
Table6. Reliability coefficients for the four components of the tool (all raters per
language group)
TL
FTA
MEAN
TERM
Spanish
.952
.929
.926
.848
Chinese
.844
.844
.864
.783
Russian
.367
.479
.492
.292
In sum, very good reliability was obtained for Spanish and Chinese raters, for the
two testing sessions (Benchmark and Reliability Testing) as well as for all components of the tool. Reliability scores for the Russian raters are low. These results are
in agreement with the standard deviation data presented in Tables 12 and Figure1a and 1b and Figure2. All of this leads us to believe that whatever the cause
for the Russian coefficients, it was not related to the tool itself.
Question 5. Is there a difference in scoring between translators and teachers?
Table7a and Table7b show the scoring in terms of average scores and standard
deviations for the translators and the teachers for all texts. Figures 3 and 4 show
the mean scores and times for Spanish raters, comparing teachers and translators.
250 Sonia Colina
Table7a. Average scores and standard deviations for consultants and translators
Score
Time
text
Mean
SD
Mean
SD
210
93.3
7.5
75.8
59.4
214
93.3
12.1
94.2
101.4
215
85.0
17.9
36.3
18.3
228
46.7
20.7
37.5
22.3
235
46.7
18.6
49.5
38.9
410
91.4
7.5
46.0
22.1
413
62.9
21.0
40.7
13.7
415
96.4
4.8
26.1
15.4
418
69.3
22.1
52.4
22.2
312
52.5
15.1
26.7
2.6
314
88.3
10.3
22.5
4.2
315
74.2
26.3
28.7
7.8
316
63.3
32.7
25.8
6.6
Table7b. Average scores and standard deviations for teachers

Score
Time
text
Mean
SD
Mean
SD
210
90.0
9.4
63.6
39.7
214
85.0
9.4
67.0
41.8
215
89.0
12.4
36.0
30.5
228
51.0
19.5
38.0
31.7
235
68.0
10.4
57.6
40.2
410
80.0
13.2
61.0
27.7
413
63.3
25.7
71.0
24.6
415
95.0
8.7
41.0
11.5
418
91.7
5.8
44.0
6.6
312
73.3
5.8
55.0
56.7
314
71.7
20.8
47.7
62.7
315
78.3
14.4
37.7
45.5
316
76.7
22.5
46.7
63.5

100
90
80
70
60
Translators
Teachers
50
40
30
20
10
0
210
214
215
228
235
Mean
forraters
Spanish raters
Figure3. Mean scores
for scores
Spanish
The corresponding data for Chinese appears in Figures 5 and 6 and in Figures 7
and 8 for Russian.
Spanish teachers tend to rate somewhat higher (3 out of 5 texts) and spend
more time rating than translators (all texts).
As with the Spanish raters, it is interesting to note that Chinese teachers rate
either higher or similarly to translators (Figure5): Only one text obtained lower
ratings from teachers than from translators. Timing results also mirror those found
for Spanish subjects: Teachers take longer to rate than translators (Figure6).
Despite the low inter-rater reliability among Russian raters, the same trend
was found when comparing Russian translators and teachers with the Chinese and
the Spanish. Russian teachers rate similarly or slightly higher than translators, and
they clearly spend more time on the rating task than the translators (Figure7 and
Figure8). This also mirrors the findings of the pre-pilot and pilot testing (Colina
2008).
In order to investigate the irregular behavior of the Russian raters and to try to
obtain an explanation for the low inter-rater reliability, the correlation between the
total score and at the recommendation (the field rec) issued by each rater was considered. This is explored in Table8. One would expect there to be a relatively high
(negative) correlation because of the inverse relationship between high score and a
low recommendation. As is illustrated in the three sub tables below, all Spanish raters with the exception of SP02PB show a strong correlation between the recommendation and the total score, ranging from 0.854 (SP01VS) to 0.981 (SP02MC). The
results are similar with the Chinese raters, whereby all raters correlate very highly
252 Sonia Colina
80
70
60
50
Translators
40
Teachers
30
20
10
0
210
214
215
228
235
Figure4. Time for Spanish

Timeraters
for Spanish raters
120
100
80
Translators
60
Teachers
40
20
0
410
413
415
Meanfor
Score
for Chinese
Chinese
raters Raters
418
80
70
60
50
Translators
40
Teachers
30
20
10
0
410
413
415
418
Figure6. Time for Chinese

raters
Time for
Chinese Raters
100
90
80
70
60
Translators
50
Teachers
40
30
20
10
0
312
314
315
Meanfor
scores
for Russian
Russian
raters Raters
316
254 Sonia Colina

60
50
40
Translators
30
Teachers
20
10
0
312
314
315
316
Timeraters
for Russian Raters
Figure8. Time for Russian
between the recommendation and the total score, ranging from 0.867 (CH01BJ)
to a perfect 1.00 (CH02JG). The results are different for the Russian raters, however.
It appears that three raters (RS01EM, RS02MK, and RS01NM) do not correlate
highly between their recommendations and their total scores. A closer look especially at these raters is warranted, as is a closer look at RS02LB, who was excluded
from the correlation analysis due to a lack of variability (the rater uniformly recommended a 2 for all texts, regardless of the total score he or she assigned). The other
Russian raters exhibited strong correlations. This result suggests some unusual behavior in the Russian raters, independently of the tool design and tool features, as
the scores and overall recommendation do not correlate highly, as expected.
Table8 (3 sub-tables). Correlation between recommendation and total score:
8.1 Spanish raters:
SP04AR
SP01JC
SP01VS
SP02JA
SP02LA
SP02PB
SP02AB
SP01PC
SP01CC
SP02MC SP01PS
0.923
0.958
0.854
0.938
0.966
0.421
0.942
0.975
0.913
0.981
8.2 Chinese raters:

CH01RL CH04YY CH01AX CH02AC CH02JG
CH01KG CH02AH CH01BJ
CH01CK CH01FL
0.935
1.000
0.955
0.943
0.980
0.996
0.894
0.980
0.867
8.3 Russian raters:

RS01EG
RS01EM RS04GN RS02NB
RS02LB
RS02MK RS01SM
RS01NM RS01RW
0.998
0.115
n/a
0.500
0.500
0.933
1.000
0.982
0.993
0.926
0.938
3. Conclusions
As in Colina (2008), testing showed that the TQA tool exhibits good inter-rater
reliability for all language groups and texts, with the exception of Russian. It was
also shown that the low reliability of the Russian raters scores is probably due to
factors unrelated to the tool itself. At this point, it is not possible to determine
what these factors may have been; yet further research with Russian teachers and
translators may provide insights about the reasons for the low inter-rater reliability
obtained for this group in the current study. In addition, the findings are in line
with those of Colina (2008) with regard to the rating behavior of translators and
teachers: Although translators and teachers exhibit similar behavior, teachers tend
to spend more time rating and their scores are slightly higher than those of translators. While, in principle, it may appear that translators would be more efficient
raters, one would have to consider the context of evaluation to select an ideal rater
for a particular evaluation task. Because they spent more time rating (and one assumes reflecting on their rating), teachers may be more apt evaluators in a formative context, where feedback is expected from the rater. Teachers may also be better
at reflecting on the nature of the developmental process and therefore better able
to offer more adequate evaluation of a process and/or a translator (versus evaluation of a product). However, when rating involves a product and no feedback is
expected (e.g. industry, translator licensing exams, etc), a more efficient translator
rater may be more suitable to the task. In sum, the current findings suggest that
professional translators and language teachers could be similarly qualified to assess
translation quality by means of the TQA tool. Which of the two types of professionals is more adequate for a specific rating task probably will depend on the
purpose and goal of evaluation. Further research comparing the skills of these two
groups in different evaluation contexts is necessary to confirm this view.
In summary, the results of empirical tests of the functional-componential tool
continue to offer evidence for the proposed approach and to warrant additional
testing and research. Future research needs to focus on testing on a larger scale,
with more subjects and various text types.
Notes
* The research described here was funded by the Robert Wood Johnson Foundation. It was part
of the Phase II of the Translation Quality Assessment project of the Hablamos Juntos National
Program. I would like to express my gratitude to the Foundation, to the Hablamos Juntos National Program, and to the Program Director, Yolanda Partida, for their support of translation in
the USA. I owe much gratitude to Yolanda Partida and Felicia Batts for comments, suggestions
256 Sonia Colina

and revision in the write-up of the draft documents and on which this paper draws. More details
and information on the Translation Quality Assessment project, including Technical Reports,
Manuals and Toolkit Series are available on the Hablamos Juntos website (www.hablamosjuntos.
org). I would also like to thank Volker Hegelheimer for his assistance with the statistics.
1. The legal basis for most language access legislation in the United States of America lies in
Title VI of the 1964 Civil Rights Act. At least 43 states have one or more laws addressing language access in health care settings.
2. www.sae.org; www.lisa.org/products/qamodel.
3. One exception is that of multilingual text generation, in which an original is written to be
translated into multiple languages.
4. Note the reference to reader response within a functionalist framework.
5. Due to rater availability, 4 raters (1 Spanish, 2 Chinese, 1 Russian) were selected that had
not participated in the training and rating sessions of the previous experiment. Given the low
number, researchers did not investigate the effect of previous experience (experienced vs. inexperienced raters).
References
Bell, Roger T. 1991. Translation and Translating. London: Longman.
Bowker, Lynne. 2001. Towards a Methodology for a Corpus-Based Approach to Translation
Evaluation. Meta 46:2. 345364.
Cao, Deborah. 1996. A Model of Translation Proficiency. Target 8:2. 325340.
Carroll, John B. 1966. An Experiment in Evaluating the Quality of Translations. Mechanical
Translation 9:34. 5566.
Colina, Sonia. 2003. Teaching Translation: From Research to the Classroom. New York: McGraw
Hill.
Colina, Sonia. 2008. Translation Quality Evaluation: Empirical evidence for a Functionalist
Approach. The Translator 14:1. 97134.
Gerzymisch-Arbogast, Heidrun. 2001. Equivalence Parameters and Evaluation. Meta 46:2.
227242.
Hatim, Basil and Ian Mason. 1997. The Translator as Communicator. London and New York:
Routledge.
Hnig, Hans. 1997. Positions, Power and Practice: Functionalist Approaches and Translation
Quality Assessment. Current issues in language and society 4:1. 634.
House, Julianne. 1997. Translation Quality Assessment: A Model Revisited. Tbingen: Narr.
House, Julianne. 2001. Translation Quality Assessment: Linguistic Description versus Social
Evaluation. Meta 46:2. 243257.
Lauscher, S. 2000. Translation Quality-Assessment: Where Can Theory and Practice Meet?.
The Translator 6:2. 149168.
Neubert, Albrecht. 1985. Text und Translation. Leipzig: Enzyklopdie.
Nida, Eugene. 1964. Toward a Science of Translation. Leiden: Brill.
Nida, Eugene and Charles Taber. 1969. The Theory and Practice of Translation. Leiden: Brill.
Nord, Christianne. 1997. Translating as a Purposeful Activity: Functionalist Approaches Explained. Manchester: St. Jerome.
PACTE. 2008. First Results of a Translation Competence Experiment: Knowledge of Translation and Efficacy of the Translation Process. John Kearns, ed. Translator and Interpreter
Training: Issues, Methods and Debates. London and New York: Continuum, 2008. 104
126.
Reiss, Katharina. 1971. Mglichkeiten und Grenzen der bersetungskritik. Mnchen: Hber.
Reiss, Katharina and Vermeer, Hans. 1984. Grundlegung einer allgemeinen Translations-Theorie.
Tbingen: Niemayer.
Van den Broeck, Raymond. 1985. Second Thoughts on Translation Criticism. A Model of its
Analytic Function. Theo Hermans, ed. The Manipulation of Literature. Studies in Literary
Translation. London and Sydney: Croom Helm, 1985. 5462.
Williams, Malcolm. 2001. The Application of Argumentation Theory to Translation Quality
Assessment. Meta 46:2. 326344.
Williams, Malcolm. 2004. Translation Quality Assessment: An Argumentation-Centered Approach, Ottawa: University of Ottawa Press.
Rsum
Colina (2008) propose une approche componentielle et fonctionnelle de lvaluation de la qualit des traductions et dresse un rapport sur les rsultats dun test-pilote portant sur un outil
conu pour cette approche. Les rsultats attestent un taux lev de fiabilit entre valuateurs et
justifient la continuation des tests. Cet article prsente une exprimentation destine tester
lapproche ainsi que loutil. Des donnes ont t collectes pendant deux priodes de tests. Un
groupe de 30 valuateurs, compos de traducteurs et enseignants espagnols, chinois et russes,
ont valu 4 ou 5 textes traduits. Les rsultats montrent que loutil assure un bon taux de fiabilit
entre valuateurs pour tous les groupes de langues et de textes, lexception du russe; ils suggrent galement que le faible taux de fiabilit des scores obtenus par les valuateurs russes est sans
rapport avec loutil lui-mme. Ces constats confirment ceux de Colina (2008).
Mots-clefs: Mots-cls: qualit, test, valuation, notation, componentiel,

fonctionnalisme, erreurs
258 Sonia Colina
Appendix1: Tool
Benchmark Rating Session
Time Rating Starts
Time Rating Ends
Translation Quality Assessment Cover Sheet

For Health Education Materials
PART I: To be completed by Requester

Requester is the Health Care Decision Maker (HCDM) requesting a quality assessment of an existing translated text.
Requester:
Title/Department:
Delivery Date:
TRANSLATION BRIEF
Source Language
Target Language
Spanish, Russian, Chinese
Text Type:
Text Title:
Target Audience:
Purpose of Document:
PRIORITY OF QUALITY CRITERIA

____ Target Language
____ Functional and Textual Adequacy
Rank EACH from 1 to 4
____ Non-Specialized Content (Meaning)
(1 being top priority)
____ Specialized Content and Terminology
PART II: To be completed by TQA Rater

Rater (Name):
Date Completed:
Contact Information
Date Received:
Total Score:
Total Rating Time:
ASSESSMENT SUMMARY AND RECOMMENDATION

Publish and/or use as is
Minor edits needed before publishing*
(To be completed after
evaluating translated text.)
Major revision needed before publishing*

Redo translation
Translation will not be an effective communication strategy for this text.
Explore other options (e.g. create new target language materials)
Notes/Recommended Edits
RATING INSTRUCTIONS:
1. Carefully read the instructions for the review of the translated text. Your decisions and evaluation should be
based on these instructions only.
2. Check the description that best fits the text given in each one of the categories.
3. It is recommended that you read the target text without looking at the English and score the Target
Language and Functional categories.
4. Examples or comments are not required, but they can be useful to help support your decisions or to provide
rationale for your descriptor selection.
1. TARGET LANGUAGE
Category
Number
Description
1.a
The translation reveals serious language proficiency issues. Ungrammatical use of the target language, spelling
mistakes. The translation is written in some sort of third language (neither the source nor the target). The
structure of source language dominates to the extent that it cannot be considered a sample of target language
text. The amount of transfer from the source cannot be justified by the purpose of the translation. The text is
extremely difficult to read, bordering on being incomprehensible.
1.b
The text contains some unnecessary transfer of elements/structure from the source text. The structure of the
source language shows up in the translation and affects its readability. The text is hard to comprehend.
1.c
Although the target text is generally readable, there are problems and awkward expressions resulting, in most
cases, from unnecessary transfer from the source text.
1.d
Check one
box
The translated text reads similarly to texts originally written in the target language that respond to the same
purpose, audience and text type as those specified for the translation in the brief. Problems/awkward
expressions are minimal if existent at all.
Examples/Comments
2. FUNCTIONAL AND TEXTUAL ADEQUACY

Category
Number
Description
2.a
Disregard for the goals, purpose, function and audience of the text. The text was translated without considering
textual units, textual purpose, genre, need of the audience, (cultural, linguistic, etc.) Can not be repaired with
revisions.
2.b
The translated text gives some consideration to the intended purpose and audience for the translation, but
misses some important aspect/s of it (e.g. level of formality, some aspect of its function, needs of the audience,
cultural considerations, etc.). Repair requires effort.
2.c
The translated text approximates to the goals, purpose (function) and needs of the intended audience, but it is
not as efficient as it could be, given the restrictions and instructions for the translation.
Can be repaired with
suggested edits.
2.d
The translated text accurately accomplishes the goals, purpose (function: informative, expressive, persuasive)
set for the translation and intended audience (including level of formality). It also attends to cultural needs and
characteristics of the audience. Minor or no edits needed.
Examples/Comments
-2-
Check one
box
260 Sonia Colina
3. NON-SPECIALIZED CONTENT-MEANING
Category
Check one
box
3.Number
NON-SPECIALIZED CONTENT-MEANINGDescription
Category
3.a
Number
3.a
3.b
3.c
3.b
The translation reflects or contains important unwarranted deviations from the original. It contains inaccurate
Description
renditions and/or important omissions and additions
that cannot be justified by the instructions. Very defective
comprehension of the original text.
The translation reflects or contains important unwarranted deviations from the original. It contains inaccurate
renditions and/or important omissions and additions that cannot be justified by the instructions. Very defective
There have been some changes in meaning, omissions or/and additions that cannot be justified by the translation
comprehension of the original text.
instructions. Translation shows some misunderstanding of original and/or translation instructions.
Check one
box
There have been some changes in meaning, omissions or/and additions that cannot be justified by the translation
Minor alterations in meaning, additions or omissions.
instructions. Translation shows some misunderstanding of original and/or translation instructions.
The translation accurately reflects the content contained in the

Minor alterations in meaning, additions or omissions.
instructions without unwarranted alterations, omissions or additions.
been rendered adequately.
The translation accurately reflects the content contained in the
3.d
Examples/Comments
instructions without unwarranted alterations, omissions or additions.
been rendered adequately.
3.c
3.d
original, insofar as it is required by the

Slight nuances and shades of meaning have
original, insofar as it is required by the
Slight nuances and shades of meaning have
Examples/Comments
4. SPECIALIZED CONTENT AND TERMINOLOGY

Category
Description
4.Number
SPECIALIZED CONTENT AND TERMINOLOGY
4.a
Category
Number
4.b
4.a
Reveals unawareness/ignorance of special terminology and/or insufficient knowledge of specialized content.

Description
Check one
box
Check one
box
Serious/frequent mistakes involving terminology and/or specialized content.

Reveals unawareness/ignorance of special terminology and/or insufficient knowledge of specialized content.
4.c
4.b
A few terminological errors, but the specialized content is not seriously affected.
Serious/frequent mistakes involving terminology and/or specialized content.
4.d
4.c
Accurate and appropriate rendition of the terminology. It reflects a good command of terms and content specific
A few terminological errors, but the specialized content is not seriously affected.
to the subject.
Examples/Comments
Accurate and appropriate rendition of the terminology. It reflects a good command of terms and content specific
4.d
to the subject.
Examples/Comments
TOTAL
SCORE
TOTAL
SCORE
-3-3-
SCORING WORKSHEET
Component: Target Language
Category #
Value
Score
Component: Functional and Textual Adequacy

Category #
Value
Score
1.a
2.a
1.b
15
2.b
10
1.c
25
2.c
20
1.d
30
2.d
25
Component: Specialized Content and

Terminology
Category #
Value
Score
Component: Non-Specialized Content

Category #
Value
3.a
Score
4.a
3.b
10
4.b
10
3.c
20
4.c
15
3.d
25
4.d
20
Tally Sheet
Category
Rating
Component
Target Language
Functional and Textual Adequacy
Non-Specialized Content
Specialized Content and Terminology

Total Score
-4-
Score Value
262 Sonia Colina
Appendix2: Text sample
264 Sonia Colina
Authors address
Sonia Colina
Department of Spanish and Portuguese
The University of Arizona
Modern Languages 545
Tucson, AZ 85721-0067
United States of America
scolina@email.arizona.edu

Further Evidence For A Functionalist Approach To Translation Quality Evaluation

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Further Evidence For A Functionalist Approach To Translation Quality Evaluation

Transféré par

Droits d'auteur :

Formats disponibles

Further evidence for a functionalist approach

to translation quality evaluation*

The University of Arizona

Colina (2008) proposes a componential-functionalist approach to translation

236 Sonia Colina

documents, rated as highly satisfactory and recommended for replication, were

Further evidence for a functionalist approach to translation quality evaluation 237

1.1 Experiential approaches

238 Sonia Colina

Further evidence for a functionalist approach to translation quality evaluation 239

240 Sonia Colina

As a response to the inadequacies identified above, Colina (2008) proposes an

Further evidence for a functionalist approach to translation quality evaluation 241

242 Sonia Colina

Further evidence for a functionalist approach to translation quality evaluation 243

Spanish: 5 teachers, 3 translators (8)

244 Sonia Colina

Further evidence for a functionalist approach to translation quality evaluation 245

Table1. Average score of each text and standard deviation

246 Sonia Colina

Figure1a. Average score and standard deviation per text

Figure1b. Average standard deviations per language

Further evidence for a functionalist approach to translation quality evaluation 247

Avg.SD (all lgs.)

248 Sonia Colina

Average SD per tool component

Further evidence for a functionalist approach to translation quality evaluation 249

Table4. Reliability coefficients for Reliability Testing

Table5. Inter-rater reliability: Benchmark and Reliability Testing

250 Sonia Colina

Table7b. Average scores and standard deviations for teachers

Further evidence for a functionalist approach to translation quality evaluation 251

252 Sonia Colina

Figure4. Time for Spanish

Further evidence for a functionalist approach to translation quality evaluation 253

Figure6. Time for Chinese

254 Sonia Colina

8.2 Chinese raters:

CH01KG CH02AH CH01BJ

8.3 Russian raters:

RS01EM RS04GN RS02NB

Further evidence for a functionalist approach to translation quality evaluation 255

256 Sonia Colina

Further evidence for a functionalist approach to translation quality evaluation 257

Mots-clefs: Mots-cls: qualit, test, valuation, notation, componentiel,

258 Sonia Colina

Time Rating Ends

Translation Quality Assessment Cover Sheet

PART I: To be completed by Requester

PRIORITY OF QUALITY CRITERIA

Rank EACH from 1 to 4

____ Non-Specialized Content (Meaning)

(1 being top priority)

____ Specialized Content and Terminology

PART II: To be completed by TQA Rater

Total Rating Time:

ASSESSMENT SUMMARY AND RECOMMENDATION

Major revision needed before publishing*

Further evidence for a functionalist approach to translation quality evaluation 259

2. FUNCTIONAL AND TEXTUAL ADEQUACY

260 Sonia Colina

The translation accurately reflects the content contained in the

original, insofar as it is required by the